Finally: STSS-131

The new dataset for evaluating STSS measures is now available on the datasets page. Years in the making, it has been produced using the best possible methods currently available and the paper from which it is extracted “A new benchmark dataset with production methodology for Short Text Semantic Similarity algorithms” is groundbreaking in establishing the measurement theoretic and statistical validity of the methods used.

The dataset is more representative of the English Language and more demanding than STSS-131, so be prepared for lower correlation coefficients between your algorithms and this dataset than with STSS-65. Both STASIS and LSA score considerably lower. This is a virtue of the dataset, it has much more headroom to demonstrate future improvements in STSS algorithms.


Welcome to Semantic Similarity

This website is intended to diseminate my own findings in the fields of Text Processing, Text Understanding and Text Mining. Because I am particularly interested in the application of Short Text Semantic Similarity in these fields I have called the site “Semantic Similarity” (the main focus of my PhD Thesis).

Appart from my work e-mail address at Manchester Metropolitan University,  I have also set up an e-mail account specifically for contacts from this website:

drjamesdoshea <at> gmai l<dot> com