The new dataset for evaluating STSS measures is now available on the datasets page. Years in the making, it has been produced using the best possible methods currently available and the paper from which it is extracted “A new benchmark dataset with production methodology for Short Text Semantic Similarity algorithms” is groundbreaking in establishing the measurement theoretic and statistical validity of the methods used.
The dataset is more representative of the English Language and more demanding than STSS-131, so be prepared for lower correlation coefficients between your algorithms and this dataset than with STSS-65. Both STASIS and LSA score considerably lower. This is a virtue of the dataset, it has much more headroom to demonstrate future improvements in STSS algorithms.
I sent off a copy of my new STSS benchmark dataset, STSS-131 to a researcher at the Max Plank institute this evening. STSS-131 is more demanding than my original dataset (STSS-65) and is more representative of the English Language in terms of Dialogue Acts etc.
I will upload a copy to this site in due course as a means of general distribution.
This website is intended to diseminate my own findings in the fields of Text Processing, Text Understanding and Text Mining. Because I am particularly interested in the application of Short Text Semantic Similarity in these fields I have called the site “Semantic Similarity” (the main focus of my PhD Thesis).
Appart from my work e-mail address at Manchester Metropolitan University, I have also set up an e-mail account specifically for contacts from this website:
drjamesdoshea <at> gmai l<dot> com