Finally: STSS-131

The new dataset for evaluating STSS measures is now available on the datasets page. Years in the making, it has been produced using the best possible methods currently available and the paper from which it is extracted “A new benchmark dataset with production methodology for Short Text Semantic Similarity algorithms” is groundbreaking in establishing the measurement theoretic and statistical validity of the methods used.

The dataset is more representative of the English Language and more demanding than STSS-131, so be prepared for lower correlation coefficients between your algorithms and this dataset than with STSS-65. Both STASIS and LSA score considerably lower. This is a virtue of the dataset, it has much more headroom to demonstrate future improvements in STSS algorithms.

Seminal Papers #1: Features of Similarity

Tversky, A., Features of Similarity. Psychological Review, 1977. 84 (4 ): p. 327-352.

Tversky’s paper (Tversky, 1977) is fundamentally important as it set out to unify the existing work on set-theoretical models of similarity into a single model. The dominant models of similarity at the time were “geometric”, measuring distance rather than similarity, but always on the assumption that distance could be converted to (or negatively correlated with) similarity.

The paper includes an analysis using measurement theory (axiomatic measurement) which appealed to me because of my backgroundin Software Engineering which makes use of these axioms (Minimality, Symmetry, The Triangle Inequality).

The paper contains lots of interesting ideas, for example practical implications for the collection of similarity judgements from humans.

All of these seminal papers are widely cited, but sometines at second or third hand and I recommend checking the original source if you are going to use it.

To the best of my knowledge, this paper is not available online. I got my copy through inter-library loan. If you know of a copy legitimately available online please post a comment to this blog entry.

Welcome to Semantic Similarity

This website is intended to diseminate my own findings in the fields of Text Processing, Text Understanding and Text Mining. Because I am particularly interested in the application of Short Text Semantic Similarity in these fields I have called the site “Semantic Similarity” (the main focus of my PhD Thesis).

Appart from my work e-mail address at Manchester Metropolitan University,  I have also set up an e-mail account specifically for contacts from this website:

drjamesdoshea <at> gmai l<dot> com