Benchmark Datasets

On this page you will find STSS-65, which was the original Short Text Semantic Similarity dataset used in Li et Al. 2006 to evaluate the STASIS STSS measure.

STSS-65 short text similarity benchmark dataset

On this page you will find STSS-131, the new benchmark dataset which is more representative of the English language and more demanding than STSS-65 and important for evaluating improved algorithms. This is an extract from a paper which is in press in ACM Transactions on Speech and Language Processing, estimated publication date December 2013. In due course the process to make it open access will be complete, for free download of the complete paper from the ACM TSLP website.

STSS-131 short text similarity benchmark dataset

Arabic Word Semantic Similarity

I lead a research group which is extending Short Text Semantic Similarity to non-English languages. I have uploaded our Arabic Word Semantic Similarity benchmark dataset. This is intended to play the same role in Arabic as the Rubenstein & Goodenough dataset plays in English.
We hope you will find it useful for evaluating and comparing word semantic similarity measures for Arabic. The dataset is contained in the paper “Arabic Word Semantic Similarity” on my publications page.


One comment on “Benchmark Datasets

  1. drjamesoshea says:

    I have attached it, but is not showing up yet. Will fix this shortly.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s