Function word lists

It is preferable to use the 277 word list for current work:

Alphabetical order 277 word new function word list

Clustered new 277 word function word list

The 264 word lists are included for historical reference, these were used in the initial publications:

Alphabetical order original 264 word function word list

Clustered original 264 word function word list

My recent work on dialogue act classification makes use of Function Words. My position on function words is that they are extremely important components of a sentence (dialogue utterance etc.). Content words tell you what the sentence is about. Function words tell you a great deal of that the sentence says about {what it is about}.
When I started my work in this are I could not find a list of function words. I did find a number of lists of stopwords, but these are high frequency words and the lists are compiled with the intention of throwing these words away in information retrieval techniques (e.g. LSA) because they don’t distinguish between texts. They may contain nouns, verbs, adjectives or adverbs, if they occur with sufficient frequency in some corpus used to create the list.
So I compiled a list by combining stopword lists, removing the content words and then searching in dictionaries for likely candidates to add. There are two list lengths so far in this section of the blog. The lists postfixed 264 contain 264 function words and these were used for early work. The lists postfixed 277 contain some extra function words discovered later. It’s possible there may be some more lurking out there so please let me know if you think I’ve missed something, but most of the additions are fairly obscure legalistic construction. And therein lies a lesson: not all high frequency words are function words, but not all function words are high frequency!
It is possible to organize the function words in different ways to optimize their performance in classification. So far I have published alphabetically ordered lists and lists where the function words are clustered by grammatical / syntactic categories. More to follow.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s