FRELI (the Free Repository of English Lexical Information) is an ongoing project to provide simple lexical information for as many English words as possible. Each entry contains the word's part of speech, a rough indicator of its frequency, and other information. FRELI does not contain word definitions, and is not meant to be a dictionary; rather, it's designed for use by spell checkers, word games, text parsers, etc.
A new version of FRELI is now available. The total number of entries is now 73,735.
A new version of FRELI has now been released under the Creative Commons Attribution License (version 2.0). There aren’t many new word entries (e.g., cockler one who takes and sells cockles and quadriga a car or chariot drawn by four horses abreast), but the principles changes are the different license and the addition of the source code for frop, the command-line tool I wrote to manage the FRELI data files.
A new version of FRELI is now available for download. More than 22,000 entries have been added, pushing the total count past 73,000. Most of the new entries are for scientific terms (for example, kallikreinogen and poikilothermy), but some old standbys were finally added as well (e.g., sheepcote and bast).
The FRELI word list is best illustrated by example. The following is an excerpt of entries from the latest version:
kalemia (n) #57881 <domain:pathology> kalends (n) #17941 @calends kaleyard (n) #38206 kali (n) #17942 <roget-count:1> kalian (n) #38207 "kind of tobacco pipe" kaliemia (n) #57882 <domain:pathology> kalif (n) #38208 @caliph <language:Arabic> kalifate (n) #38209 @caliphate kalimba (n) #38210 "kind of musical instrument" <domain:music> kalinite (n) #38211 kaliopenia (n) #57883 <domain:pathology> kaliopenic (adj) #72027 kaliph (n) #38212 @caliph kalium (n) #59192 kaliuresis (n) #53967 kaliuretic (adj) #72028 kallikrein (n) #63531 kallikreinogen (n) #62135 kalmia (n) #38213 "kind of shrub" <taxon:Kalmia> kalong (n) #38214 <language:Javanese> kalpa (n) #38215 <language:Sanskrit> kalpak (n) #38216 @calpac kalpis (n) #38217 kalsomine (n,v) #38218 @calcimine kaluresis (n) #53968 kaluretic (adj) #72029 kama'aina (n) #38219 <language:Hawaiian> kamacite (n) #38220 kamala (n) #38221 <language:Sanskrit> kambal (n) #38222 <language:Hindi> kame (n) #38223 |1 (n) "kind of ridge" <domain:geology> |2 (n) "combe" kamelaukia (n) #38225 =kamelaukion+{pl} kamelaukion (n) #38224 "type of ecclestiacal hat" <language:Greek>
The information in parentheses is the word's part(s) of speech. The number after the hash mark (#) is the word's FRELI ID, which uniquely identifies the word (as spelled). An at sign (@) indicates an alternate spelling, and angle brackets (<>) are used for key-value tags. For example, the <roget-count> information represents the word's frequency in one of the initial sources of words (the 1911 edition of Roget's thesaurus, now in the public domain).
Note that definitions are not given. The glosses in quotes are simply a rough means of distinguishing homographs.
The quality of the data in the FRELI word list should be quite good. I started with a somewhat shorter list of words with part-of-speech information, compiled from a free wordlist available on the Net (the Link grammar project [LG]). I then wrote a Perl script to parse the 1911 edition of Roget's thesaurus, available from Project Gutenberg. Any words found in both Roget's and LG were left as is, modulo any later weeding. About 14,000 words were found in Roget's but not in LG; these were checked by hand against the American Heritage Dictionary [1st ed.] and the Oxford English Dictionary [also 1st ed.]. After that, I added irregular forms for all common irregular verbs (and some uncommon ones as well) and did a good deal of checking by grep to weed out regular inflected verb and noun forms and other superfluous words.
In late 2003, I began to develop tools to make it easier to gather words and add new entries to FRELI. The principal product of that effort so far has been frop, a command-line tool to search FRELI and manage new words. (The code for frop is part of the will be made available if there is any interest in it; it's a shell scrip with some supporting Perl code.)
In March 2004, I began an effort to expand FRELI, gleaning words from WordNet and the BSD Unix file /usr/share/dict/web2, which contains words from Webster's Second International Dictionary. Since that time some 30,000 words have been added.
In the near future, I hope to further automate new word discovery (e.g., using scripts to download Project Gutenberg texts in search of unrecognized words), but in order to maintain quality, I will continue to review all changes manually.
Another near-term goal is to develop a suite of tools to support lexicon management in languages other than English. Preliminary work on this has included the refinement of the file format that FRELI now uses, which I've termed OFFLI (the Open File Format for Lexical Information).
Additions and corrections to FRELI are welcome, as are any suggestions for its future development.
Version control and software project management provided by CVSDude.
Last modified Friday, October 29, 2004 at 15:59:05 GMT -0500
webmaster@nkuitse.com