CESAR Aligned Wikipedia Headwords List

Please use the following text to cite this item or export to a predefined format:
Ljubešić, Nikola and Tadić, Marko, 2013, CESAR Aligned Wikipedia Headwords List, HR-CLARIN, http://hdl.handle.net/20.500.14615/2-33
Date issued
2013
Size
762662 entries
Description
The 762,662 entries of the lexicon are built from the Wikipedia dumps of the six CESAR languages by using article titles and interlingual links to English and the remaining five CESAR languages. In the first phase one lexicon for each CESAR language is built after which those lexicons are merged by grouping together all entries that are connected by interlingual links. If more than one article of a language is connected to a group of articles in other languages (which are actually errors in the structure of the Wikipedias), all article titles are retained, divided by a semicolon. An example of such an entry is "Астеци; Империја Астека". In the final phase category information from the English Wikipedia is added with categories divided by semicolons, and for each non-English entry the number of links to that page in the Wikipedia of the respective language is given.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
archive.zip
Size
96.65 MB
Format
application/zip
Description
zip
MD5
c09af9268e87a2e46a17ce3f1fb3d406
Preview
  File Preview