CESAR Aligned Wikipedia Headwords List
Please use the following text to cite this item or export to a predefined format:
Ljubešić, Nikola and Tadić, Marko, 2013, CESAR Aligned Wikipedia Headwords List, HR-CLARIN, http://hdl.handle.net/20.500.14615/2-33
Authors
Item identifier
Date issued
2013
Size
762662 entries
Description
The 762,662 entries of the lexicon are built from the Wikipedia dumps of the six CESAR languages by using article titles and interlingual links to English and the remaining five CESAR languages. In the first phase one lexicon for each CESAR language is built after which those lexicons are merged by grouping together all entries that are connected by interlingual links. If more than one article of a language is connected to a group of articles in other languages (which are actually errors in the structure of the Wikipedias), all article titles are retained, divided by a semicolon. An example of such an entry is "Астеци; Империја Астека". In the final phase category information from the English Wikipedia is added with categories divided by semicolons, and for each non-English entry the number of links to that page in the Wikipedia of the respective language is given.
Acknowledgement
Eurepan Comision
Project code:CIP-ICT-PSP-2009-4: 271022
Project name:Central and South-East European Resources
Subject(s)
Collections