

Author
Subject
lexicalConceptualResourceHR-CLARIN: FFZG
Author(s):
Description:
The 762,662 entries of the lexicon are built from the Wikipedia dumps of the six CESAR languages by using article titles and interlingual links to English and the remaining five CESAR languages. In the first phase one lexicon for each CESAR language is built after which those lexicons are merged by grouping together all entries that are connected by interlingual links. If more than one article of a language is connected to a group of articles in other languages (which are actually errors in the structure of the Wikipedias), all article titles are retained, divided by a semicolon. An example of such an entry is "Астеци; Империја Астека". In the final phase category information from the English Wikipedia is added with categories divided by semicolons, and for each non-English entry the number of links to that page in the Wikipedia of the respective language is given.
This item contains 1 file (96.65 MB).
Publicly Available
lexicalConceptualResourceHR-CLARIN: FFZG
Author(s):
Description:
he Croatian Automatic Collocations Dictionary has been created by Lexical Computing Ltd. and have been made available to the research community as part of the CESAR project deliverables.
This item contains 1 file (34.19 MB).
Publicly Available
lexicalConceptualResourceHR-CLARIN: FFZG
Author(s):
Description:
This resource contains sets of n-grams of different sizes (from 1 to 3) computed from the Croatian National Corpus v2.5. N-grams were computed both from lowercased text and text in original character case. For every size of n above one (i.e. for bigrams and trigrams), n-grams were computed in two ways: taking to account only those appearing within sentence and across sentence boundaries. Regarding the tokenization of the corpus, token is considered to be a continuous sequence of non-whitespace characters. Punctuation markings are treated as separate tokens. Complex punctuations are tokenized as a sequence of simple punctuations. Resource consists of 10 textual files, each computed with different combination of paramaters (i.e. n-gram length, character case, sentence boundaries). Each line in the file represents one unique n-gram and its absolute frequency in the corpus, separated by a tabulator. N-grams are ordered according to their frequency, starting from highest to lowest. The n-grams lists were produced using methodology and tools developed by the CESAR Polish partner IPIPAN.
This item contains 1 file (67.8 MB).
Publicly Available
Most Viewed Items - Last Month
corpusHR-CLARIN: FFZG
Author(s):
Description:
Kindly refer to the following publication for additional information about the data sources: https://www.croris.hr/crosbi/publikacija/prilog-skup/849552
Publicly Available
corpusHR-CLARIN: FFZG
Author(s):
Description:
The corpus contains originals and translations in all seven languages, and the order of the segments has been changed. The first version (RomCro v.1.0) was published in 2022. RomCro v.2.0 contains 33 original texts, 213 texts in total, 166,738 translation units and 19.4 million words, an increase of 3.7 million compared to the previous version. In comparison to v.1.0, v.2.0 also contains texts in Catalan.
This item contains 2 files (301.01 MB).
Publicly Available