Croatian n-grams

This resource contains sets of n-grams of different sizes (from 1 to 3) computed from the Croatian National Corpus v2.5. N-grams were computed both from lowercased text and text in original character case. For every size of n above one (i.e. for bigrams and trigrams), n-grams were computed in two ways: taking to account only those appearing within sentence and across sentence boundaries. Regarding the tokenization of the corpus, token is considered to be a continuous sequence of non-whitespace characters. Punctuation markings are treated as separate tokens. Complex punctuations are tokenized as a sequence of simple punctuations. Resource consists of 10 textual files, each computed with different combination of paramaters (i.e. n-gram length, character case, sentence boundaries). Each line in the file represents one unique n-gram and its absolute frequency in the corpus, separated by a tabulator. N-grams are ordered according to their frequency, starting from highest to lowest. The n-grams lists were produced using methodology and tools developed by the CESAR Polish partner IPIPAN.

Publisher

University of Zagreb, Faculty of Humanities and Social Sciences, Department of Information Sciences

Acknowledgement