Please use the following text to cite this item or export to a predefined format:
Tadić, Marko, 2013, Croatian n-grams, HR-CLARIN, http://hdl.handle.net/20.500.14615/2-31
dc.contributor.authorTadić, Marko
dc.date.accessioned2025-09-17T09:40:17Z
dc.date.available2025-09-17T09:40:17Z
dc.date.issued2013
dc.descriptionThis resource contains sets of n-grams of different sizes (from 1 to 3) computed from the Croatian National Corpus v2.5. N-grams were computed both from lowercased text and text in original character case. For every size of n above one (i.e. for bigrams and trigrams), n-grams were computed in two ways: taking to account only those appearing within sentence and across sentence boundaries. Regarding the tokenization of the corpus, token is considered to be a continuous sequence of non-whitespace characters. Punctuation markings are treated as separate tokens. Complex punctuations are tokenized as a sequence of simple punctuations. Resource consists of 10 textual files, each computed with different combination of paramaters (i.e. n-gram length, character case, sentence boundaries). Each line in the file represents one unique n-gram and its absolute frequency in the corpus, separated by a tabulator. N-grams are ordered according to their frequency, starting from highest to lowest. The n-grams lists were produced using methodology and tools developed by the CESAR Polish partner IPIPAN.
dc.identifier.urihttp://hdl.handle.net/20.500.14615/2-31
dc.language.isohrv
dc.publisherUniversity of Zagreb, Faculty of Humanities and Social Sciences, Department of Information Sciences
dc.rightsThe MIT Licence
dc.rights.labelPUB
dc.rights.urihttps://zzl-ffzg.mit-license.org/
dc.subjectn-grams
dc.subjectCroatian language
dc.subjectCroatian National Corpus
dc.titleCroatian n-grams
dc.typelexicalConceptualResource
local.contact.personMarko Tadić marko.tadic@ffzg.hr Faculty of Humanities and Social Sciences, University of Zagreb
local.files.count1
local.files.size71096144
local.has.filesyes
local.language.nameCroatian
local.size.info8681475 entries
local.sponsoreuFunds CIP-ICT-PSP-2009-4: 271022 European Commission Central and South-East European Resources
metashare.ResourceInfo#ContentInfo.detailedTypecomputationalLexicon
metashare.ResourceInfo#ContentInfo.mediaTypetext
This item isPublicly Available
and licensed under:
 Files in this item
Name
archive.zip
Size
67.8 MB
Format
application/zip
Description
zip
MD5
d761a4acdd4e240abd5f6afc929834e1
Preview
  File Preview