Publication date: 4 de July, 2024

How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora, leading to statistical laws on word frequency distribution. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and the numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterize such influence. We propose a model that aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the size of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the large populations on n-grams. Practical assessment with real corpora shows relative errors around 3%, stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.


Joaquim Ferreira da Silva (NOVA FCT - NOVA LINCS),

Date 10/07/2024 2:00 pm
Location DI Seminars Room and Zoom
Host Bio Joaquim Ferreira da Silva is an assistant professor and holds a Ph.D. in Computer Science from the Universidade Nova de Lisboa (2004). His research interests are focused on Information Extraction, Text Mining and Machine Learning. He publishes regularly at scientific events on those areas and has been involved in several projects, such as ISTRION, VIP-ACCESS, PATRAS and WE-LEARN. Since 2005, he has been co-chair of TeMA, a track of the EPIA conference. He has collaborated as member of some conference Program Committees, such as EPIA, ICCS and MWE ACL workshop. From some years now, his work has focused on the distribution of n-grams in large text corpora, in addition to classifying cetacean vocalizations. He is an integrated member of the NOVA LINCS research laboratory.