Detail

Publication date: 27 de January, 2026

A Scalable Model of Frequency Distribution of Low Occurrence Multi-words Towards Handling Very Large Spectrum of Text Corpora Sizes

Predicting the diversity of words and multi-words (n-grams) in a text corpus and their frequency distributions is important in NLP and language modelling, and is becoming critical to enable the design of modern applications, namely Large Language Models, e.g. for guiding tokenization and corpus analysis for pre-training. This requires the ability to model the very large scale corpora behaviour, the handling of multiwords as subwords or phrases, and the distribution of n-grams across different frequency ranges, namely the low occurrence n-grams.
A scalable model is presented to predict the number of distinct n-grams and their frequency distributions targeting an extended range of corpora sizes, from hundreds of million words to hundreds of billion words (a 1000 times factor). This led us to a novel approach for explicitly incorporating into the model the parameter dependency behaviour regarding the extended corpora size range. In the presence of such extended range of corpora sizes, the model estimates the cumulative numbers of distinct n-grams (1  n  6) greater or equal to a given frequency k 1 and the numbers of n-grams with equal-frequencies, in a given language corpus. Unlike most approaches that assume an open, potentially infinite, language word vocabulary, this model relies on the vocabulary finiteness. The model ensures
very low and stable average relative errors (circa 2%), for the low frequencies starting with singletons, from 1-grams to 6-grams, across the above very large range of corpora sizes, in English and German.

Presenter


URL https://meet.google.com/fup-ddqu-iox
Date 04/02/2026 2:00 pm
Location DI Seminars Room and Google Meet
Host Bio Joaquim Ferreira da Silva is an assistant professor and holds a PhD in Computer Science from the Universidade Nova de Lisboa (2004). His research interests are focused on Information Extraction, Text Mining and Machine Learning. He publishes regularly at scientific events on those areas and has been involved in several projects, such as ISTRION, VIP-ACCESS, PATRAS and WE-LEARN. Since 2005, he has been co-chair of TeMA / NLP-TeMA, a track of the conference EPIA conference. He has collaborated as member of some conference Program Committees, such as EPIA, ICCS, ECMLPKDD and MWE ACL workshop. Recently he has been working on n-gram distribution in big corpora. He is an integrated member of the NOVALINCS research laboratory.