A Model for Predicting n-gram Frequency Distribution in Large Corpora
Jun 2021
The statistical extraction of multiwords (n-grams) from natural
language corpora is challenged by computationally heavy searching
and indexing, which can be improved by low error prediction of the n-
gram frequency distributions. For different n-gram sizes (n>=1), we model
the sizes of groups of equal-frequency n-grams, for the low frequencies,
k = 1, 2,..., by predicting the influence of the corpus size upon the Zipf's
law exponent and the n-gram group size. The average relative errors of
the model predictions, from 1-grams up to 6-grams, are near 4 %, for
English and French corpora from 62 Million to 8.6 Billion words.