In proceedings details

A Model for Predicting n-gram Frequency Distribution in Large Corpora

Jun 2021

The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n- gram frequency distributions. For different n-gram sizes (n>=1), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, k = 1, 2,..., by predicting the influence of the corpus size upon the Zipf's law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near 4 %, for English and French corpora from 62 Million to 8.6 Billion words.

Organization: ICCS

Publisher: Springer

Authors: Joaquim Ferreira da Silva, José Cardoso e Cunha

Editors:

Series:

Volume:

ISSN:

ISBN:

Url:

Notes:

Bibtex Key:

DOI:

Publication Date: 1 Jun 2021

Publication File: