In proceedings details

  • A Model for Predicting n-gram Frequency Distribution in Large Corpora
  • Jun 2021
  • The statistical extraction of multiwords (n-grams) from natural language corpora is challenged by computationally heavy searching and indexing, which can be improved by low error prediction of the n- gram frequency distributions. For different n-gram sizes (n>=1), we model the sizes of groups of equal-frequency n-grams, for the low frequencies, k = 1, 2,..., by predicting the influence of the corpus size upon the Zipf's law exponent and the n-gram group size. The average relative errors of the model predictions, from 1-grams up to 6-grams, are near 4 %, for English and French corpora from 62 Million to 8.6 Billion words.
  • ICCS
  • Springer
  • Joaquim Ferreira da Silva, José Cardoso e Cunha
  • 1 Jun 2021