Seminars details

  • An Empirical Model for n-gram Frequency Distribution in Large Corpora
  • Statistical multiword extraction methods can benefit from the knowledge on the n-gram (n>=1) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the number of distinct n-grams in large corpora, as well as the sizes of the equal-frequency n-gram groups, which occur in the lower frequencies starting from 1. The model was validated for n-grams with 1 <= n < 7, by a wide range of real corpora in English and French, from 60 million up to 8 billion words. These are full non-truncated corpora data, that is, their associated frequency data include the entire range of observed n-gram frequencies, from 1 up to the maximum. The model predicts the monotonic growth of the numbers of distinct n-grams until reaching asymptotic plateaux when the corpus size grows to infinity. It also predicts the non-monotonicity of the sizesof the equal-frequency n-gram groups as a function of the corpus size.
  • 15/07/2020 13:00
  • Joaquim Ferreira da Silva is an assistant professor and holds a PhD in Computer Science from the Universidade Nova de Lisboa (2004). His research interests are focused on Information Extraction, Text Mining and Machine Learning. He publishes regularly at scientific events on those areas and has been involved in several projects, such as ISTRION, VIP-ACCESS, PATRAS and WE-LEARN. Since 2005, he has been co-chair of TeMA, a track of the conference EPIA conference. He has collaborated as member of some conference Program Committees, such as EPIA, ICCS and MWE ACL workshop. Recently he has been working on n-gram distribution in big corpora. He is an integrated member of the NOVALINCS research laboratory.
  • Joaquim Ferreira da Silva