Seminars details

n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora

Abstract Statistical extraction of relevant n-grams in natural language corpora is important for text indexing and classification since it can be language independent. We show how a theoretical model identifies the distribution properties of the distinct n-grams and singletons appearing in large corpora and how this knowledge contributes to understanding the performance of an n-gram cache system used for extraction of relevant terms. We show how this approach allowed us to evaluate the benefits from using Bloom filters for excluding singletons and from using static prefetching of nonsingletons in an n-gram cache. In the context of the distributed and parallel implementation of the LocalMaxs extraction method, we analyze the performance of the cache miss ratio and size, and the efficiency of n-gram cohesion calculation with LocalMaxs.

Date: 23/01/2019 14:00

Host: Computer Systems

Speaker Bio: Carlos Gonçalves is Professor at ADEETC/ISEL/IPL in Portugal and a researcher at NOVA-LINCS (@FCT/UNL) and GIATSI (ADEETC/ISEL/IPL). His main research line is focused on parallel and distributed computing including the processing of natural language corpora on the Big data domain. He has a PhD. in Informatics by Faculdade de Ciências e Tecnologia / Universidade Nova de Lisboa.

Url:

Speaker: Carlos Gonçalves