seminars
Detail
Publication date: 1 de June, 2021n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora
Abstract Statistical extraction of relevant n-grams in natural language corpora is important for text indexing and classification since it can be language independent. We show how a theoretical model identifies the distribution properties of the distinct n-grams and singletons appearing in large corpora and how this knowledge contributes to understanding the performance of an n-gram cache system used for extraction of relevant terms. We show how this approach allowed us to evaluate the benefits from using Bloom filters for excluding singletons and from using static prefetching of nonsingletons in an n-gram cache. In the context of the distributed and parallel implementation of the LocalMaxs extraction method, we analyze the performance of the cache miss ratio and size, and the efficiency of n-gram cohesion calculation with LocalMaxs.
Date | 23/01/2019 |
---|---|
State | Concluded |