Detail

Publication date: 1 de June, 2021

Identification of Document Language in Hard Contexts

Automatic determination of the language in which a document is written is not yet a completely solved problem. Generically it is solved as a classification problem and, for most common situations, namely for documents written in just one language, results obtained are 100 % precise. However, there are texts that are hard to classify for which there is currently no reliable solution. Among the hard texts are: small touristic advertisements on the web, addressing foreigners but written in such a way that most part of the words used to name local
entities are from the local language, texts written both in a national language and in English, addressing two linguistic communities.
In this work, we present a statistics based Language Identification (LID) approach based on a covariance similarity measure. This methodology is shown to be 100 % correct for normal texts written in 19 languages and maintains its robustness when classifying both short normal texts and hard language identification documents.

Authors


Date 24/01/2007
State Concluded