Detail

Publication date: 1 de June, 2021

TEXTCAT

TEXTCAT enabled an extensive comparison of classifiers, feature selection metrics and document representation. It is described in the paper “Text Categorization: An extensive comparison of classifiers, feature selection metrics and document representation” by Filipa Peleja, Joaquim Ferreira da Silva, Gabriel Pereira Lopes, In: Luis Antunes, H. Sofia Pinto, Rui Prada, and Paulo Trigo (Eds.), “Proceedings of the 15th Portuguese Conference in Artificial Intelligence, EPIA 2011, Lisbon, October, 2011”, ISBN 978-989-95618-4-7, pages 660-674, Instituto Superior Técnico (Portugal).

Abstract of this paper:

In this paper, on automatic text categorization, we extensively compare several aspects which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).

Authors

Joaquim Ferreira da Silva, Gabriel Pereira Lopes, Filipa Peleja

Date 01/06/2011