Dissertations details

  • Unsupervised, language independent multiwords extraction, clustering, characterization, and classification of documents
  • Jan 2004
  • The extraction of Multiword Lexical Units (MWLUs) from corpora is very interesting and useful in Computational Linguistics and Information Retrieval. However, existing methodologies have some limitations. In fact, when the approaches are totally or partially symbolic, their morphosyntactic filters have to be redefined whenever one wants to extract MWLUs from a corpus written in a new language. Besides, they usually need tagged corpora, and if the morphosyntactic rules are unknown the extraction is not possible. When statistical techniques are used, extractions are usually limited to 2 or 3 words sequences, therefore many MWLUs are left to extract. These limitations motivated the development of a new extractor, presented in the part I of this thesis: the LiPXtractor. This methodology is purely statistical and it is based on the use of three tools that were developed during this thesis: a new cohesion measure between textual elements (words, characters or tags) ---Symmetric Conditional Probability (SCP); a transformation that enables the assignment of cohesion values to any sequence of textual elements, whatever length it has--- the Fair Dispersion Point Normalization --- and an algorithm for selecting textual units --- the LocalMaxs ---. This approach extracts Multielement Textual Units (where MWLUs are included) of any length; it does not depend on the language of the corpus from where elements will be extracted. The lower values for precision and recall of this extractor are about 75% and it happens for small corpora. Part II of this thesis focuses on unsupervised document clustering, cluster topic extraction and classification of documents. In this approach, the MWLUs are previously extracted from documents and then used as base features to characterize documents. Then external topic lists are no longer needed, as they are in other approaches. By using a combination of statistical techniques presented in this approach and by using clustering software, document clusters are obtained considering that clusters may not be hipper-spherical and their volume may not be the same in a k-dimensional space. This restriction is usually ignored by the other approaches. Precision and recall obtained in this clustering process are about 90%, and these results were obtained for different languages. Based on the most informative MWLUs according to this approach, the topics revealing the cluster contents are also detected. Finally, it is possible to classify new documents, according to the clusters previously obtained. The precision and recall values of the criterion proposed for classification is about 90%.
  • Faculdade de Ciências e Tecnologia, UNL
  • Joaquim Ferreira da Silva
  • http://terra.di.fct.unl.pt/~jfs/publicacoes/tese_final.ps
  • This is a thesis written in Portuguese, supervised by Gabriel Pereira Lopes and co-supervised by João Tiago Mexia