Projects details

Leonardo-1 - New Computer Technologies for Access to Vast Arrays of Multilingual Information

Mar 2003 - Jul 2003

This project was aimed at the formation of 12 Bulgarian Students from Universities of Plovdiv and Sofia, having quite different background preparations: linguistics, mathematics and Statistics, and Informatics. Main objective was to have those students working in several topics for developing a system that might help users to have access to information written either in English or in Bulgarian, even if the user did not master one of those languages. Moreover, a syntactic parser for Bulgarian was developed. Work was divided into a formal course Students were divided into 5 groups that should work on the following topics: 1.Extraction of Multiword lexical units (MWUs), Suffixes and prefixes from corpora from no matter the language. Three students worked on this project: 2 informatics students and a Linguistics student. Linguistics student evaluated the quality of the multiwords and sufixes and prefixes automatically extracted. The result of this work was used in the framework of next project. 2.Adaptation to Bulgarian of an existing Information search engine used for Portuguese and other Languages. This work aimed at preparing the framework for enabling multilingual access to multilingual documents. Two informatics students were involved in this task. It included the basic machinery for, using MWUs, to enable the user to select the information that s/he is really looking for. This included the maintenance of a dialogue in one language but looking for information in other languages, while using translation equivalents extracted in another project (bellow)to keep control of the interaction with the corpora and the user. If MT existed, at the end it could be used for showing the user the information the system had really found. 3.Clustering of Named Entities, independently of the language in which they are written or its length was another project that was carried out by a Mathematics and Statistics student under the supervision of Joaquim Ferreira da Silva. This project enable the use of features of Multiword relevant Expressions for filtering out those that might not be named entities. Later it used features internal to the MWUs (writing variation, length, etc.) together with Model-Based Clustering Analysis to cluster Named Entities and Discriminant Quadratic scores for classifying new named entities that had not been clustered (either due to its number, or because they were extracted later). 4.An important project involved the extraction of translation equivalents from Bulgarian-English parallel corpora that was collected. This project involved the aligner and the extractor that had been constructed for the European project TRADAUT.PT in the framework of the MLIS European Programme. The idea was to use these translation equivalents, together with information on sufixes and prefixes, both for English and Bulgarian, extracted in the framework of project 1, for enabling the construction of a multilingual access information system to information written either in Bulgarian (for English users) or in English (for Bulgarian users). Due to some miss-coordination between groups, this part of the work was less developed. This work involved three students. 5.''Building a Parser for Bulgarian''. This work involved the use of Definite Clause Grammars and, later, in order to treat long distance movements, it involved the use of Extraposition Grammars. This work was done by three Linguistics students.

Project type: ICI

Project reference:

Coordinated by:

CITI - FCT/UNL - Centro de Informática e Tecnologias de Informação, FCT/UNL

Funding entities:

European Commission - Research Directorate General (RTD)

Total Funding Amount: 21.648

Local Funding Amount: 21.648

Start Date: 1 Mar 2003

End Date: 26 Jul 2003

Participations Joaquim Ferreira da Silva [Researcher], Gabriel Pereira Lopes [Coordinator], Vitor Rocio, Tiago Ildefonso

Partnerships

Plovdiv University

Sofia University

Url: