Detail

Publication date: 1 de June, 2021

PArallelism for Machine Learning of TRAnslationS

In the framework of projects DIXIT, contract PRAXIS 2/2.1/TIT/1670/95, and TRADAUT-PT, European contract MLIS-4005 TRADAUT-PT 26192, our team developed an innovative procedure, language independent, for aligning bitexts (or parallel texts, i.e., texts that are translations of each other or of a common source text). Using that information and an indexing and retrieval engine, acquired for the TRADAUT-PT project, we built a translation extractor, a monolingual and a bilingual concordancer, and a MT validator. However, as we don’t know the source code of that engine, the translation extraction process is computationally heavy. It is difficult to adapt these tools for for delivering services within a reasonable response time. And, worst, it is not easy to make use of built data structures for reusing available information for any volume of parallel text (terabytes of data), and for any number of languages. So, it is our aim to implement a new indexing machine, using adequate data structures (suffix arrays),
adequate algorithms and adequate computational architectures (grid computing), in order to solve pointed problems.

With this project, we aim at delivering a Global Computing infrastructure for translation services, using an innovative translation technique (Translation as Information Retrieval), that became apparent during the preparation of PATRAS project. We aim at applying our technique to 6 languages — Portuguese, Spanish, French, German, and Czech —, from 3 different families (Latin, Germanic and Slavic), for a total of 15 language pairs. It is our goal to build a national, European and world wide service provider for the translation needs of all agents, acting in the Translation and Cross Language Information Retrieval areas. We have the knowledge and the expertise in this field. We just need to make our services available, by effectively tackling response speed, huge number of users, and access to terabytes of information for as many languages as possible. The larger the quantity of aligned parallel text and the number of languages tackled, the larger will be the probability of direct translation success, even for languages for which there is no parallel corpus.

Team

Joaquim Ferreira da Silva, Gabriel Pereira Lopes

Sname PATRAS
Funding Total 80
Funding Center 80
State Concluded
Startdate 01/05/2005
Enddate 30/11/2007