In proceedings details

First steps towards coverage-based document alignment

Aug 2016

In this paper we describe a method for selecting pairs of parallel documents (documents that are a translation of each other) from a large collection of documents obtained from the web. Our approach is based on a \emph{coverage} score that reflects the number of distinct bilingual phrase pairs found in each pair of documents, normalized by the total number of unique phrases found in them. Since parallel documents tend to share more bilingual phrase pairs than non-parallel documents, our alignment algorithm selects pairs of documents with the maximum coverage score from all possible pairings involving either one of the two documents.

Organization:

Publisher: Association for Computational Linguistics

Authors: Luís Gomes, Gabriel Pereira Lopes

Editors:

Series:

Volume:

ISSN:

ISBN:

Url: http://www.aclweb.org/anthology/W/W16/W16-2369.pdf

Notes: Note: The system described in this paper was ranked first among 22 competing systems in the Bilingual Document Alignment Shared Task at WMT16.

Bibtex Key:

DOI:

Pages: 697 to 702

Publication Date: 1 Aug 2016

Publication File: