In proceedings details

  • First steps towards coverage-based document alignment
  • Aug 2016
  • In this paper we describe a method for selecting pairs of parallel documents (documents that are a translation of each other) from a large collection of documents obtained from the web. Our approach is based on a \emph{coverage} score that reflects the number of distinct bilingual phrase pairs found in each pair of documents, normalized by the total number of unique phrases found in them. Since parallel documents tend to share more bilingual phrase pairs than non-parallel documents, our alignment algorithm selects pairs of documents with the maximum coverage score from all possible pairings involving either one of the two documents.
  • Association for Computational Linguistics
  • Luís Gomes, Gabriel Pereira Lopes
  • http://www.aclweb.org/anthology/W/W16/W16-2369.pdf
  • Note: The system described in this paper was ranked first among 22 competing systems in the Bilingual Document Alignment Shared Task at WMT16.
  • 697 to 702
  • 1 Aug 2016