学海网 文档下载 文档下载导航
设为首页 | 加入收藏
搜索 请输入内容:  
 导航当前位置: 文档下载 > 所有分类 > FITE-TRT A High Quality Translation Technique for OOV Words
免费下载此文档

FITE-TRT A High Quality Translation Technique for OOV Words

We devised a novel statistical technique for the identification of the translation equivalents of source words obtained by transformation rule based translation (TRT). The effectiveness of the devised FITE (frequency-based identification of translation equ

FITE-TRT: A High Quality Translation Technique for OOV

Words

Ari Pirkola, Jarmo Toivonen*, Heikki Keskustalo, Kalervo Järvelin

Department of Information Studies33014 University of Tampere, Finland

{ari.pirkola, heikki.keskustalo,

kalervo.jarvelin}@uta.fi

ABSTRACT

We devised a novel statistical technique for the identification ofthe translation equivalents of source words obtained bytransformation rule based translation (TRT). The effectiveness ofthe devised FITE (frequency-based identification of translationequivalents) technique was tested using biological and medicalcross-lingual spelling variants and OOV words in Spanish-English and Finnish-English TRT. For Spanish-English,translation recall was 89.2%-91.0% and for Finnish-English71.9%-72.9%. For both language pairs FITE-TRT achieved hightranslation precision, i.e., 97.0%-98.8%. The technique alsoreliably identified native source language words, i.e., sourcewords that cannot be correctly translated by TRT. Dictionary-based CLIR augmented with FITE-TRT performed substantiallybetter than dictionary-based CLIR where OOV keys were keptintact.

Categories and Subject Descriptors

H.3.3 [Information Systems]: Information Search and Retrieval

General Terms

Algorithms, Performance, Experimentation

Keywords

Cross-language information retrieval, OOV words, TRT

1.INTRODUCTION

Out-of-vocabulary (OOV) words constitute a major problem incross-language information retrieval (CLIR) and machinetranslation (MT). In those cases where equivalent terms indifferent languages are etymologically related technical terms(cross-lingual spelling variants -as Germankonstruktion andEnglishconstruction) it is possible to use transliteration type oftranslation to recognize the target language equivalents of thesource language words. In [5] we generated automatically largecollections of character correspondences in several language

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.

SAC’06, April, 23-27, 2006, Dijon, France.

Copyright 2006 ACM 1-59593-108-2/06/0004…$5.00.

Institute of Signal Processing*Tampere University of Technology

Tampere, Finland

jarmo.toivonen@cs.tut.fi

pairs for the translation of cross-lingual spelling variants. We callthe regular correspondences augmented with statisticalinformationtransformation rulesand thetranslation techniquebased on the generated rulestransformation rule basedtranslation(TRT).

It is obvious that a technique where words not found in adictionary are translated by transformation rules would be usefulin many information systems where automatic translation is partof the system. However, the TRT technique may be useless if itjust indicates a set of translation equivalent candidates for asource word but is not able to indicate the one correct equivalent,which was the case in [5] as well as in [7]. In the presentresearch we combat this problem, and move TRT from what iscalled fuzzy translation towards dictionary-like translation wherefor each source word either one translation equivalent rather thana set of words possibly containing the equivalent is indicated, orthe source word is indicated not to be translatable by means ofTRT. For this we developed a novel statistical equivalentidentification technique calledfrequency-based identification oftranslationequivalents (FITE). The identification of equivalentsis based on regular frequency patterns associated with the targetword forms obtained by TRT.

In this paper we also present a novel feature of TRT, viz.,translation through indirect translation routes. If a directtranslation from a source language into a target language fails tofind an equivalent the source word is retranslated into a targetlanguage through intermediate languages. As in the case ofdirect translation the equivalents are searched for from TRT’stranslation set by means of the novel FITE technique.

We study Spanish-English and Finnish-English TRT. For bothlanguage pairs German and French serve as intermediatelanguages. As test words we use terms in the domains of biologyand medicine. The terms were selected from texts and realinformation requests of biomedical researchers.

The novel FITE-TRT technique is fundamentally different fromother OOV methods/systems presented in the literature. Forinstance, Cheng et al. [1] and Zhang and Vines [8] bothdeveloped a Web-based translation method for Chinese-EnglishOOV words where the OOV words were extracted from bilingualChinese-English texts found in Chinese Web pages using wordco-occurrence statistics and syntactic structures. Fujii andIshikawa [2] used character-based rules to establish mappingbetween English characters and romanized Japanese katakanacharacters. They also utilized probabilistic character-basedlanguage models, which can be seen as a variation of the fuzzy

第1页

免费下载Word文档免费下载:FITE-TRT A High Quality Translation Technique for OOV Words

(下载1-7页,共7页)

TOP相关主题

我要评论

站点地图 | 文档上传 | 侵权投诉 | 手机版
新浪认证  诚信网站  绿色网站  可信网站   非经营性网站备案
本站所有资源均来自互联网,本站只负责收集和整理,均不承担任何法律责任,如有侵权等其它行为请联系我们.
文档下载 Copyright 2013 doc.xuehai.net All Rights Reserved.  email
返回顶部