RMIT University
Browse

Corpus effects on the evaluation of automated transliteration systems

conference contribution
posted on 2024-10-30, 18:41 authored by Sarvnaz Karimi, Andrew Turpin, Falk ScholerFalk Scholer
Most current machine transliteration systems employ a corpus of known sourcetarget word pairs to train their system, and typically evaluate their systems on a similar corpus. In this paper we explore the performance of transliteration systems on corpora that are varied in a controlled way. In particular, we control the number, and prior language knowledge of human transliterators used to construct the corpora, and the origin of the source words that make up the corpora. We find that the word accuracy of automated transliteration systems can vary by up to 30% (in absolute terms) depending on the corpus on which they are run. We conclude that at least four human transliterators should be used to construct corpora for evaluating automated transliteration systems; and that although absolute word accuracy metrics may not translate across corpora, the relative rankings of system performance remains stable across differing corpora.

History

Start page

640

End page

647

Total pages

8

Outlet

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

Editors

A. van der Bosch, A. Zaenan

Name of conference

Association of Computational Linguistics

Publisher

Association of Computational Linguistics

Place published

USA

Start date

2007-06-23

End date

2007-06-30

Language

English

Copyright

© 2007 Association of Computational Linguistics

Former Identifier

2006006574

Esploro creation date

2020-06-22

Fedora creation date

2009-10-08

Usage metrics

    Scholarly Works

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC