The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, and so before any stemming, foreign words need to be identified. In this paper, we investigate three approaches for the identification of foreign words in Arabic text: lexicons, language patterns, and n-grams and present that results show that lexicon-based approaches outperform the other techniques.
History
Start page
258
End page
266
Total pages
9
Outlet
Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006)
Editors
E. Ringger
Name of conference
Conference on Empirical Methods in Natural Language Processing