There are several well-known approaches to parsing Arabic text in preparation for indexing and retrieval. Techniques such as stemming and stopping have been shown to improve search results on written newswire dispatches, but few comparisons are available on other data sources. In this paper, we apply several alternative stemming and stopping approaches to Arabic text automatically extracted from the audio soundtrack of news video footage, and compare these with approaches that rely on machine translation of the underlying text. Using the TRECVID video collection and queries, we show that normalisation, stopword- removal, and light stemming increase retrieval precision, but that heavy stemming and trigrams have a negative effect. We also show that the choice of machine translation engine plays a major role in retrieval effectiveness.
History
Related Materials
1.
ISBN - Is published in 0769528414 (urn:isbn:0769528414)
Start page
11
End page
16
Total pages
6
Outlet
Proceedings of the 6th IEEE International Conference on Computer and Information Science
Editors
R. Lee, M. Chowdhury, S. Ray & T. Lee
Name of conference
6th IEEE International Conference on Computer and Information Science