Double-pass clustering technique for multilingual document collections

研究成果: Article査読

6 被引用数 (Scopus)

抄録

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.

本文言語English
ページ(範囲)304-321
ページ数18
ジャーナルJournal of Information Science
37
3
DOI
出版ステータスPublished - 2011 6月

ASJC Scopus subject areas

  • 情報システム
  • 図書館情報学

フィンガープリント

「Double-pass clustering technique for multilingual document collections」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル