High-speed rough clustering for very large document collections

研究成果: Article査読

9 被引用数 (Scopus)

抄録

Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high-speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader-follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single-pass leaderfollower algorithm. Also, a two-stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single-pass leader-follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two-stage grouping technique did not reduce the processing time in this experiment.

本文言語English
ページ(範囲)1092-1104
ページ数13
ジャーナルJournal of the American Society for Information Science and Technology
61
6
DOI
出版ステータスPublished - 2010 6月 1

ASJC Scopus subject areas

  • ソフトウェア
  • 情報システム
  • 人間とコンピュータの相互作用
  • コンピュータ ネットワークおよび通信
  • 人工知能

フィンガープリント

「High-speed rough clustering for very large document collections」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル