TY - JOUR
T1 - Techniques of document clustering
T2 - A review
AU - Kishida, Kazuaki
PY - 2003/12/1
Y1 - 2003/12/1
N2 - The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.
AB - The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.
UR - http://www.scopus.com/inward/record.url?scp=13444259727&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=13444259727&partnerID=8YFLogxK
M3 - Review article
AN - SCOPUS:13444259727
SN - 0373-4447
SP - 33
EP - 75
JO - Library and Information Science
JF - Library and Information Science
IS - 49
ER -