Techniques of document clustering: A review

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.

Original languageEnglish
Pages (from-to)33-75
Number of pages43
JournalLibrary and Information Science
Issue number49
Publication statusPublished - 2003
Externally publishedYes

Fingerprint

indexing
semantics
follower
information retrieval
leader
efficiency
evaluation
literature

ASJC Scopus subject areas

  • Library and Information Sciences

Cite this

Techniques of document clustering : A review. / Kishida, Kazuaki.

In: Library and Information Science, No. 49, 2003, p. 33-75.

Research output: Contribution to journalArticle

@article{069709fe22b14cd3a2811591e18d55f4,
title = "Techniques of document clustering: A review",
abstract = "The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.",
author = "Kazuaki Kishida",
year = "2003",
language = "English",
pages = "33--75",
journal = "Library and Information Science",
issn = "0373-4447",
publisher = "Mita Society for Library and Information Science",
number = "49",

}

TY - JOUR

T1 - Techniques of document clustering

T2 - A review

AU - Kishida, Kazuaki

PY - 2003

Y1 - 2003

N2 - The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.

AB - The document clustering technique is widely recognized as a useful tool for information retrieval, organizing web documents, text mining and so on. The purpose of this paper is to review various document clustering techniques, and to discuss research issues for enhancing effectiveness or efficiency of the clustering methods. We explore extensive literature on non-hierarchical methods (single-pass methods), hierarchical methods (single-link, complete-link, etc.), dimensional reduction methods (LSI, principal component analysis, etc.), probabilistic methods, data mining techniques, and so on. In particular, this paper focuses on typical techniques, such as the k-means algorithm, the leader-follower algorithm, self-organizing map (SOM), single- or complete-link methods, bisecting k-means methods, latent semantic indexing (LSI), Gaussian-Mixture model and so on. After reviewing the techniques and algorithms, we discuss research issues on document clustering; computational complexity, feature extraction (selection of words), methods for defining term weights and similarity, and evaluation of results.

UR - http://www.scopus.com/inward/record.url?scp=13444259727&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=13444259727&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:13444259727

SP - 33

EP - 75

JO - Library and Information Science

JF - Library and Information Science

SN - 0373-4447

IS - 49

ER -