Double-pass clustering technique for multilingual document collections

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.

Original languageEnglish
Pages (from-to)304-321
Number of pages18
JournalJournal of Information Science
Volume37
Issue number3
DOIs
Publication statusPublished - 2011 Jun

Fingerprint

Clustering algorithms
Glossaries
Scalability
news
Experiments
language
dictionary
experiment

Keywords

  • document translation
  • multilingual document clustering
  • translation disambiguation

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

Double-pass clustering technique for multilingual document collections. / Kishida, Kazuaki.

In: Journal of Information Science, Vol. 37, No. 3, 06.2011, p. 304-321.

Research output: Contribution to journalArticle

@article{18a0d840600d47b1bd4b3674f0e83557,
title = "Double-pass clustering technique for multilingual document collections",
abstract = "It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.",
keywords = "document translation, multilingual document clustering, translation disambiguation",
author = "Kazuaki Kishida",
year = "2011",
month = "6",
doi = "10.1177/0165551511404867",
language = "English",
volume = "37",
pages = "304--321",
journal = "Journal of Information Science",
issn = "0165-5515",
publisher = "SAGE Publications Ltd",
number = "3",

}

TY - JOUR

T1 - Double-pass clustering technique for multilingual document collections

AU - Kishida, Kazuaki

PY - 2011/6

Y1 - 2011/6

N2 - It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.

AB - It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.

KW - document translation

KW - multilingual document clustering

KW - translation disambiguation

UR - http://www.scopus.com/inward/record.url?scp=79959194131&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79959194131&partnerID=8YFLogxK

U2 - 10.1177/0165551511404867

DO - 10.1177/0165551511404867

M3 - Article

AN - SCOPUS:79959194131

VL - 37

SP - 304

EP - 321

JO - Journal of Information Science

JF - Journal of Information Science

SN - 0165-5515

IS - 3

ER -