Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese

Yuanzhi Ke, Masafumi Hagiwara

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-directional RNN document feature encoder. The results achieved are on par with the character embedding-based models, and close to the state-of-the-art word embedding-based models, with 90% smaller vocabulary, and at least 13% and 80% fewer parameters than the character embedding-based models and word embedding-based models respectively. The results suggest that the radical embedding-based approach is cost-effective for machine learning on Chinese and Japanese.

Original languageEnglish
Pages (from-to)561-573
Number of pages13
JournalJournal of Machine Learning Research
Volume77
Publication statusPublished - 2017 Jan 1
Event9th Asian Conference on Machine Learning, ACML 2017 - Seoul, Korea, Republic of
Duration: 2017 Nov 152017 Nov 17

Fingerprint

Sentiment Analysis
Encoder
Model
Neural Network Model
Machine Learning
Learning systems
Character
Neural networks
Costs

Keywords

  • Natural Language Processing
  • Sentiment Analysis

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Cite this

Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese. / Ke, Yuanzhi; Hagiwara, Masafumi.

In: Journal of Machine Learning Research, Vol. 77, 01.01.2017, p. 561-573.

Research output: Contribution to journalConference article

@article{8b5a5c195b8241a09a17dfb1971ebfd4,
title = "Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese",
abstract = "The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-directional RNN document feature encoder. The results achieved are on par with the character embedding-based models, and close to the state-of-the-art word embedding-based models, with 90{\%} smaller vocabulary, and at least 13{\%} and 80{\%} fewer parameters than the character embedding-based models and word embedding-based models respectively. The results suggest that the radical embedding-based approach is cost-effective for machine learning on Chinese and Japanese.",
keywords = "Natural Language Processing, Sentiment Analysis",
author = "Yuanzhi Ke and Masafumi Hagiwara",
year = "2017",
month = "1",
day = "1",
language = "English",
volume = "77",
pages = "561--573",
journal = "Journal of Machine Learning Research",
issn = "1532-4435",
publisher = "Microtome Publishing",

}

TY - JOUR

T1 - Radical-level Ideograph Encoder for RNN-based Sentiment Analysis of Chinese and Japanese

AU - Ke, Yuanzhi

AU - Hagiwara, Masafumi

PY - 2017/1/1

Y1 - 2017/1/1

N2 - The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-directional RNN document feature encoder. The results achieved are on par with the character embedding-based models, and close to the state-of-the-art word embedding-based models, with 90% smaller vocabulary, and at least 13% and 80% fewer parameters than the character embedding-based models and word embedding-based models respectively. The results suggest that the radical embedding-based approach is cost-effective for machine learning on Chinese and Japanese.

AB - The character vocabulary can be very large in non-alphabetic languages such as Chinese and Japanese, which makes neural network models huge to process such languages. We explored a model for sentiment classification that takes the embeddings of the radicals of the Chinese characters, i.e, hanzi of Chinese and kanji of Japanese. Our model is composed of a CNN word feature encoder and a bi-directional RNN document feature encoder. The results achieved are on par with the character embedding-based models, and close to the state-of-the-art word embedding-based models, with 90% smaller vocabulary, and at least 13% and 80% fewer parameters than the character embedding-based models and word embedding-based models respectively. The results suggest that the radical embedding-based approach is cost-effective for machine learning on Chinese and Japanese.

KW - Natural Language Processing

KW - Sentiment Analysis

UR - http://www.scopus.com/inward/record.url?scp=85070905623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85070905623&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85070905623

VL - 77

SP - 561

EP - 573

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -