CNN-encoded radical-level representation for Japanese processing

Yuanzhi Ke, Masafumi Hagiwara

Research output: Contribution to journalArticle

Abstract

Although word embeddings are powerful, weakness on rare words, unknown words and issues of large vocabulary motivated people to explore alternative representations. While the character embeddings have been successful for alphabetical languages, Japanese is difficult to be processed at the character level as well because of the large vocabulary of kanji, written in the Chinese characters. In order to achieve fewer parameters and better generalization on infrequent words and characters, we proposed a model that encodes Japanese texts from the radical-level representation, inspired by the experimental findings in the field of psycholinguistics. The proposed model is comprised of a convolutional local encoder and a recurrent global encoder. For the convolutional encoder, we propose a novel combination of two kinds of convolutional filters of different strides in one layer to extract information from the different levels. We compare the proposed radical-level model with the state-of-the-art word and character embedding-based models in the sentiment classification task. The proposed model outperformed the state-of-the-art models for the randomly sampled texts and the texts that contain unknown characters, with 91% and 12% fewer parameters than the word embedding-based and character embedding-based models, respectively. Especially for the test sets of unknown characters, the results by the proposed model were 4.01% and 2.38% above the word embedding-based and character embedding-based baselines, respectively. The proposed model is powerful with cheaper computational and storage cost, can be used for devices with limited storage and to process texts of rare characters.

Original languageEnglish
JournalTransactions of the Japanese Society for Artificial Intelligence
Volume33
Issue number4
DOIs
Publication statusPublished - 2018 Jan 1

Fingerprint

Processing
Costs

Keywords

  • Convolutional neural networks
  • Deep learning
  • Natural language processing
  • Sub-character language modeling
  • Text classification

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

CNN-encoded radical-level representation for Japanese processing. / Ke, Yuanzhi; Hagiwara, Masafumi.

In: Transactions of the Japanese Society for Artificial Intelligence, Vol. 33, No. 4, 01.01.2018.

Research output: Contribution to journalArticle

@article{0214233c13354a26a31985d43a09c5e8,
title = "CNN-encoded radical-level representation for Japanese processing",
abstract = "Although word embeddings are powerful, weakness on rare words, unknown words and issues of large vocabulary motivated people to explore alternative representations. While the character embeddings have been successful for alphabetical languages, Japanese is difficult to be processed at the character level as well because of the large vocabulary of kanji, written in the Chinese characters. In order to achieve fewer parameters and better generalization on infrequent words and characters, we proposed a model that encodes Japanese texts from the radical-level representation, inspired by the experimental findings in the field of psycholinguistics. The proposed model is comprised of a convolutional local encoder and a recurrent global encoder. For the convolutional encoder, we propose a novel combination of two kinds of convolutional filters of different strides in one layer to extract information from the different levels. We compare the proposed radical-level model with the state-of-the-art word and character embedding-based models in the sentiment classification task. The proposed model outperformed the state-of-the-art models for the randomly sampled texts and the texts that contain unknown characters, with 91{\%} and 12{\%} fewer parameters than the word embedding-based and character embedding-based models, respectively. Especially for the test sets of unknown characters, the results by the proposed model were 4.01{\%} and 2.38{\%} above the word embedding-based and character embedding-based baselines, respectively. The proposed model is powerful with cheaper computational and storage cost, can be used for devices with limited storage and to process texts of rare characters.",
keywords = "Convolutional neural networks, Deep learning, Natural language processing, Sub-character language modeling, Text classification",
author = "Yuanzhi Ke and Masafumi Hagiwara",
year = "2018",
month = "1",
day = "1",
doi = "10.1527/tjsai.D-I23",
language = "English",
volume = "33",
journal = "Transactions of the Japanese Society for Artificial Intelligence",
issn = "1346-0714",
publisher = "Japanese Society for Artificial Intelligence",
number = "4",

}

TY - JOUR

T1 - CNN-encoded radical-level representation for Japanese processing

AU - Ke, Yuanzhi

AU - Hagiwara, Masafumi

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Although word embeddings are powerful, weakness on rare words, unknown words and issues of large vocabulary motivated people to explore alternative representations. While the character embeddings have been successful for alphabetical languages, Japanese is difficult to be processed at the character level as well because of the large vocabulary of kanji, written in the Chinese characters. In order to achieve fewer parameters and better generalization on infrequent words and characters, we proposed a model that encodes Japanese texts from the radical-level representation, inspired by the experimental findings in the field of psycholinguistics. The proposed model is comprised of a convolutional local encoder and a recurrent global encoder. For the convolutional encoder, we propose a novel combination of two kinds of convolutional filters of different strides in one layer to extract information from the different levels. We compare the proposed radical-level model with the state-of-the-art word and character embedding-based models in the sentiment classification task. The proposed model outperformed the state-of-the-art models for the randomly sampled texts and the texts that contain unknown characters, with 91% and 12% fewer parameters than the word embedding-based and character embedding-based models, respectively. Especially for the test sets of unknown characters, the results by the proposed model were 4.01% and 2.38% above the word embedding-based and character embedding-based baselines, respectively. The proposed model is powerful with cheaper computational and storage cost, can be used for devices with limited storage and to process texts of rare characters.

AB - Although word embeddings are powerful, weakness on rare words, unknown words and issues of large vocabulary motivated people to explore alternative representations. While the character embeddings have been successful for alphabetical languages, Japanese is difficult to be processed at the character level as well because of the large vocabulary of kanji, written in the Chinese characters. In order to achieve fewer parameters and better generalization on infrequent words and characters, we proposed a model that encodes Japanese texts from the radical-level representation, inspired by the experimental findings in the field of psycholinguistics. The proposed model is comprised of a convolutional local encoder and a recurrent global encoder. For the convolutional encoder, we propose a novel combination of two kinds of convolutional filters of different strides in one layer to extract information from the different levels. We compare the proposed radical-level model with the state-of-the-art word and character embedding-based models in the sentiment classification task. The proposed model outperformed the state-of-the-art models for the randomly sampled texts and the texts that contain unknown characters, with 91% and 12% fewer parameters than the word embedding-based and character embedding-based models, respectively. Especially for the test sets of unknown characters, the results by the proposed model were 4.01% and 2.38% above the word embedding-based and character embedding-based baselines, respectively. The proposed model is powerful with cheaper computational and storage cost, can be used for devices with limited storage and to process texts of rare characters.

KW - Convolutional neural networks

KW - Deep learning

KW - Natural language processing

KW - Sub-character language modeling

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=85049758629&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049758629&partnerID=8YFLogxK

U2 - 10.1527/tjsai.D-I23

DO - 10.1527/tjsai.D-I23

M3 - Article

VL - 33

JO - Transactions of the Japanese Society for Artificial Intelligence

JF - Transactions of the Japanese Society for Artificial Intelligence

SN - 1346-0714

IS - 4

ER -