TY - JOUR
T1 - A non-image-based subcharacter-level method to encode the shape of chinese characters
AU - Ke, Yuanzhi
AU - Hagiwara, Masafumi
N1 - Publisher Copyright:
© 2020, Japanese Society for Artificial Intelligence. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Most characters in the Chinese and Japanese languages are ideographic compound characters composed of subcharacter elements arranged in a planar manner. Token-based and image-based subcharacter-level models have been proposed to leverage the subcharacter elements. However, on the one hand, the conventional token-based subcharacter-level models are blind to the planar structural information; on the other hand, the image-based models are weak with respect to ideographic characters that share similar shapes but have different meanings. These characteristics motivate us to explore non-image-based methods to encode the planar structural information of characters. In this paper, we propose and discuss a method to encode the planar structural information by learning embeddings of the categories of the structure types. Our proposed model adds the structure embeddings to the conventional subcharacter embeddings and position embeddings before they are input into the encoder. In this way, the model learns the planar structural and positional information and retains the uniqueness of each character. We evaluated the method in a text classification task. In the experiment, the embeddings were encoded by a CNN encoder, and then the encoded vectors were input into an LSTM classifier to classify product reviews as positive or negative. We compared the proposed model with models that use only the subcharacter embeddings, the structure embedding or the position embeddings as well as with the conventional models in previous works. The results show that adding both structure embeddings and position embeddings leads to more rich and representative features and better fitting on the dataset. The proposed method results in at best 1.8% better recall, 0.63% better F-score, and 0.55% better accuracy on the testing datasets compared to previous methods.
AB - Most characters in the Chinese and Japanese languages are ideographic compound characters composed of subcharacter elements arranged in a planar manner. Token-based and image-based subcharacter-level models have been proposed to leverage the subcharacter elements. However, on the one hand, the conventional token-based subcharacter-level models are blind to the planar structural information; on the other hand, the image-based models are weak with respect to ideographic characters that share similar shapes but have different meanings. These characteristics motivate us to explore non-image-based methods to encode the planar structural information of characters. In this paper, we propose and discuss a method to encode the planar structural information by learning embeddings of the categories of the structure types. Our proposed model adds the structure embeddings to the conventional subcharacter embeddings and position embeddings before they are input into the encoder. In this way, the model learns the planar structural and positional information and retains the uniqueness of each character. We evaluated the method in a text classification task. In the experiment, the embeddings were encoded by a CNN encoder, and then the encoded vectors were input into an LSTM classifier to classify product reviews as positive or negative. We compared the proposed model with models that use only the subcharacter embeddings, the structure embedding or the position embeddings as well as with the conventional models in previous works. The results show that adding both structure embeddings and position embeddings leads to more rich and representative features and better fitting on the dataset. The proposed method results in at best 1.8% better recall, 0.63% better F-score, and 0.55% better accuracy on the testing datasets compared to previous methods.
KW - Convolutional Neural Networks
KW - Deep Learning
KW - Natural Language Processing
KW - Subcharacter Language Modeling
KW - Text Classification
UR - http://www.scopus.com/inward/record.url?scp=85084345108&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084345108&partnerID=8YFLogxK
U2 - 10.1527/tjsai.C-J74
DO - 10.1527/tjsai.C-J74
M3 - Article
AN - SCOPUS:85084345108
SN - 1346-0714
VL - 35
JO - Transactions of the Japanese Society for Artificial Intelligence
JF - Transactions of the Japanese Society for Artificial Intelligence
IS - 2
M1 - C-J74
ER -