TY - JOUR
T1 - Effects of unlabeled data on classification error in normal discriminant analysis
AU - Takai, Keiji
AU - Hayashi, Kenichi
N1 - Funding Information:
The authors would like to acknowledge the associate editor and two anonymous reviewers for their helpful comments and suggestions. K. Takai is financially supported by JSPS KAKENHI (Grant-in-Aid for Scientific Research) Grant No. 20572019 . K. Hayashi is financially supported by JSPS KAKENHI (Grant-in-Aid for Scientific Research) Grant No. 24700276 .
PY - 2014/4
Y1 - 2014/4
N2 - Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.
AB - Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.
KW - Asymptotic relative efficiency
KW - Missing data
KW - Nonnormal data
KW - Normal discrimination
KW - Partially labeled data
KW - Semi-supervised learning
KW - Unlabeled data
UR - http://www.scopus.com/inward/record.url?scp=84892551592&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84892551592&partnerID=8YFLogxK
U2 - 10.1016/j.jspi.2013.11.004
DO - 10.1016/j.jspi.2013.11.004
M3 - Article
AN - SCOPUS:84892551592
SN - 0378-3758
VL - 147
SP - 66
EP - 83
JO - Journal of Statistical Planning and Inference
JF - Journal of Statistical Planning and Inference
ER -