Effects of unlabeled data on classification error in normal discriminant analysis

Keiji Takai, Kenichi Hayashi

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.

Original languageEnglish
Pages (from-to)66-83
Number of pages18
JournalJournal of Statistical Planning and Inference
Volume147
DOIs
Publication statusPublished - 2014 Apr 1
Externally publishedYes

Fingerprint

Discriminant analysis
Discriminant Analysis
Labeling
Supervised learning
Missing Data
Non-normal Distribution
Asymptotic Relative Efficiency
Semi-supervised Learning
Dependent
Classification Problems
Numerical Computation
Discrimination
Gaussian distribution
Data analysis
Tend

Keywords

  • Asymptotic relative efficiency
  • Missing data
  • Nonnormal data
  • Normal discrimination
  • Partially labeled data
  • Semi-supervised learning
  • Unlabeled data

ASJC Scopus subject areas

  • Statistics and Probability
  • Statistics, Probability and Uncertainty
  • Applied Mathematics

Cite this

Effects of unlabeled data on classification error in normal discriminant analysis. / Takai, Keiji; Hayashi, Kenichi.

In: Journal of Statistical Planning and Inference, Vol. 147, 01.04.2014, p. 66-83.

Research output: Contribution to journalArticle

@article{9b93b03f17134eef94c1a21b0f749692,
title = "Effects of unlabeled data on classification error in normal discriminant analysis",
abstract = "Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.",
keywords = "Asymptotic relative efficiency, Missing data, Nonnormal data, Normal discrimination, Partially labeled data, Semi-supervised learning, Unlabeled data",
author = "Keiji Takai and Kenichi Hayashi",
year = "2014",
month = "4",
day = "1",
doi = "10.1016/j.jspi.2013.11.004",
language = "English",
volume = "147",
pages = "66--83",
journal = "Journal of Statistical Planning and Inference",
issn = "0378-3758",
publisher = "Elsevier",

}

TY - JOUR

T1 - Effects of unlabeled data on classification error in normal discriminant analysis

AU - Takai, Keiji

AU - Hayashi, Kenichi

PY - 2014/4/1

Y1 - 2014/4/1

N2 - Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.

AB - Semi-supervised learning, i.e., the estimation of parameters based on both labeled and unlabeled data, is widely believed to be effective in constructing a boundary in classification problems. The present paper investigates whether this belief is true in the case of normal discrimination in terms of the classification error for normal and nonnormal data. For this investigation, we use the framework of missing-data analysis because data consisting of labeled and unlabeled individuals can be regarded as missing data. Based on this framework, we introduce two labeling mechanisms: feature-independent labeling and feature-dependent labeling. For each of these labeling mechanisms, we analytically derive the asymptotic relative efficiency based on the labeled data alone and based on both the labeled and unlabeled data. Numerical computations reveal that (i) under the feature-independent labeling mechanism, unlabeled data tend to contribute to the improvement of the classification error even for nonnormal data and (ii) under the feature-dependent labeling mechanism, unlabeled data from both normal and nonnormal distributions are helpful when the labeled data are informative, but unlabeled data can augment the classification error when the labeled data are not informative. Finally, we describe some future areas of research.

KW - Asymptotic relative efficiency

KW - Missing data

KW - Nonnormal data

KW - Normal discrimination

KW - Partially labeled data

KW - Semi-supervised learning

KW - Unlabeled data

UR - http://www.scopus.com/inward/record.url?scp=84892551592&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84892551592&partnerID=8YFLogxK

U2 - 10.1016/j.jspi.2013.11.004

DO - 10.1016/j.jspi.2013.11.004

M3 - Article

VL - 147

SP - 66

EP - 83

JO - Journal of Statistical Planning and Inference

JF - Journal of Statistical Planning and Inference

SN - 0378-3758

ER -