Finite-sample analysis of impacts of unlabeled data and their labeling mechanisms in linear discriminant analysis

Kenichi Hayashi, Keiji Takai

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.

Original languageEnglish
Pages (from-to)184-203
Number of pages20
JournalCommunications in Statistics: Simulation and Computation
Volume46
Issue number1
DOIs
Publication statusPublished - 2017 Jan 2
Externally publishedYes

Fingerprint

Discriminant analysis
Discriminant Analysis
Labeling
Classifiers
Prediction
Classification Problems
Error Rate
Proportion
Classifier
Imply

Keywords

  • Classification error
  • Missing data
  • Monte Carlo simulation
  • Relative efficiency
  • Semi-supervised learning

ASJC Scopus subject areas

  • Statistics and Probability
  • Modelling and Simulation

Cite this

@article{a3f025472df342159628f5437b75bea1,
title = "Finite-sample analysis of impacts of unlabeled data and their labeling mechanisms in linear discriminant analysis",
abstract = "It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.",
keywords = "Classification error, Missing data, Monte Carlo simulation, Relative efficiency, Semi-supervised learning",
author = "Kenichi Hayashi and Keiji Takai",
year = "2017",
month = "1",
day = "2",
doi = "10.1080/03610918.2014.957847",
language = "English",
volume = "46",
pages = "184--203",
journal = "Communications in Statistics Part B: Simulation and Computation",
issn = "0361-0918",
publisher = "Taylor and Francis Ltd.",
number = "1",

}

TY - JOUR

T1 - Finite-sample analysis of impacts of unlabeled data and their labeling mechanisms in linear discriminant analysis

AU - Hayashi, Kenichi

AU - Takai, Keiji

PY - 2017/1/2

Y1 - 2017/1/2

N2 - It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.

AB - It is widely believed that unlabeled data are promising for improving prediction accuracy in classification problems. Although theoretical studies about when/how unlabeled data are beneficial exist, an actual prediction improvement has not been sufficiently investigated for a finite sample in a systematic manner. We investigate the impact of unlabeled data in linear discriminant analysis and compare the error rates of the classifiers estimated with/without unlabeled data. Our focus is a labeling mechanism that characterizes the probabilistic structure of occurrence of labeled cases. Results imply that an extremely small proportion of unlabeled data has a large effect on the analysis results.

KW - Classification error

KW - Missing data

KW - Monte Carlo simulation

KW - Relative efficiency

KW - Semi-supervised learning

UR - http://www.scopus.com/inward/record.url?scp=84992395894&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84992395894&partnerID=8YFLogxK

U2 - 10.1080/03610918.2014.957847

DO - 10.1080/03610918.2014.957847

M3 - Article

AN - SCOPUS:84992395894

VL - 46

SP - 184

EP - 203

JO - Communications in Statistics Part B: Simulation and Computation

JF - Communications in Statistics Part B: Simulation and Computation

SN - 0361-0918

IS - 1

ER -