Retrieving of Data Similarity using Metadata on a Data Analysis Competition Platform

Hiroki Sakaji, Teruaki Hayashi, Yoshiaki Fukami, Takumi Shimizu, Hiroyasu Matsushima, Kiyoshi Izumi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In recent years, instead of closing data and analysis skills in-house, there has been much interest in widely releasing data analysis knowledge on the web. A data exchange platform is a type of digital platform that exchanges data between stakeholders, e.g., data owners, users, and analysts. However, the datasets handled on such platforms are independently acquired and stored by the data providers for their own purposes. These datasets are not based on the premise of coordination and combination, and there is currently little information available to discuss the systematic organization and combination of these datasets. In this study, we focus on a metadata, summary information of data, and examine the similarity of data on a data exchange platform using natural language processing. In our experiments, we use the metadata from the data exchange platform Kaggle. To compare the similarity of the data, our method employs word2vec and BERT as vectorize methods and converts data descriptions to vectors. Then, our method measures the distances of each vector by calculating cosine similarities between each vector. From experimental results, we found that Kaggle has the same character as other data exchange platforms. Additionally, the results indicated the usability of the natural language processing-based method for extracting similar data pairs.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
EditorsYixin Chen, Heiko Ludwig, Yicheng Tu, Usama Fayyad, Xingquan Zhu, Xiaohua Tony Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, Carlos Ordonez
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3480-3485
Number of pages6
ISBN (Electronic)9781665439022
DOIs
Publication statusPublished - 2021
Event2021 IEEE International Conference on Big Data, Big Data 2021 - Virtual, Online, United States
Duration: 2021 Dec 152021 Dec 18

Publication series

NameProceedings - 2021 IEEE International Conference on Big Data, Big Data 2021

Conference

Conference2021 IEEE International Conference on Big Data, Big Data 2021
Country/TerritoryUnited States
CityVirtual, Online
Period21/12/1521/12/18

Keywords

  • Data Analysis Competition Platform
  • Data Similarity
  • Metadata
  • Text Mining
  • Word Embeddings

ASJC Scopus subject areas

  • Information Systems and Management
  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Information Systems

Fingerprint

Dive into the research topics of 'Retrieving of Data Similarity using Metadata on a Data Analysis Competition Platform'. Together they form a unique fingerprint.

Cite this