Retrieving of Data Similarity using Metadata on a Data Analysis Competition Platform

Hiroki Sakaji, Teruaki Hayashi, Yoshiaki Fukami, Takumi Shimizu, Hiroyasu Matsushima, Kiyoshi Izumi

研究成果: Conference contribution

抄録

In recent years, instead of closing data and analysis skills in-house, there has been much interest in widely releasing data analysis knowledge on the web. A data exchange platform is a type of digital platform that exchanges data between stakeholders, e.g., data owners, users, and analysts. However, the datasets handled on such platforms are independently acquired and stored by the data providers for their own purposes. These datasets are not based on the premise of coordination and combination, and there is currently little information available to discuss the systematic organization and combination of these datasets. In this study, we focus on a metadata, summary information of data, and examine the similarity of data on a data exchange platform using natural language processing. In our experiments, we use the metadata from the data exchange platform Kaggle. To compare the similarity of the data, our method employs word2vec and BERT as vectorize methods and converts data descriptions to vectors. Then, our method measures the distances of each vector by calculating cosine similarities between each vector. From experimental results, we found that Kaggle has the same character as other data exchange platforms. Additionally, the results indicated the usability of the natural language processing-based method for extracting similar data pairs.

本文言語English
ホスト出版物のタイトルProceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
編集者Yixin Chen, Heiko Ludwig, Yicheng Tu, Usama Fayyad, Xingquan Zhu, Xiaohua Tony Hu, Suren Byna, Xiong Liu, Jianping Zhang, Shirui Pan, Vagelis Papalexakis, Jianwu Wang, Alfredo Cuzzocrea, Carlos Ordonez
出版社Institute of Electrical and Electronics Engineers Inc.
ページ3480-3485
ページ数6
ISBN(電子版)9781665439022
DOI
出版ステータスPublished - 2021
イベント2021 IEEE International Conference on Big Data, Big Data 2021 - Virtual, Online, United States
継続期間: 2021 12月 152021 12月 18

出版物シリーズ

名前Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021

Conference

Conference2021 IEEE International Conference on Big Data, Big Data 2021
国/地域United States
CityVirtual, Online
Period21/12/1521/12/18

ASJC Scopus subject areas

  • 情報システムおよび情報管理
  • 人工知能
  • コンピュータ ビジョンおよびパターン認識
  • 情報システム

フィンガープリント

「Retrieving of Data Similarity using Metadata on a Data Analysis Competition Platform」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル