TY - GEN
T1 - Retrieving of Data Similarity using Metadata on a Data Analysis Competition Platform
AU - Sakaji, Hiroki
AU - Hayashi, Teruaki
AU - Fukami, Yoshiaki
AU - Shimizu, Takumi
AU - Matsushima, Hiroyasu
AU - Izumi, Kiyoshi
N1 - Funding Information:
This study was supported by JSPS KAKENHI (JP20H02384).
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - In recent years, instead of closing data and analysis skills in-house, there has been much interest in widely releasing data analysis knowledge on the web. A data exchange platform is a type of digital platform that exchanges data between stakeholders, e.g., data owners, users, and analysts. However, the datasets handled on such platforms are independently acquired and stored by the data providers for their own purposes. These datasets are not based on the premise of coordination and combination, and there is currently little information available to discuss the systematic organization and combination of these datasets. In this study, we focus on a metadata, summary information of data, and examine the similarity of data on a data exchange platform using natural language processing. In our experiments, we use the metadata from the data exchange platform Kaggle. To compare the similarity of the data, our method employs word2vec and BERT as vectorize methods and converts data descriptions to vectors. Then, our method measures the distances of each vector by calculating cosine similarities between each vector. From experimental results, we found that Kaggle has the same character as other data exchange platforms. Additionally, the results indicated the usability of the natural language processing-based method for extracting similar data pairs.
AB - In recent years, instead of closing data and analysis skills in-house, there has been much interest in widely releasing data analysis knowledge on the web. A data exchange platform is a type of digital platform that exchanges data between stakeholders, e.g., data owners, users, and analysts. However, the datasets handled on such platforms are independently acquired and stored by the data providers for their own purposes. These datasets are not based on the premise of coordination and combination, and there is currently little information available to discuss the systematic organization and combination of these datasets. In this study, we focus on a metadata, summary information of data, and examine the similarity of data on a data exchange platform using natural language processing. In our experiments, we use the metadata from the data exchange platform Kaggle. To compare the similarity of the data, our method employs word2vec and BERT as vectorize methods and converts data descriptions to vectors. Then, our method measures the distances of each vector by calculating cosine similarities between each vector. From experimental results, we found that Kaggle has the same character as other data exchange platforms. Additionally, the results indicated the usability of the natural language processing-based method for extracting similar data pairs.
KW - Data Analysis Competition Platform
KW - Data Similarity
KW - Metadata
KW - Text Mining
KW - Word Embeddings
UR - http://www.scopus.com/inward/record.url?scp=85125301491&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125301491&partnerID=8YFLogxK
U2 - 10.1109/BigData52589.2021.9671414
DO - 10.1109/BigData52589.2021.9671414
M3 - Conference contribution
AN - SCOPUS:85125301491
T3 - Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
SP - 3480
EP - 3485
BT - Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
A2 - Chen, Yixin
A2 - Ludwig, Heiko
A2 - Tu, Yicheng
A2 - Fayyad, Usama
A2 - Zhu, Xingquan
A2 - Hu, Xiaohua Tony
A2 - Byna, Suren
A2 - Liu, Xiong
A2 - Zhang, Jianping
A2 - Pan, Shirui
A2 - Papalexakis, Vagelis
A2 - Wang, Jianwu
A2 - Cuzzocrea, Alfredo
A2 - Ordonez, Carlos
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Big Data, Big Data 2021
Y2 - 15 December 2021 through 18 December 2021
ER -