SDC: Structured data collection by yourself

Takuya Ohshima, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450364041
DOIs
Publication statusPublished - 2018 Mar 16
Event8th International Conference on Information Systems and Technologies, ICIST 2018 - Istanbul, Turkey
Duration: 2018 Mar 162018 Mar 18

Other

Other8th International Conference on Information Systems and Technologies, ICIST 2018
CountryTurkey
CityIstanbul
Period18/3/1618/3/18

Fingerprint

Websites
Sales
Navigation
Semantics
Experiments

Keywords

  • JSON-LD
  • Microdata
  • Schema.org
  • Semantic web
  • Structured data

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Ohshima, T., & Toyama, M. (2018). SDC: Structured data collection by yourself. In Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018 [Y] Association for Computing Machinery. https://doi.org/10.1145/3200842.3200849

SDC : Structured data collection by yourself. / Ohshima, Takuya; Toyama, Motomichi.

Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018. Association for Computing Machinery, 2018. Y.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ohshima, T & Toyama, M 2018, SDC: Structured data collection by yourself. in Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018., Y, Association for Computing Machinery, 8th International Conference on Information Systems and Technologies, ICIST 2018, Istanbul, Turkey, 18/3/16. https://doi.org/10.1145/3200842.3200849
Ohshima T, Toyama M. SDC: Structured data collection by yourself. In Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018. Association for Computing Machinery. 2018. Y https://doi.org/10.1145/3200842.3200849
Ohshima, Takuya ; Toyama, Motomichi. / SDC : Structured data collection by yourself. Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018. Association for Computing Machinery, 2018.
@inproceedings{ac9ad22093f148e3b12e0353e834a0ec,
title = "SDC: Structured data collection by yourself",
abstract = "In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.",
keywords = "JSON-LD, Microdata, Schema.org, Semantic web, Structured data",
author = "Takuya Ohshima and Motomichi Toyama",
year = "2018",
month = "3",
day = "16",
doi = "10.1145/3200842.3200849",
language = "English",
booktitle = "Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - SDC

T2 - Structured data collection by yourself

AU - Ohshima, Takuya

AU - Toyama, Motomichi

PY - 2018/3/16

Y1 - 2018/3/16

N2 - In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

AB - In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

KW - JSON-LD

KW - Microdata

KW - Schema.org

KW - Semantic web

KW - Structured data

UR - http://www.scopus.com/inward/record.url?scp=85048027600&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85048027600&partnerID=8YFLogxK

U2 - 10.1145/3200842.3200849

DO - 10.1145/3200842.3200849

M3 - Conference contribution

AN - SCOPUS:85048027600

BT - Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018

PB - Association for Computing Machinery

ER -