SDC: Structured data collection by yourself

Takuya Ohshima, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450364041
DOIs
Publication statusPublished - 2018 Mar 16
Event8th International Conference on Information Systems and Technologies, ICIST 2018 - Istanbul, Turkey
Duration: 2018 Mar 162018 Mar 18

Other

Other8th International Conference on Information Systems and Technologies, ICIST 2018
Country/TerritoryTurkey
CityIstanbul
Period18/3/1618/3/18

Keywords

  • JSON-LD
  • Microdata
  • Schema.org
  • Semantic web
  • Structured data

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint

Dive into the research topics of 'SDC: Structured data collection by yourself'. Together they form a unique fingerprint.

Cite this