Abstract
In our research, we focus on ways to crawl and scrape structured data embedded in web pages and collect it as an available data set. To realize large-scale structured data collection, we develop SDC, specialized structured data extraction crawler. SDC integrates the three modules required for extracting structured data: crawling, scraping, and output generation. In particular, we are paying attention to the fact that only simple syntax and semantics are used for structured data. In this way, information extraction from web pages that exist in multiple domains is realized with a simple single setting. It is not necessary to consider the hierarchical structure of websites and the DOM structure of web pages. The users can also display a part of it as a preview to see if the extracted data is what they need. However, even if structured data is used, there are cases where how elements are written differently for each domain. For example, in an e-book sales website, one element is given to each author of the book, but in another site, the list of multiple authors is described as a comma-separated string. Also, there is a website that implements navigation such as pagination using Ajax, and it is necessary to execute the corresponding JavaScript to obtain hyperlink from this site. To cope with the difference between these domains, it is possible for users to quickly specify the data extraction mechanism corresponding to individual websites complementary. As the experiment, we extracted 500 top domains with a large number of links on the websites by our system, that 254 sites where structured data are present, and we have been successfully extracted from 243 sites.
Original language | English |
---|---|
Title of host publication | Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018 |
Publisher | Association for Computing Machinery |
ISBN (Electronic) | 9781450364041 |
DOIs | |
Publication status | Published - 2018 Mar 16 |
Event | 8th International Conference on Information Systems and Technologies, ICIST 2018 - Istanbul, Turkey Duration: 2018 Mar 16 → 2018 Mar 18 |
Other
Other | 8th International Conference on Information Systems and Technologies, ICIST 2018 |
---|---|
Country/Territory | Turkey |
City | Istanbul |
Period | 18/3/16 → 18/3/18 |
Keywords
- JSON-LD
- Microdata
- Schema.org
- Semantic web
- Structured data
ASJC Scopus subject areas
- Human-Computer Interaction
- Computer Networks and Communications
- Computer Vision and Pattern Recognition
- Software