Ducky: A data extraction system for various structured web documents

Kei Kanaoka, Yotaro Fujii, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

Original languageEnglish
Title of host publicationProceedings of the 18th International Database Engineering and Applications Symposium, IDEAS 2014
PublisherAssociation for Computing Machinery
Pages342-347
Number of pages6
ISBN (Print)9781450326278
DOIs
Publication statusPublished - 2014 Jan 1
Event18th International Database Engineering and Applications Symposium, IDEAS 2014 - Porto, Portugal
Duration: 2014 Jul 72014 Jul 9

Publication series

NameACM International Conference Proceeding Series

Other

Other18th International Database Engineering and Applications Symposium, IDEAS 2014
CountryPortugal
CityPorto
Period14/7/714/7/9

Keywords

  • CSS selector
  • Data extraction
  • Web scraping
  • Web wrapper

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Ducky: A data extraction system for various structured web documents'. Together they form a unique fingerprint.

  • Cite this

    Kanaoka, K., Fujii, Y., & Toyama, M. (2014). Ducky: A data extraction system for various structured web documents. In Proceedings of the 18th International Database Engineering and Applications Symposium, IDEAS 2014 (pp. 342-347). (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/2628194.2628244