Ducky: A data extraction system for various structured web documents

Kei Kanaoka, Yotaro Fujii, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

Original languageEnglish
Title of host publicationACM International Conference Proceeding Series
PublisherAssociation for Computing Machinery
Pages342-347
Number of pages6
ISBN (Print)9781450326278
DOIs
Publication statusPublished - 2014
Event18th International Database Engineering and Applications Symposium, IDEAS 2014 - Porto, Portugal
Duration: 2014 Jul 72014 Jul 9

Other

Other18th International Database Engineering and Applications Symposium, IDEAS 2014
CountryPortugal
CityPorto
Period14/7/714/7/9

Fingerprint

World Wide Web
Websites
Application programming interfaces (API)
XML

Keywords

  • CSS selector
  • Data extraction
  • Web scraping
  • Web wrapper

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Kanaoka, K., Fujii, Y., & Toyama, M. (2014). Ducky: A data extraction system for various structured web documents. In ACM International Conference Proceeding Series (pp. 342-347). Association for Computing Machinery. https://doi.org/10.1145/2628194.2628244

Ducky : A data extraction system for various structured web documents. / Kanaoka, Kei; Fujii, Yotaro; Toyama, Motomichi.

ACM International Conference Proceeding Series. Association for Computing Machinery, 2014. p. 342-347.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kanaoka, K, Fujii, Y & Toyama, M 2014, Ducky: A data extraction system for various structured web documents. in ACM International Conference Proceeding Series. Association for Computing Machinery, pp. 342-347, 18th International Database Engineering and Applications Symposium, IDEAS 2014, Porto, Portugal, 14/7/7. https://doi.org/10.1145/2628194.2628244
Kanaoka K, Fujii Y, Toyama M. Ducky: A data extraction system for various structured web documents. In ACM International Conference Proceeding Series. Association for Computing Machinery. 2014. p. 342-347 https://doi.org/10.1145/2628194.2628244
Kanaoka, Kei ; Fujii, Yotaro ; Toyama, Motomichi. / Ducky : A data extraction system for various structured web documents. ACM International Conference Proceeding Series. Association for Computing Machinery, 2014. pp. 342-347
@inproceedings{98d55a66f75844f59da5bfabe7da0d90,
title = "Ducky: A data extraction system for various structured web documents",
abstract = "The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.",
keywords = "CSS selector, Data extraction, Web scraping, Web wrapper",
author = "Kei Kanaoka and Yotaro Fujii and Motomichi Toyama",
year = "2014",
doi = "10.1145/2628194.2628244",
language = "English",
isbn = "9781450326278",
pages = "342--347",
booktitle = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - Ducky

T2 - A data extraction system for various structured web documents

AU - Kanaoka, Kei

AU - Fujii, Yotaro

AU - Toyama, Motomichi

PY - 2014

Y1 - 2014

N2 - The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

AB - The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.

KW - CSS selector

KW - Data extraction

KW - Web scraping

KW - Web wrapper

UR - http://www.scopus.com/inward/record.url?scp=84906808321&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84906808321&partnerID=8YFLogxK

U2 - 10.1145/2628194.2628244

DO - 10.1145/2628194.2628244

M3 - Conference contribution

AN - SCOPUS:84906808321

SN - 9781450326278

SP - 342

EP - 347

BT - ACM International Conference Proceeding Series

PB - Association for Computing Machinery

ER -