Effective web data extraction with Ducky

Kei Kanaoka, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may be necessary. These processes can be tedious and complicated. To address these, Ducky, a web wrapper that extracts data from web sources and translates them into structured data based on a user-defined configuration, has been developed. Ducky is able to extract data flexibly from various structured web pages, remove noise from extracted data and integrate multiple pages from different sites. In addition, the current version of Ducky automatically extracts data from Wikipedia and trendy keywords of Google and Yahoo.

Original languageEnglish
Title of host publicationACM International Conference Proceeding Series
EditorsBipin C. Desai, Motomichi Toyama
PublisherAssociation for Computing Machinery
Pages212-213
Number of pages2
EditionCONFCODENUMBER
ISBN (Electronic)9781450334143
DOIs
Publication statusPublished - 2015 Jul 13
Event19th International Database Engineering and Applications Symposium, IDEAS 2015 - Yokohama, Japan
Duration: 2015 Jul 132015 Jul 15

Publication series

NameACM International Conference Proceeding Series
NumberCONFCODENUMBER
Volume0

Other

Other19th International Database Engineering and Applications Symposium, IDEAS 2015
Country/TerritoryJapan
CityYokohama
Period15/7/1315/7/15

Keywords

  • CSS selector
  • Data extraction
  • Web scraping
  • Web wrapper

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Effective web data extraction with Ducky'. Together they form a unique fingerprint.

Cite this