Browser GUI for generating web data extraction rules in ducky

Kei Kanaoka, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

To benefit from the invaluable data in the World Wide Web, manual extraction or creation of web scraping programs may be necessary. However, these processes can be tedious and complicated. To address these, we have proposed Ducky, which is aWeb data extraction system including a web wrapper that extracts data from web sources and translates them into structured data based on user-defined data extraction rules. Ducky can extract data flexibly from various structured web pages, remove noise from extracted data and integrate data distributed to multiple pages from different sites. In this paper, we propose a browser GUI for Ducky. Instead of manually writing a configuration file, users can just click or point a cursor (mouse over) to objective elements. The users' actions are then automatically converted to data extraction rules and saved in a configuration file. Thus, we help users to extract the data by allowing intuitive operations and reduce users' burden in write the configuration file.

Original languageEnglish
Title of host publication17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015 - Proceedings
PublisherAssociation for Computing Machinery, Inc
ISBN (Print)9781450334914
DOIs
Publication statusPublished - 2015 Dec 11
Event17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015 - Brussels, Belgium
Duration: 2015 Dec 112015 Dec 13

Other

Other17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015
CountryBelgium
CityBrussels
Period15/12/1115/12/13

Keywords

  • CSS selector
  • Data Extraction
  • Web scraping
  • Web Wrapper

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Computer Science Applications

Fingerprint Dive into the research topics of 'Browser GUI for generating web data extraction rules in ducky'. Together they form a unique fingerprint.

Cite this