Abstract
To benefit from the invaluable data in the World Wide Web, manual extraction or creation of web scraping programs may be necessary. However, these processes can be tedious and complicated. To address these, we have proposed Ducky, which is aWeb data extraction system including a web wrapper that extracts data from web sources and translates them into structured data based on user-defined data extraction rules. Ducky can extract data flexibly from various structured web pages, remove noise from extracted data and integrate data distributed to multiple pages from different sites. In this paper, we propose a browser GUI for Ducky. Instead of manually writing a configuration file, users can just click or point a cursor (mouse over) to objective elements. The users' actions are then automatically converted to data extraction rules and saved in a configuration file. Thus, we help users to extract the data by allowing intuitive operations and reduce users' burden in write the configuration file.
Original language | English |
---|---|
Title of host publication | 17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015 - Proceedings |
Publisher | Association for Computing Machinery, Inc |
ISBN (Print) | 9781450334914 |
DOIs | |
Publication status | Published - 2015 Dec 11 |
Event | 17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015 - Brussels, Belgium Duration: 2015 Dec 11 → 2015 Dec 13 |
Other
Other | 17th International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2015 |
---|---|
Country/Territory | Belgium |
City | Brussels |
Period | 15/12/11 → 15/12/13 |
Keywords
- CSS selector
- Data Extraction
- Web scraping
- Web Wrapper
ASJC Scopus subject areas
- Computer Networks and Communications
- Information Systems
- Computer Science Applications