TY - GEN
T1 - Ducky
T2 - 18th International Database Engineering and Applications Symposium, IDEAS 2014
AU - Kanaoka, Kei
AU - Fujii, Yotaro
AU - Toyama, Motomichi
PY - 2014/1/1
Y1 - 2014/1/1
N2 - The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.
AB - The World Wide Web has become a primary source of in-formation. Therefore, extracting data from Web sources has become a key technology. In this paper, we introduce a semi-automatic system Ducky, including a Web Wrapper which extracts data from Web sources and translates them into structured data. In Ducky, by defining a configuration file consisting in several parameters (URL of the Web page, CSS selectors which locates the data to retrieve and so on.), users do not need to write Web scraping programs at all. The definition is simple, yet can extract data flexibly from various structured Web pages. Additionally, Ducky provides a Web API and various output data formats: XML, JSON, CSV. Finally, experimentations confirmed that Ducky can accurately extract data from 22 different structured Web sources.
KW - CSS selector
KW - Data extraction
KW - Web scraping
KW - Web wrapper
UR - http://www.scopus.com/inward/record.url?scp=84906808321&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84906808321&partnerID=8YFLogxK
U2 - 10.1145/2628194.2628244
DO - 10.1145/2628194.2628244
M3 - Conference contribution
AN - SCOPUS:84906808321
SN - 9781450326278
T3 - ACM International Conference Proceeding Series
SP - 342
EP - 347
BT - Proceedings of the 18th International Database Engineering and Applications Symposium, IDEAS 2014
PB - Association for Computing Machinery
Y2 - 7 July 2014 through 9 July 2014
ER -