Dictionary matching: Review of the Aho-Corasick algorithm and vision for large dictionaries

Qiao ZhanPeng, Kento Goto, Takuya Ohshima, Masahiro Tajima, Motomichi Toyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Pattern-matching techniques have recently been applied to network security applications such as intrusion detection, virus protection, and spam filters. The widely used the Aho-Corasick algorithm can simultaneously match multiple patterns while providing a worst-case performance guarantee. The traditional Aho-Corasick algorithm has some problems on the large-scale dictionary matching, such as take a huge memory space, cost too much time when building trie, huge update cost (rebuild trie) and so on. In this paper, we summarize several recent works on the expansion of the AC algorithm: the original Aho-Corasick algorithm(AC), the parallel Aho-Corasick algorithm(PAC), and the parallel failure-less Aho-Corasick algorithm(PFAC). Among them, the PAC used muti-thread to process input text by divide input to few parts, but in this way, this is a problem called boundary detection problem. The PFAC which uses parallel processing, discarding the failure function and enhancing the AC processing capability by the power of GPU is relatively popular. The PFAC not only enhance the speed of matching process but also solve the boundary detection problem. In the original PFAC paper, only a small size dictionary pattern matching test was conducted. In this paper, we verify the feasibility of using PFAC for large-scale dictionary characters. And give a discussion on the possibility of using the PFAC algorithm to match large-scale pattern dictionaries. At last, we present our proposals for PFAC expansion for large-scale dictionary pattern matching. For example, the one called pattern cut approach is to cut the pattern from the common prefix with other words, and the other one is the large automaton is divided according to the input text by the prefix of words. And the sub-automatons are constructed separately. When the input text is processed, each word is processed by the sub-automaton that distributed according to the prefix of words. By this way, we can achieve the purpose of improving the parallel processing efficiency.

Original languageEnglish
Title of host publicationProceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450364041
DOIs
Publication statusPublished - 2018 Mar 16
Event8th International Conference on Information Systems and Technologies, ICIST 2018 - Istanbul, Turkey
Duration: 2018 Mar 162018 Mar 18

Other

Other8th International Conference on Information Systems and Technologies, ICIST 2018
CountryTurkey
CityIstanbul
Period18/3/1618/3/18

Keywords

  • Aho-corasick algorithm
  • Dictionary matching
  • Parallel Aho-Corasick algorithm
  • Parallel algorithm
  • Parallel failure-less Aho-Corasick algorithm
  • Pattern matching
  • Trie

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Fingerprint Dive into the research topics of 'Dictionary matching: Review of the Aho-Corasick algorithm and vision for large dictionaries'. Together they form a unique fingerprint.

  • Cite this

    ZhanPeng, Q., Goto, K., Ohshima, T., Tajima, M., & Toyama, M. (2018). Dictionary matching: Review of the Aho-Corasick algorithm and vision for large dictionaries. In Proceedings of the 8th International Conference on Information Systems and Technologies, ICIST 2018 [Y] Association for Computing Machinery. https://doi.org/10.1145/3200842.3200850