Convolutional neural network based on SMILES representation of compounds for detecting chemical motif

Maya Hirohara, Yutaka Saito, Yuki Koda, Kengo Sato, Yasubumi Sakakibara

Research output: Contribution to journalArticle

Abstract

Background: Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results: We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions: The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.

Original languageEnglish
Article number526
JournalBMC Bioinformatics
Volume19
DOIs
Publication statusPublished - 2018 Dec 31

Fingerprint

Learning
Neural Networks
Neural networks
Dermatoglyphics
Fingerprint
Lead compounds
Chirality
Binding sites
Convolution
Functional groups
Websites
Benchmarking
Screening
Information Structure
Multivariate Analysis
Binding Sites
Feature Space
Protein Binding
Notation
Discrimination

Keywords

  • Chemical compound
  • Convolutional neural network
  • SMILES
  • TOX 21 Challenge
  • Virtual screening

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. / Hirohara, Maya; Saito, Yutaka; Koda, Yuki; Sato, Kengo; Sakakibara, Yasubumi.

In: BMC Bioinformatics, Vol. 19, 526, 31.12.2018.

Research output: Contribution to journalArticle

@article{eb961438b8574b5d9db83e8c5dc6d612,
title = "Convolutional neural network based on SMILES representation of compounds for detecting chemical motif",
abstract = "Background: Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results: We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions: The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.",
keywords = "Chemical compound, Convolutional neural network, SMILES, TOX 21 Challenge, Virtual screening",
author = "Maya Hirohara and Yutaka Saito and Yuki Koda and Kengo Sato and Yasubumi Sakakibara",
year = "2018",
month = "12",
day = "31",
doi = "10.1186/s12859-018-2523-5",
language = "English",
volume = "19",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Convolutional neural network based on SMILES representation of compounds for detecting chemical motif

AU - Hirohara, Maya

AU - Saito, Yutaka

AU - Koda, Yuki

AU - Sato, Kengo

AU - Sakakibara, Yasubumi

PY - 2018/12/31

Y1 - 2018/12/31

N2 - Background: Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results: We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions: The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.

AB - Background: Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results: We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions: The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.

KW - Chemical compound

KW - Convolutional neural network

KW - SMILES

KW - TOX 21 Challenge

KW - Virtual screening

UR - http://www.scopus.com/inward/record.url?scp=85059287866&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059287866&partnerID=8YFLogxK

U2 - 10.1186/s12859-018-2523-5

DO - 10.1186/s12859-018-2523-5

M3 - Article

C2 - 30598075

AN - SCOPUS:85059287866

VL - 19

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 526

ER -