Adaptive seeds tame genomic sequence comparison

Szymon M. Kiełbasa, Raymond Wan, Kengo Sato, Paul Horton, Martin C. Frith

Research output: Contribution to journalArticle

370 Citations (Scopus)

Abstract

The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

Original languageEnglish
Pages (from-to)487-493
Number of pages7
JournalGenome Research
Volume21
Issue number3
DOIs
Publication statusPublished - 2011 Mar
Externally publishedYes

Fingerprint

Seeds
DNA

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C. (2011). Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3), 487-493. https://doi.org/10.1101/gr.113985.110

Adaptive seeds tame genomic sequence comparison. / Kiełbasa, Szymon M.; Wan, Raymond; Sato, Kengo; Horton, Paul; Frith, Martin C.

In: Genome Research, Vol. 21, No. 3, 03.2011, p. 487-493.

Research output: Contribution to journalArticle

Kiełbasa, SM, Wan, R, Sato, K, Horton, P & Frith, MC 2011, 'Adaptive seeds tame genomic sequence comparison', Genome Research, vol. 21, no. 3, pp. 487-493. https://doi.org/10.1101/gr.113985.110
Kiełbasa, Szymon M. ; Wan, Raymond ; Sato, Kengo ; Horton, Paul ; Frith, Martin C. / Adaptive seeds tame genomic sequence comparison. In: Genome Research. 2011 ; Vol. 21, No. 3. pp. 487-493.
@article{e2de7059294246d5ab4ed4a2a4c0fe3c,
title = "Adaptive seeds tame genomic sequence comparison",
abstract = "The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.",
author = "Kiełbasa, {Szymon M.} and Raymond Wan and Kengo Sato and Paul Horton and Frith, {Martin C.}",
year = "2011",
month = "3",
doi = "10.1101/gr.113985.110",
language = "English",
volume = "21",
pages = "487--493",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "3",

}

TY - JOUR

T1 - Adaptive seeds tame genomic sequence comparison

AU - Kiełbasa, Szymon M.

AU - Wan, Raymond

AU - Sato, Kengo

AU - Horton, Paul

AU - Frith, Martin C.

PY - 2011/3

Y1 - 2011/3

N2 - The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

AB - The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

UR - http://www.scopus.com/inward/record.url?scp=79952256999&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952256999&partnerID=8YFLogxK

U2 - 10.1101/gr.113985.110

DO - 10.1101/gr.113985.110

M3 - Article

VL - 21

SP - 487

EP - 493

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 3

ER -