In:
PLOS Computational Biology, Public Library of Science (PLoS), Vol. 18, No. 3 ( 2022-3-7), p. e1009492-
Abstract:
Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
Type of Medium:
Online Resource
ISSN:
1553-7358
DOI:
10.1371/journal.pcbi.1009492
DOI:
10.1371/journal.pcbi.1009492.g001
DOI:
10.1371/journal.pcbi.1009492.g002
DOI:
10.1371/journal.pcbi.1009492.g003
DOI:
10.1371/journal.pcbi.1009492.g004
DOI:
10.1371/journal.pcbi.1009492.g005
DOI:
10.1371/journal.pcbi.1009492.t001
DOI:
10.1371/journal.pcbi.1009492.s001
DOI:
10.1371/journal.pcbi.1009492.s002
DOI:
10.1371/journal.pcbi.1009492.s003
DOI:
10.1371/journal.pcbi.1009492.s004
DOI:
10.1371/journal.pcbi.1009492.s005
DOI:
10.1371/journal.pcbi.1009492.r001
DOI:
10.1371/journal.pcbi.1009492.r002
DOI:
10.1371/journal.pcbi.1009492.r003
DOI:
10.1371/journal.pcbi.1009492.r004
DOI:
10.1371/journal.pcbi.1009492.r005
DOI:
10.1371/journal.pcbi.1009492.r006
Language:
English
Publisher:
Public Library of Science (PLoS)
Publication Date:
2022
detail.hit.zdb_id:
2193340-6
Permalink