In:
PLOS Biology, Public Library of Science (PLoS), Vol. 19, No. 11 ( 2021-11-9), p. e3001421-
Abstract:
The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
Type of Medium:
Online Resource
ISSN:
1545-7885
DOI:
10.1371/journal.pbio.3001421
DOI:
10.1371/journal.pbio.3001421.g001
DOI:
10.1371/journal.pbio.3001421.g002
DOI:
10.1371/journal.pbio.3001421.g003
DOI:
10.1371/journal.pbio.3001421.s001
DOI:
10.1371/journal.pbio.3001421.s002
DOI:
10.1371/journal.pbio.3001421.s003
DOI:
10.1371/journal.pbio.3001421.s004
DOI:
10.1371/journal.pbio.3001421.s005
DOI:
10.1371/journal.pbio.3001421.s006
DOI:
10.1371/journal.pbio.3001421.s007
DOI:
10.1371/journal.pbio.3001421.s008
DOI:
10.1371/journal.pbio.3001421.s009
DOI:
10.1371/journal.pbio.3001421.s010
DOI:
10.1371/journal.pbio.3001421.r001
DOI:
10.1371/journal.pbio.3001421.r002
DOI:
10.1371/journal.pbio.3001421.r003
DOI:
10.1371/journal.pbio.3001421.r004
DOI:
10.1371/journal.pbio.3001421.r005
DOI:
10.1371/journal.pbio.3001421.r006
DOI:
10.1371/journal.pbio.3001421.r007
DOI:
10.1371/journal.pbio.3001421.r008
Language:
English
Publisher:
Public Library of Science (PLoS)
Publication Date:
2021
detail.hit.zdb_id:
2126773-X
Permalink