Overview of the filtering process within the amr.watch workflow applied to public genomes of priority bacterial pathogens available in the International Nucleotide Sequence Database Collaboration (INSDC) databases. The genome data are filtered in a series of steps, depicted from left to right in the table, with the numbers in each column representing a subset of those from the previous column.
For the pathogens that are grouped together, we initially accept genomes annotated in the ENA with any of the corresponding taxonomy IDs from the same group and use the Speciator assignments in subsequent processing.
Pathogen | ENA entries (run accessions) | Illumina paired-end entries | Filtered entries1 | Entries with geotemporal data2 | Entries available for download in SRA | Entries associated with unique samples3 | Assembled genomes | Genomes with correct species | Genomes that passed QC | Genomes with collection date post-2010 |
---|---|---|---|---|---|---|---|---|---|---|
All Pathogens | 1,580,477 | 1,404,693 | 1,391,702 | 784,704 | 781,314 | 752,122 | 720,739 | 627,878 | 616,012 | 558,900 |
A. baumannii | 43,074 | 35,709 | 34,529 | 22,375 | 22,264 | 21,552 | 20,956 | 20,766 | 20,457 | 19,108 |
C. coli | 166,259 | 161,511 | 159,972 | 106,496 | 105,598 | 105,001 | 101,264 | 32,106 | 30,608 | 29,500 |
C. jejuni | 68,976 | 64,925 | 62,649 | |||||||
E. cloacae complex | 18,754 | 16,614 | 16,437 | 12,101 | 12,101 | 11,134 | 10,670 | 10,473 | 10,267 | 9,805 |
E. faecium | 42,480 | 40,961 | 40,649 | 25,384 | 25,384 | 24,453 | 24,188 | 24,018 | 22,970 | 21,771 |
E. coli | 565,608 | 473,442 | 469,001 | 249,904 | 249,507 | 240,206 | 230,016 | 202,023 | 196,699 | 179,518 |
S. flexneri | 10,420 | 10,095 | 9,301 | |||||||
S. sonnei | 13,792 | 13,481 | 12,499 | |||||||
H. influenzae | 16,279 | 15,089 | 15,027 | 6,170 | 6,168 | 5,991 | 5,397 | 5,386 | 5,248 | 4,634 |
K. pneumoniae | 110,643 | 97,350 | 96,066 | 58,278 | 58,276 | 53,734 | 52,385 | 50,380 | 49,207 | 47,233 |
N. gonorrhoeae | 72,927 | 70,187 | 70,099 | 46,053 | 46,053 | 43,342 | 40,102 | 39,732 | 38,434 | 37,658 |
P. aeruginosa | 76,305 | 56,773 | 55,940 | 28,067 | 27,939 | 26,864 | 25,432 | 25,201 | 24,502 | 22,829 |
Salmonella Typhi | 123,749 | 115,433 | 114,694 | 88,122 | 87,907 | 85,904 | 84,368 | 0 | 8,409 | 6,415 |
Salmonella Typhimurium | 0 | 0 | 0 | |||||||
Salmonella Enteritidis | 0 | 0 | 0 | |||||||
S. aureus | 172,852 | 157,126 | 155,327 | 88,154 | 86,517 | 82,703 | 78,492 | 77,378 | 74,663 | 63,062 |
S. pneumoniae | 171,547 | 164,498 | 163,961 | 53,600 | 53,600 | 51,238 | 47,469 | 47,227 | 46,047 | 32,918 |
- Entries (runs) are filtered to include only those with two FASTQ files, ≥20x mean coverage (via assessment of the "base_count" field) and those associated with a single sample accession.
- Entries (runs) are filtered to include those with a collection date that is decodable to at least the year and a sampling location that is decodable to at least the country level.
- Entries (runs) are filtered to ensure only one run per sample accession is included (selecting the run with the highest number of bases via assessment of the "base_count" field).