How SITE-100 data is generated
All specimens, sequences, and mitogenomes in this database were produced through a standardised two-track molecular pipeline — combining bulk metabarcoding with individual shotgun sequencing — followed by automated annotation and quality correction (Bian et al., 2022).
Sequencing pipeline
Each field sample contains approximately 50 morphologically distinct specimens (morphospecies). The pipeline operates on two parallel tracks: a bulk track generating community-level barcodes, and an individual track generating complete mitochondrial genomes for taxonomically sorted morphospecies.
sample
specimens
extraction
pool
barcoding
species
extraction
(COX1 amplicon)
sequencing
~15,000 bp
gene annotation
codon boundary refinement
Database
Bulk sample processing & metabarcoding
Specimens collected by standardised passive traps (flight interception traps) are sorted by parataxonomists into morphospecies. Up to ~50 specimens per sample are pooled for bulk DNA extraction. Amplification of the cox1 barcode region generates community-level ASVs, matched against reference sequences to produce species-level metabarcodes.
Individual barcoding & shotgun sequencing
Representative specimens from each morphospecies undergo individual DNA extraction. A cox1 barcode is sequenced per specimen. Separately, ~200 specimens are pooled for low-coverage Illumina shotgun sequencing (genome skimming), yielding 50–80% complete mitogenome assemblies per specimen.
Mitogenome assembly and specimen assignment
Mitogenome contigs are assembled from shotgun reads using standard genome assemblers. Each assembled mitogenome is assigned back to its specimen using the individual COX1 barcode as an anchor — the barcode sequence identifies which mitogenome contig belongs to which specimen in the pool.
Gene annotation and correction with MitoCorrect
Protein-coding gene boundaries are annotated and then refined by MitoCorrect (Creedy et al., 2026), which corrects start and stop codon positions across all 13 mitochondrial protein-coding genes. The tool scores each candidate annotation against two criteria: conservation of intergenic spacing observed across the full alignment, and agreement with curated reference sequences. This eliminates frameshift errors that automated annotation tools commonly propagate.
Database integration and taxonomy
Curated mitogenomes are imported into the SITE-100 database and linked to specimen records via a three-step matching hierarchy: museum accession number → morphospecies name → image identifier.
Taxonomic identities were resolved using an internal reference table compiled by Maria Pestana Correia through cross-matching records across iNaturalist, BOLD, NCBI, Open Tree of Life and Catalogue of Life (taxonomy.site100.org). Each specimen is assigned to a unique taxonomic entity reconciled across all project collection sites.
Mitogenome data sources
The database integrates mitogenomes from five source collections. Each source has a dedicated table and a unique prefix in the MT-ID field.
| Source | MT-ID prefix | Description | Records |
|---|---|---|---|
| BIOD | BIOD····· | NHM London SITE-100 specimens with physical vouchers | ~8,895 |
| GBDL | GBDL····· | GenBank-derived sequences with locality data | ~4,259 |
| QINL | QINL····· | Qinling (China) specimens with physical vouchers | ~1,015 |
| Others | CCCP / MIZA / … | Collaborator-contributed sequences (8 institutions) | ~1,007 |
Geographic data structure
All collection sites are organised in a four-level geographic hierarchy. Coordinates are assigned at the finest available level and propagated upward when precise data are unavailable.
94 countries
686 sites
112 units
598 plots
Data access
All data is published under CC BY 4.0. Specimen records, taxonomic data, and images are publicly accessible without registration. Mitogenome sequences with GenBank accession numbers are freely downloadable.
Sequences without public accession numbers are in pre-submission phase (ENA submission in progress). Downloading these sequences requires free account registration. Registration enables us to track data reuse and ensure datasets are properly cited — the licence permits any reuse provided attribution is given.
To request access or report a data issue, contact site100database@gmail.com.
Key references
-
(2022).
The SITE-100 project: site-based biodiversity genomics for species discovery, community ecology, and a global tree-of-life.
Frontiers in Ecology and Evolution 10: 787560.
doi:10.3389/fevo.2022.787560 -
(2026).
Bioinformatics of combined nuclear and mitochondrial phylogenomics to define key nodes for the classification of Coleoptera.
Systematic Biology 75(3): 445–467.
doi:10.1093/sysbio/syaf031