Data Curation

How SITE-100 data is generated

All specimens, sequences, and mitogenomes in this database were produced through a standardised two-track molecular pipeline — combining bulk metabarcoding with individual shotgun sequencing — followed by automated annotation and quality correction (Bian et al., 2022).

25,451

Specimens

15,000+

Mitogenomes

Countries

2019–2026

Collection period

Sequencing pipeline

Each field sample contains approximately 50 morphologically distinct specimens (morphospecies). The pipeline operates on two parallel tracks: a bulk track generating community-level barcodes, and an individual track generating complete mitochondrial genomes for taxonomically sorted morphospecies.

Track A — Bulk community sample

Bulk
sample

→

~50
specimens

→

DNA
extraction

→

One
pool

→

Meta-
barcoding

→

ASVs

Metabarcodes

Track B — Individual morphospecies

Morpho-
species

→

DNA
extraction

→

Barcoding
_{(COX1 amplicon)}

Shotgun
sequencing

→

Barcodes (COX1)

Mitogenomes
_{~15,000 bp}

Annotation & integration

Raw mitogenome
_{gene annotation}

→

MitoCorrect
_{codon boundary refinement}

→

SITE-100
Database

Bulk sample processing & metabarcoding

Specimens collected by standardised passive traps (flight interception traps) are sorted by parataxonomists into morphospecies. Up to ~50 specimens per sample are pooled for bulk DNA extraction. Amplification of the cox1 barcode region generates community-level ASVs, matched against reference sequences to produce species-level metabarcodes.

Individual barcoding & shotgun sequencing

Representative specimens from each morphospecies undergo individual DNA extraction. A cox1 barcode is sequenced per specimen. Separately, ~200 specimens are pooled for low-coverage Illumina shotgun sequencing (genome skimming), yielding 50–80% complete mitogenome assemblies per specimen.

Mitogenome assembly and specimen assignment

Mitogenome contigs are assembled from shotgun reads using standard genome assemblers. Each assembled mitogenome is assigned back to its specimen using the individual COX1 barcode as an anchor — the barcode sequence identifies which mitogenome contig belongs to which specimen in the pool.

Gene annotation and correction with MitoCorrect

Protein-coding gene boundaries are annotated and then refined by MitoCorrect (Creedy et al., 2026), which corrects start and stop codon positions across all 13 mitochondrial protein-coding genes. The tool scores each candidate annotation against two criteria: conservation of intergenic spacing observed across the full alignment, and agreement with curated reference sequences. This eliminates frameshift errors that automated annotation tools commonly propagate.

MitoCorrect VSEARCH

Database integration and taxonomy

Curated mitogenomes are imported into the SITE-100 database and linked to specimen records via a three-step matching hierarchy: museum accession number → morphospecies name → image identifier.

Taxonomic identities were resolved using an internal reference table compiled by Maria Pestana Correia through cross-matching records across iNaturalist, BOLD, NCBI, Open Tree of Life and Catalogue of Life (taxonomy.site100.org). Each specimen is assigned to a unique taxonomic entity reconciled across all project collection sites.

Mitogenome data sources

The database integrates mitogenomes from five source collections. Each source has a dedicated table and a unique prefix in the MT-ID field.

Source	MT-ID prefix	Description	Records
BIOD	BIOD·····	NHM London SITE-100 specimens with physical vouchers	~8,895
GBDL	GBDL·····	GenBank-derived sequences with locality data	~4,259
QINL	QINL·····	Qinling (China) specimens with physical vouchers	~1,015
Others	CCCP / MIZA / …	Collaborator-contributed sequences (8 institutions)	~1,007

Geographic data structure

All collection sites are organised in a four-level geographic hierarchy. Coordinates are assigned at the finest available level and propagated upward when precise data are unavailable.

Country
_{94 countries}

→

Locality
_{686 sites}

→

Precise location
_{112 units}

→

Plot
_{598 plots}

Data access

All data is published under CC BY 4.0. Specimen records, taxonomic data, and images are publicly accessible without registration. Mitogenome sequences with GenBank accession numbers are freely downloadable.

Sequences without public accession numbers are in pre-submission phase (ENA submission in progress). Downloading these sequences requires free account registration. Registration enables us to track data reuse and ensure datasets are properly cited — the licence permits any reuse provided attribution is given.

To request access or report a data issue, contact site100database@gmail.com.

Key references

Bian X, Garner BH, Liu H and Vogler AP. (2022). The SITE-100 project: site-based biodiversity genomics for species discovery, community ecology, and a global tree-of-life. Frontiers in Ecology and Evolution 10: 787560.
doi:10.3389/fevo.2022.787560
Creedy TJ, Ding Y, Gregory KM, Swaby L, Zhang F and Vogler AP. (2026). Bioinformatics of combined nuclear and mitochondrial phylogenomics to define key nodes for the classification of Coleoptera. Systematic Biology 75(3): 445–467.
doi:10.1093/sysbio/syaf031