SITE-100 Global Beetle Database
Data Curation

How SITE-100 data is generated

All specimens, sequences, and mitogenomes in this database were produced through a standardised two-track molecular pipeline — combining bulk metabarcoding with individual shotgun sequencing — followed by automated annotation and quality correction (Bian et al., 2022).

25,451
Specimens
15,000+
Mitogenomes
14
Countries
2019–2026
Collection period

Sequencing pipeline

Each field sample contains approximately 50 morphologically distinct specimens (morphospecies). The pipeline operates on two parallel tracks: a bulk track generating community-level barcodes, and an individual track generating complete mitochondrial genomes for taxonomically sorted morphospecies.

Track A — Bulk community sample
Bulk
sample
~50
specimens
DNA
extraction
One
pool
Meta-
barcoding
ASVs
Metabarcodes

Track B — Individual morphospecies
Morpho-
species
DNA
extraction
Barcoding
(COX1 amplicon)
Shotgun
sequencing
Barcodes (COX1)
Mitogenomes
~15,000 bp

Annotation & integration
Raw mitogenome
gene annotation
MitoCorrect
codon boundary refinement
SITE-100
Database
1

Bulk sample processing & metabarcoding

Specimens collected by standardised passive traps (flight interception traps) are sorted by parataxonomists into morphospecies. Up to ~50 specimens per sample are pooled for bulk DNA extraction. Amplification of the cox1 barcode region generates community-level ASVs, matched against reference sequences to produce species-level metabarcodes.

2

Individual barcoding & shotgun sequencing

Representative specimens from each morphospecies undergo individual DNA extraction. A cox1 barcode is sequenced per specimen. Separately, ~200 specimens are pooled for low-coverage Illumina shotgun sequencing (genome skimming), yielding 50–80% complete mitogenome assemblies per specimen.

3

Mitogenome assembly and specimen assignment

Mitogenome contigs are assembled from shotgun reads using standard genome assemblers. Each assembled mitogenome is assigned back to its specimen using the individual COX1 barcode as an anchor — the barcode sequence identifies which mitogenome contig belongs to which specimen in the pool.

4

Gene annotation and correction with MitoCorrect

Protein-coding gene boundaries are annotated and then refined by MitoCorrect (Creedy et al., 2026), which corrects start and stop codon positions across all 13 mitochondrial protein-coding genes. The tool scores each candidate annotation against two criteria: conservation of intergenic spacing observed across the full alignment, and agreement with curated reference sequences. This eliminates frameshift errors that automated annotation tools commonly propagate.

MitoCorrect VSEARCH
5

Database integration and taxonomy

Curated mitogenomes are imported into the SITE-100 database and linked to specimen records via a three-step matching hierarchy: museum accession number → morphospecies name → image identifier.

Taxonomic identities were resolved using an internal reference table compiled by Maria Pestana Correia through cross-matching records across iNaturalist, BOLD, NCBI, Open Tree of Life and Catalogue of Life (taxonomy.site100.org). Each specimen is assigned to a unique taxonomic entity reconciled across all project collection sites.

Mitogenome data sources

The database integrates mitogenomes from five source collections. Each source has a dedicated table and a unique prefix in the MT-ID field.

Source MT-ID prefix Description Records
BIOD BIOD····· NHM London SITE-100 specimens with physical vouchers ~8,895
GBDL GBDL····· GenBank-derived sequences with locality data ~4,259
QINL QINL····· Qinling (China) specimens with physical vouchers ~1,015
Others CCCP / MIZA / … Collaborator-contributed sequences (8 institutions) ~1,007

Geographic data structure

All collection sites are organised in a four-level geographic hierarchy. Coordinates are assigned at the finest available level and propagated upward when precise data are unavailable.

Country
94 countries
Locality
686 sites
Precise location
112 units
Plot
598 plots

Data access

All data is published under CC BY 4.0. Specimen records, taxonomic data, and images are publicly accessible without registration. Mitogenome sequences with GenBank accession numbers are freely downloadable.

Sequences without public accession numbers are in pre-submission phase (ENA submission in progress). Downloading these sequences requires free account registration. Registration enables us to track data reuse and ensure datasets are properly cited — the licence permits any reuse provided attribution is given.

To request access or report a data issue, contact site100database@gmail.com.

Key references