NewThe detectors that scored perfect collapsed the hardest under attack.
For detection vendors and technical buyers

The science behind the score.

This page is the deep end, written for the teams who build detectors and the technical buyers who vet them. Nothing about how a number is produced is proprietary. The protocol is committed publicly before results, so you can check our work rather than trust our framing.

How an evaluation runs

Four stages, the same every time, every claim traceable to a number.

  1. 01Input

    Benchmark dataset

    A robust, diverse corpus, real identities and frontier-quality attacks, spanning skin tones, conditions, and attack types, and expanded as new generators appear.

  2. 02Transform

    Real-world conditions

    We put every sample through the same compression and re-encoding the real platforms apply, so the score reflects production, not a clean lab.

  3. 03Measure

    The metrics that fit your field

    We score on the measures your field actually relies on, each with a margin of error, fixed in advance and reproducible from the same data.

  4. 04Output

    Evaluation reportDeliverable

    Pass, conditional, or fail. Audit-ready exhibits, per-group breakdowns, and the recipe that bypassed the system.

The corpus

Run it on our corpus, or build it in yours.

Every evaluation needs both sides: genuine identities and the fakes built to beat them. By default we supply both, real bona fides and frontier-quality attacks generated against them.

When the threat is specific to your population, or nothing can leave your system, we bring our tools into your environment and build the corpus there. Your own images become the source identities, and we generate the impersonations against them, so the attack we run is the one an adversary would run on a real person in your funnel, not a generic sample.

Protocol

The load-bearing decisions.

Five controls that separate a measured result from a flattering one. Each is fixed before evaluation begins.

01

Pre-registration

Hypotheses, corpus, detector set, and perturbation slugs are committed to a public document before any result is computed. The git commit hash serves as the timing record. Confirmatory and exploratory analyses are separated, with multiple-comparison correction applied to each family.

02

Fairness stratification

Every metric is broken out across a 12-cell axis of skin tone by gender. We report the maximum disparity across cells, not the average, so a detector cannot hide a failing subgroup behind a strong pooled number. Cells with thin samples are tagged preliminary and carry wider confidence intervals rather than being silently dropped.

Synthetic face, very light skin tone, femaleSynthetic
Very lightFemale
Synthetic face, very light skin tone, maleSynthetic
Very lightMale
Synthetic face, intermediate skin tone, femaleSynthetic
IntermediateFemale
Synthetic face, intermediate skin tone, maleSynthetic
IntermediateMale
Synthetic face, tan skin tone, femaleSynthetic
TanFemale
Synthetic face, tan skin tone, maleSynthetic
TanMale
Synthetic face, dark skin tone, femaleSynthetic
DarkFemale
Synthetic face, dark skin tone, maleSynthetic
DarkMale

12-cell axis: skin tone by gender

The synthetic attack side. Each cell is matched by genuine bona fides the detector has to tell it apart from.

03

Cross-dataset and leave-one-out

Intra-dataset validation overstates skill. We evaluate on generators and sources held out of training, and rotate leave-one-out across the corpus to surface which conditions are genuinely hard rather than which were memorized.

04

Format parity

Bona fides and attacks pass through one encoding pipeline so the two classes share a compression signature. Without this control a detector can separate classes on format alone and report skill it does not have.

05

Anonymized reporting

Detectors are reported anonymized unless the vendor chooses to self-reveal. Placement and results cannot be purchased. Each result card pins the detector build or commit hash, the subset identifier, per-cell sample sizes, the metric in fixed notation, the date window, and a named signatory.

Glossary

The terms, in plain language.

The vocabulary an evaluation report uses, defined so a non-specialist reviewer can read the result without a statistics background.

ROC-AUC
A single score for how well a detector separates real from fake across all thresholds. 1.0 is perfect, 0.5 is a coin flip.
APCER / BPCER
The two error directions for presentation-attack detection: attacks wrongly accepted, and genuine samples wrongly rejected. We report both, never just the flattering one.
Per-group disparity
The gap between the best and worst demographic group. We report the maximum disparity, not the average, so a failing group cannot be hidden.
Format parity
Forcing real and fake media through one encoding pipeline so a detector cannot win by reading compression signature instead of synthesis artifacts.
Leave-one-out
Holding one generator or source out of training, testing on it, and rotating. It measures generalization to the unseen, not memorization.
Pre-registration
Committing the protocol publicly before any result is produced. The git commit hash is the timing record, so results cannot be reverse-fit.

A secure engagement, verifiable end to end.

Your model and the data you share stay inside the engagement, the methodology is public and pre-registered, and results are reported against named external standards. You can check every number yourself.

How we protect your data

Have a detector you need measured?

Tell us the threat model and the deployment. We will scope an evaluation against attacks it has not seen.