NewThe detectors that scored perfect collapsed the hardest under attack.
Back to Research
Benchmark

Detectors collapse from near-perfect to near-random.

An initial benchmark and robustness re-evaluation of leading open-source deepfake detectors, stratified across generators, platform degradations, and demographic groups. Findings first; the sequestered corpus and engagement terms are not published.

White paper · June 2026 · 6 minute read · DOI registered

Overview

We evaluated leading open-source deepfake detectors the way a buyer should: on attacks held out of training, under the conditions production imposes, and stratified so no group hides behind a strong average. The two detectors that scored a perfect clean number fell the furthest; six others lost less ground but still slipped. The pattern is the same across the field: clean-benchmark accuracy does not survive the test getting harder.

Detectors that score near-perfect on clean benchmarks fall to near-random once attacks are unseen and media is reprocessed the way platforms reprocess it.

Bar chart of detector ROC-AUC on SDXL and InstantID, colored by training-corpus family, with per-cell AUC range whiskers across 12 demographic cells.
Fig. 1Overall AUC by detector, with the black whisker showing the per-group range. The spread inside a single bar is the fairness story a pooled number hides.

Source: Margen open-source detector benchmark · 14 detectors.

The clean leaderboard

On a clean benchmark, the field looks healthy. A handful of detectors post near-perfect scores and a tidy ranking emerges. This is the table a buyer usually sees, and the number a vendor usually quotes.

DetectorClean AUCClean-board tier
DMimageDetection1.0000Top of clean board
Fusion1.0000Top of clean board
SigLIP20.8373Mid clean board
Smogy0.7519Mid clean board
Xception0.7065Mid clean board
F3Net0.5996Near random
UCF0.4992Near random
clipdet_latent10k0.3005Near random
SBI (FF c23)0.2731Near random

Clean-benchmark ROC-AUC on leading open-source detectors. The rest of this piece is about what happens to these numbers once the test stops being clean.

The format confound

A large share of reported detector skill comes from reading compression signature, not synthesis artifacts. When real and fake media are forced through one encoding pipeline so the two share a format, the apparent accuracy collapses.

Slope chart showing detector skill collapsing once real and fake media share one encoding pipeline.
Fig. 2Format parity. Apparent skill that came from compression signature disappears once both classes pass through the same pipeline.

The control is a few lines. Score the detector as-is, then re-encode both classes through one pipeline and score again. The gap is the share of accuracy that was reading format rather than synthesis.

format_parity.pyPython
1from sklearn.metrics import roc_auc_score
2
3# Score without parity: real and fake arrive in different formats
4auc_raw = roc_auc_score(y_true, detector.score(images))
5
6# Re-encode every image through ONE pipeline, then score again
7parity = [reencode(img, codec="h264", quality=70) for img in images]
8auc_parity = roc_auc_score(y_true, detector.score(parity))
9
10# The gap is the "skill" that was reading format, not artifacts
11print(round(auc_raw - auc_parity, 3))

Per-group failure

Pooled accuracy hides subgroup failure. We break every result out across skin-tone and gender groups and report the maximum disparity, not the average, so a failing group cannot be averaged away.

Heatmap of detector accuracy across skin-tone and gender groups.
Fig. 3Per-group true-positive rate by skin tone and gender. Darker cells indicate lower detector confidence.

Source: Margen open-source detector benchmark · 14 detectors. Skin-tone labels are imperfect; read per-group results as directional, not precise.

Platform degradation

Real-world re-encoding, resizing, and recompression move numbers that clean-lab benchmarks never test. A detector validated only on pristine images is not validated for production.

Chart of detector performance dropping under platform re-encoding and recompression.
Fig. 4Platform degradation. Performance falls under the reprocessing a real platform applies.

What this means for buyers

A published accuracy near 1.0 is not evidence a detector will hold against fraud. Before you rely on one, it should be tested on unseen generators, under platform-realistic conditions, and broken out by group. That is what a Margen evaluation produces.

Cite this paper

This benchmark is published openly with a permanent DOI. Cite the immutable record, not this page.

Cite this paper

Pick a format. Copy the string.

Babalola, D.. (2026). Deepfake Detector Robustness Under Social-Media Re-encoding. Zenodo. https://doi.org/10.5281/zenodo.20781389
DOI 10.5281/zenodo.20781389·All citations point to the immutable DOI, not the paper page.