NewThe detectors that scored perfect collapsed the hardest under attack.
Independent red team for biometric systems

We break the deepfake detectors your fraud defense relies on.

Margen is the independent red team for your deepfake defense. We attack the detection you have bought or built with the fraud that is actually circulating, then tell you where it holds, where it fails, and what to do about it.

Best fit for organizations holding large amounts of biometric data.

Detector report card

vendor-x v4.2

CONDITIONAL
  • Unseen generatorsFails
  • Platform-realistic conditionsDegrades
  • Per-group fairnessGap found
  • Clean benchmarkPasses

Illustrative. Real engagements report confidence intervals and the exact failure path.

Results reported against these external standards, not certified against them

  • ISO/IEC 30107-3
  • ISO/IEC 19795
  • ISO/IEC 24027
  • NIST AI RMF
  • EU AI Act
Where the human layer ends

At scale, no one is checking by hand. The system is.

High-volume identity flows, onboarding, verification, remote interviews, move far more traffic than any team can review by hand. And the best fakes now slip past trained reviewers anyway. The human layer cannot hold the line.

So your detection system is the fraud-prevention layer. These systems run several checks at once, and a weak spot at any one of them lets the fraud through.

That technical control is what we evaluate. We find where it fails, before an attacker does.

The core challenge

Why a strong-looking detector still lets fraud through.

Trained on yesterday

Detectors learn from the generators that existed when they were built. New models ship every month, and the detector has never seen them.

Tested in a lab

Vendor numbers come from clean, pristine images. Real fraud arrives compressed, resized, and re-encoded by the platforms it passes through.

Measured on the average

A strong overall score can hide groups the detector barely catches. The average looks fine while a whole subgroup is an open lane.

The problem, measured

The detectors that scored perfect collapsed the hardest.

Detection score, where 1.00 is perfect and 0.50 is a coin flip.

Clean lab test

1.00score

Two open-source detectors that hit a perfect score on a clean test.

Decayto a coin flip

Real conditions

0.34score

The same two, re-tested against fresh attacks and the compression real platforms apply. Six other detectors slipped too, but far less.

Source: Margen open-source detector benchmark · 14 detectors

What we test

We test against attacks your detector has never seen.

Fraudsters do not use last year's models. We evaluate against generators held out of your training and add new ones as threats emerge, so the score reflects tomorrow's attack.

What we offer

Two ways to red-team a detector.

Both engagements run on the same dataset and the same methodology. They differ in who initiates them, who owns the customer relationship, and what the report says on the cover.

01

Evaluation

You submit a detector. We red-team it.

An independent red team against your own model. We attack your detector with the fraud that is actually circulating, under the conditions a real platform imposes, then return a verdict: pass, conditional, or fail. Where it breaks, you get the exact recipe that beat it, so you can fix it.

Initiated by
The detection vendor
Duration
4 to 6 weeks, fixed scope
Deliverable
Red-team report with a verdict, per-group results, and the recipes that broke it
Used for
Procurement evidence, marketing-claim validation, pre-release QA
Request an evaluation

02

Co-delivered

Your red team brings us in for the technical layer.

A partnership with red-team and security-awareness firms. The partner runs the engagement and keeps the customer; we add an independent review of the detection technology in scope, so the end customer leaves with one joint report covering both the human and the technical layer.

Initiated by
A red team or security-awareness partner
Duration
Matches the host engagement
Deliverable
Single joint report covering human and technical layers
Used for
Enterprise security audits, joint customer engagements
Become a partner
Who we serve

Five teams, one measurement layer.

  • +Somewhere else? Tell us your use case
01 / 05Vendor

The third-party red team that helps you close the deal.

Your buyers ask for proof that goes beyond your own benchmark. We are the independent red team that supplies it: an evaluation grounded in a corpus your team did not assemble and a method your team did not design, so the number holds up in the room where the deal is won.

What we measure for them

  • Per-group performance. Demographic and platform breakdowns of every score.
  • Bypass recipes. Every failure annotated with the recipe that surfaced it.
  • Pre-release QA. A second pair of eyes before you ship.
Why Margen

Three commitments the measurement layer cannot exist without.

01 / Coverage

Coverage that tracks the threat.

Our evaluation corpus expands toward the frontier generators adversaries are adopting, so the benchmark keeps pace with the attack.

02 / Method

Reproducible methodology.

Every claim is backed by a dataset, a documented pipeline that recompresses media the way platforms do, and a pre-registered statistical methodology. Results can be independently re-run by anyone with corpus access on request.

03 / Fit

Context-fit evaluation.

We tailor the assessment to the threat the enterprise actually faces. The methodology is rigorous within each context, not generic across them.

Versus the alternatives

Not an internal team. Not a generalist pentest.

A detector is a specialized control, and evaluating it takes a specialized, independent adversary. Here is how an engagement compares to the alternatives most teams reach for first.

CapabilityMargenInternal red teamGeneralist pentest
Independent of the vendor under test
Deepfake-specific attack corpus
Per-group fairness breakdown
Platform-realistic conditions
Pre-registered, reproducible method
Hands back the breaking recipe
YesPartialNo
14
detectors evaluated
12
demographic groups covered per detector
1.00 to 0.34
top detectors, from perfect to below a coin flip
0
detectors we sell, by design
Synthetic face, tan skin tone, femaleSynthetic
TanFemale
We build the attacks

We build the impersonations your detector has to catch.

A detector is only proven on real and fake side by side. We build the hard half: frontier-quality fakes, including impersonations made to pass as a real person. Your detector is scored on both, so the number reflects whether it can still tell a genuine face from the attack built to wear it. The faces shown here are the synthetic attack side.

Find your blind spot before someone else does.

Submit a model for evaluation, or add the detection layer to a red-team engagement. We return a per-group report card showing both kinds of mistake and the margin of error on each, and where a detector fails, the recipe that broke it. For buyers, we can point you to the detection that actually holds for your use case.