NewThe detectors that scored perfect collapsed the hardest under attack.
Vendor · Builders that need third-party validation.

The third-party red team that helps you close the deal.

Your buyers ask for proof that goes beyond your own benchmark. We are the independent red team that supplies it: an evaluation grounded in a corpus your team did not assemble and a method your team did not design, so the number holds up in the room where the deal is won.

The evidence

Your benchmark vs an independent one

What you were told

~1.00

on your own benchmark

The score your team measured and put in the deck.

What holds in your funnel

in front of a buyer

What that number is worth once the buyer knows your team picked the test, the data, and the threshold.

Buyers increasingly discount a vendor's own benchmark. A score from a party that did not build the model, on attacks it has never seen, is the one that moves a deal.

Read the benchmark
How an evaluation unfolds

Six weeks. Five milestones. One report on the other side.

  1. W1

    Kickoff and scope lock

    Attack classes, demographic axes, and the bar for a finding are agreed and frozen.

  2. W2

    Integration and calibration

    Your detector is wired into the harness. Baseline performance captured before any perturbation.

  3. W3 to W4

    Adversarial runs

    Benchmark and perturbation pipeline executed. Every decision logged for replay.

  4. W5

    Analysis and bypass recipes

    Per-group margins of error computed. Failures annotated with the recipe that surfaced them.

  5. W6

    Verdict and engineering debrief

    Evaluation report delivered. Working session with engineering. Verdict signed.

Co-delivered engagements are matched to the host scope and may run shorter or longer.

The key deliverable

We hand back the recipe that broke your model, ready to fine-tune.

Proof that closes deals

Independent evidence buyers trust, because your team did not pick the test, the data, or the threshold.

Re-evaluate fast

Re-run against fresh attacks as they appear, and see exactly where the detector starts to slip.

The exact attack, not just a score

For every miss, the attack that produced it, formatted to drop straight into your next training set.

Before you ship

A second, independent read on the model before it reaches a customer.

What the report looks like

Every miss comes with the recipe to fix it.

We rank what we find by severity and hand back the exact attack behind each one, so your team can reproduce it and fold it into the next fine-tune.

SeverityFindingThe recipe
CriticalPlatform compressionRe-compress uploads at H.264 quality 70, the way real platforms do. The score fell from 1.00 to 0.34, below a coin flip.
CriticalWeakest groupBreak results out per group: one cell fell below a coin flip while the overall average still looked healthy.
HighFormat tellRe-encode real and fake identically, removing the file-format signal the model had been leaning on instead of content.
Example findings drawn from our open benchmark, shown to illustrate the report format. Your evaluation returns findings specific to the detector you submit.
What you get

Everything procurement asks for is in the box.

  • Per-group performance

    Results broken out by demographic and platform, with the worst case shown.

  • Both kinds of mistake

    Fakes let through and real users wrongly blocked, with the margin of error on each.

  • Bypass recipes

    Every failure annotated with the recipe that surfaced it.

  • Methodology, documented

    Public, versioned, and signed by the lead researcher.

  • Platform-realistic conditions

    Scored under the re-encoding your deployment actually applies.

  • Audit-ready exhibits

    Findings packaged to hand to a board or a regulator.

The open benchmark

Put your detector on the public benchmark.

We publish an open, independent benchmark of deepfake detectors, all measured against the same attacks. Submit yours to be ranked alongside the field. If it holds, that is third-party proof you can put in front of buyers. The methodology is fixed and public, so a place on the board is earned by the result, never bought.

Put an independent number on your detector.

Submit your detector and we return where it holds, where it breaks, and the recipe behind every failure, ready for your next fine-tune.