The third-party red team that helps you close the deal.
Your buyers ask for proof that goes beyond your own benchmark. We are the independent red team that supplies it: an evaluation grounded in a corpus your team did not assemble and a method your team did not design, so the number holds up in the room where the deal is won.
Your benchmark vs an independent one
~1.00
on your own benchmark
The score your team measured and put in the deck.
in front of a buyer
What that number is worth once the buyer knows your team picked the test, the data, and the threshold.
Buyers increasingly discount a vendor's own benchmark. A score from a party that did not build the model, on attacks it has never seen, is the one that moves a deal.
Read the benchmarkSix weeks. Five milestones. One report on the other side.
- W1
Kickoff and scope lock
Attack classes, demographic axes, and the bar for a finding are agreed and frozen.
- W2
Integration and calibration
Your detector is wired into the harness. Baseline performance captured before any perturbation.
- W3 to W4
Adversarial runs
Benchmark and perturbation pipeline executed. Every decision logged for replay.
- W5
Analysis and bypass recipes
Per-group margins of error computed. Failures annotated with the recipe that surfaced them.
- W6
Verdict and engineering debrief
Evaluation report delivered. Working session with engineering. Verdict signed.
Co-delivered engagements are matched to the host scope and may run shorter or longer.
The key deliverable
We hand back the recipe that broke your model, ready to fine-tune.
Proof that closes deals
Independent evidence buyers trust, because your team did not pick the test, the data, or the threshold.
Re-evaluate fast
Re-run against fresh attacks as they appear, and see exactly where the detector starts to slip.
The exact attack, not just a score
For every miss, the attack that produced it, formatted to drop straight into your next training set.
Before you ship
A second, independent read on the model before it reaches a customer.
Every miss comes with the recipe to fix it.
We rank what we find by severity and hand back the exact attack behind each one, so your team can reproduce it and fold it into the next fine-tune.
| Severity | Finding | The recipe |
|---|---|---|
| Critical | Platform compression | Re-compress uploads at H.264 quality 70, the way real platforms do. The score fell from 1.00 to 0.34, below a coin flip. |
| Critical | Weakest group | Break results out per group: one cell fell below a coin flip while the overall average still looked healthy. |
| High | Format tell | Re-encode real and fake identically, removing the file-format signal the model had been leaning on instead of content. |
Everything procurement asks for is in the box.
Per-group performance
Results broken out by demographic and platform, with the worst case shown.
Both kinds of mistake
Fakes let through and real users wrongly blocked, with the margin of error on each.
Bypass recipes
Every failure annotated with the recipe that surfaced it.
Methodology, documented
Public, versioned, and signed by the lead researcher.
Platform-realistic conditions
Scored under the re-encoding your deployment actually applies.
Audit-ready exhibits
Findings packaged to hand to a board or a regulator.
Put your detector on the public benchmark.
We publish an open, independent benchmark of deepfake detectors, all measured against the same attacks. Submit yours to be ranked alongside the field. If it holds, that is third-party proof you can put in front of buyers. The methodology is fixed and public, so a place on the board is earned by the result, never bought.
One measurement layer, every side of it.
Whichever side you are on, the same arms race runs underneath. See how we serve the rest of the market, or go straight to scoping your own evaluation.
Put an independent number on your detector.
Submit your detector and we return where it holds, where it breaks, and the recipe behind every failure, ready for your next fine-tune.