Benchmarks
Compare Mesmer techniques across objectives, targets, and budgets.
Benchmarks let you compare techniques under shared metrics.
The benchmark example compares probe, frontier-search, and autonomous-agent techniques across several objective-specific success criteria:
uv run python examples/benchmark.pyUseful benchmark dimensions include:
- Success rate.
- Turns.
- Target queries.
- Attacker, evaluator, and target token usage.
- Cost.
- Stop reason.
- Reproduction artifact availability.
- Row-level attempt evidence.
- Success as query budget increases.
Benchmark reports can include two higher-resolution views:
EvidenceMatrixkeeps one row per attempt/evaluator result, including run, objective, technique, target, target capabilities, candidate, response, score, pass/fail, query count, and cost.BudgetCurveshows how success changes as the allowed target-call budget increases.
Remote datasets are first-class:
from mesmer import DatasetColumnMap, DatasetFormat, RemoteDatasetSource
objectives = RemoteDatasetSource(
url="https://example.com/dataset.csv",
format=DatasetFormat.CSV,
column_map=DatasetColumnMap(goal="goal", target="target"),
limit=3,
)