Benchmarks

Compare Mesmer techniques across objectives, targets, and budgets.

Benchmarks let you compare techniques under shared metrics.

The benchmark example compares probe, frontier-search, and autonomous-agent techniques across several objective-specific success criteria:

uv run python examples/benchmark.py

Useful benchmark dimensions include:

  • Success rate.
  • Turns.
  • Target queries.
  • Attacker, evaluator, and target token usage.
  • Cost.
  • Stop reason.
  • Reproduction artifact availability.
  • Row-level attempt evidence.
  • Success as query budget increases.

Benchmark reports can include two higher-resolution views:

  • EvidenceMatrix keeps one row per attempt/evaluator result, including run, objective, technique, target, target capabilities, candidate, response, score, pass/fail, query count, and cost.
  • BudgetCurve shows how success changes as the allowed target-call budget increases.

Remote datasets are first-class:

from mesmer import DatasetColumnMap, DatasetFormat, RemoteDatasetSource

objectives = RemoteDatasetSource(
    url="https://example.com/dataset.csv",
    format=DatasetFormat.CSV,
    column_map=DatasetColumnMap(goal="goal", target="target"),
    limit=3,
)