Benchmarks | Mesmer

Benchmarks let you compare techniques under shared metrics.

The benchmark example compares probe, frontier-search, and autonomous-agent techniques across several objective-specific success criteria:

uv run python examples/benchmark.py

Useful benchmark dimensions include:

Success rate.
Turns.
Target queries.
Attacker, evaluator, and target token usage.
Cost.
Stop reason.
Reproduction artifact availability.
Row-level attempt evidence.
Success as query budget increases.

Benchmark reports can include two higher-resolution views:

EvidenceMatrix keeps one row per attempt/evaluator result, including run, objective, technique, target, target capabilities, candidate, response, score, pass/fail, query count, and cost.
BudgetCurve shows how success changes as the allowed target-call budget increases.

Remote datasets are first-class:

from mesmer import DatasetColumnMap, DatasetFormat, RemoteDatasetSource

objectives = RemoteDatasetSource(
    url="https://example.com/dataset.csv",
    format=DatasetFormat.CSV,
    column_map=DatasetColumnMap(goal="goal", target="target"),
    limit=3,
)