Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Segmentation) = macro Dice × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Agent Workflow — S1 → S5
Every agent run flows through the same five stages. Each stage captures a distinct capability of the research loop. Scoring for each stage lives under .
Click any stage to see its scoring rubric.
Difficulty Tiers
Three tiers control how much help the agent gets. Same task, different level of scaffolding.
Dataset modality · samples · source
Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.
Sandbox Rules
- Policy
- Inference-only. Agents run pre-trained weights on real patient data; no training or fine-tuning.
- Allowed
-
/data/public/read-only ·/workspace/read-write - Forbidden
-
/data/private/· benchmark source · other agents' runs - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final output was).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
- Task!Task (Segmentation) = mean(per-target Dice ∈ [0, 1]). Targets are per-dataset (e.g. organ + lesion for KiTS19, 7 tissue classes for FeTA) and weighted equally. 50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Per-Task Leaderboard
Pick a task and difficulty tier.
About
AutoMedBench is a joint effort between University of California, Santa Cruz and NVIDIA, benchmarking autonomous agentic models across the medical AI research evaluation pipeline — plan, setup, validate, infer, submit.