A joint project from UC Santa Cruz NVIDIA

Towards Medical AutoResearch
a benchmark for AI agents on medical research tasks.

AutoMedBench scores autonomous agentic models across the research evaluation pipeline (plan, setup, validate, infer, submit), not just final outputs.

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Segmentation) = macro Dice × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Agent Workflow — S1 → S5

Every agent run flows through the same five stages. Each stage captures a distinct capability of the research loop. Scoring for each stage lives under .

Click any stage to see its scoring rubric.

Difficulty Tiers

Three tiers control how much help the agent gets. Same task, different level of scaffolding.

Dataset modality · samples · source

Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.

Sandbox Rules

Policy
Inference-only. Agents run pre-trained weights on real patient data; no training or fine-tuning.
Allowed
/data/public/ read-only · /workspace/ read-write
Forbidden
/data/private/ · benchmark source · other agents' runs
On violation
Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final output was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]
  • Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
  • Task!Task (Segmentation) = mean(per-target Dice ∈ [0, 1]). Targets are per-dataset (e.g. organ + lesion for KiTS19, 7 tissue classes for FeTA) and weighted equally. 50%
Agentic = weighted S1 → S5 ∈ [0, 1]
  • S1PLAN25%
  • S2SETUP15%
  • S3VALIDATE35%
  • S4INFERENCE15%
  • S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Per-Task Leaderboard

Pick a task and difficulty tier.

Team

About

AutoMedBench is a joint effort between University of California, Santa Cruz and NVIDIA, benchmarking autonomous agentic models across the medical AI research evaluation pipeline — plan, setup, validate, infer, submit.

UC Santa Cruz
University of California, Santa Cruz
NVIDIA