AutoMedBench · Towards Medical AutoResearch · a benchmark for AI agents on medical research tasks

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Segmentation) = macro Dice × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Agent Workflow — S1 → S5

Every agent run flows through the same five stages. Each stage captures a distinct capability of the research loop. Scoring for each stage lives under .

Click any stage to see its scoring rubric.

Difficulty Tiers

Three tiers control how much help the agent gets. Same task, different level of scaffolding.

Dataset modality · samples · source

Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.

Sandbox Rules

Policy: Inference-only. Agents run pre-trained weights on real patient data; no training or fine-tuning.
Allowed: /data/public/ read-only · /workspace/ read-write
Forbidden: /data/private/ · benchmark source · other agents' runs
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final output was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Segmentation) = mean(per-target Dice ∈ [0, 1]). Targets are per-dataset (e.g. organ + lesion for KiTS19, 7 tissue classes for FeTA) and weighted equally. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Three tiers — more scaffolding = lower difficulty.

Lite · follows a recipe
Standard · chooses within bounds
Pro · discovers and competes

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (VQA) = exact-match accuracy × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact VLM given
Standard · pick from VLM families

Agent Workflow — S1 → S5

Identical single-LLM coding-agent loop as every other track. One long conversation, execute_code only, no auxiliary tools.

Same stages as Segmentation. See for how the task half is scored.

Difficulty Tiers

Lite is model-locked for comparability; Standard opens the choice within a fixed five-VLM candidate pool.

Dataset

Task	Modality	Task Time Range	Source	Description
Pathology VQA	Histopathology	`—`	PathVQA	Open-ended pathology QA.
Radiology VQA	CT · MRI · X-ray	`—`	VQA-RAD	Mixed-modality radiology QA.
Multi-frame Medical VQA	Multi-frame video	`—`	MedFrameQA	Multi-frame MCQ (A–E).
Semantic Radiology VQA	Radiology · EN	`—`	SLAKE-EN	Semantic radiology QA.
Expert Multimodal VQA	Multi-modal	`—`	MedXpertQA · MM	Expert multi-image MCQ.
Endoscopy VQA	GI endoscopy	`—`	Kvasir-VQA	Open-ended GI-endoscopy QA — yes/no, MCQ, short-answer.
Comprehensive VQA	Multi-modal	`—`	OmniMedVQA	Multi-modality medical MCQ — 12 modalities, 73 source datasets.
Literature VQA	Mixed biomedical	`—`	PMC-VQA	MCQ on PubMed Central biomedical figures.
MMMU-Medical-VQA	Multi-modal	`—`	MMMU	Health & Medicine slice from MMMU (5 subjects, validation split).

Datasets remain the property of their respective authors. AutoMedBench reports leaderboard scores only and does not redistribute dataset content; benchmark runs download data directly from the linked sources.

Sandbox Rules

Policy: Inference-only. No fine-tuning on benchmark data.
Allowed: /data/public/ read-only · /workspace/ · API keys env-injected
Forbidden: /data/private/ · ground-truth files

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (answer quality).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
Task!Task (VQA) = mean(per-question score ∈ [0, 1]). Exact-match for MCQ, max(EM, token-F1) or LLM-judge for open-ended. Every question weighted equally.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1, S3 Continuous: S4 Discrete: S2, S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Report Gen) = mean(BLEU, METEOR, ROUGE-L, F1RadGraph) × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact MRG given
Standard · pick from MRG families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. The report-generation harness also exposes an S1–S3 judge (heuristic or online) that grades the plan/setup/validate stages, mirroring eval_seg/.

Benchmark unit is one study, not one JPEG — each case may contain multiple views under public/<case_id>/images/. See for how each stage is scored.

Difficulty Tiers

Defined in eval_report_gen/tier_config.py. Tiers differ by model-selection freedom and plan-artefact requirements at S1.

Dataset study-level chest X-ray reporting

Task	Modality	Task Time Range	Source	Description
Chest X-ray Findings Report	Chest X-ray	`—`	MIMIC-CXR	Findings-only reports.
Chest X-ray Findings / Impression	Chest X-ray	`—`	IU / Open-i	Frontal chest X-ray findings or impression.
Chest X-ray Full Report	Chest X-ray	`—`	CheXpert Plus	Full structured reports (findings + impression).
Pathology Captioning · 100	Histopathology	`—`	PathCap	Histopathology image captioning · 100-case split.
Pathology Captioning · 500	Histopathology	`—`	PathCap	Histopathology image captioning · 500-case split.

Sandbox Rules

Policy: Inference-only. Agent produces one report.txt per study; partial submissions are rejected by the format checker.
Allowed: public/<case_id>/images/*.jpg read-only · public/<case_id>/manifest.json · /workspace/
Forbidden: private/<case_id>/report.txt · private/<case_id>/labels.json · private/<case_id>/manifest.json · other agents' runs
Credentials: eval_report_gen/api_keys/key.txt exactly four lines, keys only
Preflight: CheXbert + BERT + RadGraph checkpoints must exist on disk before scoring starts.

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the generated report is against the reference).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
Task!Task (Report Generation) = mean(per-report score ∈ [0, 1]), where per-report = mean(BLEU, METEOR, ROUGE_L, F1RadGraph, μP, μR, μF1). All seven components weighted equally.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact detector given
Standard · pick from detector families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. S4 writes a bbox prediction list per image; S5 consolidates the submission and calls submit_results. Scoring is mAP@0.5 on the held-out private labels.

Benchmark unit is one chest X-ray. See for how each stage is scored.

Difficulty Tiers

Dataset

Task	Modality	Task Time Range	Source	Description
Chest X-ray Abnormality Detection	Chest X-ray	`—`	VinDr-CXR	14-class bbox detection. Task score = mAP@0.5.
Blood Cell Detection	Blood-smear microscopy	`—`	BCCD	3-class detection (rbc · wbc · platelets). Task score = mAP@0.5.
Dental Disease Detection	Panoramic dental X-ray	`—`	DENTEX	4-class disease detection (caries · deep caries · periapical lesion · impacted). Task score = mAP@0.5.
Wrist Anomaly Detection	Pediatric wrist X-ray	`—`	GRAZPEDWRI-DX	9-class anomaly detection (fracture · bone lesion · foreign body · …). Task score = mAP@0.5.

Sandbox Rules

Policy: Inference-only. No training on benchmark data.
Allowed: data/VinDrCXR_Detection1000/public/ read-only · /workspace/
Forbidden: data/VinDrCXR_Detection1000/private/ · ground-truth files
Limits: Time 3 600 s / run · IoU 0.5
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final bbox output was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Detection) = mAP@0.5 on the held-out private labels. ∈ [0, 1]. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit), judged from conversation.json. ∈ [0, 100]. + Task!Task (Enhancement) = mean SSIM × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of repeats that could not be scored.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · follows a recipe
Standard · chooses within bounds

Agent Workflow — S1 → S5

Same coding-agent loop. Two-container architecture (v5 gold standard): the agent container does S1–S5 with GPU + network, then a separate eval container with GPU but --network none scores the outputs.

Agent writes to /agent_outputs/; eval container reads them plus /data/private/ reference images and writes /results/detail_report.json. See for how each metric is scored.

Difficulty Tiers

Lite locks the method (DRUNet for LDCT · Swin2SR for MRI-SR) so results are comparable across agents. Standard opens the choice inside the documented inference-only candidate pool.

Dataset modality · samples · source

Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.

Sandbox Rules 2-container isolation

Policy: Inference-only. .backward(), optimizer.step(), model.train(), and torch.optim.* are all blocked by the sandbox.
Agent container: GPU + bridge network · --read-only rootfs · no-new-privileges · --pids-limit 4096 · 3-layer sandbox (static regex + Python sys.addaudithook + bash wrappers)
Eval container: GPU + --network none · no agent code · writes /results/detail_report.json
Forbidden (agent side): /data/private/ · /eval/ · /results/ · /bands/ · any reference.npy / ground_truth.csv / baseline_bands.json
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the final reconstruction was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. S1/S2/S3 LLM-judged from plan.md + conversation.json; S4/S5 deterministic from output completeness. Shown ×100.50%
Task!Task (Image Enhancement) = mean(per-patient SSIM ∈ [0, 1]) between reconstruction and reference. PSNR and LPIPS are reported as diagnostic signals only.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Agentic workflow S1 → S5

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

Team

About

AutoMedBench is a joint effort between University of California, Santa Cruz and NVIDIA, benchmarking autonomous agentic models across the medical AI research evaluation pipeline — plan, setup, validate, infer, submit.

University of California, Santa Cruz

NVIDIA

Towards Medical AutoResearcha benchmark for AI agents on medical research tasks.

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset modality · samples · source

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Per-Task Leaderboard

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset study-level chest X-ray reporting

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset modality · samples · source

Sandbox Rules 2-container isolation

Overall Score

Agentic workflow S1 → S5

About

Towards Medical AutoResearch
a benchmark for AI agents on medical research tasks.