Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Segmentation) = macro Dice × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Agent Workflow — S1 → S5
Every agent run flows through the same five stages. Each stage captures a distinct capability of the research loop. Scoring for each stage lives under .
Click any stage to see its scoring rubric.
Difficulty Tiers
Three tiers control how much help the agent gets. Same task, different level of scaffolding.
Dataset modality · samples · source
Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.
Sandbox Rules
- Policy
- Inference-only. Agents run pre-trained weights on real patient data; no training or fine-tuning.
- Allowed
-
/data/public/read-only ·/workspace/read-write - Forbidden
-
/data/private/· benchmark source · other agents' runs - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final output was).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
- Task!Task (Segmentation) = mean(per-target Dice ∈ [0, 1]). Targets are per-dataset (e.g. organ + lesion for KiTS19, 7 tissue classes for FeTA) and weighted equally. 50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Per-Task Leaderboard
Pick a task and difficulty tier.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (VQA) = exact-match accuracy × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Identical single-LLM coding-agent loop as every other track. One long
conversation, execute_code only, no auxiliary tools.
Same stages as Segmentation. See for how the task half is scored.
Difficulty Tiers
Lite is model-locked for comparability; Standard opens the choice within a fixed five-VLM candidate pool.
Dataset
| Task | Modality | Task Time Range | Source | Description |
|---|---|---|---|---|
| Pathology VQA | Histopathology | — |
PathVQA | Open-ended pathology QA. |
| Radiology VQA | CT · MRI · X-ray | — |
VQA-RAD | Mixed-modality radiology QA. |
| Multi-frame Medical VQA | Multi-frame video | — |
MedFrameQA | Multi-frame MCQ (A–E). |
| Semantic Radiology VQA | Radiology · EN | — |
SLAKE-EN | Semantic radiology QA. |
| Expert Multimodal VQA | Multi-modal | — |
MedXpertQA · MM | Expert multi-image MCQ. |
| Endoscopy VQA | GI endoscopy | — |
Kvasir-VQA | Open-ended GI-endoscopy QA — yes/no, MCQ, short-answer. |
| Comprehensive VQA | Multi-modal | — |
OmniMedVQA | Multi-modality medical MCQ — 12 modalities, 73 source datasets. |
| Literature VQA | Mixed biomedical | — |
PMC-VQA | MCQ on PubMed Central biomedical figures. |
| MMMU-Medical-VQA | Multi-modal | — |
MMMU | Health & Medicine slice from MMMU (5 subjects, validation split). |
Datasets remain the property of their respective authors. AutoMedBench reports leaderboard scores only and does not redistribute dataset content; benchmark runs download data directly from the linked sources.
Sandbox Rules
- Policy
- Inference-only. No fine-tuning on benchmark data.
- Allowed
-
/data/public/read-only ·/workspace/· API keys env-injected - Forbidden
-
/data/private/· ground-truth files
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (answer quality).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
- Task!Task (VQA) = mean(per-question score ∈ [0, 1]). Exact-match for MCQ, max(EM, token-F1) or LLM-judge for open-ended. Every question weighted equally.50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Report Gen) = mean(BLEU, METEOR, ROUGE-L, F1RadGraph) × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Same coding-agent loop as every other track. The report-generation harness
also exposes an S1–S3 judge (heuristic or online) that grades the
plan/setup/validate stages, mirroring eval_seg/.
Benchmark unit is one study, not one JPEG — each case may contain
multiple views under public/<case_id>/images/. See
for how each stage is scored.
Difficulty Tiers
Defined in eval_report_gen/tier_config.py. Tiers differ by model-selection freedom and plan-artefact requirements at S1.
Dataset study-level chest X-ray reporting
| Task | Modality | Task Time Range | Source | Description |
|---|---|---|---|---|
| Chest X-ray Findings Report | Chest X-ray | — |
MIMIC-CXR | Findings-only reports. |
| Chest X-ray Findings / Impression | Chest X-ray | — |
IU / Open-i | Frontal chest X-ray findings or impression. |
| Chest X-ray Full Report | Chest X-ray | — |
CheXpert Plus | Full structured reports (findings + impression). |
| Pathology Captioning · 100 | Histopathology | — |
PathCap | Histopathology image captioning · 100-case split. |
| Pathology Captioning · 500 | Histopathology | — |
PathCap | Histopathology image captioning · 500-case split. |
Sandbox Rules
- Policy
- Inference-only. Agent produces one
report.txtper study; partial submissions are rejected by the format checker. - Allowed
-
public/<case_id>/images/*.jpgread-only ·public/<case_id>/manifest.json·/workspace/ - Forbidden
-
private/<case_id>/report.txt·private/<case_id>/labels.json·private/<case_id>/manifest.json· other agents' runs - Credentials
-
eval_report_gen/api_keys/key.txtexactly four lines, keys only - Preflight
- CheXbert + BERT + RadGraph checkpoints must exist on disk before scoring starts.
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the generated report is against the reference).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
- Task!Task (Report Generation) = mean(per-report score ∈ [0, 1]), where per-report = mean(BLEU, METEOR, ROUGE_L, F1RadGraph, μP, μR, μF1). All seven components weighted equally.50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Detection) = mAP@0.5 × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Same coding-agent loop as every other track. S4 writes a bbox
prediction list per image; S5 consolidates the submission and calls
submit_results. Scoring is mAP@0.5 on the
held-out private labels.
Benchmark unit is one chest X-ray. See for how each stage is scored.
Difficulty Tiers
Dataset
| Task | Modality | Task Time Range | Source | Description |
|---|---|---|---|---|
| Chest X-ray Abnormality Detection | Chest X-ray | — |
VinDr-CXR | 14-class bbox detection. Task score = mAP@0.5. |
| Blood Cell Detection | Blood-smear microscopy | — |
BCCD | 3-class detection (rbc · wbc · platelets). Task score = mAP@0.5. |
| Dental Disease Detection | Panoramic dental X-ray | — |
DENTEX | 4-class disease detection (caries · deep caries · periapical lesion · impacted). Task score = mAP@0.5. |
| Wrist Anomaly Detection | Pediatric wrist X-ray | — |
GRAZPEDWRI-DX | 9-class anomaly detection (fracture · bone lesion · foreign body · …). Task score = mAP@0.5. |
Sandbox Rules
- Policy
- Inference-only. No training on benchmark data.
- Allowed
-
data/VinDrCXR_Detection1000/public/read-only ·/workspace/ - Forbidden
-
data/VinDrCXR_Detection1000/private/· ground-truth files - Limits
-
Time
3 600 s/ run · IoU0.5 - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final bbox output was).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
-
Task!Task (Detection) =
mAP@0.5on the held-out private labels. ∈ [0, 1]. 50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Classification) = accuracy (correct ÷ patients; a missing prediction counts as wrong).) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of runs that did not reach a final submission.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Same coding-agent loop as every other track. S4 classifies every
patient and writes a predictions.csv (patient_id,label);
S5 verifies the predictions and calls submit_results.
Scoring is accuracy against the held-out private labels.
Benchmark unit is one image. See for how each stage is scored.
Difficulty Tiers
Dataset
| Task | Modality | Task Time Range | Source | Description |
|---|---|---|---|---|
| Brain Tumor MRI Classification | Brain MRI | — |
Brain Tumor MRI | 4-class single-label (glioma · meningioma · notumor · pituitary), 100 patients balanced. Task score = accuracy. |
| Colorectal Tissue Histology | Histopathology (H&E) | — |
NCT-CRC-HE-100K | 9-class colorectal tissue (tumor · stroma · lymphocytes · …), 100 cases. Task score = accuracy. |
| Lymph-Node Metastasis | Histopathology (H&E) | — |
PatchCamelyon | Binary tumor / normal in 96×96 lymph-node patches, 100 cases. Task score = accuracy. |
| Chest X-ray Pneumonia | Chest X-ray | — |
Kermany et al. | Binary pneumonia / normal, 100 cases. Task score = accuracy. |
| Skin Lesion | Dermoscopy | — |
ISIC Archive | Multi-class dermoscopic skin-lesion classification, 100 cases. Task score = accuracy. |
Sandbox Rules
- Policy
- Inference-only. No training on benchmark data.
- Allowed
-
data/<dataset>/public/read-only ·/workspace/ - Forbidden
-
data/<dataset>/private/· ground-truth labels - Limits
-
Time
3 600 s/ run · Metricaccuracy - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how accurate the final per-patient labels were).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
-
Task!Task (Classification) =
accuracy= correct ÷ patients on the held-out private labels (a missing prediction counts as wrong). ∈ [0, 1]. 50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Synthetic) = mean SSIM of the synthetic CT against the held-out private ground-truth CT.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate
!
Failure = a round whose medal is fail — no valid output, or SSIM below the task baseline. The strict CT-SR thresholds (Good ≥ 0.98 / Okay ≥ 0.95) make this common even when the synthetic CT is close.
% of rounds scored as a fail medal.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Same coding-agent loop as every other track. S4 generates one synthetic
CT per patient and writes agents_outputs/<patient_id>/sct.nii.gz;
S5 verifies the volumes and calls submit_results.
Scoring is mean SSIM against the held-out private CT.
Benchmark unit is one patient volume (a 3D NIfTI). See for how each stage is scored.
Difficulty Tiers
Dataset
| Task | Modality | Task Time Range | Source | Description |
|---|---|---|---|---|
| Head & Neck MRI-to-CT | MRI → CT | — |
SynthRAD2025 | Generate synthetic CT from head-and-neck MRI, 20 cases. Task score = mean SSIM. |
| Body CT Super-Resolution | CT (×4 SR) | — |
CT-ORG | Restore ×4 z-axis–degraded CT volumes, 20 cases. Task score = mean SSIM. |
| Pancreas CT Super-Resolution | CT (×4 SR) | — |
MSD Task07 Pancreas | Restore ×4 z-axis–degraded pancreas CT, 20 cases. Task score = mean SSIM. |
| Whole-Body CT Super-Resolution | CT (×4 SR) | — |
TotalSegmentator v2 | Restore ×4 z-axis–degraded CT volumes, 20 cases. Task score = mean SSIM. |
Sandbox Rules
- Policy
- Inference-only. No training on benchmark data.
- Allowed
-
data/<dataset>/public/read-only ·/workspace/ - Forbidden
-
data/<dataset>/private/· ground-truth CT - Limits
-
Time
3 600 s/ run · MetricSSIM - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how close the synthetic CT is to the private ground-truth CT, by SSIM).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
-
Task!Task (Synthetic) = mean
SSIMof the synthetic CT against the held-out private ground-truth CT. ∈ [0, 1]. 50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Stage-by-stage breakdown S1 → S5 · weights & sub-criteria
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
Overall Score — averaged across all tasks
Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit), judged from conversation.json. ∈ [0, 100]. + Task!Task (Enhancement) = mean SSIM × completion rate.) / 2
Avg Turns
Conversational turns / run
Avg Time
Wall-clock per run
Avg Tokens
Total LLM tokens / run
Avg Cost
USD / run
Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.
% of repeats that could not be scored.
Per-Task Leaderboard
Pick a task and difficulty tier.
Agent Workflow — S1 → S5
Same coding-agent loop. Two-container architecture (v5 gold standard):
the agent container does S1–S5 with GPU + network, then a separate
eval container with GPU but --network none scores the outputs.
Agent writes to /agent_outputs/; eval container reads them plus
/data/private/ reference images and writes
/results/detail_report.json. See
for how each metric is scored.
Difficulty Tiers
Lite locks the method (DRUNet for LDCT · Swin2SR for MRI-SR) so results are comparable across agents. Standard opens the choice inside the documented inference-only candidate pool.
Dataset modality · samples · source
Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.
Sandbox Rules 2-container isolation
- Policy
- Inference-only.
.backward(),optimizer.step(),model.train(), andtorch.optim.*are all blocked by the sandbox. - Agent container
-
GPU +
bridgenetwork ·--read-onlyrootfs ·no-new-privileges·--pids-limit 4096· 3-layer sandbox (static regex + Pythonsys.addaudithook+ bash wrappers) - Eval container
-
GPU +
--network none· no agent code · writes/results/detail_report.json - Forbidden (agent side)
-
/data/private/·/eval/·/results/·/bands/· anyreference.npy/ground_truth.csv/baseline_bands.json - On violation
- Warning execution blocked; agent continues · Disqualified run zeroed; no credit
Overall Score
Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the final reconstruction was).
- Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. S1/S2/S3 LLM-judged from
plan.md+conversation.json; S4/S5 deterministic from output completeness. Shown ×100.50% - Task!Task (Enhancement) = mean(per-patient SSIM ∈ [0, 1]) between reconstruction and reference. PSNR and LPIPS are reported as diagnostic signals only.50%
- S1PLAN25%
- S2SETUP15%
- S3VALIDATE35%
- S4INFERENCE15%
- S5SUBMIT10%
All sub-scores live in [0, 1]; the leaderboard just shows them ×100.
Agentic workflow S1 → S5
The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.
About
AutoMedBench is a joint effort between University of California, Santa Cruz and NVIDIA, benchmarking autonomous agentic models across the medical AI research evaluation pipeline — plan, setup, validate, infer, submit.