AutoMedBench · Towards Medical AutoResearch · a benchmark for AI agents on medical research tasks

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Segmentation) = macro Dice × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Agent Workflow — S1 → S5

Every agent run flows through the same five stages. Each stage captures a distinct capability of the research loop. Scoring for each stage lives under .

Click any stage to see its scoring rubric.

Difficulty Tiers

Three tiers control how much help the agent gets. Same task, different level of scaffolding.

Dataset modality · samples · source

Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.

Sandbox Rules

Policy: Inference-only. Agents run pre-trained weights on real patient data; no training or fine-tuning.
Allowed: /data/public/ read-only · /workspace/ read-write
Forbidden: /data/private/ · benchmark source · other agents' runs
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final output was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Segmentation) = mean(per-target Dice ∈ [0, 1]). Targets are per-dataset (e.g. organ + lesion for KiTS19, 7 tissue classes for FeTA) and weighted equally. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Three tiers — more scaffolding = lower difficulty.

Lite · follows a recipe
Standard · chooses within bounds
Pro · discovers and competes

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (VQA) = exact-match accuracy × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact VLM given
Standard · pick from VLM families

Agent Workflow — S1 → S5

Identical single-LLM coding-agent loop as every other track. One long conversation, execute_code only, no auxiliary tools.

Same stages as Segmentation. See for how the task half is scored.

Difficulty Tiers

Lite is model-locked for comparability; Standard opens the choice within a fixed five-VLM candidate pool.

Dataset

Task	Modality	Task Time Range	Source	Description
Pathology VQA	Histopathology	`—`	PathVQA	Open-ended pathology QA.
Radiology VQA	CT · MRI · X-ray	`—`	VQA-RAD	Mixed-modality radiology QA.
Multi-frame Medical VQA	Multi-frame video	`—`	MedFrameQA	Multi-frame MCQ (A–E).
Semantic Radiology VQA	Radiology · EN	`—`	SLAKE-EN	Semantic radiology QA.
Expert Multimodal VQA	Multi-modal	`—`	MedXpertQA · MM	Expert multi-image MCQ.
Endoscopy VQA	GI endoscopy	`—`	Kvasir-VQA	Open-ended GI-endoscopy QA — yes/no, MCQ, short-answer.
Comprehensive VQA	Multi-modal	`—`	OmniMedVQA	Multi-modality medical MCQ — 12 modalities, 73 source datasets.
Literature VQA	Mixed biomedical	`—`	PMC-VQA	MCQ on PubMed Central biomedical figures.
MMMU-Medical-VQA	Multi-modal	`—`	MMMU	Health & Medicine slice from MMMU (5 subjects, validation split).

Datasets remain the property of their respective authors. AutoMedBench reports leaderboard scores only and does not redistribute dataset content; benchmark runs download data directly from the linked sources.

Sandbox Rules

Policy: Inference-only. No fine-tuning on benchmark data.
Allowed: /data/public/ read-only · /workspace/ · API keys env-injected
Forbidden: /data/private/ · ground-truth files

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (answer quality).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
Task!Task (VQA) = mean(per-question score ∈ [0, 1]). Exact-match for MCQ, max(EM, token-F1) or LLM-judge for open-ended. Every question weighted equally.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1, S3 Continuous: S4 Discrete: S2, S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Report Gen) = mean(BLEU, METEOR, ROUGE-L, F1RadGraph) × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact MRG given
Standard · pick from MRG families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. The report-generation harness also exposes an S1–S3 judge (heuristic or online) that grades the plan/setup/validate stages, mirroring eval_seg/.

Benchmark unit is one study, not one JPEG — each case may contain multiple views under public/<case_id>/images/. See for how each stage is scored.

Difficulty Tiers

Defined in eval_report_gen/tier_config.py. Tiers differ by model-selection freedom and plan-artefact requirements at S1.

Dataset study-level chest X-ray reporting

Task	Modality	Task Time Range	Source	Description
Chest X-ray Findings Report	Chest X-ray	`—`	MIMIC-CXR	Findings-only reports.
Chest X-ray Findings / Impression	Chest X-ray	`—`	IU / Open-i	Frontal chest X-ray findings or impression.
Chest X-ray Full Report	Chest X-ray	`—`	CheXpert Plus	Full structured reports (findings + impression).
Pathology Captioning · 100	Histopathology	`—`	PathCap	Histopathology image captioning · 100-case split.
Pathology Captioning · 500	Histopathology	`—`	PathCap	Histopathology image captioning · 500-case split.

Sandbox Rules

Policy: Inference-only. Agent produces one report.txt per study; partial submissions are rejected by the format checker.
Allowed: public/<case_id>/images/*.jpg read-only · public/<case_id>/manifest.json · /workspace/
Forbidden: private/<case_id>/report.txt · private/<case_id>/labels.json · private/<case_id>/manifest.json · other agents' runs
Credentials: eval_report_gen/api_keys/key.txt exactly four lines, keys only
Preflight: CheXbert + BERT + RadGraph checkpoints must exist on disk before scoring starts.

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the generated report is against the reference).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100.50%
Task!Task (Report Generation) = mean(per-report score ∈ [0, 1]), where per-report = mean(BLEU, METEOR, ROUGE_L, F1RadGraph, μP, μR, μF1). All seven components weighted equally.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact detector given
Standard · pick from detector families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. S4 writes a bbox prediction list per image; S5 consolidates the submission and calls submit_results. Scoring is mAP@0.5 on the held-out private labels.

Benchmark unit is one chest X-ray. See for how each stage is scored.

Difficulty Tiers

Dataset

Task	Modality	Task Time Range	Source	Description
Chest X-ray Abnormality Detection	Chest X-ray	`—`	VinDr-CXR	14-class bbox detection. Task score = mAP@0.5.
Blood Cell Detection	Blood-smear microscopy	`—`	BCCD	3-class detection (rbc · wbc · platelets). Task score = mAP@0.5.
Dental Disease Detection	Panoramic dental X-ray	`—`	DENTEX	4-class disease detection (caries · deep caries · periapical lesion · impacted). Task score = mAP@0.5.
Wrist Anomaly Detection	Pediatric wrist X-ray	`—`	GRAZPEDWRI-DX	9-class anomaly detection (fracture · bone lesion · foreign body · …). Task score = mAP@0.5.

Sandbox Rules

Policy: Inference-only. No training on benchmark data.
Allowed: data/VinDrCXR_Detection1000/public/ read-only · /workspace/
Forbidden: data/VinDrCXR_Detection1000/private/ · ground-truth files
Limits: Time 3 600 s / run · IoU 0.5
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how good the final bbox output was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Detection) = mAP@0.5 on the held-out private labels. ∈ [0, 1]. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Classification) = accuracy (correct ÷ patients; a missing prediction counts as wrong).) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of runs that did not reach a final submission.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · exact classifier given
Standard · pick from classifier families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. S4 classifies every patient and writes a predictions.csv (patient_id,label); S5 verifies the predictions and calls submit_results. Scoring is accuracy against the held-out private labels.

Benchmark unit is one image. See for how each stage is scored.

Difficulty Tiers

Dataset

Task	Modality	Task Time Range	Source	Description
Brain Tumor MRI Classification	Brain MRI	`—`	Brain Tumor MRI	4-class single-label (glioma · meningioma · notumor · pituitary), 100 patients balanced. Task score = accuracy.
Colorectal Tissue Histology	Histopathology (H&E)	`—`	NCT-CRC-HE-100K	9-class colorectal tissue (tumor · stroma · lymphocytes · …), 100 cases. Task score = accuracy.
Lymph-Node Metastasis	Histopathology (H&E)	`—`	PatchCamelyon	Binary tumor / normal in 96×96 lymph-node patches, 100 cases. Task score = accuracy.
Chest X-ray Pneumonia	Chest X-ray	`—`	Kermany et al.	Binary pneumonia / normal, 100 cases. Task score = accuracy.
Skin Lesion	Dermoscopy	`—`	ISIC Archive	Multi-class dermoscopic skin-lesion classification, 100 cases. Task score = accuracy.

Sandbox Rules

Policy: Inference-only. No training on benchmark data.
Allowed: data/<dataset>/public/ read-only · /workspace/
Forbidden: data/<dataset>/private/ · ground-truth labels
Limits: Time 3 600 s / run · Metric accuracy
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how accurate the final per-patient labels were).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Classification) = accuracy = correct ÷ patients on the held-out private labels (a missing prediction counts as wrong). ∈ [0, 1]. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit). ∈ [0, 100]. + Task!Task (Synthetic) = mean SSIM of the synthetic CT against the held-out private ground-truth CT.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = a round whose medal is `fail` — no valid output, or SSIM below the task baseline. The strict CT-SR thresholds (Good ≥ 0.98 / Okay ≥ 0.95) make this common even when the synthetic CT is close.

% of rounds scored as a fail medal.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · target model given
Standard · pick from model families

Agent Workflow — S1 → S5

Same coding-agent loop as every other track. S4 generates one synthetic CT per patient and writes agents_outputs/<patient_id>/sct.nii.gz; S5 verifies the volumes and calls submit_results. Scoring is mean SSIM against the held-out private CT.

Benchmark unit is one patient volume (a 3D NIfTI). See for how each stage is scored.

Difficulty Tiers

Dataset

Task	Modality	Task Time Range	Source	Description
Head & Neck MRI-to-CT	MRI → CT	`—`	SynthRAD2025	Generate synthetic CT from head-and-neck MRI, 20 cases. Task score = mean SSIM.
Body CT Super-Resolution	CT (×4 SR)	`—`	CT-ORG	Restore ×4 z-axis–degraded CT volumes, 20 cases. Task score = mean SSIM.
Pancreas CT Super-Resolution	CT (×4 SR)	`—`	MSD Task07 Pancreas	Restore ×4 z-axis–degraded pancreas CT, 20 cases. Task score = mean SSIM.
Whole-Body CT Super-Resolution	CT (×4 SR)	`—`	TotalSegmentator v2	Restore ×4 z-axis–degraded CT volumes, 20 cases. Task score = mean SSIM.

Sandbox Rules

Policy: Inference-only. No training on benchmark data.
Allowed: data/<dataset>/public/ read-only · /workspace/
Forbidden: data/<dataset>/private/ · ground-truth CT
Limits: Time 3 600 s / run · Metric SSIM
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the research pipeline) and a Task score (how close the synthetic CT is to the private ground-truth CT, by SSIM).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. Stays in [0, 1] internally, shown ×100. 50%
Task!Task (Synthetic) = mean SSIM of the synthetic CT against the held-out private ground-truth CT. ∈ [0, 1]. 50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

LLM-judge scored: S1–S3 Continuous: S4 Discrete: S5

Overall Score — averaged across all tasks

Overall = (Agentic!Agentic — how well the agent ran the research pipeline. Weighted sum of S1–S5 stage scores (plan, setup, validate, infer, submit), judged from conversation.json. ∈ [0, 100]. + Task!Task (Enhancement) = mean SSIM × completion rate.) / 2

click a chip to toggle

Avg Turns

Conversational turns / run

Avg Time

Wall-clock per run

Avg Tokens

Total LLM tokens / run

Avg Cost

USD / run

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

% of repeats that could not be scored.

Per-Task Leaderboard

Pick a task and difficulty tier.

Task

Difficulty ! Two tiers — more scaffolding = lower difficulty.

Lite · follows a recipe
Standard · chooses within bounds

Agent Workflow — S1 → S5

Same coding-agent loop. Two-container architecture (v5 gold standard): the agent container does S1–S5 with GPU + network, then a separate eval container with GPU but --network none scores the outputs.

Agent writes to /agent_outputs/; eval container reads them plus /data/private/ reference images and writes /results/detail_report.json. See for how each metric is scored.

Difficulty Tiers

Lite locks the method (DRUNet for LDCT · Swin2SR for MRI-SR) so results are comparable across agents. Standard opens the choice inside the documented inference-only candidate pool.

Dataset modality · samples · source

Each task is a standalone public medical imaging dataset. Agents run on the same held-out samples across all tiers.

Sandbox Rules 2-container isolation

Policy: Inference-only. .backward(), optimizer.step(), model.train(), and torch.optim.* are all blocked by the sandbox.
Agent container: GPU + bridge network · --read-only rootfs · no-new-privileges · --pids-limit 4096 · 3-layer sandbox (static regex + Python sys.addaudithook + bash wrappers)
Eval container: GPU + --network none · no agent code · writes /results/detail_report.json
Forbidden (agent side): /data/private/ · /eval/ · /results/ · /bands/ · any reference.npy / ground_truth.csv / baseline_bands.json
On violation: Warning execution blocked; agent continues · Disqualified run zeroed; no credit

Overall Score

Every run gets a single Overall Score (0–100). It is the average of an Agentic score (how well the agent ran the S1–S5 pipeline) and a Task score (how good the final reconstruction was).

Overall = ½ · Agentic + ½ · Task × 100 → [0, 100]

Agentic!Agentic — weighted sum of S1–S5 stage scores, each ∈ [0, 1]. S1/S2/S3 LLM-judged from plan.md + conversation.json; S4/S5 deterministic from output completeness. Shown ×100.50%
Task!Task (Enhancement) = mean(per-patient SSIM ∈ [0, 1]) between reconstruction and reference. PSNR and LPIPS are reported as diagnostic signals only.50%

Agentic = weighted S1 → S5 ∈ [0, 1]

S1PLAN25%
S2SETUP15%
S3VALIDATE35%
S4INFERENCE15%
S5SUBMIT10%

All sub-scores live in [0, 1]; the leaderboard just shows them ×100.

Agentic workflow S1 → S5

The Agentic score is a weighted sum across the five stages. Click any stage for its sub-criteria and scoring formula.

Team

About

AutoMedBench is a joint effort between University of California, Santa Cruz and NVIDIA, benchmarking autonomous agentic models across the medical AI research evaluation pipeline — plan, setup, validate, infer, submit.

University of California, Santa Cruz

NVIDIA

Towards Medical AutoResearcha benchmark for AI agents on medical research tasks.

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset modality · samples · source

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Per-Task Leaderboard

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset study-level chest X-ray reporting

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = not able to finish the end-to-end workflow or provide complete submission.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Avg Turns

Avg Time

Avg Tokens

Avg Cost

Failure Rate ! Failure = a round whose medal is fail — no valid output, or SSIM below the task baseline. The strict CT-SR thresholds (Good ≥ 0.98 / Okay ≥ 0.95) make this common even when the synthetic CT is close.

Per-Task Leaderboard

Agent Workflow — S1 → S5

Difficulty Tiers

Dataset

Sandbox Rules

Overall Score

Stage-by-stage breakdown S1 → S5 · weights & sub-criteria

Overall Score — averaged across all tasks

Towards Medical AutoResearch
a benchmark for AI agents on medical research tasks.

Failure Rate ! Failure = a round whose medal is `fail` — no valid output, or SSIM below the task baseline. The strict CT-SR thresholds (Good ≥ 0.98 / Okay ≥ 0.95) make this common even when the synthetic CT is close.