Fully Open Meditron：首个全开放临床LLM管道

精选理由

做临床AI或医疗NLP的团队终于有了一个可审计、可复现的完整管道，不用再猜数据来源和训练细节——直接拿来用或参考构建自己的CDSS，值得点开看具体实现。

AI 摘要

Fully Open Meditron 是首个完全开放的临床大语言模型（LLM）构建管道，解决了现有“开放”模型仅开放权重、缺乏数据来源和训练流程透明性的问题。该管道包含经临床医生审核的训练语料库、可复现的数据构建和训练框架，以及对齐临床使用的评估协议。语料库整合了8个公开医学QA数据集，并扩展了三种临床医生验证的合成数据：考试式QA、基于46,469条临床实践指南的QA和临床小案例。评估采用LLM作为裁判的协议，校准了204名人类评分员。在五个全开放基座模型上应用后，所有MeditronFO变体均优于基座，其中Apertus-70B-MeditronFO在医学综合基准上提升6.6个百分点，达到53.8%，创下全开放模型的新纪录。结果表明，全开放管道可以在不牺牲可审计性和可复现性的前提下实现领域内最先进性能。

原文 · arXiv cs.AI

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

阅读原文