Biomedical Visual Instruction Tuning with Clinician Preference Alignment


1Stanford University, 2Emory University, 3Tongji University,

4University of Washington, 5Massachusetts General Hospital, 6Harvard Medical School

*Equal Contribution

In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BiomedVITAl), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%).


  1. We introduce a data-centric framework BioMed-VITAL, which generates and selects instruction-following data aligned with clinician preference for visual instruction tuning. Evaluation indicates an improved data quality and our instruction-tuned models remarkably improve in both open visual chat (18.5% relatively) and three biomedical VQA benchmarks (win rate up to 81.73%).
  2. We propose a paradigm involving clinician preference during generation and an effective data selection model based on a mixture of preferences. It is shown that our distilled data selection model excels in matching human preferences compared with judgments of GPT-4.
  3. To facilitate further study, we release 80K clinician preference-aligned instruction-following datasets generated and selected from ours, along with the models instruction-tuned based on them.

Multimodal Medical Instrucion-Following Data

Based on the PMC-15 dataset, we interact with GPT-4V, and collect 80K unique language-image instruction-following samples in total. Please check out ``BioMed-VITAL-Instruct-80K''' on [HuggingFace Dataset].

Data file name File Size Sample Size
BioMed-VITAL-instructions-80K.json 142 MB 80K
BioMed-VITAL-instructions-150K.json 309 MB 60K + 10K + 80K
BioMed-VITAL-instructions-210K.json 462 MB 60K + 10K + 80K + 60K

Case Study

Biomedical Visual Instruction-Following Example


  title={Biomedical Visual Instruction Tuning with Clinician Preference Alignment}, 
  author={Hejie Cui and Lingjun Mao and Xin Liang and Jieyu Zhang and Hui Ren and Quanzheng Li and Xiang Li and Carl Yang},


This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.


The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.