Biomedical Visual Instruction Tuning with Clinician Preference Alignment

BioMed-VITAL

Hejie Cui^{1,2 *}, Lingjun Mao^{3 *}, Xin Liang³, Jieyu Zhang⁴, Hui Ren^5,6, Quanzheng Li^5,6, Xiang Li^5,6, Carl Yang²

¹Stanford University, ²Emory University, ³Tongji University,

⁴University of Washington, ⁵Massachusetts General Hospital, ⁶Harvard Medical School

^*Equal Contribution

arXiv Code Dataset Model

In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BiomedVITAl), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%).

Contribution

We introduce a data-centric framework BioMed-VITAL, which generates and selects instruction-following data aligned with clinician preference for visual instruction tuning. Evaluation indicates an improved data quality and our instruction-tuned models remarkably improve in both open visual chat (18.5% relatively) and three biomedical VQA benchmarks (win rate up to 81.73%).
We propose a paradigm involving clinician preference during generation and an effective data selection model based on a mixture of preferences. It is shown that our distilled data selection model excels in matching human preferences compared with judgments of GPT-4.
To facilitate further study, we release 80K clinician preference-aligned instruction-following datasets generated and selected from ours, along with the models instruction-tuned based on them.

Multimodal Medical Instrucion-Following Data

Based on the PMC-15 dataset, we interact with GPT-4V, and collect 80K unique language-image instruction-following samples in total. Please check out ``BioMed-VITAL-Instruct-80K''' on [HuggingFace Dataset].

Data file name	File Size	Sample Size
BioMed-VITAL-instructions-80K.json	142 MB	80K
BioMed-VITAL-instructions-150K.json	309 MB	60K + 10K + 80K
BioMed-VITAL-instructions-210K.json	462 MB	60K + 10K + 80K + 60K

Case Study

Biomedical Visual Instruction-Following Example

BibTeX


@misc{cui2024biomedical,
  title={Biomedical Visual Instruction Tuning with Clinician Preference Alignment}, 
  author={Hejie Cui and Lingjun Mao and Xin Liang and Jieyu Zhang and Hui Ren and Quanzheng Li and Xiang Li and Carl Yang},
  year={2024},
  eprint={2406.13173},
  archivePrefix={arXiv}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

License

The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.