Biomedical Visual Instruction Tuning with Clinician Preference Alignment

BioMed-VITAL

Hejie Cui^{1,2 *}, Lingjun Mao^{3 *}, Xin Liang³, Jieyu Zhang⁴, Hui Ren^5,6, Quanzheng Li^5,6, Xiang Li^5,6, Carl Yang²

¹Stanford University, ²Emory University, ³University of California, Berkeley,

⁴University of Washington, ⁵Massachusetts General Hospital, ⁶Harvard Medical School

^*Equal Contribution

arXiv Code Dataset Model

In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BiomedVITAl), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%).

Contribution

We introduce a data-centric framework BioMed-VITAL, which generates and selects instruction-following data aligned with clinician preference for visual instruction tuning. Evaluation indicates an improved data quality and our instruction-tuned models remarkably improve in both open visual chat (18.5% relatively) and three biomedical VQA benchmarks (win rate up to 81.73%).
We propose a paradigm involving clinician preference during generation and an effective data selection model based on a mixture of preferences. It is shown that our distilled data selection model excels in matching human preferences compared with judgments of GPT-4.
To facilitate further study, we release 80K clinician preference-aligned instruction-following datasets generated and selected from ours, along with the models instruction-tuned based on them.

Multimodal Medical Instrucion-Following Data

Based on the PMC-15 dataset, we utilized gpt-4-vision-preview API to generate multi-round QA instructional data and conducted a two-stage clinician preference alignment process, selecting 60K and 80K language-image instruction-following samples. Additionally, we combined the filtered 80K samples with 10K and 60K samples provided by LLaVA-Med, resulting in a larger dataset of 150K samples (80K+10K+60K). We also offer an intermediate dataset of 60K samples that only incorporates the second stage of preference distillation, merging these to form a dataset of 210K samples (80K+10K+60K+60K). [HuggingFace Dataset].

Data file name	File Size	Sample Size
BioMed-VITAL-instructions-60K.json	127 MB	60K
BioMed-VITAL-instructions-80K.json	156 MB	80K
BioMed-VITAL-instructions-150K.json	309 MB	60K + 10K + 80K
BioMed-VITAL-instructions-210K.json	463 MB	80K + 10K + 60K + 60K

You can download the original images from the following link:

Data file name	File Size
PMC_image_urls.jsonl	129 MB

Case Study

Biomedical Visual Instruction-Following Example

BibTeX


@article{cui2024biomedical,
  title={Biomedical Visual Instruction Tuning with Clinician Preference Alignment}, 
  author={Cui, Hejie and Mao, Lingjun and Liang, Xin and Zhang, Jieyu and Ren, Hui and Li, Quanzheng and Li, Xiang and Yang, Carl},
   journal={Advances in Neural Information Processing Systems},
  year={2024}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

License

The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.