Biomedical Visual Instruction Tuning with Clinician Preference Alignment

BioMed-VITAL

1Stanford University, 2Emory University, 3University of California, Berkeley,

4University of Washington, 5Massachusetts General Hospital, 6Harvard Medical School

*Equal Contribution



In this work, we propose a data-centric framework, Biomedical Visual Instruction Tuning with Clinician Preference Alignment (BiomedVITAl), that incorporates clinician preferences into both stages of generating and selecting instruction data for tuning biomedical multimodal foundation models. First, during the generation stage, we prompt the GPT-4V generator with a diverse set of clinician-selected demonstrations for preference-aligned data candidate generation. Then, during the selection phase, we train a separate selection model, which explicitly distills clinician and policy-guided model preferences into a rating function to select high-quality data for medical instruction tuning. Results show that the model tuned with the instruction-following data from our method demonstrates a significant improvement in open visual chat (18.5% relatively) and medical VQA (win rate up to 81.73%).

Contribution

  1. We introduce a data-centric framework BioMed-VITAL, which generates and selects instruction-following data aligned with clinician preference for visual instruction tuning. Evaluation indicates an improved data quality and our instruction-tuned models remarkably improve in both open visual chat (18.5% relatively) and three biomedical VQA benchmarks (win rate up to 81.73%).
  2. We propose a paradigm involving clinician preference during generation and an effective data selection model based on a mixture of preferences. It is shown that our distilled data selection model excels in matching human preferences compared with judgments of GPT-4.
  3. To facilitate further study, we release 80K clinician preference-aligned instruction-following datasets generated and selected from ours, along with the models instruction-tuned based on them.

Multimodal Medical Instrucion-Following Data

Based on the PMC-15 dataset, we utilized gpt-4-vision-preview API to generate multi-round QA instructional data and conducted a two-stage clinician preference alignment process, selecting 60K and 80K language-image instruction-following samples. Additionally, we combined the filtered 80K samples with 10K and 60K samples provided by LLaVA-Med, resulting in a larger dataset of 150K samples (80K+10K+60K). We also offer an intermediate dataset of 60K samples that only incorporates the second stage of preference distillation, merging these to form a dataset of 210K samples (80K+10K+60K+60K). [HuggingFace Dataset].

Data file name File Size Sample Size
BioMed-VITAL-instructions-60K.json 127 MB 60K
BioMed-VITAL-instructions-80K.json 156 MB 80K
BioMed-VITAL-instructions-150K.json 309 MB 60K + 10K + 80K
BioMed-VITAL-instructions-210K.json 463 MB 80K + 10K + 60K + 60K

You can download the original images from the following link:

Data file name File Size
PMC_image_urls.jsonl 129 MB

Case Study

Biomedical Visual Instruction-Following Example

BibTeX


@misc{cui2024biomedical,
  title={Biomedical Visual Instruction Tuning with Clinician Preference Alignment}, 
  author={Hejie Cui and Lingjun Mao and Xin Liang and Jieyu Zhang and Hui Ren and Quanzheng Li and Xiang Li and Carl Yang},
  year={2024},
  eprint={2406.13173},
  archivePrefix={arXiv}
}
  
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaVA and LLaVA-Med team for giving us access to their models, and open-source projects, including BioMed-CLIP.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaVA and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

License

The source code of this repository is released under the Apache License 2.0. The model license and dataset license are listed on their corresponding webpages.