Bayesian Uncertainty Quantification in Medical Question Answering
Overview
This project integrates Bayesian uncertainty quantification into large language models to build a more trustworthy medical question-answering system. Using Monte Carlo Dropout (p = 0.1, 10 stochastic forward passes) on LoRA-fine-tuned models, the system produces a confidence signal alongside each prediction and analyses uncertainty through entropy and mutual information — a key requirement for healthcare AI, where a calibrated "I'm not sure" matters as much as the answer itself.
Results (MedQA)
| Model | Accuracy | Avg. entropy |
|---|---|---|
| Llama 3.1 8B | 54% | 1.13 |
| Mistral 7B | 31% | 4.79 |
| Gemma 7B | 21% | 12.45 |
Across the MedQA benchmark (10,178 train / 1,272 test, 5-option MCQ), Llama 3.1 8B was both the most accurate and the most confident — its low average entropy tracking its higher accuracy, while Gemma's broad entropy distribution reflected far less reliable predictions.
Citation
@article{bompilwar2025bayesian,
title = {Bayesian Uncertainty Quantification in Large Language Models for Medical Question Answering},
author = {Bompilwar, Ritik},
institution = {Northeastern University},
year = {2025}
}