Bayesian Uncertainty Quantification in Medical Question Answering

Ritik Bompilwar · Northeastern University

2024 · LLM uncertainty estimation

Overview

This project integrates Bayesian uncertainty quantification into large language models to build a more trustworthy medical question-answering system. Using Monte Carlo Dropout (p = 0.1, 10 stochastic forward passes) on LoRA-fine-tuned models, the system produces a confidence signal alongside each prediction and analyses uncertainty through entropy and mutual information — a key requirement for healthcare AI, where a calibrated "I'm not sure" matters as much as the answer itself.

Results (MedQA)

Model	Accuracy	Avg. entropy
Llama 3.1 8B	54%	1.13
Mistral 7B	31%	4.79
Gemma 7B	21%	12.45

Across the MedQA benchmark (10,178 train / 1,272 test, 5-option MCQ), Llama 3.1 8B was both the most accurate and the most confident — its low average entropy tracking its higher accuracy, while Gemma's broad entropy distribution reflected far less reliable predictions.

Citation

@article{bompilwar2025bayesian,
  title  = {Bayesian Uncertainty Quantification in Large Language Models for Medical Question Answering},
  author = {Bompilwar, Ritik},
  institution = {Northeastern University},
  year   = {2025}
}