Dangers of Applying Off-Policy Reinforcement Learning to Healthcare
Off-policy reinforcement learning learns and evaluates sequential decision-making policies using samples from current practice. In healthcare, off-policy evaluation works around the unethical alternative-testing on patients purely for learning purposes. However, practical application is fraught with potential complications. We develop a simulator of sepsis treatment to control complication severity and quantify their effects: 1) Evaluation is challenging when few or no observed treatment sequences follow the learned policy. We demonstrate how learning and evaluation improves when a near-optimal policy with sufficient variation is in place. Additionally, we show how function approximation, especially in continuous or high-dimensional state-action spaces, may result in dangerous treatment recommendations. 2) Confounding factors that affect prescribed treatments and outcomes may be unrecorded and reflected only in a noisy proxy. We quantify how bias towards poor policies and variance in estimating effects increase with noise. 3) Model choice may not match underlying physiological mechanisms. We examine how the average-case analysis typical to reinforcement learning may fail to detect consistently poor policies for small subclasses of patients. This work serves as a warning to the clinical machine learning field: off-policy reinforcement learning can lead to poor treatment suggestions and wrong estimates regarding patient outcomes, including mortality.
Authors: Christina X Ji, Fredrik D. Johansson, David Sontag