A Hybridized Bayesian Parametric-Nonparametric Approach to the Pure Exploration Problem
Information-driven approaches to reinforcement learning (RL) and bandit problems largely rely on optimizing with respect to an expectation on calculated Kullback-Leibler (KL) divergence values. Although KL divergence may provide bounds on problem domain models, bounds on the expected KL divergence itself are absent from information-driven approaches. As such, we focus our investigation on the pure exploration problem, a key component of RL and bandit problems, where the objective is to efficiently gain knowledge about the problem domain. For this task, we develop an algorithm using a Poisson exposure process Cox Gaussian process (Pep-CGP), a hybridized Bayesian parametric-nonparametric L\'{e}vy process, and theoretically derive a bound for the Pep-CGP expectation on KL divergence. Our algorithm, Real-time Adaptive Prediction of Time-varying and Obscure Rewards (RAPTOR), is validated on 4 real-world datasets, wherein baseline pure exploration approaches are outperformed by RAPTOR.