Introduction

The multi-armed bandit (MAB) is one of the foundational problems in sequential decision-making. You have a set of options, each with an unknown payoff, and a limited budget of attempts. Every choice give you a reward and information, you want to accumulate as much reward as possible before your budget runs out. The name is a nod to slot machines, but the problem shows up far beyond casinos as one of the simplest case of reinforcement learning.

Source: https://www.linkedin.com/pulse/multi-armed-bandits-simplest-reinforcement-learning-setup-alex-yudin-hdrrc/

Explore vs. Exploit

The MAB problem can be viewed as a trade-off between exploration and exploitation. Every time you pull a lever, you face a fundamental dilemma:

Exploit: Pull the arm that has given you the best results so far. This is a safe choice but maybe you are missing a better option
Explore: Try an arm you haven’t pulled much. This is risky but you might discover a goldmine.

If you only exploit, you might lock in on a mediocre arm forever. If you only explore, you waste pulls on bad arms instead of maximizing reward. Striking a balance between the two is crucial.

Formalizing the Problem

You have $K$ arms. Each arm $i$ has an unknown reward distribution with mean $\mu_i$. At each time step $t = 1,2,...,T$.

You choose an arm $a_t \in \{1,...,K\}$
You receive a reward $r_t$ drawn from arm $a_t$’s distribution
You update your beliefs about the arms

Your goal is to maximize total reward $\sum_{t=1}^T r_t$. Equivalently, we often talk about minimizing regret, which is the difference between what you earned and what you would earn if you’d known the best arm from the start:

$$ \mathrm{Regret}(T) = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^{T} r_t\right] $$

where $\mu^* = \max_i \mu_i$ is the mean reward of the best arm. A good algorithm has regret that grows slowly with $T$ (ideally logarithmically).

3 Algorithm to solve the problem