The multi-armed bandit (MAB) is one of the foundational problems in sequential decision-making. You have a set of options, each with an unknown payoff, and a limited budget of attempts. Every choice give you a reward and information, you want to accumulate as much reward as possible before your budget runs out. The name is a nod to slot machines, but the problem shows up far beyond casinos as one of the simplest case of reinforcement learning.

Source: https://www.linkedin.com/pulse/multi-armed-bandits-simplest-reinforcement-learning-setup-alex-yudin-hdrrc/
The MAB problem can be viewed as a trade-off between exploration and exploitation. Every time you pull a lever, you face a fundamental dilemma:
If you only exploit, you might lock in on a mediocre arm forever. If you only explore, you waste pulls on bad arms instead of maximizing reward. Striking a balance between the two is crucial.
You have $K$ arms. Each arm $i$ has an unknown reward distribution with mean $\mu_i$. At each time step $t = 1,2,...,T$.
Your goal is to maximize total reward $\sum_{t=1}^T r_t$. Equivalently, we often talk about minimizing regret, which is the difference between what you earned and what you would earn if you’d known the best arm from the start:
$$ \mathrm{Regret}(T) = T \cdot \mu^* - \mathbb{E}\left[\sum_{t=1}^{T} r_t\right] $$
where $\mu^* = \max_i \mu_i$ is the mean reward of the best arm. A good algorithm has regret that grows slowly with $T$ (ideally logarithmically).