Regret Analysis of Multi Armed Bandits

Anand Deo, TIFR


Friday, 6 July 2018, 17:15 to 18:15


  • A-201 (STCS Seminar Room)


Consider a system with K arms, arm $i$ yielding a payoff $X_{i}$, according to the distribution $P_{i}$. Suppose now, a gambler who is unaware of $P_{i}$, samples rewardsfrom a sequence of arms, with the objective of maximizing his total expected reward. TheMulti Armed Bandit (MAB) problem consists of finding a (stochastic) sequence of armpulls that achieves a large expected reward. This is equivalent to minimizing regret, orthe difference between the expected payoff of a genie strategy, which knows the best armis beforehand, and the chosen sequence of arm pulls. In 1985, Lai and Robbins provedthat regret grows at least at the rate log(n), where n is the total number of armspulled. In this talk, we will see the construction of simple strategies, whichasymptotically achieve this bound.  This talk is based on the work of Auer, Cesa-Bianchi, Fischer (2002). I will assume only basic familiarity with probability for this talk.