Dissecting “Reinforcement Studying” by Richard S. Sutton with Customized Python Implementations, Episode III

We proceed our deep dive into Sutton’s nice e-book about RL [1] and right here deal with Monte Carlo (MC) strategies. These are in a position to be taught from expertise alone, i.e. don’t require any type of mannequin of the atmosphere, as e.g. required by the Dynamic programming (DP) strategies we launched within the earlier put up.
That is extraordinarily tempting — as usually the mannequin will not be recognized, or it’s arduous to mannequin the transition possibilities. Think about the sport of Blackjack: regardless that we absolutely perceive the sport and the principles, fixing it through DP strategies can be very tedious — we must compute all types of possibilities, e.g. given the presently performed playing cards, how seemingly is a “blackjack”, how seemingly is it that one other seven is dealt … Through MC strategies, we don’t should cope with any of this, and easily play and be taught from expertise.
Because of not utilizing a mannequin, MC strategies are unbiased. They’re conceptually easy and straightforward to know, however exhibit a excessive variance and can’t be solved in iterative style (bootstrapping).
As talked about, right here we’ll introduce these strategies following Chapter 5 of Sutton’s e-book…