Dissecting “Reinforcement Studying” by Richard S. Sutton with customized Python implementations, Episode V

In our earlier submit, we wrapped up the introductory collection on basic reinforcement studying (RL) methods by exploring Temporal-Distinction (TD) studying. TD strategies merge the strengths of Dynamic Programming (DP) and Monte Carlo (MC) strategies, leveraging their finest options to type a number of the most essential RL algorithms, corresponding to Q-learning.
Constructing on that basis, this submit delves into n-step TD studying, a flexible method launched in Chapter 7 of Sutton’s e-book [1]. This methodology bridges the hole between classical TD and MC methods. Like TD, n-step strategies use bootstrapping (leveraging prior estimates), however additionally they incorporate the subsequent n rewards, providing a singular mix of short-term and long-term studying. In a future submit, we’ll generalize this idea even additional with eligibility traces.
We’ll comply with a structured method, beginning with the prediction drawback earlier than shifting to manage. Alongside the way in which, we’ll:
Introduce n-step Sarsa,Lengthen it to off-policy studying,Discover the n-step tree backup algorithm, andPresent a unifying perspective with n-step Q(σ).
As at all times, you will discover all accompanying code on GitHub. Let’s dive in!