Policy Gradients In Reinforcement Learning Explained | by Wouter van Heeswijk, PhD | Towards Data Science

Policy Gradients In Reinforcement Learning Explained

Learn all about policy gradient algorithms based on likelihood ratios (REINFORCE): the intuition, the derivation, the ‘log trick’, and update rules for Gaussian and softmax policies.

image
image

Wouter van Heeswijk, PhD

·

Follow

Published in

Towards Data Science

·

15 min read

·

Apr 9, 2022

Photo by Scott Webb on Unsplash

image

When I first studied policy gradient algorithms, I did not find them particularly easy to fathom. Intuitively they seemed straightforward enough — sample actions, observe rewards, tweak the policy — but after the initial idea followed many lengthy derivations, calculus tricks I had long forgotten, and an overwhelming amount of notation. At a certain point, it just became a blur of probability distributions and gradients.

In this article, I try to explain the concept step by step, including key thought processes and mathematical operations. Admittedly, it’s a bit of a long read and requires a certain preliminary knowledge on Reinforcement Learning (RL), but hopefully it sheds some light on the idea behind policy gradients. The focus is on likelihood ratio policy gradients, which is the foundation of classical algorithms such as REINFORCE/vanilla policy gradient.

Given the length, let’s structure this article up front:

  1. Value approximation: learning deterministic policies
  2. Policy approximation methods: Moving to stochastic policies
  3. Establishing the objective function
  4. Defining trajectory probabilities
  5. Introducing the policy gradient
  6. Deriving the policy gradient
  7. Gradient of the log probability function
  8. Approximating the gradient
  9. Defining the update rule
  10. Examples: Softmax and Gaussian policies
  11. Loss functions and automated gradient calculations
  12. Algorithmic implementation (REINFORCE)

I. Value approximation: learning deterministic policies

The objective of RL is to learn a good decision-making policy π that maximizes rewards over time. Although the notion of a (deterministic) policy π might seem a bit…