Policy Gradients In Reinforcement Learning Explained

Learn all about policy gradient algorithms based on likelihood ratios (REINFORCE): the intuition, the derivation, the ‘log trick’, and update rules for Gaussian and softmax policies.

Wouter van Heeswijk, PhD

Published in

Towards Data Science

15 min read

Apr 9, 2022

Photo by Scott Webb on Unsplash

When I first studied policy gradient algorithms, I did not find them particularly easy to fathom. Intuitively they seemed straightforward enough — sample actions, observe rewards, tweak the policy — but after the initial idea followed many lengthy derivations, calculus tricks I had long forgotten, and an overwhelming amount of notation. At a certain point, it just became a blur of probability distributions and gradients.

In this article, I try to explain the concept step by step, including key thought processes and mathematical operations. Admittedly, it’s a bit of a long read and requires a certain preliminary knowledge on Reinforcement Learning (RL), but hopefully it sheds some light on the idea behind policy gradients. The focus is on likelihood ratio policy gradients, which is the foundation of classical algorithms such as REINFORCE/vanilla policy gradient.

Given the length, let’s structure this article up front:

Value approximation: learning deterministic policies
Policy approximation methods: Moving to stochastic policies
Establishing the objective function
Defining trajectory probabilities
Introducing the policy gradient
Deriving the policy gradient
Gradient of the log probability function
Approximating the gradient
Defining the update rule
Examples: Softmax and Gaussian policies
Loss functions and automated gradient calculations
Algorithmic implementation (REINFORCE)

I. Value approximation: learning deterministic policies

The objective of RL is to learn a good decision-making policy π that maximizes rewards over time. Although the notion of a (deterministic) policy π might seem a bit…

Policy Gradients In Reinforcement Learning Explained | by Wouter van Heeswijk, PhD | Towards Data Science

Policy Gradients In Reinforcement Learning Explained

Learn all about policy gradient algorithms based on likelihood ratios (REINFORCE): the intuition, the derivation, the ‘log trick’, and update rules for Gaussian and softmax policies.

I. Value approximation: learning deterministic policies