Title image: a colorful and imaginative scene where abstract characters are interacting with and navigating a complex and visually rich environment.

Summary

This blog post explores the practical application of Reinforcement Learning (RL) through Markov Decision Processes (MDPs) using OpenAI Gym. It begins with an introduction to RL and MDPs, highlighting the significance of OpenAI Gym in RL development. The post then covers implementations of classic RL algorithms like Value Iteration, Policy Iteration, and Q-Learning on simple MDPs such as FrozenLake-v0. Additionally, it addresses challenges posed by custom MDPs and continuous state spaces, offering insights and solutions for handling them effectively. Through examples and references, the post emphasizes the adaptability and potential of RL in diverse domains, underscoring the importance of understanding its core concepts and limitations.

Introduction

Reinforcement Learning (RL) is a powerful subset of machine learning where agents interact with an environment to hone their decision-making skills. At the core of RL lie Markov Decision Processes (MDPs), providing a mathematical structure to define states, actions, rewards, and the dynamics of how an environment transitions over time. OpenAI Gym has become an indispensable toolkit within the RL community, offering a standardized set of environments and streamlined tools for developing, testing, and comparing different RL algorithms.

In this blog post, we’ll dive into practical implementations of classic RL algorithms using OpenAI Gym. We’ll start with a simple MDP and then move onto setting up custom MDPs and tackling the challenges of discretization for continuous state-spaces. Along the way, we’ll also weave in the historical context that shaped these techniques. This article contains relevant code snippets, but you can also follow along by playing around with the Colab notebook gym_examples.ipynb.

Hello World

To get our hands dirty, let’s pick FrozenLake-v0, a simple MDP from Gym’s library. In this grid-world environment, our agent is located at a start tile and aims to reach a goal tile while avoiding holes. The agent can move up, down, left or right at each step. Let’s solve this MDP using the classical algorithms.

1. Value Iteration

Value iteration has roots in the dynamic programming concepts pioneered by Richard Bellman [1]. At its core, this method updates state values based on the values of future states, governed by the Bellman equation. Starting from an initialization of state values (often zeros), each iteration refines value estimates for states using the maximum expected future reward achievable from neighboring states, taking the discount factor into account. We then extract the optimal policy from the converged value function (checkout these lecture scribes for proof of convergence). For the implementation of VI, we use bettermdptools to create the planner object, and run the VI algorithm with it.

View this gist on GitHub
Value iteration implementation

Internally, VI iterates until the max update in the value of all states is less than a threshold. In each iteration, we use the Bellman equation to update the value of each state. After convergence, we construct the optimal policy by choosing the action that maximizes the value of each state. The optimal policy learned from Value Iteration can then be used to run an episode of the Frozen Lake problem by iteratively using the policy to step through the states until we hit a terminal state. If you’re following along in the notebook, verify that the agent successfully reaches the goal state and gets a reward of 1.

2. Policy Iteration

Policy Iteration (PI), like Value Iteration, is rooted in dynamic programming principles [2]. PI tackles the MDP by iteratively refining both the policy and its associated value function. Starting with a random policy, each iteration consists of two phases. In the evaluation phase, the value function is estimated under the current policy. Then, the improvement phase leverages these values to greedily update the policy in each state. The algorithm continues iterating until the policy stabilizes (again, refer to these lecture scribes for proof of convergence). The code below implements this algorithm.

View this gist on GitHub
Policy iteration implementation

Again, the converged policy learned by Policy Iteration algorithm should be able to find the goal state, if you’re following the Colab notebook.

3. Q-Learning

Q-Learning departs from the previous two methods as a model-free, off-policy reinforcement learning algorithm. This means that Q-Learning doesn’t require explicit knowledge of the environment’s transition dynamics. At its core, Q-Learning directly learns the optimal Q-values (action-values) through experience. This is done by iteratively updating Q-values based on the rewards received and an estimate of future Q-values, guided by the Q-learning update rule [3]. The code to use Q-learning via bettermdptools is shown below (lines 1-4). This implementation of Q-learning supports decay for epsilon and learning rates, check the Q-learning readme for more customization.

View this gist on GitHub
Epsilon-greedy Q-Learning implementation

In the snippet above, I’ve also written a basic epsilon-greedy Q-Learning algorithm, where the agent randomly explores with probability epsilon, while exploiting the best actions based on the learned Q-table otherwise. This implementation is retained for anyone who wants to compare and analyze the benefits of epsilon-decay, otherwise the implementation from bettermdptools should be preferred. Q-Learning relies on a crucial balance of exploration and exploitation to converge to an optimal policy. For anyone wondering, it was important to randomly break ties among all actions that have the same Q-value to make this work. This is because initially all Q-values are zero, so without the randomization, the agent’s trials all led it to holes which give zero reward in Gym’s version of frozen lake. The agent successfully learns to reach the goal state in around 1000 episodes of training for this toy problem. However, for more complicated MDPs, you might need to:

  • Tune the learning rate and exploration rate, and their decay schedules
  • Reduce the MDP size to ensure that the agent has enough chances to learn from rewards
  • Modify the reward structure by introducing more frequent rewards

Custom MDPs: Extending OpenAI Gym’s Reach

While Gym offers a diverse set of environments, sometimes you’ll want to design an MDP specifically tailored to your research or real-world problem. To that end, there is flexibility to create custom MDPs which can also reuse existing Gym components (like different types of spaces – boxes, graphs, etc.). This can be done by extending the gym.Env class, let us go through a simple custom MDP.

Example: The Resource Gathering Problem

Let’s imagine a scenario where an agent navigates a grid world containing resources and obstacles. The goal is to collect as many resources as possible within a limited number of steps. Here’s how we could design this as an MDP:

  • States: The agent’s position on the grid and the number of steps taken.
  • Actions: Movement (up, down, left, right).
  • Transitions: The agent’s movement updates its position on the grid. An episode terminates when the agent hits an obstacle or a resource. We stop after hitting a resource to simplify the custom MDP implementation, otherwise the state space (env.observation_space) becomes a more complicated tuple, and existing PI and VI implementations need to be updated to account for multidimensional state spaces because state would be a tuple of the form (grid_position, resources_collected, steps_taken).
  • Rewards:
    • Positive reward for hitting a resource cell. To make the problem slightly interesting, we uniformly sample the reward from the range [5, 15], to possibly observe the dynamic between higher reward vs longer distance.
    • -100 for hitting an obstacle.
    • -1 per step to encourage efficient solutions.

Implementation

The following gist highlights some of the important aspects of implementing a custom MDP in Gym. It is necessary to define action_space and observation_space attributes of the environment (as per documentation). This environment also supports custom maps taken in the constructor input, otherwise we randomly generate a valid grid. Besides that, we also define the transition matrix self.P for Value Iteration and Policy Iteration algorithms to work. This is not always possible for real-world problems, we discuss the reasons and possible ways around this in the next subsection.

View this gist on GitHub
Gist for the custom MDP implementation

Other functions that need to be implemented are:

  • step(): perform the given action at current state, and return a tuple:
    • observation: for this case, it is just the next state.
    • reward
    • terminal: bool indicating whether a terminal state is reached.
    • truncated: a bool indicating premature termination, can be used to indicate stoppage before a terminal state is reached. An example would be a limit on the maximum number of steps the agent can take.
    • arbitrary dictionary containing additional info for logging/debugging.
  • reset(): reset the state to a start position, and return a tuple of (observation, info dictionary) as defined above
  • [optional] render(): GUI or string representation of the environment for debugging or visualization

If you’re following the notebook, the agent should be able to reach the resource cell with highest value of (reward – distance from start cell) using policies learnt by all the above algorithms. The notebook uses a 10×10 grid, so it is a fairly simple problem for all algorithms. For rendering, we pretty-print the grid with ‘X’ representing an obstacle, ‘.’ representing an empty cell and ‘R(value)’ representing a resource cell with the corresponding reward value. Also, a boxed ‘[ ]’ cell denotes the agent’s current position.

Transition Probabilities

Obtaining accurate transition probabilities for an MDP can be very challenging in many real-world scenarios. Some reasons that lead to this are:

  • Large state spaces (think a continuous state-space MDP) can make storing the transition matrix impossible
  • Systems are often intricate and involve factors beyond our complete control. It might be impractical or impossible to capture every possible state transition and its probability.
  • Real-world environments often have inherent stochasticity and the outcome might vary due to unforeseen factors.
  • In many cases, the agent might not have access to all the information about its environment, making it challenging to determine the exact state and the resulting transition probabilities.

PI and VI algorithms rely on the exact transition matrix to converge to an optimal policy. Excluding continuous state space states for now (see next section for that), one approach to get around these limitations is to estimate the transition function. This can be done in controlled MDP trials, where the agent is allowed to explore, and we collect samples for (state, action, next_state, reward). These samples can be used to estimate the transition matrix. By incorporating domain knowledge, we can get more samples from stages of an episode with more certainty, in order to increase the confidence in those estimates.

Bridging the Gap: Tackling MDPs with Continuous State Spaces

In our exploration of MDPs so far, we’ve primarily focused on environments with discrete state spaces. However, many real-world scenarios involve continuous state variables, making direct application of standard VI and PI algorithms challenging. For example, consider the CartPole problem [4], where a cart can move along a 1d track and has a pole attached on it through a hinge which constrains the pole’s rotation in the plane along the track (see figure below). The action space consists of {+1, -1} which correspond to applying a fixed force to the cart along the positive or negative x-axis. The goal of the agent is to keep the pole upright for as long as possible, so a reward of +1 is given for every timestep that the MDP continues. Termination happens when the pole falls off (its angle with the vertical exceeds a certain threshold). The state can be described by four continuous variables: cart’s position and velocity, and the pole’s angle with the vertical and angular velocity, making this a continuous state space MDP.

Figure for the Cart Pole system
Source: Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983) [4]

The continuous nature of the state variables leads to an infinitely large number of possible states, rendering exact representation within algorithms impractical. Additionally, as the number of continuous state dimensions increases, the computational complexity of algorithms designed for discrete spaces explodes. This section touches on some approaches for handling MDPs with continuous state spaces.

In a basic grid-based discretization setup, we divide each continuous attributes into intervals, thereby creating a multidimensional grid (four dimensions for the CartPole problem). Each cell in the grid can then be mapped to a discrete state and we could then apply a model-free learning algorithm like Q-learning or estimate the transition probabilities to use PI and VI algorithms. The discretization granularity becomes an important hyperparameter here because the state space size increases as the inverse of the “bin size”. With 4 continuous dimensions, this can already become too large to use PI or VI algorithms we discussed. On the other hand, large bin sizes can end up masking important distinctions needed for decision making. Understanding this trade-off between finer resolution and computational cost motivates the need for careful tuning or looking for alternative approaches.

Often times, the value function for a large continuous state space can be approximated as a linear function of a much smaller feature space. Radial basis function and fourier series analysis are some common approaches on how to extract the non-linear feature space representation of the original state space. We can use this approximate value function to define an approximate Q-value function over the states, action and feature space. Refer to Section 2.1 of [5] for more details on this approach. Similar to a linear approximation, we can aim for better non-linear approximations. Deep Q-Networks (DQNs) [6] use deep neural networks to approximate the value function. DQNs also employ strategies like using a Replay buffer to train the network on memories from various stages of an episode in the same batch. An intuition behind this strategy is that it allows the agent to focus on the important decisions, while not forgetting the learnings from the begining of an episode. This allows the agent to capture complex relationships between states and values but requires careful training and can be computationally expensive.

Note: Function approximation in a complex RL environment is an open problem with plenty of exploring for all of you!

Conclusion

In this article, we delved into the fundamentals of Reinforcement Learning with a focus on Markov Decision Processes and their practical implementation using OpenAI Gym. Starting with classic algorithms like Value Iteration, Policy Iteration and Q-Learinng on a simple MDP, we explored the complexities of designing custom MDPs and the challenges posed by continuous state spaces. Reinforcement Learning provides a powerful framework for modeling sequential decision-making problems in diverse domains. While the core concepts are rooted in dynamic programming, their successful application in real-world scenarios demands adaptability. Whether it’s designing custom environments tailored to your problem, obtaining transition probabilities, or tackling the complexities of continuous states, understanding the limitations and potential solutions is key.

References

  1. Bellman, R. (1957). Dynamic Programming. Princeton, NJ, USA: Princeton University Press.
  2. Howard, R. A. (1960). Dynamic Programming and Markov Processes. Cambridge, MA: MIT Press.
  3. Watkins, C. J. C. H. (1989). Learning from delayed rewards.
  4. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5), 834-846.
  5. Van Hasselt, H. (2012). Reinforcement learning in continuous state and action spaces. In Reinforcement Learning: State-of-the-Art (pp. 207-251). Berlin, Heidelberg: Springer Berlin Heidelberg.
  6. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  7. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  8. Lagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. The Journal of Machine Learning Research, 4, 1107-1149.