Open AIGym|Simple SARSA and Q-Learning Reinforcement Learning Implementations

Vibhav Inna Kedege
12 min readSep 25, 2020
https://www.youtube.com/watch?v=ODpZ6IdsPBg&t=1s

In this blogpost, Reinforcement Learning has been used in order to solve OpenAI Gym problems (These environments can be found here). More specifically the algorithms of SARSA and Q-Learning have been used to solve the problems of,

  1. Balancing a pole on a moving cart
  2. Driving a car up a mountain.

The aim of solving multiple problems using two reinforcement learning algorithms is to compare each algorithm in terms of the episodes taken for learning and the resulting behaviour of the agent in the environment.

To facilitate such an analysis in a methodical manner, this report has been written such section 1 talks about the related work that has been done and that has helped with implementation and analysis. An overview on the important aspects of each algorithm has been mentioned in algorithms, in section 2. Following this an analysis on the environment/problems being solved is presented in section 3. Section 4 presents the results of the learning and important inferences made, followed by section 5 on a sensitivity analysis which aims at studying the algorithm a deeper by varying important hyper-parameters. This is followed by section 6 on conclusions consisting of the links to the code and videos.

1 | Literature References

OpenAI gym based environments were chosen to compare the algorithms. Such a choice was mainly because of the ease of interfacing with the agents and also because it supports multiple environments. Ever since its inception in 2016, the gym has been widely used and there have been many implementations on this. In the specific case of this report, the method of SARSA and Q-Learning is similar to the implementation of Q-Learning made by Sentdex in this link. For the analysis and conclusions drawn, the major reference book used is Sutton and Barto which can be found in this link .

2 | Algorithms

In order to solve the 2 mentioned problems, the reinforcement learning algorithms of SARSA and Q-Learning have been utilised. We explain both in detail in this section.

SARSA

State, Action, Reward, Next State and Next Aciton which is abbreviated as SARSA is an On-Policy temporal difference algorithm. Temporal difference learning is a type of model-free learning that performs learning by bootstrapping from the current value function estimate. Further, an on-policy algorithm implies that the agent learns the Q-values based on the policy being carried out by the agent.

SARSA Algorithm (source)

In the above algorithm, the learning rate (alpha), discount factor (gamma), exploration rate (epsilon), the number of episodes and a Q table of the size of state times action is first set up. Following this multiple episodes of training are run within which various steps of the episode are executed. At each step, when the agent is in a particular state, an action is taken and the corresponding reward and state that is reached (next state) are recorded. The corresponding action value is selected from the Q-table and the Q table is updated using the Q learning update equation as shown in line 9 of the figure. Following this the state and actions are updated with the new state and action pairs and the step is continued again. A step terminates after a predefined maximum number of steps or when the goal of the environment has been reached.

Q-Learning

Q-Learning is an off-policy temporal difference learning algorithm. The term off-policy refers to the fact that at each step the optimal policy/Q-value is learnt independently from the agents actions.

Q-Learning (source)

The working of of the algorithm is shown in the above figure. The initialisation of the hyper-parameters (alpha, delta etc) is similiar to SARSA. When comparing figures for sarsa and qlearning, it can be seen that the main difference comes in the selection of the action and updating of the Q-table. In Q-Learning the action from the current step is not retained for the next step and at each step a new action is selected. In line 8 of Q-Learning, the Q-table is updated by selecting an action that maximises the next state instead of using the action that would have corresponded to this state in the Q-value. By doing this the agent is learning using a different action value than what the agent has taken and thus going off the current policy and thus being classified as an off policy algorithm. The Q-table update is continued further for a finite number of times following which the next episode of training starts again where a new state is initialised while retaining the same Q-table.

Some additional good sources to learn about Q-Learning are on Simplilearn, Sentdex and DataCamp

3 | Environment Analysis

To implement the algorithms it is required to note the important aspects of the environments in below figures.

CartPole (left) and MountainCar (left) environments

The size and limits of the action and state space have been captured in below table

Cart Pole

The state space of the cart pole problem is a 4 x 1 array consisting of the values of cart position, cart velocity, pole angle and pole angular velocity. The action space consists of 2 values, 0 which is to push the cart to the left and 1 to push the cart to the right. A reward of +1 is given for every step taken and an episode terminates when
1) Pole angle ≥12 degrees
2) The cart centre reaches display edge
3) Episode length is ≥ 200

Mountain Car

The state space of the mountain car problem is a 2 x 1 array containing the car position and car velocity. The 3 action values are 0 denoting leftward acceleration, 1 for no acceleration and 2 for accelerating towards to the right. A reward of 0 is received if the agent reaches the flag (position = 0.5) and -1 is rewarded if the agent position is less than 0.5. The episode terminates when:-
1) Car position ≥ 0.5
2) Episode length is ≥ 200

4 | Implementation and Results

CartPole

To implement a Q-Learning solution as well as SARSA, a Q-table was first constructed that was of the size, state size x action size. This corresponded to the number of divisions in the pole position and velocity space which was continuous. In order to have an element of exploration in the agent, a constant value of the exploration constant (epsilon) was used. The values of eahc hyperparamter that was used for both SARSA and Q-Learning are shown in the below table,

Cartpole Hyperparameters

The training was run for a number of episodes and at every episode the Q value corresponding to the agents pole position and pole velocity was updated. The environment was also rendered/displayed every 500 episodes. The corresponding graphs can be seen below. In each case, the graph is a plot between the maximum reward, minimum reward and average reward per episode, versus the episode. The average reward per episode is computed as the sum of total rewards until a particular episode divided by the total number of episodes until that point.

SARSA and Q-Learning behaviour for Cartpole

Mountain Car

When implementing the algorithms on this environment, it was found that using a constant value of epsilon was not able to solve this problem. Instead, a dynamic epsilon was used that was of a high value in the beginning and was decreased as the episodes progressed. This had the effect of ensuring a higher probability of exploration in the beginning and a higher probability of exploitation or a greedy action at higher episodes. The hyperparameter table being,

Mountain Car Hyperparameters

The corresponding graphs can be seen in the below figures, which shows the trend of the maximum, average and minimum rewards at every episode.

SARSA and Q-Learning behaviour for MountainCar

Inferences

On comparing the graphs of SARSA and Q-Learning for each case we observe that:-

  1. In both the cartpole and mountain car problems, the reward converges to a larger value in the case of Q-Learning than in the case of SARSA. This is possibly due to the action selection step. In Q-Learning, the action corresponding to the largest Q-value is selected. This therefore can cause a higher reward value to be obtained in the longrun.
  2. The maximum reward is obtained by the agent in 2500 episodes in the case of cart pole and 7500 episodes in the case of mountain car. This points towards the cart pole example being more easy to learn than the mountain car example, for the same algorithm used. This is also the reason why a dynamic epsilon has been used in the mountain car case.
  3. The average reward almost converges to the maximum reward after 30000 episodes in the case of Cart-pole but not in the case of mountain car. This also thus implies that more training is required by the agent if it needs to converge to the maximum value.

5 | Sensitivity Analysis

In order to critically analyse the effect of each hyper-parameter, a single environment was considered and a particular hyper-parameter was varied across 4 different values while keeping other values constant. For this analysis, we consider the cart-pole environment on which we apply Q-Learning following an epsilon-greedy strategy. Thus, we consider the below update equation and select the hyper-parameters for the sensitivity analysis that can potentially change the updated values at each iteration.

Q Learning Update Equation (source)

In each case a plot has been made between the average reward per episode versus the episode for 4 values of the same hyper-parameter.

Exploration constant sensitivity

Epsilon as mentioned previously is used to give the agent an element of exploration. When epsilon=0, the algorithm only considers actions corresponding to the highest Q-value. When epsilon = 1, the algorithm only selects random action values. The below graph shows the training results for each of the 4 values.

Epsilon Sensitivity Analysis

We make the following observations:-

  1. A higher epsilon value (in this case 0.5 and 0.75) causes the agent to obtain average reward values of only around 48 even after 30000 episodes. This is expected as With a high value of epsilon the agent has a higher probability of exploration and there are very less chances that the agent actually takes actions that give a higher reward at that moment.
  2. A decreasing of epsilon to a mid range value (like 0.25) causes training to improve further, but also causes the agent to begin exploring again. This therefore appears to cause the agent to obtain lower rewards again.
  3. A low value of epsilon (around 0.1) actually helps with the training for this example. Due to the simplicity of the example, having almost a low probability of exploration is beneficial.

It is important to note that for more complicated examples, the value of epsilon must be first kept high and must be decremented at each episode, as has been done in the mountain car example. This would help in exploration in the beginning following which the probability of exploration decreases and only actions corresponding to the highest Q-value is selected.

Discount Factor sensitivity

The discount factor (delta) is used to measure how far ahead in time the algorithm must look. When delta =0, then none of the future rewards are considered in the Q-Learning equation and when delta = 1, future rewards are given a high weight. The results for running the training for various values of delta is shown below,

Discount Factor Sensitivity Analysis

We make the following observations:-

  1. With Lower delta values (like 0.5) there is no improvement ion the average reward per episode that the agent receives even in the 30000 episodes. With low delta values, future average rewards are given a very less weight and therefore do not impact the update of the q-value in each step of an episode. Therefore this result is expected.
  2. When the delta value is increased (to around 0.5), the average reward per episode increases. This is due to a better update of the Q-table values and the agent learning more from the future Q-values.
  3. With values of delta closer to 1 like 0.95 and 1 itself, the agent performs the best because each update step updated the Q-table by a larger factor and can therefore consider the effects of future Q-values better when choosing action values the next time.

State window sensitivity

When working out the problems, it was found that the selection of the number of states between the minimum and maximum values of a particular state (also called window/bucket) was important in order to have good learning. This is primarily associated to the dimensions of the Q-table while learning and therefore a higher window value implies a higher Q-table size and a higher resolution across each state. The buckets were varied across 4 values and it has been shown below,

State Window Sensitivity Analysis

We make the following observations:-

  1. A low state window value like 5 leads to a lower resolution of the state space. Due to this for larger groups of state space values, the same action is taken and this leads to a poorer learning for an agent which gets an almost constant average reward per episode for a large number of Episodes (30000 in this case).
  2. When the number of windows increases, then the state space is differentiated better and therefore when the number of windows is 25 or 50, it results in faster and better learning at 30000 episodes.

Learning rate sensitivity

The learning rate (alpha) is set between 0 and 1. It is used to facilitate the Q-value update at a desired rate. When alpha = 0 then the Q-values are never updated and so nothing is learnt. When alpha = 1 then nothing is added to the current Q-value. Varying alpha between these values and running the simulations we get the trend as shown below,

Learning Rate Sensitivity Analysis

From the figure, we make the following observations:-

  1. Setting alpha = 1 causes no learning to taken place. This is due to the nature of update equation (shown at the beginning of the section) which is a difference and hence keeping such a value causes the existing Q-value to vanish.
  2. Setting alpha = 0.25 appears to be an optimum level of learning which provides a balance between very slow learning as seen in alpha = 0.1 and no learning with alpha=1.
  3. The behaviour of setting alpha to very high values (like 0.75) can appear to converge to a maximum average reward at the beginning but has the possibility to reduce upon further training.

6 | Conclusion

In this blogpost, SARSA and Q-Learning has been implemented in order to solve the cart pole and mountain car problems of the OpenAI gym environment. The algorithms have been compared collectively and a sensitivity analysis of varying one of the hyper-parameters and looking at its effect on the learning has also been performed.

The code has been made freely available and the recorded videos of its working have been made and these can be found below. Further, please reach out to me if anything is unclear or any corrections/modifications are needed

  1. Code (link)
  2. Cart-Pole SARSA (link)
  3. Cart-Pole Q-Learning (link)
  4. Mountain car SARSA (link)
  5. Mountain car Q-Learning (link)

Thank you!

--

--

Vibhav Inna Kedege

Agile Software Engineer | MS TU Delft | BTech NITK-Surathkal | Embedded Systems | Artificial Intelligence | Robotics | Big Data Analytics