## Description

TAs: Emilio, Varun, Yash, Aashi, Fabricio

START HERE: Instructions

Summary In this assignment, you will implement a reinforcement learning algorithm for solving the classic mountain-car environment. As a warmup, Section 1 will lead you through an on-paper example of how value iteration and Q-learning work. Then, in Section 2, you will implement Q-learning with function approximation to solve the mountain car environment.

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ~mgormley/courses/10601/about.html#6-general-policies

• Submitting your work: You will use Gradescope to submit answers to all questions, and Autolab to submit your code. Please follow instructions at the end of this PDF to correctly submit all your code to Autolab.

– Autolab: You will submit your code for programming questions on the homework to

Autolab (https://autolab.andrew.cmu.edu/). After uploading your code, our grading scripts will autograde your assignment by running your program on a virtual machine (VM). The software installed on the VM is identical to that on linux.andrew.cmu.edu, so you should check that your code runs correctly there. If developing locally, check that the version number of the programming language environment (e.g. Python 2.7, Octave 3.8.2, OpenJDK 1.8.0, g++ 4.8.5) and versions of permitted libraries (e.g. numpy 1.7.1) match those on linux.andrew.cmu.edu. (Octave users: Please make sure you do not use any Matlab-specific libraries in your code that might make it fail against our tests. Python3 users: Please pay special attention to the instructions at the end of this PDF) You have a total of 10 Autolab submissions. Use them wisely. In order to not waste Autolab submissions, we recommend debugging your implementation on your local machine (or the linux servers) and making sure your code is running correctly first before any Autolab submission.

• Materials: Download from Autolab the tar file (“Download handout”). The tar file will contain all the data that you will need in order to complete this assignment.

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course? Matt Gormley

# Marie Curie # Noam Chomsky

Select One: Who taught this course? Matt Gormley # Marie Curie

@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein

Isaac Newton

I don’t know

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I don’t know

Fill in the blank: What is the course number?

10-S7601S

1 Written Questions [32 points]

1.1 Value Iteration [6 points]

In this question you will carry out value iteration by hand to solve a maze. A map of the maze is shown in the table below, where ‘G’ represents the goal of the agent (it’s the terminal state); ‘H’ represents an obstacle; the zeros are the state values V(s) that are initialized to zero.

0 0 G

H 0 H

0 0 0

Table 1.1: Map of the maze

The agent can choose to move up, left, right, or down at each of the 6 states (the goal and obstacles are not valid initial states, “not moving” is not a valid action). The transitions are deterministic, so if the agent chooses to move left, the next state will be the grid to the left of the previous one. However, if it hits the wall (edge) or obstacle (H), it stays in the previous state. The agent receives a reward of -1 whenever it takes an action. The discount factor γ is 1.

1. [1 points] How many possible deterministic policies are there in this environment, including both optimal and non-optimal policies?

2. [3 points] Compute the state values after each round of synchronous value iteration updates on the map of the maze before convergence. For example, after the first round, the values should look like this:

-1 -1 G

H -1 H

-1 -1 -1

Table 1.2: Value function after round 1

3. [2 points] Which of the following changes will result in the same optimal policy as the settings above?

Select all that apply:

The agent receives a reward of 10 when it takes an action that reaches G and receives a reward of -1 whenever it takes an action that doesn’t reach G. Discount factor is 1.

The agent receives a reward of 10 when it takes an action that reaches G and doesn’t receive any reward whenever it takes an action that doesn’t reach G. Discount factor is

1.

The agent receives a reward of 10 when it takes an action that reaches G and doesn’t receive any reward whenever it takes an action that doesn’t reach G. Discount factor is

0.9.

The agent receives a reward of -10 when it takes an action that reaches G and doesn’t receive any reward whenever it takes an action that doesn’t reach G. Discount factor is

0.9.

None of the above.

1.2 Q-learning [8 points]

In this question, we will practice using the Q-learning algorithm to play tic-tac-toe. Tic-tac-toe is a simple two-player game. Each player, either X (cross) or O (circle), takes turns marking a location in a 3×3 grid. The player who first succeeds in placing three of their marks in a column, a row, or a diagonal wins the game.

1 2 3

4 5 6

7 8 9

Table 1.6: tic-tac-toe board positions

We will model the game as follows: each board location corresponds to an integer between 1 and 9, illustrated in the graph above. Actions are also represented by an integer between 1 and 9. Playing action a results in marking the location a and an action a is only valid if the location a has not been marked by any of the players. We train the model by playing against an expert. The agent only receives a possibly nonzero reward when the game ends. Note a game ends when a player wins or when every location in the grid has been occupied. The reward is +1 if it wins, -1 if it loses and 0 if the game draws.

O X

O O X

X

Table 1.7: State 1 (circle’s turn)

To further simplify the question, let’s say we are the circle player and it’s our turn. Our goal is to try to learn the best end-game strategy given the current state of the game illustrated in table 1.7. The possible actions we can take are the positions that are unmarked: . If we select action 7, the game ends and we receive a reward of +1; if we select action 8, the expert will select action 3 to end the game and we’ll receive a reward of -1; if we select action 3, the expert will respond by selecting action 7, which results in the state of the game in table 1.8. In this scenario, our only possible action is 8, which ends the game and we receive a reward of 0.

O X O

O O X

X X

Table 1.8: State 2 (circle’s turn)

Suppose we apply a learning rate α = 0.01 and discount factor γ = 1. The Q-values are initialized as:

Q(1,3) = 0.6

Q(1,7) = −0.3

Q(1,8) = −0.5 Q(2,8) = 0.8

1. [1 points] In the first episode, the agent takes action 7, receives +1 reward, and the episode terminates. Derive the updated Q-value after this episode. Remember that given the sampled experience (s,a,r,s0) of (state, action, reward, next state), the update of the Q value is:

(1.1)

Note if s0 is the terminal state, Q(s0,a0) = 0 for all a0. Please round to three decimal places.

−0.287

2. [1 points] In the second episode, the agent takes action 8, receives a reward of -1, and the episode terminates. Derive the updated Q-value based on this episode. Please round to three decimal places.

3. [2 points] In the third episode, the agent takes action 3, receives a reward of 0, and arrives at State 2 (1.8). It then takes action 8, receives a reward of 0, and the episode terminates. Derive the updated Q-values after each of the two experiences in this episode. Suppose we update the corresponding Q-value right after every single step. Please round to three decimal places.

4. [2 points] If we run the three episodes in cycle forever, what will be the final values of the four Q-values. Please round to three decimal places.

5. [2 points] What will happen if the agent adopts the greedy policy (always pick the action that has the highest current Q-value) during training? Calculate the final four Q-values in this case. Please round to three decimal places.

1.3 Function Approximation [8 points]

In this question we will motivate function approximation for solving Markov Decision Processes by looking at Breakout, a game on the Atari 2600. The Atari 2600 is a gaming system released in the 1980s, but nevertheless is a popular target for reinforcement learning papers and benchmarks. The Atari 2600 has a resolution of 160×192 pixels. In the case of Breakout, we try to move the paddle to hit the ball in order to break as many tiles above as possible. We have the following actions:

• Move the paddle left

• Move the paddle right

• Do nothing

(a) Atari Breakout (b) Black and white Breakout

Figure 1.1: Atari Breakout. 1.1a is what Breakout looks like. We have the paddle in the bottom of the screen aiming to hit the ball in order to break the tiles at the top of the screen. 1.1b is our transformation of Atari Breakout into black and white pixels for the purpose some of the following problems.

1. [1 points] Suppose we are dealing with the black and white version of Breakout as in Figure 1.1b. Furthermore, suppose we are representing the state of the game as just a vector of pixel values without considering if a certain pixel is always black or white. Since we are dealing with the black and white version of the game, these pixel values can either be 0 or 1. What is the size of the state space?

2. [1 points] In the same setting as the previous part, suppose we wish to apply Q-learning to this problem. What is the size of the Q-value table we will need?

3. [1 points] Now assume we are dealing with the colored version of Breakout as in Figure 1.1a. Now each pixel is a tuple of real valued numbers between 0 and 1. For example, black is represented as (0,0,0) and white is (1,1,1).

What is the size of the state space and Q-value table we will need?

By now you should see that we will need a huge table in order to apply Q-learning (and similarly value iteration and policy iteration) to Breakout given this state representation. This table would not even fit in the memory of any reasonable computer! Now this choice of state representation is particularly na¨ıve. If we choose a better state representation, we could drastically reduce the table size needed.

On the other hand, perhaps we don’t want to spend our days feature engineering a state representation for Breakout. Instead we can apply function approximation to our reinforcement algorithms! The whole idea of function approximation is that states nearby to the state of interest should have similar values. That is, we should be able to generalize the value of a state to nearby and unseen states.

Let us define qπ(s,a) as the true action value function of the current policy π. Assume qπ(s,a) is given to us by some oracle. Also define q(s,a;w) as the action value predicted by the function approximator parameterized by w. Here w is a matrix of size |S| × |A|. Clearly we want to have q(s,a;w) be close to qπ(s,a) for all (s,a) pairs we see. This is just our standard regression setting. That is, our objective function is just the Mean Squared Error:

(1.2)

Because we want to update for each example stochastically , we get the following update rule:

w ← w − α(q(s,a;w) − qπ(s,a))∇wq(s,a;w) (1.3)

However, more often then not we will not have access to the oracle that gives us our target qπ(s,a). So how do we get the target to regress q(s,a;w) on? One way is to bootstrap an estimate of the action value under a greedy policy using the function approximator itself. That is to say

qπ(s,a) ≈ r + γ maxq(s0,a0;w) (1.4)

a0

Where r is the reward observed from taking action a at state s, γ is the discount factor and s0 is the state resulting from taking action a at state s. This target is often called the Temporal Difference (TD) target, and gives rise to the following update for the parameters of our function approximator in lieu of a tabular update:

w ) (1.5)

| {z }

TD Error

4. [2 points] Let us consider the setting where we can represent our state by some vector s, action a ∈ {0,1,2} and we choose a linear approximator. That is:

q(s,a;w) = sT wa (1.6)

Again, assume we are in the black and white setting of Breakout as in Figure 1.1b. Show that tabular Q-learning is just a special case of Q-learning with a linear function approximator by describing a construction of s. (Hint: Engineer features such that we can encode a table in vector form)

5. [3 points] Stochastic Gradient Descent works because we can assume that the samples we receive are independent and identically distributed. Is that the case here? If not, why and what are some ways you think you could combat this issue?

No. Because the next state and action are always depending on the current state. Bur we could use Experience Replay strategy to combat this issue. We could save some samples in a place, and when we update w, randomly choose a sample from the place to minimize the correlation.

1.4 Empirical Questions [10 points]

The following questions should be completed after you work through the programming portion of this assignment (Section 2).

1. [4 points] Run Q-learning on the mountain car environment using both tile and raw features.

For the raw features: run for 2000 episodes with max iterations of 200, set to 0.05, γ set to 0.999, and a learning rate of 0.001.

For the tile features: run for 400 episodes with max iterations of 200, set to 0.05, γ set to 0.99, and a learning rate of 0.00005.

For each set of features, plot the return (sum of all rewards in an episode) per episode on a line graph. On the same graph, also plot the rolling mean over a 25 episode window. Comment on the difference between the plots.

(a) (b)

Figure 1.2: Value function visualizations for both types of features

2. [2 points] For both raw and tile features, we have run Q-learning with some good6 parameters and created visualizations of the potential value functions. For each plot in Figure 1.2, write down which features were likely used in Q-learning with function approximation. Explain your reasoning. In addition, interpret each of these plots in the context of the mountain car environment.

Graph (a) used tile feature, while graph (b) used raw feature. Because we use linear approximator to approximate Q, and the value function is the maximum Q, therefore for the raw feature, it should be linear which is graph (b) whose plane is only depending on position.. While when we do feature engineering to refine the state by using tile feature, the value function should be nonlinear because the value function is not only depending on position but also depending on velocity. For graph(a), the car will have maximum value function when the position and velocity both equals zero. And the car’s velocity and position is symmetric with respect to (0,0). For graph(b), the car will gain maximum value function when the position is at -1.25 which is the left to the starting point.

3. [2 points] We see that Figure 1.2b seems to look like a plane. Can the value function depicted in this plot ever be nonlinear? If so, describe a potential shape. If not explain why. (Hint: How do we calculate the value of a state given the Q-values?)

Yes. The value function will be nonlinear. Although q is linear respect to state, value function is the maximum value of all the q value which will not be a linear function.

For some sense of good.

Figure 1.3: Policy visualizations for both types of features

4. [2 points] In a similar fashion to the previous part we have created visualizations of the potential policies learned. For each plot in Figure 1.3 write down which features were likely used in Q-learning with function approximation. Explain your reasoning. In addition, interpret each of these plots in the context of the mountain car environment.

Graph (a) is using tile feature while graph (b) is using raw feature. Policy is to find out which action gives us the maximum value function, so if the value function is linear, the decision boundary of the value function must be linear. The decision boundaries are the actions which give the same value function. For tile feature, since the value function is nor linear, the decision boundary will not be linear either. For graph (a), the car will take action -2(push to right) when velocity is greater 0.00 but there are some exceptions that when position is less than -0.6, the car will be pushed to right even the velocity is less than 0.00. When car has velocity below 0, it will mostly be pushed to left. There are several cases that it will stay there when position is greater than 0 while velocity is less than -0.02 and etc. For graph(b), the decision boundary is very clear that when velocity is close to -0.02 and keep increasing, the position is also keeping increasing, the car will be pushed to the right, while it will stay there when the position and velocity are both negative. The car will be pushed to the left when velocity is below zero and position less than -0.1.

Collaboration Questions Please answer the following:

1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

2 Programming [68 points]

Your goal in this assignment is to implement Q-learning with linear function approximation to solve the mountain car environment. You will implement all of the functions needed to initialize, train, evaluate, and obtain the optimal policies and action values with Q-learning. In this assignment we will provide the environment for you.

Octave/MATLAB users: Note that we will be manually grading your code using MATLAB for this assignment only. This means that the autograder will not grade your code on submission. This is because Octave’s collections.Map is too slow for the assignment’s purposes. We heavily suggest you use one of the other three languages instead, since our reference solution takes 400 seconds on MATLAB and much longer on octave.

2.1 Specification of Mountain Car

In this assignment, you will be given code that fully defines the Mountain Car environment. In Mountain Car you control a car that starts at the bottom of a valley. Your goal is to reach the flag at the top right, as seen in Figure 2.1. However, your car is under-powered and can not climb up the hill by itself. Instead you must learn to leverage gravity and momentum to make your way to the flag. It would also be good to get to this flag as fast as possible.

Figure 2.1: What the Mountain Car environment looks like. The car starts at some point in the valley. The goal is to get to the top right flag.

The state of the environment is represented by two variables, position and velocity. position can be between −1.2 and 0.6 inclusive and velocity can be between −0.07 and 0.07 inclusive. These are just measurements along the x-axis.

2.2 Q-learning With Linear Approximations

The Q-learning algorithm is a model-free reinforcement learning algorithm where we assume we don’t have access to the model of the environment we’re interacting with. We also don’t build a complete model of the environment during the learning process. A learning agent interacts with the environment solely based on calls to step and reset methods of the environment. Then the Q-learning algorithm updates the q-values based on the values returned by these methods. Analogously, in the approximation setting the algorithm will instead update the parameters of q-value approximator.

Let the learning rate be α and discount factor be γ. Recall that we have the information after one interaction with the environment, (s,a,r,s0). The tabular update rule based on this information is:

Instead, for the function approximation setting we get the following update rule derived from Section 1.3 :

w

Where:

q(s,a;w) = sT wa + b

The epsilon-greedy action selection method selects the optimal action with probability 1 − and selects uniformly at random from one of the 3 actions (0, 1, 2) with probability . The reason that we use an epsilon-greedy action selection is we would like the agent to do explorations as well. For the purpose of testing, we will test two cases: = 0 and 0 1. When = 0, the program becomes deterministic and your output have to match our reference output accurately. In this case, if there is a draw in the greedy action selection process, pick the action represented by the smallest number. For example, if we’re at state s and Q(s,0) = Q(s,2), then take action 0. And when

1, your reference output will need to fall in a certain range that we determine by running exhaustive experiments based on the input parameters.

2.3 Feature Engineering

Velocity Velocity

−1.2 −0.84−0.48−0.12 0.24 0.6 Position −1.2 −0.84−0.48−0.12 0.24 0.6 Position

(a) A discretization of the state space of Mountain Car (b) A tiling of the state space of Mountain Car

Figure 2.2: State representations for the states of Mountain Car

For the Mountain Car environment, we know that position and velocity are both bounded.

What we can do is draw a grid over the possible position-velocity combinations as seen in Figure 2.2a. We then enumerate the grid from bottom left to top right, row by row. Then we map all states that fall into a grid square with the corresponding one-hot encoding of the grid number. For efficiency reasons we will just use the index that is non-zero. For example the green point would be mapped to {6}. This is called a discretization of the state space.

The downside to the above approach is that although observing the green point will let us learn parameters that generalize to other points in the shaded blue region, we will not be able to generalize to the orange point even though it is nearby. We can instead draw two grids over the state space, each offset slightly from each other as in Figure 2.2b. Now we can map the green point to two indices, one for each grid, and get {6,39}. Now the green point has parameters that generalize to points that map to {6} (the blue shaded region) in the first discretization and parameters that generalize to points that map to {39} (the red shaded region) in the second. We can generalize this to multiple grids, which is what we do in practice. This is called a tiling or a coarse-coding of the state space.

2.4 Implementation Details

Here we describe the API to interact with the Mountain Car environment available to you in Python. The other languages will have an analagous API.

• init (mode): Initializes the environment to the a mode specified by the value of mode. This can be a string of either “raw” or “tile”.

“raw” mode tells the environment to give you the state representation of raw features encoded in a sparse format: {0 → position,1 → velocity}.

In “tile” mode you are given indices of the tiles which are active in a sparse format: {T1 → 1,T2 → 1,…Tn → 1} where Ti is the tile index for the ith tiling. All other tile indices are assumed to map to 0. For example the state representation of the example in Figure 2.2b would become {6 → 1,39 → 1}.

The size of the state space of the “raw” mode is 2. The size of the state space of the “tile” mode is 2048. These values can be accessed from the environment through the state space property, and similarly for other languages.

• reset(): Reset the environment to starting conditions.

• step(action): Take a step in the environment with the given action. action must be either 0, 1 or 2. This will return a tuple of (state,reward,done) which is the next state, the reward observed, and a boolean indicating if you reached the goal or not, ending the episode. The state will be either a raw’ or tile representation, as defined above, depending on how you initialized Mountain Car. If you observe done = True then you should reset the environment and end the episode. Failure to do so will result in undefined behavior.

• [Python Only] render(self): Optionally render the environment. It is computationally intensive to render graphics, so only render a full episode once every 100 or 1000 episodes. Requires the installation of pyglet. This will be a no-op in autolab.

You should now implement your Q-learning algorithm with linear approximations as q learning.{py|java|cpp|m}. The program will assume access to a given environment file(s) which contains the Mountain Car environment which we have given you. Initialize the parameters of the linear model with all 0 (and don’t forget to include a bias!) and use the epsilon-greedy strategy for action selection.

Your program should write a output file containing the total rewards (the returns) for every episode after running Q-learning algorithm. There should be one return per line.

Your program should also write an output file containing the weights of the linear model. The first line should be the value of the bias. Then the following |S| × |A| lines should be the values of weights, outputted in row major order , assuming your weights are stored in a |S| × |A| matrix.

The autograder will use the following commands to call your function:

For Python: $ python q learning.py [args…]

For Java: $ javac -cp “./lib/ejml-v0.33-libs/*:./” q learning.java; java -cp “./lib/ejml-v0.33-libs/*:./” q learning [args…]

For C++: $ g++ -g -std=c++11 -I./lib q learning.cpp; ./a.out [args…]

Where above [args…] is a placeholder for command-line arguments: <mode> <weight out>

<returns out><episodes><max iterations><epsilon><gamma><learning rate>. These arguments are described in detail below:

1. <mode>: mode to run the environment in. Should be either ‘‘raw’’ or ‘‘tile’’.

2. <weight out>: path to output the weights of the linear model.

3. <returns out>: path to output the returns of the agent

4. <episodes>: the number of episodes your program should train the agent for. One episode is a sequence of states, actions and rewards, which ends with terminal state or ends when the maximum episode length has been reached.

5. <max iterations>: the maximum of the length of an episode. When this is reached, we terminate the current episode.

6. <epsilon>: the value for the epsilon-greedy strategy

7. <gamma>: the discount factor γ.

8. <learning rate>: the learning rate α of the Q-learning algorithm Example command for python users:

$ python q_learning.py

4 200 0.05 0.99 0.01 raw weight.out returns.out

Example output from running the above command (your code won’t match exactly, but should be close).

<weight out>

-7.6610506220312296

1.3440159024460183

1.344872959883069

1.340055578403996

-0.0007770480987990149

0.0011306483117300896

0.0017559989206646666

<returns out>

-200.0

-200.0

-200.0

-200.0

2.5 Autolab Submission

You must submit a .tar file named rl.tar containing q learning.{py|m|java|cpp}.

You can create that file by running:

tar -cvf rl.tar q_learning.{py|m|java|cpp}

Some additional tips: DO NOT compress your files; you are just creating a tarball. Do not use tar -czvf. DO NOT put the above files in a folder and then tar the folder. Autolab is case sensitive, so observe that all your files should be named in lowercase. You must submit this file to the corresponding homework link on Autolab. The autograder for Autolab prints out some additional information about the tests that it ran. You can view this output by selecting ”Handin History” from the menu and then clicking one of the scores you received for a submission. For example on this assignment, among other things, the autograder will print out which language it detects (e.g. Python, Octave, C++, Java).

Python3 Users: Please include a blank file called python3.txt (case-sensitive) in your tar submission and we will execute your submitted program using Python 3 instead of Python

2.7. Please do not rely on any ordering of dictionary elements.

Java EJML is a pure Java linear algebra package with three interfaces. We strongly recommend using the SimpleMatrix interface. Autolab will use EJML version 3.3. The command line arguments above demonstrate how we will call you code. The classpath inclusion -cp”./lib/ejml-v0.33-libs/*:./” will ensure that all the EJML jars are on the classpath as well as your code.

C++ Eigen is a header-only library, so there is no linking to worry about—just #include whatever components you need. Autolab will use Eigen version 3.3.4. The command line arguments above demonstrate how we will call you code. The argument -I./lib will include the lib/Eigen subdirectory, which contains all the headers. We have included the correct versions of EJML/Eigen in the handout.tar for your convenience. Do not include EJML or Eigen in your Autolab submission tar; the autograder will ensure that they are in place.

ahttps://ejml.org

bhttp://eigen.tuxfamily.org/

## Reviews

There are no reviews yet.