Description
Q-Learning and Actor-Critic
1 Part 1: Q-Learning
1.1 Introduction
1.2 Installation
Obtain the code from https://github.com/berkeleydeeprlcourse/homework/ tree/master/hw3. To run the code, go into the hw3 directory and simply execute python run_dqn_atari.py. It will not work however until you finish implementing the algorithm in dqn.py.
You will also need to install the dependencies, which are OpenAI Gym, TensorFlow, and OpenCV (which is used to resize the images). Remember to also follow the instructions for installing the Atari environments for OpenAI gym, which can be found on the Github page for OpenAI Gym. To install OpenCV, run pip install opencv-python. If you have trouble with ffmpeg you can install it via homebrew or apt-get depending on your system.
There are also some slight differences between different versions of TensorFlow in regard to initialization. If you get an error inside dqn_utils.py related to variable initialization, check the comment inside initialize_interdependent_variables, it explains how to modify the code to be compatible with older versions of TensorFlow.
1.3 Implementation
To determine if your implementation of Q-learning is performing well, you should run it with the default hyperparameters on the Pong game. Our reference solution gets a reward of around -20 to -15 after 500k steps, -15 to -10 after 1m steps, -10 to -5 after 1.5m steps, and around +10 after 2m steps on Pong. The maximum score of around +20 is reached after about 4-5m steps. However, there is considerable variation between runs.
Another debugging option is provided in run_dqn_lander.py, which trains your agent to play Lunar Lander, a 1979 arcade game (also made by Atari) that has been implemented in OpenAI Gym. Our reference solution with the default hyperparameters achieves around 150 reward after 400k timesteps. We recommend using Lunar Lander to check the correctness of your code before running longer experiments with run_dqn_ram.py and run_dqn_atari.py.
1.4 Evaluation
Question 1: basic Q-learning performance. Include a learning curve plot showing the performance of your implementation on the game Pong. The x-axis should correspond to number of time steps (consider using scientific notation) and the y-axis should show the mean 100-episode reward as well as the best mean reward. These quantities are already computed and printed in the starter code. Be sure to label the y-axis, since we need to verify that your implementation achieves similar reward as ours. If you needed to modify the default hyperparameters to obtain good performance, include the hyperparameters in the caption. You only need to list hyperparameters that were modified from the defaults.
Question 2: double Q-learning. Use the double estimator [1] to improve the accuracy of your learned Q values. This amounts to using the online Q network (instead of the target Q network) to select the best action when computing target values. Compare the performance of double DQN to vanilla DQN.
2 Part 2: Actor-Critic
2.1 Introduction
Recall the policy gradient from hw2:
.
In this formulation, we estimate the reward to go by taking the sum of rewards to go over each trajectory to estimate the Q function, and subtracting the value function baseline to obtain the advantage
In practice, the estimated advantage value suffers from high variance. Actorcritic addresses this issue by using a critic network to estimate the sum of rewards to go. The most common type of critic network used is a value function, in which case our estimated advantage becomes
Aπ(st,at) ≈ r(st,at) + γVφπ (st+1) − Vφπ (st)
In this assignment we will use the same value function network from hw2 as the basis for our critic network. One additional consideration in actor-critic is updating the critic network itself. While we can use Monte Carlo rollouts to estimate the sum of rewards to go for updating the value function network, in practice we fit our value function to the following target values:
yt = r(st,at) + γV π(st+1)
we then regress onto these target values via the following regression objective which we can optimize with gradient descent:
1. Update targets with current value function
2. Regress onto targets to update value function by taking a few gradient steps
3. Redo steps 1 and 2 several times
In all, the process of fitting the value function critic is an iterative process in which we go back and forth between computing target values and updating the value function to match the target values. Through experimentation, you will see that this iterative process is crucial for training the critic network.
2.2 Installation
Obtain the code from https://github.com/berkeleydeeprlcourse/homework/ tree/master/hw3. To run the code, go into the hw3 directory and simply execute python train_ac_f18.py.
You should have already installed all the required dependencies in hw2. Refer to that assignment for installation instructions if you have issues.
2.3 Implementation
We have taken the python train_pg_f18.py starter code from hw2 and modified it slightly to fit the framework of actor-critic. Core functions, such as Agent.build_mlp, Agent.define_placeholders, Agent.policy_forward_pass, and Agent.get_log_prob remain unchanged from last time. This assignment requires that you use your solution code from hw2. Before you begin, go through python train_ac_f18.py and in all places marked YOUR HW2 CODE HERE, paste in your corresponding hw2 solution code.
In order to accommodate actor-critic, the following functions have been modified or added:
• Agent.build_computation_graph: we now have actor_update_op for updating the actor network, and critic_update_op for updating the critic network.
• Agent.sample_trajectory: in addition to logging the observations, actions, and rewards, we now need to log the next observation and terminal values in order to compute the advantage function and update the critic network. Please implement these features.
• Agent.estimate_advantage: this function uses the critic network to estimate the advantage values. The advantage values are computed according to
Note: for terminal timesteps, you must make sure to cut off the reward to go, in which case we have
• Agent.update_critic: Perform the critic update according to process outlined in the introduction. You must perform self.num_grad_steps_per_target_update * self.num_target_updates number of updates, and recompute the target values every self.num_grad_steps_per_target_update number of steps.
Go through the code and note the changes from policy gradient in detail. Then implement all requested features, which we have marked with YOUR CODE HERE.
2.4 Evaluation
Once you have a working implementation of actor-critic, you should prepare a report. The report should consist of one figure for each question below. You should turn in the report as one PDF (same PDF as part 1) and a zip file with your code (same zip file as part 1). If your code requires special instructions or dependencies to run, please include these in a file called README inside the zip file.
Question 1: Sanity check with Cartpole Now that you have implemented actor-critic, check that your solution works by running Cartpole-v0. Using the same parameters as we did in hw2, run the following:
python train_ac_f18.py CartPole-v0 -n 100 -b 1000 -e 3
,→ –exp_name 1_1 -ntu 1 -ngsptu 1
In the example above, we alternate between performing one target update and one gradient update step for the critic. As you will see, this probably doesn’t work, and you need to increase both the number of target updates and number of gradient updates. Compare the results for the following settings and report which worked best. Provide a short explanation for your results.
python train_ac_f18.py CartPole-v0 -n 100
,→ –exp_name 100_1 -ntu 100 -ngsptu 1 -b 1000 -e 3
python train_ac_f18.py CartPole-v0 -n 100
,→ –exp_name 1_100 -ntu 1 -ngsptu 100 -b 1000 -e 3
python train_ac_f18.py CartPole-v0 -n 100
,→ –exp_name 10_10 -ntu 10 -ngsptu 10 -b 1000 -e 3
At the end, the best setting from above should match the policy gradient results from Cartpole in hw2.
Question 2: Run actor-critic with more difficult tasks Use the best setting from the previous question to run InvertedPendulum and HalfCheetah:
python train_ac_f18.py InvertedPendulum-v2 -ep 1000 –discount
,→ 0.95 -n 100 -e 3 -l 2 -s 64 -b 5000 -lr 0.01 –exp_name
,→ <>_<> -ntu <> -ngsptu <>
python train_ac_f18.py HalfCheetah-v2 -ep 150 –discount 0.90 -n
,→ 100 -e 3 -l 2 -s 32 -b 30000 -lr 0.02 –exp_name <>_<> -ntu
,→ <> -ngsptu <>
Your results should roughly match those of policy gradient, perhaps a little bit worse in performance.
3 Submission
Turn in both parts of the assignment on Gradescope as one submission. Upload the zip file with your code to HW3 Code, and upload the PDF of your report to HW3.
References
Reviews
There are no reviews yet.