CMU10703 – Homework 3: Policy Gradient (Solution)

$ 29.99
Category:

Description

Instructions: START HERE
• Submitting your work:
– Gradescope: Please write your answers and copy your plots into the provided LaTeX template, and upload a PDF to the GradeScope assignment titled “Homework 3.” Additionally, export the code from your Colab notebook ([File → Export .py]) and upload it the GradeScope assignment titled “Homework 3: Code.” Each team should only upload one copy of each part. Regrade requests can be made within one week of the assignment being graded.
– Autolab: Autolab is not used for this assignment.
• Grading: This assignment is shorter than other homework assignments, and therefore will be graded out of 70 points. (The first two homeworks were graded out of 100 points).
Introduction
In this assignment, you will implement different RL algorithms and evaluate them on the OpenAI Gym LunarLander-v2 environment. This environment is considered solved if the agent can achieve an average score of at least 200.
This is a challenging assignment. Please start early!
Installation instructions (Linux)
apt-get install swig virtualenv env source env/bin/activate pip install -U -r requirements.txt
If your installation is successful, then you should be able to run the provided template code:
python reinforce.py python a2c.py
Note: You will need to install swig and box2d in order to install gym[box2d], which contains the LunarLander-v2 environment. You can install box2d by running pip install git+https://github.com/pybox2d/pybox2d
Problem 1: REINFORCE (30 pts)
In this section, you will implement episodic REINFORCE [6], a policy-gradient learning algorithm. Please write your code in reinforce.py; the template code provided inside is there to give you an idea on how you can structure your code, but is not mandatory to use.
Policy gradient methods directly optimize the policy π(A | S,θ), which is parameterized by θ. The REINFORCE algorithm proceeds as follows. We generate an episode by following policy π. After each episode ends, for each time step t during that episode, we update the policy parameters θ with the REINFORCE update. This update is proportional to the product of the return Gt experienced from time step t until the end of the episode and the gradient of logπ(At | St,θ). See Algorithm ?? for details.

Algorithm 1 REINFORCE

1: procedure REINFORCE
2: Start with policy model πθ 3: repeat:
4: Generate an episode S0,A0,r0,…,ST−1,AT−1,rT−1 following πθ(·)
5: for t from
6: P k t
7:
8: Optimize πθ using ∇L(θ)
9: end procedure

For the policy model π(A | S,θ), we recommend starting with a model that has:
• three fully connected layers with 16 units each, each followed by ReLU activations
• another fully connected layer with 4 units (the number of actions)
• a softmax activation (so the output is a proper distribution)
Initialize bias for each layer to zero. We recommend using a variance scaling kernel initializer

that draws samples from a uniform distribution over [−α,α] for α = p(3 ∗ scale/n) where scale = 1.0 and n is the average of the input and output units. HINT: Read the Keras documentation.
You can use the model.summary() and model.get config() calls to inspect the model architecture.
You can choose which optimizer and hyperparameters to use, so long as they work for learning on LunarLander-v2. We recommend using Adam as the optimizer. It will automatically adjust the learning rate based on the statistics of the gradients it’s observing. You can think of it like a fancier SGD with momentum. Keras provides a version of Adam https: //keras.io/optimizers/.
Train your implementation on the LunarLander-v2 environment until convergence . Be sure to keep training your policy for at least 1000 more episodes after it reaches 200 reward so that you are sure it consistently achieves 200 reward and so that this convergence is reflected in your graphs. Then, answer the following questions.
1. [10 pts] Describe your implementation, including the optimizer and any hyperparameters you used (learning rate, γ, etc.). Your description should be detailed enough that someone could reproduce your results.
2. [20 pts] Plot the learning curve: Every k episodes, freeze the current cloned policy and run 100 test episodes, recording the mean and standard deviation of the cumulative reward. Plot the mean cumulative reward on the y-axis with the standard deviation as error-bars against the number of training episodes on the x-axis. Write a paragraph or two describing your graph(s) and the learning behavior you observed. Be sure to address the following questions:
• What trends did you see in training?
• How does the final policy perform?
Hint: You can use matplotlib’s plt.errorbar() function. https://matplotlib.org/ api/_as_gen/matplotlib.pyplot.errorbar.html
Problem 2: Advantage-Actor Critic (40 pts)
In this section, you will implement N-step Advantage Actor Critic (A2C) [1]. Please write your code in a2c.py; the template code provided inside is there to give you an idea on how you can structure your code, but is not mandatory to use.
N-step A2C provides a balance between bootstraping using the value function and using the full Monte-Carlo return, using an N-step trace as the learning signal. See Algorithm ?? for details. N-step A2C includes both REINFORCE with baseline (N = ∞) and the 1-step A2C covered in lecture (N = 1) as special cases and is therefore a more general algorithm.
The critic updates the state-value parameters ω, and the actor updates the policy parameters θ in the direction suggested by the N-step trace.

Algorithm 2 N-step Advantage Actor-Critic
1: procedure N-Step Advantage Actor-Critic
2: Start with policy model πθ and value model Vω
3: repeat:
4: Generate an episode S0,A0,r0,…,ST−1,AT−1,rT−1 following πθ(·)
5: for t from T − 1 to 0:
6: Vend = 0 if (t + N ≥ T) else Vω(st+N)
7: else 0)
8:
9: 10: Optimize πθ using ∇L(θ)
11: Optimize Vω using ∇L(ω)
12: end procedure
Start off with the same policy architecture described in Problem 1 for both the actor and the critic. Play around with the network architecture of the critic’s state-value approximator to find one that works for LunarLander-v2. Once again, you can choose which optimizer and hyperparameters to use, so long as they work for learning on LunarLander-v2.
Answer the following questions:
1. [10 pts] Describe your implementation, including the optimizer, the critic’s network architecture, and any hyperparameters you used (learning rate, γ, etc.).
• What trends did you observe in training?
• How does the final policy perform?
• If you found A2C to be unstable or otherwise difficult to train, why might this be the case? What about the algorithm formulation could cause training instability, and what improvements might be made to improve it?
3. [10 pts] Discuss how the performance of your implementation of A2C compares with REINFORCE and how A2C’s performance varies with N. Which algorithm and N setting learns faster, and why do you think this is the case?
Extra credit (up to 15 pts)
A major bottleneck in training policy gradient algorithms is that only one episode (or batch of states) is generated at a time. However, once the policy has been updated once, the training data is no longer drawn from the current policy distribution, becoming “invalid” in a sense. A similar challenge occurs when parallelizing training, since once a parameter update is performed by one worker, the policy distribution changes and invalidates the data gathered and gradients computed by the other workers. Mnih et al. argue that the exploration noise from asynchronous policy updates can be beneficial to learning [2].
Then, implement multi-threaded synchronous Advantage Actor-Critic by gathering episode rollouts in parallel and performing a single gradient update. What speedup can you achieve? How might you measure this? Then, implement Asynchronous Advantage Actor-Critic (A3C) with multiple threads, using your multi-threaded synchronous Advantage Actor-Critic as a starting point. Do you see a learning speedup or increased stability compared to a synchronous implementation?
Guidelines on implementation
This homework requires a significant implementation effort. It is hard to read through the papers once and know immediately what you will need to be implement. We suggest you to think about the different components (e.g., model definition, model updater, model runner, …) that you will need to implement for each of the different methods that we ask you about, and then read through the papers having these components in mind. By this we mean that you should try to divide and implement small components with well-defined functionalities rather than try to implement everything at once. Much of the code and experimental setup is shared between the different methods so identifying well-defined reusable components will save you trouble.
Some hyperparameter and implementation tips and tricks:
• For efficiency, you should try to vectorize your code as much as possible and use as few loops as you can in your code. In particular, in lines 5 and 6 of Algorithm 1 (REINFORCE) and lines 5 to 7 of Algorithm 2 (A2C) you should not use two nested loops. How can you formulate a single loop to calculate the cumulative discounted rewards? Hint: Think backwards!
• Moreover, it is likely that it will take between 10K and 50K episodes for your model to converge, though you should see improvements within 5K episodes (about 30 minutes to one hour). On a NVIDIA GeForce GTX 1080 Ti GPU, it takes about five hours to run 50K training episodes with our REINFORCE implementation.
• For A2C, downscale the rewards by a factor of 1e-2 (i.e., divide by 100) when training (but not when plotting the learning curve) This will help with the optimization since the initial weights of the critic are far away from being able to predict a large range such as [−200,200]. You are welcome to try downscaling the rewards of REINFORCE as well.
• Likewise, batch normalization between layers can improve stability and convergence rate of both REINFORCE and A2C. Keras has a built-in batch normalization layer https://keras.io/layers/normalization/.
• We recommend using a discount factor of γ = 0.99.
• Instead of training one episode at a time, you can try generating a fixed number of steps in the environment, possibly encompassing several episodes, and training on such a batch instead.
References
[1] Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In S. A. Solla, T. K. Leen, and K. Mu¨ller, editors, Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000.
[3] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
[4] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv e-prints, page arXiv:1506.02438, Jun 2015.
[5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
[6] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine Learning, pages 229–256, 1992.

Reviews

There are no reviews yet.

Be the first to review “CMU10703 – Homework 3: Policy Gradient (Solution)”

Your email address will not be published. Required fields are marked *