Description
Instructions: START HERE
• Submitting your work:
– Gradescope: Please write your answers and copy your plots into the provided LaTeX template, and upload a PDF to the GradeScope assignment titled “Homework 4.” Additionally, upload your code (as a zip file) to the GradeScope assignment titled “Homework 4: Code.” Each team should only upload one copy of each part. Regrade requests can be made within one week of the assignment being graded.
– Autolab: Autolab is not used for this assignment.
Environment
You are provided with a custom environment in 2Dpusher env.py. In order to make the environment using gym, you can use the following code:
import gym import envs env = gym.make(‘Pushing2D-v0’)
The environment is considered “solved” once the percent successes (i.e., the box reaches the goal within the episode) reaches 95%.
Problem 1: Deep Deterministic Policy Gradients (DDPG) [60+5 pts]
In this problem you will implement DDPG, an off-policy RL algorithm for continuous action spaces. In homework 2 you implemented DQN, another off-policy RL algorithm. Like DQN, DDPG will learn a Q-function. Recall DQN chose actions by computing the Q value for each action and then chose the maximum:
π(a | s) = 1(a = maxQ(s,a))
a
While we would like to use this same policy in continuous action spaces, finding the optimal action involves solving an optimization problem. Since solving this optimization problem is expensive, you will amortize the cost of optimization by learning a policy that predicts the optimum. Intuitively, you will solve the following optimization problem:
maxQ(s,a = µ(s | θµ))
θµ
Using TensorFlow/PyTorch, you can directly take the gradient of this objective w.r.t. the policy parameters, θµ. If you work this out by hand by applying the chain rule, you will get the same expression as in the Algorithm 1. There are a few things to note:
1. You will learn an actor network with parameters θµ and a critic network with parameters θQ.
2. Similar to DQN, you will use a target network for both the actor and the critic. These target networks have parameters{θQ0,θµ0} are slowly updated towards the trained weights.
Figure 1: DDPG algorithm presented by [3].
3. The algorithm requires a random process N to offset the deterministic actor policy. For this assignment, you can use an -normal noise process, where with probability , you sample an action uniformly from the action space and otherwise sample from a normal distribution with the mean as indicated by your actor network and standard deviation as a hyperparameter.
4. There is a replay buffer R which can have a burn-in period, although this is not required to solve the environment.
5. The target values values yi used to update the critic network is a one-step TD backup where the bootstrapped Q value uses the slow moving target weights {θµ0,θQ0}.
6. The update for the actor network differs from the traditional score function update used in vanilla policy gradient. Instead, DDPG uses information from the critic about how actions influence value estimates and pushes the policy in a direction to maximize increase in estimated rewards.
To implement DDPG, we recommend following the steps below:
1. Create actor and critic networks; for actor and critic network, use the algo/criticnetwork.py and algo/actornetwork.py respectively. For this environment, a simple fully connected network with two layers should suffice. You can choose which optimizer and hyperparameters to use, so long as you are able to solve the environment. We recommend using Adam as the optimizer. It will automatically adjust the learning rate based on the statistics of the gradients it’s observing. You can check they create the network you wanted using the function create actor network and create critic network and printing the model architectures.
2. Connect the two implemented models in the DDPG method in ddpg.py script.
3. In the file run.py, implement the main training and evaluation loops.
The file ReplayBuffer.py does not need to be changed. Generally, you can find the places where we expect you to add code by ”NotImplementedError”. For this part you don’t need to modify add hindsight replay experience function. Train your implementation on the Pushing2D-v0 environment until convergence , and answer the following questions:
1. (15 pts) The neat trick of DDPG is that you can learn a policy by taking gradients of the Q function directly w.r.t. the policy parameters. An alternative approach that seems easier would if we could directly take gradients of the cumulative reward w.r.t. the policy parameters, without having to learn a Q function. Why is this approach not feasible? Optional (0 pts): How could you make this approach feasible?
2. (10 pts) In 2-3 sentences, explain how you implemented the actor update.
3. (5 pts) Describe the hyperparameter settings that you used to train DDPG.
4. (20 pts) Plot the mean cumulative reward: Every k episodes, freeze the current cloned policy and run 10 test episodes, recording the mean/std of the cumulative reward. Plot the mean cumulative reward µ on the y-axis with ±σ standard deviation as error-bars vs. the number of training episodes on the x-axis. You don’t need to use the noise process N when testing. Hint: You can use matplotlib’s plt.errorbar() function. https://matplotlib.org/api/_as_gen/matplotlib.pyplot.errorbar.html
5. (10 pts) You might have noticed that the TD error is not a good predictor of whether your DDPG agent is learning. What other metric might you use to measure performance without collecting new transitions from the environment? Why is this a reasonable metric? Implement this metric, and then determine whether this metric is actually useful for predicting the agent’s performance.
6. (Extra credit: up to 5 pts) DDPG is known to overestimate the value of states and actions. A recent method, TD3 [2], proposes a correction that avoids this overestimation by learning two Q functions, Q1(s,a) and Q2(s,a), and then choosing actions according to the minimum (i.e., acting pessimistically):
maxmin(Q1(s,a),Q1(s,a)) a
Extend your DDPG implementation to implement TD3, and conduct an experiment to compare the two algorithms. Provide a plot comparing the two algorithms and write a few sentences explaining your results.
Problem 2: Hindisght Experience Replay (HER) [40+2 pts]
In this section, you will combine HER with DDPG to hopefully learn faster on the Pushing2D-v0 environment (see Figure 2 for the full algorithm). The motivation behind hindsight experience replay is that even if an episode did not successfully reach the goal, we can still use it to learn something useful about the environment. To do so, we turn a problem that usually has sparse rewards into one with less sparse rewards by hallucinating different goal states that would hopefully provide non-zero reward given the actions that we took in an episode and add those to the experience replay buffer.
To use HER in our setup, set hindsight=True in the train method. For this part, you will need to implement the add hindsight replay experience function. To help you form new transitions to add to the replay, the code for the Pushing2D-v0 environment provides a method, apply hindsight, to compute the reward given a new goal state (r(·) Fig. 2) and to modify each state to set the goal to be the state actually reached.
1. (5 pts) Describe the hyperparameter settings that you used to train DDPG with HER. Ideally, these should match the hyperparameters you used in Part 1 so we can isolate the impact of the HER component.
2. (15 pts) Plot the mean cumulative reward: Every k episodes, freeze the current cloned policy and run 100 test episodes, recording the mean/std of the cumulative reward. Plot the mean cumulative reward µ on the y-axis with ±σ standard deviation as errorbars vs. the number of training episodes on the x-axis. Do this on the same axes as the curve from Part 1 so that you can compare the two curves.
3. (5 pts) How does the learning curve for DDPG+HER compare to that for DDPG?
4. (10 pts) In the typical multi-goal RL setup, we are given a distribution over goals, p(g). HER trains on a different distribution over goals, ˆp(g). Mathematically define pˆ(g). When will ˆp(g) be very different from p(g)? Why might a big difference between pˆ(g) and p(g) be problematic?
5. (Extra credit: up to 2 pts) How might you solve this distribution shift problem?
6. (5 pts) What are settings where you cannot apply HER, or where HER would not be able to use it to speed up training?
Figure 2: Hindsight Experience Replay [1].
General Advice
References
[1] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. CoRR, abs/1707.01495, 2017.
[3] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
Reviews
There are no reviews yet.