Description
Instructions: START HERE
• Submitting your work:
– Gradescope: Please write your answers and copy your plots into the provided LaTeX template, and upload a PDF to the GradeScope assignment titled “Homework 6.” Additionally, upload your code (as a zip file) to the GradeScope assignment titled “Homework 6: Code.” Each team should only upload one copy of each part. Regrade requests can be made within one week of the assignment being graded.
– Autolab: Autolab is not used for this assignment.
Problem 1: Crowd-Sourcing Quiz Questions (16 pts)
This problem is aimed at helping you prepare for the Quiz 2. Propose four great quiz questions that cover material from the entire course. Please submit your four questions and answers to each using the following form (once per question):
https://forms.gle/xKqhqaxNaXHYRS2c7 You can view the proposed questions from all teams here: https://docs.google.com/spreadsheets/d/1lNGqHV7TQ7P-3RX42UM0mHWZ-QmKTQ5msSAOy2vQNeg
This question will be graded based on completion, not content, but we nonetheless encourage you to think carefully about it, as it will help you prepare for the quiz. Try your best to make sure that your proposed questions do not repeat any questions that have already been proposed. We recommend doing this problem early, before most of the “good” questions have been proposed.
Problem 2: Types of Uncertainty (25 pts)
This problem studies the two types of uncertainty and how they can be captured. Understanding these two types of uncertainty is useful for designing exploration algorithms and model-based RL algorithms.
2.1 Combined Variance (10 pts) As a warmup, prove the following identity:
Var(y) = Var(E[y | x]) + E[Var(y | x)], (1)
where x and y are random variables. Note what this identity says: the uncertainty over one random variable can be factored into two constituent terms that capture uncertainty.
2.2 Combined Variance and Dynamics Models (15 pts) Now, consider learning a dynamics model of the environment (i.e., a “forward” model): f : S × A → S. Our model fθ will have parameters θ, which we will treat as a random variable. After learning this dynamics model, we will care about the uncertainty over our predictions: Var[st+1 | st,at]. Parts 2.2.2, 2.2.3, and 2.2.4 can all be answered independently.
2.2.1 (6 pts) Use the identity from above to write this uncertainty as a combination of aleatoric and epistemic uncertainty. Hint: you will use Var[st+1 | st,at] as the left hand side of Equation 1.
2.2.2 (3 pts) In practice, how would you estimate each of the terms?
2.2.3 (3 pts) How is this decomposition related to model-based RL (e.g., [2])?
2.2.4 (3 pts) How is this decomposition related to exploration bonuses (e.g., [4,
3])?
Problem 3: Bayesian Neural Networks (30 pts)
The aim of this problem is to understand how we can capture uncertainty using Bayesian Neural Networks, and how these types of networks can be trained. We will use p(y | x) to represent the true conditional probability distribution. Using θ to represent the parameters of our neural network, we will define q(y | x,θ) as the corresponding conditional probability distribution of this neural network. To make this neural network Bayesian we will learn a distribution over parameters, q(θ), rather than learning a fixed parameter θ.
3.1 Uncertainty in BNNs (5 pts) Write the mean and variance of the network’s predictions y for a fixed input x, given the parameter distribution q(θ). This question shouldn’t take more than a few lines. You are welcome to write your answer in terms of expectations, rather than integrals. Problem 2.1.1 might be helpful for writing the variance.
3.2 An Objective for BNNs (5 pts) For learning our BNN, we will use qφ(θ) as our distribution over the parameters of the network, where φ is the parameters of the distribution. We can obtain the conditional distribution of y given x under our BNN by marginalizing over the parameters θ:
Z qφ(y | x) = q(y | x,θ)qφ(θ)dθ = Eθ∼qφ(θ)[q(y | x,θ)]. (2)
We will attempt to learn φ such that the BNN distribution qφ(y | x) equals the true distribution, p(y | x). To do this, we will maximize the log likelihood of data sampled from the true distribution:
maxEx,y∼p [logqφ(y | x)]. (3) φ
Relate the objective (on the LHS to a KL divergence.((((((
3.3 REINFORCE for BNNs (10 pts) Derive an algorithm for optimizing Equation 3 that uses a likelihood-ratio gradient (i.e., the REINFORCE trick [5]).
3.4 Variational Inference for BNNs (10 pts) Derive an algorithm for optimizing a lower bound on Equation 3. Hint: try substituting Equation 2 into Equation 3 and applying Jensen’s inequality.
Problem 4: LQR and iLQR (29 pts)
In this problem you will implement Linear Quadratic Regulation (LQR) and iterative Linear Quadratic Regulator (iLQR). We have provided you a coding template; ou do not have to use the provided code. You should be able to run all parts of this problem on a laptop (no need for AWS or GPUs).
Environment: You will be controlling a simple 2-link planar arm. It is based on the OpenAI gym API with a couple of additions you will need to approximate the dynamics. The environment comes with rendering code so that you can visually see what the arm is doing. The environments themselves define a cost matrix Q and a cost matrix R. Use these when calculating your trajectories. The rewards returned by the environment are computed using the LQR cost function. The step function includes an additional argument dt. When calculating the finite differences your dt will be much smaller than the dt that the simulator normally steps at when calling step. So when executing a command you should just use step with an action. When you are trying to approximate the dynamics using finite differences you should use step and override the dt argument as well. You can also explicitly set the state of this simulator using the state attribute. You will need this when doing finite differences. Just set this attribute equal to the q and ˙q values you want before calling step.
4.1 LQR (12pts) Implement LQR. Test your LQR implementation on the TwoLinkArm-v0 environment. Record the total reward and number of steps to reach the goal. Also plot q, ˙q, and your control inputs u.
4.2 iLQR [12pts] Implement iLQR. Test your iLQR implementation on the TwoLinkArm-v0 environment. Plot the total cost (intermediate cost + final cost) respect to iterations and record the total reward. Also plot q, ˙q, your control inputs u.
4.3 Comparing LQR and iLQR [5pts] In 2-3 sentences, describe how LQR and iLQR performed similarly/differently in your experiments, and use your knowledge of LQR/iLQR to suggest why they performed similarly/differently.
Hyperparameters: Set number of control steps to T = 100. The intermediate cost function of intermediate control steps 1···T − 1 is kuk2, where u is the control parameters. The final cost function for the final step is 104 × kxT − x∗k2, where xT is the final state generated from your algorithm and x∗ is the target state. Set the maximum number of optimization iterations to be 105. You can add any heuristic stopping criteria to early stop if the algorithm converges.
Suggestions: iLQR will take a lot longer to run than LQR, but you should still be able to run it on a regular laptop. If you face the numerical issue when you are doing the inverse operation, try to add λI to make the matrix full-rank. You are free to use any λ > 0 (e.g., λ = 1). We recommend using simple tasks for debugging.
References
[1] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
[3] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 16–17, 2017.
[4] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
[5] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
Reviews
There are no reviews yet.