Description
Homework 6: Inference in Graphical Models, MDPs
Introduction
Please type your solutions after the corresponding problems using this LATEX template, and start each problem on a new page.
Please submit the writeup PDF to the Gradescope assignment ‘HW6’. Remember to assign pages for each question.
Please submit your LATEX file and code files to the Gradescope assignment ‘HW6 – Supplemental’.
You can use a maximum of 2 late days on this assignment. Late days will be counted based on the latest of your submissions.
Problem 1 (Explaining Away + Variable Elimination 15 pts)
We have three binary variables: rain R, wet grass G, and sprinkler S. We assume the following factorization of the joint distribution:
Pr(R,S,G) = Pr(R)Pr(S)Pr(G|R,S).
The conditional probability tables look like the following:
Pr(R = 1) = 0.25
Pr(S = 1) = 0.5 Pr(G = 1|R = 0,S = 0) = 0
Pr(G = 1|R = 1,S = 0) = .75
Pr(G = 1|R = 0,S = 1) = .75 Pr(G = 1|R = 1,S = 1) = 1
1. Draw the graphical model corresponding to the factorization. Are R and S independent? [Feel free to use facts you have learned about studying independence in graphical models.]
2. You notice it is raining and check on the sprinkler without checking the grass. What is the probability that it is on?
3. You notice that the grass is wet and go to check on the sprinkler (without checking if it is raining). What is the probability that it is on?
4. You notice that it is raining and the grass is wet. You go check on the sprinkler. What is the probability that it is on?
5. What is the “explaining away” effect that is shown above?
Consider if we introduce a new binary variable, cloudy C, to the the original three binary variables such that the factorization of the joint distribution is now:
Pr(C,R,S,G) = Pr(C)Pr(R|C)Pr(S|C)Pr(G|R,S).
6. For the marginal distribution Pr(R), write down the variable elimination expression with the elimination ordering S,G,C (where S is eliminated first, then G, then C).
7. For the marginal distribution Pr(R), write down the variable elimination expression with the elimination ordering C,G,S.
8. Give the complexities for each ordering. Which elimination ordering takes less computation?
Solution:
Problem 2 (Policy and Value Iteration, 15 pts)
This question asks you to implement policy and value iteration in a simple environment called Gridworld. The “states” in Gridworld are represented by locations in a two-dimensional space. Here we show each state and its reward:
Also, the agent does not receive the reward of a state immediately upon entry, but instead only after it takes an action at that state. For example, if the agent moves right four times (deterministically, with no chance of slipping) the rewards would be +0, +0, -50, +0, and the agent would reside in the +50 state. Regardless of what action the agent takes here, the next reward would be +50.
Your job is to implement the following three methods in file T6 P2.ipynb. Please use the provided helper functions get reward and get transition prob to implement your solution.
Embed all plots in your writeup.
Problem 2 (cont.)
Important: The state space is represented using integers, which range from 0 (the top left) to 19 (the bottom right). Therefore both the policy pi and the value function V are 1-dimensional arrays of length num states = 20. Your policy and value iteration methods should only implement one update step of the iteration – they will be repeatedly called by the provided learn strategy method to learn and display the optimal policy. You can change the number of iterations that your code is run and displayed by changing the max iter and print every parameters of the learn strategy function calls at the end of the code.
Note that we are doing infinite-horizon planning to maximize the expected reward of the traveling agent. For parts 1-3, set discount factor γ = 0.7.
1a. Implement function policy evaluation. Your solution should learn value function V , either using a closed-form expression or iteratively using convergence tolerance theta = 0.0001 (i.e., if V (t) represents V on the t-th iteration of your policy evaluation procedure, then if |V (t+1)[s]−V (t)[s]| ≤ θ for all s, then terminate and return V (t+1).)
1b. Implement function update policy iteration to update the policy pi given a value function V using one step of policy iteration.
1c. Set max iter = 4, print every = 1 to show the learned value function and the associated policy for the first 4 policy iterations. Do not modify the plotting code. Please fit all 4 plots onto one page of your writeup.
1d. Set ct = 0.01 and increase max iter such that the algorithm converges. Include a plot of the final learned value function and policy. How many iterations does it take to converge? Now try ct = 0.001 and ct = 0.0001. How does this affect the number of iterations until convergence?
2a. Implement function update value iteration, which performs one step of value iteration to update V, pi.
2b. Set max iter = 4, print every = 1 to show the learned value function and the associated policy for the first 4 value iterations. Do not modify the plotting code. Please fit all 4 plots onto one page of your writeup.
2c. Set ct = 0.01 and increase max iter such that the algorithm converges. Include a plot of the final learned value function and policy. How many iterations does it take to converge? Now try ct = 0.001 and ct = 0.0001. How does this affect the number of iterations until convergence?
3 Compare and contrast the number of iterations, time per iteration, and overall runtime between policy iteration and value iteration. What do you notice?
4 Plot the learned policy with each of γ ∈ (0.6,0.7,0.8,0.9). Include all 4 plots in your writeup. Describe what you see and provide explanations for the differences in the observed policies. Also discuss the effect of gamma on the runtime for both policy and value iteration.
5 Now suppose that the game ends at any state with a positive reward, i.e. it immediately transitions you to a new state with zero reward that you cannot transition away from. What do you expect the optimal policy to look like, as a function of gamma? Numerical answers are not required, intuition is sufficient.
Solution:
Problem 3 (Reinforcement Learning, 20 pts)
In 2013, the mobile game Flappy Bird took the world by storm. You’ll be developing a Q-learning agent to play a similar game, Swingy Monkey (See Figure 1a). In this game, you control a monkey that is trying to swing on vines and avoid tree trunks. You can either make him jump to a new vine, or have him swing down on the vine he’s currently holding. You get points for successfully passing tree trunks without hitting them, falling off the bottom of the screen, or jumping off the top. There are some sources of randomness: the monkey’s jumps are sometimes higher than others, the gaps in the trees vary vertically, the gravity varies from game to game, and the distances between the trees are different. You can play the game directly by pushing a key on the keyboard to make the monkey jump. However, your objective is to build an agent that learns to play on its own.
You will need to install the pygame module (http://www.pygame.org/wiki/GettingStarted).
Task: Your task is to use Q-learning to find a policy for the monkey that can navigate the trees. The implementation of the game itself is in file SwingyMonkey.py, along with a few files in the res/ directory. A file called stub.py is the starter code for setting up your learner that interacts with the game. This is the only file you need to modify (but to speed up testing, you can comment out the animation rendering code in SwingyMonkey.py). You can watch a YouTube video of the staff Q-Learner playing the game at http://youtu.be/l4QjPr1uCac. It figures out a reasonable policy in a few dozen iterations. You’ll be responsible for implementing the Python function action_callback. The action callback will take in a dictionary that describes the current state of the game and return an action for the next time step. This will be a binary action, where 0 means to swing downward and 1 means to jump up. The dictionary you get for the state looks like this:
{ ’score’: <current score>,
’tree’: { ’dist’: <pixels to next tree trunk>,
’top’: <height of top of tree trunk gap>,
’bot’: <height of bottom of tree trunk gap> }, ’monkey’: { ’vel’: <current monkey y-axis speed>,
’top’: <height of top of monkey>,
’bot’: <height of bottom of monkey> }}
All of the units here (except score) will be in screen pixels. Figure 1b shows these graphically. Note that since the state space is very large (effectively continuous), the monkey’s relative position needs to be discretized into bins. The pre-defined function discretize_state does this for you.
Requirements
Code: First, you should implement Q-learning with an ϵ-greedy policy yourself. You can increase the performance by trying out different parameters for the learning rate α, discount rate γ, and exploration rate ϵ. Do not use outside RL code for this assignment. Second, you should use a method of your choice to further improve the performance. This could be inferring gravity at each epoch (the gravity varies from game to game), updating the reward function, trying decaying epsilon greedy functions, changing the features in the state space, and more. One of our staff solutions got scores over 800 before the 100th epoch, but you are only expected to reach scores over 50 before the 100th epoch. Make sure to turn in your code!
Evaluation: In 1-2 paragraphs, explain how your agent performed and what decisions you made and why. Make sure to provide evidence where necessary to explain your decisions. You must include in your write up at least one plot or table that details the performances of parameters tried (i.e. plots of score vs. epoch number for different parameters).
(a) SwingyMonkey Screenshot (b) SwingyMonkey State
Figure 1: (a) Screenshot of the Swingy Monkey game. (b) Interpretations of various pieces of the state dictionary.
Solution:
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.