CS 234 Midterm – Winter 2017-18 (Solution)

$ 24.99
Category:

Description

**Do not turn this page until you are instructed to do so.
Instructions
3. While the faculty alone has the right and obligation to set academic requirements, the students and faculty will work together to establish optimal conditions for honorable academic work.
Signature:
Name:
SUNet ID:
Grading (For Midterm Graders Only)
Question # 1 2 3 4 5 6
Maximum Points 15 14 10 10 8 15
Student Grade
1
Question 1 True/False and Short answer [15pts]
In the question below V ∗(s) is the optimal value function in state s, π is a policy and V π is its value function. We use the subscript t to indicate their dependence on the timestep for nite horizon MDPs. We also indicate with H the horizon of the MDP.
A) Circle True or False. Read each statement completely before answering. [1 pt each]
1. True False In a discounted in nite horizon MDP V ∗(s) ≥ V π(s) for all states s and policies π.
2. True False In an undiscounted nite horizon MDP Vt∗(s) ≥ Vtπ(s) for all states s, policies π and timesteps 1 ≤ t ≤ H.
3. True False Suppose a MDP M = (S,A,P,R,γ) with nite state space, nite action space, nite rewards and γ = 1, contains a terminal state Send that ends the episode, such that for every state s and action a there is a positive (non-zero) probability to reach Send. Then the value function V π for every policy π must be nite for all states s.
4. True False For a nite horizon MDP, we require at most H +1 iterations of value iteration to compute the optimal policy
5. True False For a given MDP, the optimal policy π∗ is always unique.
7. True False The Universal Approximation Theorem (Hornik, 1991), guarantees that for a hidden layer neural network with a linear output unit, any RL algorithm can converge to the optimal parameter values given enough hidden units.
B) For the next few questions answer in 2 – 3 sentences. [2 pts each]
1. In an in nite horizon MDP with S states and A actions, how many deterministic policies are there ? How many stochastic policies are there ?
2. Is the REINFORCE algorithm guaranteed to converge to the optimal policy if the policy class can represent the optimal policy ? Explain brie y.
3. Under what conditions does SARSA for nite-state and nite-action MDPs converge to the optimal action value ?
4. State the principle of maximum entropy in the context of inverse reinforcement learning.
Question 2 Commuting from Stanford [14 pts]
** With thanks to Andrew McCallum’s PhD thesis for the underlying problem.
Every weekend you travel from Stanford to either San Francisco or San Jose, and return at the end of the weekend back to Stanford. Since this is a repeated activity, we want to model it as an in nite horizon process. The goal is to gure out an optimal policy for how to commute from each location.
This can be represented exactly as a 3 state non-episodic (in nite horizon) Markov decision process, with states S1, S2 and S3, and actions bus and train as shown in the gure below :

Taking both actions bus or train from either S1 or S2, takes you to state S3. Taking the bus from S3 takes you to S1, while taking the train from S3 takes you to S2.
a) If S1 represents San Francisco and S3 represents Stanford, add a short description of state S2.
[2 pts]
b) All dynamics and rewards in this 3 state MDP are deterministic. The true (but unknown) rewardsare speci ed by the following table
Rewards bus train
S1 +0.7 -1.0
S2 +1.0 -1.3
S3 -0.5 -0.7
The optimal non-episodic Q-values with γ = 0.9 were computed for you and are in the table below:
Q-values bus train
S1 +1.64 -0.06
S2 +1.94 -0.36
S3 +0.96 +1.03
What is the optimal policy ? [2 pts]
c) You now notice that a simpler state representation is su cient to represent the optimal policy as you wrote down in part b. Namely you can simply distinguish between being at Stanford or not being at Stanford. Call state S12 the Not Stanford state. The actions are still the same, i.e. to take the bus or train, and both are available from S12 and S3 in this combined representation. Note that in this new representation taking either bus or train from S3 takes you to S12 and vice-versa.

Can this representation be used to express the optimal policy in part b) ? [2 pts]
d) We now explore whether we could’ve learned the optimal policy in this simpler state representation.You run −greedy Q-learning in the true non-episodic environment (representable with 3 states) but using the 2 state representation, with (same as in part b) for 109 (essentially in nite) time steps in the 2-state 2-action tabular setting, as speci ed in part c, and you get the following Q∗ :
Q∗ bus train
S12 +2.08 0.08
S3 +1.38 1.18
What is the optimal policy given these Q-values ? Why does the policy di er from the policy computed
using a true 3 state representation ? [4 pts]
e) There are 4 possible deterministic policies for this 2-state 2-action problem. Consider running MonteCarlo policy evaluation in the true non-episodic environment (representable with 3 states) but using the 2 state representation, with γ = 0.9 (same as above) on each of the 4 policies and then selecting the one with the highest value. Would this nd the same policy as c) or d) or a di erent one ? Why ? [4 pts]
Question 3 Value Iteration
Non-Monotonic Convergence of Value Iteration [10 pts]
a) Consider a MDP M = (S,A,P,R,γ) with nite state and action spaces, and γ = 1. Let V ∗(s) be the optimal value function in state s. As you learned in class, value iteration produces iterates V1(s),V2(s),V3(s),… that eventually converge to V ∗(s). However, convergence need not be monotonic for all the states. Find an example of an MDP such that if you run value iteration then the error |V ∗(s)−Vk(s)| increases in at least one state s while moving from the k-th to the (k +1)-th iteration. Your value function must be initialized to zero and use an MDP with exactly 2 states and exactly 2 actions in each state. The rewards that you specify must be either -2, -1 or 3. Draw the MDP. [8 pts]
Question 4 MC Update vs TD Update [10 pts]
A) Monte Carlo update
a) Write down the formula for a MC update of a Q-value with a lookup table representation. [2 pts]
b) Identify which parts (if any) of the equation involve bootstrapping and/or sampling. [2 pt]
c) What is the computational complexity of performing MC update for an entire episode ? [2 pt]
B) Temporal Di erence update
a) Write down the formula for a TD update of a Q-value with a lookup table representation. [2 pts]
b) Identify which parts (if any) of the equation involve bootstrapping and/or sampling. [2 pt]
Question 5 Value Function Approximation [8 pts]
A) Q-learning with VFA
Consider executing Q-learning using a linear value function approximation. Computing the greedy policy with respect to the current Q does not change the policy (no states would have a di erent action). Does this imply that the optimal policy has been found ? Choose between
• Yes
• No
• Depends
Provide 1-3 sentences to justify your answer. [4 pts]
B) Functional capacity
You are designing a deep RL system for a new consumer modeling problem. You are choosing between 3 neural network architectures
• S2. A fully connected neural network with an input layer, 2 nodes in the hidden layer, and an output layer.
• S100. A fully connected neural network with an input layer, 100 nodes in the hidden layer, and an output layer.
• D10. A fully connected deep neural network with an input layer, 10 hidden layers each with 10 nodes, and an output layer.
Assume that the input layer consists of 10 nodes, with each node representing a feature for each consumer, and the output layer is a single node. Except the input layer, assume that the activation function for each node in the neural network is a sigmoid. You do not need to consider dropout or batch normalization for this problem.
Which network would you pick and why ? Provide a short explanation. How does the number of trainable parameters vary across the 3 networks ? You do not need to compute the exact number of parameters for each network just do an order of magnitude comparison. [4 pts]
Question 6 Alice in Wonderland [15 pts]
Alice is taking CS234 and has just learned about the Q-values. She is trying to explore a large nitehorizon MDP with γ=1. The transitions are deterministic and QH+1(s,a) = 0 for all s,a. To help her with her MDP you tell her the optimal policy π∗(s,t), de ned in every state s and timestep t, that Alice should follow to maximize her reward. Denote with Q∗t(s,a) the Q-values of the optimal policy upon taking action a in state s at timestep t.
A) First Step Error
In the rst timestep t = 1 Alice is in state s1 and chooses action a, which is suboptimal. If she then follows the optimal policy from t = 2 until the end of the episode, what is the value of this policy compared to the optimal one? Express your result only using . [3 pts]
B) Relating the Value Function to the Q-Values
Unfortunately Alice does not seem to follow π∗ at all. Instead, she keeps following policy π until the end of the episode. Denote with and the value function of policy π∗ and π, respectively, at the start state s1 in the initial timestep t = 1. Show that the loss can be computed using the expression below:
H
V1∗(s1) − V1π(s1) = X(Q∗t(st,π∗(st,t)) − Q∗t(st,π(st,t))) (1)
t=1
where s1,s2,s3,… are the deterministic states that Alice encounters upon following policy π. Hint: Use induction. [6 pts]
After getting low reward, Alice decides to improve her policy. She will try to infer the best policy using Q˜, which are a close approximation to the Q-values of the optimal value function Q∗. In particular, for every state s, action a and timestep t we have that . Suppose that Alice chooses her policy π˜(s,t) = argmaxaQ˜t(s,a), show that the loss can be bounded with the expression below:

Hint: Use equation (1) from part B. [6 pts]

Reviews

There are no reviews yet.

Be the first to review “CS 234 Midterm – Winter 2017-18 (Solution)”

Your email address will not be published. Required fields are marked *