CS234 – (Do not turn this page until you are instructed to do so!) (Solution)

$ 20.99
Category:

Description

Problem #1 #2 #3 #4 #5 #6 Total
Maximum 10 12 6 24 16 12 80
3. While the faculty alone has the right and obligation to set academic requirements, thestudents and faculty will work together to establish optimal conditions for honorable academic work.
Signature:
Name:
SUNetID:
1 Markov Decision Process (10 Points)
You are in a Las Vegas casino! You have $20 for this casino venture and will play until you lose it all or as soon as you double your money (i.e., increase your holding to at least $40). You can choose to play two slot machines: 1) slot machine A costs $10 to play and will return $20 with probability 0.05 and $0 otherwise; and 2) slot machine B costs $20 to play and will return $30 with probability 0.01 and $0 otherwise. Until you are done, you will choose to play machine A or machine B in each turn. In the space below, provide an MDP that captures the above description. Describe the state space, action space, rewards and transition probabilities. Assume the discount factor γ = 1.
You can express your solution using equations instead of a table, or a diagram.
2 Bellman Operator with Function Approximation (12 Points)
Consider an MDP M = (S,A,R,P,γ) with finite discrete state space S and action space A. Assume M has dynamics model P(s0|s,a) for all s,s0 ∈ S and a ∈ A and reward model R(s,a) for all s ∈ S and a ∈ A. Recall that the Bellman operator B applied to a function V : S → R is defined as
B(V )(s) = max(R(s,a) + γ XP(s0|s,a)V (s0))
a s0
(a) (4 Points) Now, consider a new operator which first applies a Bellman backup and then applies afunction approximation, to map the value function back to a space representable by the function approximation. We will consider a linear value function approximator over a continuous state space. Consider the following graphs:

The graphs show linear regression on the sample X0 = {0,1,2} for hypothetical underlying functions. On the left, a target function f (solid line), that evaluates to f(0) = f(1) = f(2) = 0, and its corresponding fitted function fˆ(x) = 0. On the right, another target function g (solid line), that evaluates to g(0) = 0 and g(1) = g(2) = 1, and its fitted function ˆ .
What happens to the distance between points {f(0),f(1),f(2)} and {g(0),g(1),g(2)} after we do the linear approximation? In other words, compare maxx∈X0 |f(x) − g(x)| and maxx∈X0 |fˆ(x) − gˆ(x)|.
(b) (4 Points) Is the linear function approximator here a contraction operator? Explain your answer.
(c) (4 Points) Will the new operator be guaranteed to converge to single value function? If yes, will thisbe the optimal value function for the problem? Justify your answers.
3 Probably Approximately Correct (6 Points)
Let A(α,β) be a hypothetical reinforcement learning algorithm, parametrized in terms of α and β such that
for any α > β > 1, it selects action a for state s satisfying in all but steps with probability at least 1 .
Per the definition of Probably Approximately Correct Reinforcement Learning, express N as a function of and γ. What is the resulting N? Is algorithm A probably approximately correct (PAC)?
Briefly justify.
4 Model Approximation/Modification (24 Points)
Let M = (S,A,R,P,γ) be an MDP with |S| < ∞, |A| < ∞ and γ ∈ [0,1). Let Mˆ be a modification of M to be specified below. We compare the optimal value functions and policies in M and Mˆ .
(a) (12 Points) Let Mˆ = (S,A,R,P,γˆ ) where for all s ∈ S and a ∈ A. Besides the rewards, all other components of Mˆ stay the same as in M. Prove that . Will M and
Mˆ have the same optimal policy? Briefly explain. Note V ∗ and Vˆ∗ are optimal value functions in M and Mˆ , respectively.
(b) (12 Points) Now, let Mˆ = (S,A,R,P,γˆ ) where for all s ∈ S and a ∈ A for some constant . Equivalently, the same constant is added to each reward R(s,a). How are V ∗ and Vˆ∗ related? Express Vˆ∗ in terms of V ∗. Will M and Mˆ has the same optimal policy? Briefly explain.
5 Q-Learning (16 Points)
An agent is exploring an MDP M = (S,A,R,P,γ) where S = {s1,s2,s3} and A = {a1,a2,a3} and γ = 0.5 and P(s,ai,si) = 1 for any s for all i. The rewards for transitioning into a state are defined as R(si) = i. The maximum reward is 3.
(a) (8 Points) The agent follows the trajectory
(s1,a1,1,s1,a2,2,s2)
Consider a Q-learning agent using -greedy. A random action never happens to pick the greedy action. Ties are broken by choosing the ai with the smallest i. The learning rate α is set to 0.5. Initialize Q to zeros.
Could such an agent generate this trajectory for = 0? If yes, label the actions that are greedy and which are random, and justify your answers.
(b) (8 Points) Could the Rmax algorithm with m = 1 have generated this trajectory? Are any of the (s,a) pairs known at the end of the trajectory, and if so, which?
6 Neural Networks (12 Points)
Imagine there is a world where its states are determined by N political parties and represented by a bit string of length N. Each state corresponds to a possible composition of the government with some parties holding seats in the government and others without seats. More specifically, the i-th party holds a seat in a given state if and only if the i-th bit of the corresponding bit string is 1. We say the value of a state is 0 if there are an even number of political parties with a seat in office (because each party is argumentative and can vote for the opposite of each other), or 1 if there are an odd number of political parties with a seat (because ties in voting are not possible). We want to represent the value function by linear function approximation where the input is the binary feature vector s of a state with si = 1 if i-th party holds a seat, and si = 0 otherwise.
(a) (3 Points) If N = 1, can we represent the value function perfectly using linear value function approximation? Briefly explain.
(b) (3 Points) If N = 2, can we represent the value function perfectly using linear value function approximation? Briefly explain.
(c) (3 Points) Consider using a 2 layer (one hidden layer and 1 output layer) neural network. The hiddenneurons are defined as y = f(w|x+b) for an input x with weight w and bias b and activation function f(x) = 1 if x > 0 and 0 if x ≤ 0. Can we represent the value function perfectly for N = 2? For any N? Briefly explain your answers.
(d) (3 Points) For N = 2, is there a reason to use a deeper neural network? How about for general N?
Briefly justify your answers.

Reviews

There are no reviews yet.

Be the first to review “CS234 – (Do not turn this page until you are instructed to do so!) (Solution)”

Your email address will not be published. Required fields are marked *