Description
Problem 1 : Value Iteration
Let M be an MDP given by < S,A,P,R,γ > with |S| < ∞ and |A| < ∞ and γ ∈ [0,1). We are given a policy π and the task is to evalute V π(s) for every state s ∈ S of the MDP. To this end, we use the iterative policy evalution algorithm. It is the analog of the algorithm described in (Refer : Slides 15, 20 of Lecture 7) for the policy evaluation case. We start the iterative policy evaluation algorithm with an initial guess V1 and let Vk+1 be the k + 1-th iterate of the value function corresponding to policy π. Our constraint on compute infrastructure does not allow us to wait for the successive iterates of the value function to converge to the true value function V π given by V π = (I −γP)−1R. Instead, we let the algorithm terminate at time step k +1 when the distance between the sucessive iterates given bykVk+1 − Vkk∞ ≤ ε for a given ε > 0.
(a) Prove that the error estimate between the obtained value function estimate Vk+1 and true value function V π is given by
(5 Points)
The last iterate of the algorithm is Vk+1 and we know thatkVk+1 − Vkk∞ ≤ ε. By using the triangular inequality (of norms) and by using the fact BVk = Vk+1 where B is the Bellman evaluaiton backup, we have,
kVk − V k∞ ≤ kVk − Vk+1k∞ +kVk+1 − V k∞ =kVk − Vk+1k∞ +kBVk − BV k∞
≤ kVk − Vk+1k∞ + γkVk − V k∞ = ε + γkVk − V k∞
Therefore, . This allows us to conclude that,
(b) Prove that the iterative policy evaluation algorithm converges geometrically, i.e.
(2 Points)
We already proved that,
kVk+1 − V k∞ ≤ γkVk − V k∞
,
Applying this inequality recursively, we get the desired result
(c) Let v denote a value function and consider the Bellman optimallity operator given by,
.
Prove that the Bellman optimality operator (L) satisfies the monotoniciity property. That is, for any two value functions u and v such that u ≤ v (this means, u(s) ≤ v(s) for all s ∈ S), we have L(u) ≤ L(v) (3 Points)
By the defintion of Bellman optimality operator L, for a fixed state s ∈ S and for action set A finite, one can conlude that there exists a1,a2 ∈ A (with a1 and a2 possibly different) such that,
L(u(s)) = R(s,a1) + γ XP(s0|s,a1)u(s)
s0
and
L(v(s)) = R(s,a2) + γ XP(s0|s,a2)v(s)
s0
It is then easy to observe (using the defintion of optimality operator) that,
Now, we have,
Since, u(s) ≤ v(s), we have,
Since s was choosen arbitarily, we have the desired result.
Problem 2 : On Contractions
(a) Let P and Q be two contractions defined on a normed vector space < V,k·k >. Prove that the compositions P ◦ Q and Q ◦ P are contractions on the same normed vector space. (3 Points) Indeed, for all ,v,u ∈ V, we have,
It is importrant to note that since γp and γq belong to [0,1), hence the product γpγq also belong to [0,1).
(b) What can be suitable contraction (or LIpschitz) coeffecients for the contractions P ◦ Q and
Q◦P ? (1 point)
γpγq is the contraction co-efficient
(c) Define operator B as F◦L where L is the Bellman optimality operator and F is any other suitable operator. For example, F could play the role of a function approximator to the Bellman backup L. Under what conditions would the value iteration algorithm converge to a unique solution if operator B is used in place of L (in the value iteration algorithm) ? Explain your answer. (2 Points)
For unique solution to exist, the operator F ◦L must be a contraction. This composition can be contraction only when both F and L is a contraction under the max-norm.
Problem 3 : Monte Carlo Methods
Consider a Markov process with 2 states S = {S,A} with transition probablities as shown in the table below, where p ∈ (0,1) is a non-zero probablity. To generate a MRP from this Markov chain, assume that the rewards for being in states S and A are 1 and 0, respectively. In addition, let the discount factor of the MRP be γ = 1.
S A
S 1 − p p
A 0 1
(a) Provide a generic form for a typical trajectory starting at state S. (1 Point)
(b) Estimate V (S) using first visit MC. (2 Points)
(c) Estimate V (S) using every visit MC. (2 Points)
(d) What is true value of V (S) ? (3 Points)
(e) Explain if the every vist MC estimate is biased. (2 Points)
(f) In general for a MRP, comment on the convergence properties of the first visit MC and every visit MC algorithms (2 Points)
(a) All trajectories starting in state S look like SSSS ···A where there are l occurences of S.
(b) First time estimate of V (S) would be l.
(c) There are l every time estimates of V (S) for every trajectory and their sum is l(l+1)/2. So, everytime estimate per trajectory is given by (l + 1)/2
(d) The true value of V (S) is 1/p as it is the average length of the trajectory.
(e) Yes. Expected value of the everytime estimate ((1/p)+1)/2 which is erroneous by a factor of 2, if p were small
Problem 4 : Temporal Difference Methods
Given a policy π of an MDP M, consider the one step TD error given by
δt = rt+1 + γV π(st+1) − V π(st)
where (st,rt+1,st+1) is a transition observed under policy π at time t (Refer : Slide 11 Lecture 10).
(a) Compute Eπ(δt|st = s) if δt uses the true state value function V π (2 Points)
We have used the fact that δt uses the true state value function V π in the penultimate step
(b) Compute Eπ(δt|St = s,At = a), for an arbitrary action a taken at time t, if δt uses the true state value function V π (2 Points)
(c) In the TD(λ) algorithm, we use λ returns as the target. The λ return target is given by,
∞
Gλt = (1 − λ)Xλn−1G(tn)
n=1
where is the n-step return defined as,
.
The parameter λ is used to determine the weights corresponding to each of the n-step returns in the λ-return target. We know that the weights decay exponentially with n. Therefore, in the Gλt sequence, after some terms, the weights of subsequent terms would have fallen by more than half as compared to the weight of the first term. Let η(λ) denote the time by which the weighting sequence would have fallen to half of its initial value. Derive an expression
that relates the parameter λ to η(λ). Use the expression derived to compute the value of λ for which the weights would drop to half after 3 step returns. (3 Points)
The λ-return is defined as
Defining τ = η(λ) to be the value of n such that
Thus given λ, the half life τ is given by this expression. For example if , then we compute that τ = 3 so we are looking that many steps ahead before our weighting drops by one half.
(Pls. Check if this is correct)
Problem 5 : On Learning Rates
In any TD based algorithm, the update rule is of the following form
V (s) ← V (s) + αt[r + γV (s0) − V (s)]
where αt is the learning rate at the t-th time step. In here, the time step t refers to the t-th time we are updating the value of the state s. Among other conditions, the learning rate αt has to obey the Robbins-Monroe condition given by,
for convergence to true V (s). Other conditions being same, reason out if the following values for αt would result in convergence. (5 Points)
(1) αt = 1t
Generalize the above result for αt = t1p for any positive real number p (i.e. p ∈ R+)
The series is harmonic series and it does not converge. In fact, one can rewrite the series in the following way (by re-grouping terms)
A generalization of the Harmonic series is the p-series (Hyperharmonic series) defined as for any +ve real number p. The p-series converges for all p > 1 (overharmonic series) and diverges for all p ≤ 1. So, one can now use the above property to get the following results.
αt Pαt Pαt2 Algo converges
1
t2 < ∞ < ∞ No
1 t ∞ < ∞ Yes
1
2 t3 ∞ < ∞ Yes
1
t0.5 ∞ ∞ No
Problem 6 : Programming Value and Policy Iteration
Implement value and policy iteration algorithm and test it on ’Frozen Lake’ environment in openAI gym. ’Frozen Lake’ is a grid-world like environment available in gym. The purpose of this exercise is to to help you get hands on with using gym and to understand the implementation details of value and policy iteration algorithm(s)
This question will not be graded but will still come in handy for future assignments.
ALLTHEBEST
Reviews
There are no reviews yet.