AI 3000 / CS 5500 : Reinforcement Learning Assignment № 1 Solved

$ 20.99
Category:

Description

Problem 1 : Markov Reward Process
Consider the following snake and ladders game as depicted in the figure below.

• Initial state is S and a fair four sided die is used to decide the next state at each time
• Player must land exactly on state W to win
• Die throws that take you further than state W leave the state unchanged
(a) Identify the states, transition matrix of this Markov process. (1 point)
(b) Construct a suitable reward function, discount factor and use the Bellman equation for the Markov reward process to compute how long does it take “on average” (the expected number of die throws) to reach the state W from any other state. (4 points)
(a) The states of the Markov process is given by, S = {S,1,3,5,6,7,8,W}. Positions 2 and

4 of the grid are same as positions 7 and 8 respectively.

0 0.25 0.25 0 0 0.25 0.25

0 0 0.25 0.25 0 0.25 0.25


0 0 0 0.25 0.25 0.25 0.25 
0 0.25 0.25 0 0 0.25 0.25  
0
 0 

0 

0 
P =  
0


0

0 

0 0 0
0 0 0.25
0.25
0.25
0 0 0
0
0 0 0
0
0 0.25 0.25 0.25

0.25 0.25 0.25
 0 0.5 0.25

0 0 1
The transition matrix is given by,

(b) The state W is an absorbing state since if the Markov process reach the state W there are no further state transitions possible apart from staying at state W.
(c) Following are the suitable reward functions and discount factor γ.
• The suitable discount factor is γ = 1 as we are estimating “average” number of steps to reach W.
• The reward for any state could be R(s) = −1 for s 6= W and R(s) = 0 for s = W. Then V (s) = 0 for s = W.
V (s) = {7.0833,7,6.6667,6.6667,5.3333,5.3333,5.3333}
Problem 2 : Markov Decision Process
A production facility has N machines. If a machine starts up correctly in the morning, it renders a daily revenue of 1$. A machine that does not start up correctly, needs to be repaired. A visit by a repair man costs $ per day and he repairs all broken machines on the same day. The repair cost is a lump-sum amount and does not depend on the number of machines that is repaired. A machine that has been repaired always starts up correctly the next day. The number of machines that start up correctly the next day depends on the number of properly working machines at present day and is governed by the probability distribution given in the table below, where m stands for the number of (presently) working machines and n stands for the number of ones that would start up correctly the next day. The goal for the facility manager is to maximize the profits (revenue – costs) earned.

(a) Formulate the above problem as a Markov decision process by ennumerating the state space, action space, rewards and transition probabilities. (3 Points)
• State space : Number of working machines on any given day S = {0,1,··· ,N}
• Action space : {repair,no-repair}
• Reward : R(s,no-repair,s0) = s R(s,repair,s ;
• Transition probablities, for action no-repair action

Transition probablities, for action repair action

(b) Would you use discounted or undiscounted setting for the above MDP formulation ? Justify the answer. (1 Point)
It is preferred to use the discounted setting as the problem as formulated is an infinite horizon case.
(c) Suppose the facility manager adopts the policy to never call the repair man. Calculate the value of the policy. For this sub-problem assume that the number of machines in the facility to be five. (3 Points)
For this sub-problem and the next, let us assume γ = 0.9. One can use the Bellman evaluation equation in the following form to the value of the policy (π) ’no-repair’. We let |S| to be five.
V π = (I − γPπ)−1Rπ
With the identity matrix of size 6 ×6, we have the value for the 6 states as
V π = {0,1.818,3.63,5.45,7.29,9.09}
(d) Perform one iteration the of policy iteration algorithm on the no-repair policy adopted by the facility manager to get an improved policy for the five machine scenario. (3 Points) Improvement is achieved by being greedy w.r.t to the no-repair policy π at every state s At state s = 0, we have, π1(0) = argmax(0+ γV π(0),−2.5+0.9 ∗ V π(5)) = argmax(0,−2.5+0.9 ∗ 9.09) = repair
a
π1(1) = argmax(1.818,−5/2+0/9 ∗ (0.5 ∗ V π(5)+0.5 ∗ V π(4))) = repair
a
π1(2) = argmax(3.63,−5/2+0/9 ∗ (0.33 ∗ V π(5)+0.33 ∗ V π(4)+0.33 ∗ V π(3))) = repair
a
π1(3) = argmax(5.45,−5/2+0/9∗(0.25∗V π(5)+0.25∗V π(4)+0.25∗V π(3)+0.25∗V π(2))) = no-repair
a
Similarly, it si possible to work out for state s = 4 and s = 5 a better action is no-repair
Problem 3 : On Ordering of Policies
Consider the MDP shown in Figure 1. The MDP has 4 states S = {A,B,C,D} and there are

Figure 1: Partial Ordering of Policies
two actions a1 and a2 possible. The actions determine which direction to move from a given state. We consider a stochastic environment such that action suggested by the policy succeeds 90 % of the times and fails 10 % of the times. Upon failure, the agent moves in the direction suggested by the other action. The state D is a terminal state with reward of 100. One can think that terminal states have only one action (an exit option) which gives the terminal reward 100.
We consider three policies to this MDP.
• Policy π1 is deterministic policy that chooses action a1 at all states s ∈ S.
• Policy π2 is another deterministic policy that chooses action a2 at all states s ∈ S.
• Policy π3 is a stochastic policy described as follows
– Action a1 is chosen in states B and D with probability 1.0
– Action a2 is chosen in state C with probability 1.0
– Action a1 is chosen in state A with probability 0.4 and action a2 is chosen with probability 0.6
(a) Evaluate V π(s) for each policy described above using the Bellman evaluation equation for all states s ∈ S. (3 Points)
For states {A,B,C,D}, we have
(a) V π1 = {75.61,87.56,68.05,100};
(b) V π2 = {75.61,68.05,87.56,100}; (c) V π3 = {77.78,87.78,87.78,100};
(b) Which is the best policy among the suggested policies ? Why ? (1 Point)
Policy π3 is the best policy as its value at each state is greater than or equal to the values obtained by policy π.
(c) Are all policies comparable ? Provide reason for your answer. (1 Point)
Policies π1 and π2 are not comparable since for state B, V π1(B) > V π2(B) whereas for state C we have V π2(C) > V π1(C)
(d) Let π1 and π2 be two deterministic stationary policies of an MDP M . Construct a new policy π that is better than policies π1 and π2. Explain the answer. (3 Points)
[Note : M in sub-question (d) is any arbitrary MDP]
Here we are given two policies π1 and π2 for MDP M. At every state s ∈ S of the MDP, construct a new policy π as follows :

π1(s) if V π1(s) ≥ V π2(s) π(s) =
π2(s) Otherwise
This construction will lead to the conclusion that for every state s ∈ S of the MDP M, we have V π(s) ≥ max(V π1(s),V π2(s)), which makes the policy π better than policy π1 and π2.
Problem 4 : Effect of Noise and Discounting
Consider the grid world problem shown in Figure 2. The grid has two terminal states with positive payoff (+1 and +10). The bottom row is a cliff where each state is a terminal state with negative payoff (-10). The greyed squares in the grid are walls. The agent starts from the yellow state

Figure 2: Modified Grid World
S. As usual, the agent has four actions A = (Left, Right, Up, Down) to choose from any nonterminal state and the actions that take the agent off the grid leaves the state unchanged. Notice that, if agent follows the dashed path, it needs to be careful not to step into any terminal state at the bottom row that has negative payoff. There are four possible (optimal) paths that an agent can take.
• Prefer the close exit (state with reward +1) but risk the cliff (dashed path to +1)
• Prefer the distant exit (state with reward +10) but risk the cliff (dashed path to +10)
• Prefer the close exit (state with reward +1) by avoiding the cliff (solid path to +1)
• Prefer the distant exit (state with reward +10) by avoiding the cliff (solid path to +10)
(a) Identify what values of γ and η lead to each of the optimal paths listed above with reasoning. (8 Points)
[Hint : For the discount factor, consider high and low γ values like 0.9 and 0.1 respectively. For noise, consider deterministic and stochastic environment with noise level η being 0 or 0.5 respectively]
Answer
1. When γ is low, RL agent is ’short sighted’ and better rewards available in the distant future is not given importance. Further, when noise is zero in the environment, there is no danger of tripping to the cliff. Therefore, for low γ and low η, the agent would prefer the close exit and risk the cliff.
2. When γ is low, RL agent is ’short sighted’ and better rewards available in the distant future is not given importance. Further, when noise is high or moderate in the environment, there is danger of tripping to the cliff. Therefore, for low γ and low η, the agent would prefer the close exit and not risk the cliff.
3. When γ is high, RL agent is ’far sighted’ and better rewards available in the distant future is given importance. Further, when noise is low or zero in the environment, there is less or no danger of tripping to the cliff. Therefore, for high γ and low η, the agent would prefer the distant exit and risk the cliff.
4. When γ is high, RL agent is ’far sighted’ and better rewards available in the distant future is given importance. Further, when noise is high or medium in the environment, there is danger of tripping to the cliff. Therefore, for high γ and high η, the agent would prefer the distant exit and not risk the cliff.
Problem 5 : Value Functions
Let M1 and M2 be two identical MDPs with|S| < ∞ and|A| < ∞ except for reward formulation. That is, M1 =< S,A,P,R1,γ > and M2 =< S,A,P,R2,γ >. Let M3 be another MDP such that M3 =< S,A,P,R1+ R2,γ >. Assume γ < 1.
(a) For an arbitrary but fixed policy π, suppose we are given action value functions and
, corresponding to MDPs M1 and M2, respectively. Explain whether it is possible to combine these action value functions in a simple manner to calculate corresponding to MDP M3. (2 Points)
Yes, it is possible to combine the two action value functions of the MDP into a single action value function for the composite MDP since the combination involve only expectation operator and it is linear in nature.
(b) Suppose we are given optimal polices and corresponding to MDPs M1 and M2, re-
spectively. Explain whether it is possible to combine these optimal policies in a simple
manner to formulate an optimal policy corresponding to MDP M3. (2 Points)
Combining optimal policies is not straightforward as it involves taking care of the max operator which is a non-linear operator. Hence, optimal policies of the two MDPs cannot be combined in a straightforward fashion
(c) Suppose π∗ is an optimal policy for both MDPs M1 and M2. Will π∗ also be an optimal policy for MDP M3 ? Justify the answer. (2 Points) Given that π∗ is an optimal policy for both M1 and M2 we have,

Adding the optimal value function for MDPs M1 and M2, with the knowledge that π∗ is an optimal policy for both MDPs, is given by,

Optimal value function for MDP M3 is given by

Equating the last two equations, one can recognize that
WM∗ 3 = VM∗1 + VM∗2
Sum of two optimal value functions following the same optimal policy results in optimal value function for M3.
(d) Let ε be a fixed constant. Assume that the reward functions R1 and R2 are related as
R1(s,a,s0) − R2(s,a,s0) = ε
for all s,s0 ∈ S and a ∈ A. Let π be an arbitrary policy and let and be the corresponding value functions of policy π for MDPs M1 and M2, respectively. Derive an expression that relates for all s ∈ S. (3 Points)
Considering the definition of V π(s), the state value function under policy π, we have

We assume that each reward has a constant added to it. That is we consider the reward rˆt+k+1 in terms of rt+k+1 by
rˆt+k+1 = rt+k+1+ ε
Then, the state value function for this new sequence of rewards is given by,

The alternate relation
Vˆπ = V π(I − γP)−1ε
is dependent on the model of the MDP.
ALLTHEBEST

Reviews

There are no reviews yet.

Be the first to review “AI 3000 / CS 5500 : Reinforcement Learning Assignment № 1 Solved”

Your email address will not be published. Required fields are marked *