CS 234: Assignment #3 (Solution)

$ 24.99
Category:

Description

These questions require thought, but do not require long answers. Please be as concise as possible.
Please review any additional instructions posted on the assignment page at http://cs234.stanford.edu/assignment3. When you are ready to submit, please follow the instructions on the course website.
1 Frozen Lake
Now we are going to implement algorithms with smart exploration for the Frozen Lake environment. If you recall the Frozen Lake environment, the layout of the environment is described using a grid. The layout is as following:
S H H H
F H H H
F H H H
F F F G
In this environment, the agent start from top left. It needs to reach the goal at bottom right. The max steps it could take in this 4×4 layout is 6. The agent cannot move to the hole’s location. We have provided custom versions of this environment in the starter code.
(a) (coding) Implement Rmax in q1.py. With default hyper-parameters, it should reach the goal quickly.
(b) (Written) How would the threshold m influence the performance? Please plot the running average of the first 10,000 training episodes to support your argument.The x-axis should be the episode number, and the y-axis should be the average score over all training episodes so far. Include the plot in your writeup.
(c) (coding and Written) Implement Q-learning (minimal changes from assignment 1) in q2.py. Experiment with different values of and choose one that works well. Compare Rmax (with the best performing m) with . What do you observe? Plot performance on the same axis as part (b).
1
CS 234: Assignment #3

2 Expected Regret Bounds
Assume a reinforcement learning algorithm A for discounted infinite-horizon MDPs has expected regret

for all T > 0, where E∗ is over the probability distribution with respect to the optimal policy π∗ and EA is the expectation with respect to the algorithm’s behavior. We assume that γ ∈ [0,1) is the discount factor and that rewards are normalized, i.e., rt ∈ [0,1].
(a) Let π be an arbitrary policy or algorithm. Show that for any 0 and where H =
1/(1 − γ), we have
, for all state s.
Note Vπ is the value function associated with π and Eπ is the expectation with respect to the randomization of π.
(b) From the regret guarantee of algorithm A and Part a), it follows that for any 0 and , we have
0,
where VA is the value function of the (possibly nonstationary) policy that algorithm A follows. Assume
√ now f(T) = T. Show that for any > 0 and , we have

Page 2 of 2

Reviews

There are no reviews yet.

Be the first to review “CS 234: Assignment #3 (Solution)”

Your email address will not be published. Required fields are marked *