Description
CS181-S22
Homework 3: Bayesian Methods and Neural Networks
Introduction
This homework is about Bayesian methods and Neural Networks. Section 2.9 in the textbook as well as reviewing MLE and MAP will be useful for Q1. Chapter 4 in the textbook will be useful for Q2. Please type your solutions after the corresponding problems using this LATEX template, and start each problem on a new page.
Please submit the writeup PDF to the Gradescope assignment ‘HW3’. Remember to assign pages for each question. All plots you submit must be included in your writeup PDF. We will not be checking your code / source files except in special circumstances.
Please submit your LATEXfile and code files to the Gradescope assignment ‘HW3 – Supplemental’.
Problem 1 (Bayesian Methods)
This question helps to build your understanding of making predictions with a maximum-likelihood estimation (MLE), a maximum a posterior estimator (MAP), and a full posterior predictive.
Consider a one-dimensional random variable x = µ + ϵ, where it is known that ϵ ∼ N(0,σ2). Suppose we have a prior µ ∼ N(0,τ2) on the mean. You observe iid data(denote the data as D). We derive the distribution of x|D for you.
The full posterior predictive is computed using:
Z Z p(x|D) = p(x,µ|D)dµ = p(x|µ)p(µ|D)dµ
One can show that, in this case, the full posterior predictive distribution has a nice analytic form:
!
(1)
1. Derive the distribution of µ|D.
2. In many problems, it is often difficult to calculate the full posterior because we need to marginalize out the parameters as above (here, the parameter is µ). We can mitigate this problem by plugging in a point estimate of µ∗ rather than a distribution.
a) Derive the MLE estimate µMLE.
b) Derive the MAP estimate µMAP.
c) What is the relation between µMAP and the mean of x|D?
d) For a fixed value of µ = µ∗, what is the distribution of x|µ∗? Thus, what is the distribution of x|µMLE and x|µMAP?
e) Is the variance of x|D greater or smaller than the variance of x|µMLE? What is the limit of the variance of x|D as n tends to infinity? Explain why this is intuitive.
3. Let us compare µMLE and µMAP. There are three cases to consider:
a) Assume Pxi∈D xi = 0. What are the values of µMLE and µMAP?
b) Assume Pxi∈D xi > 0. Is µMLE greater than µMAP?
c) Assume Pxi∈D xi < 0. Is µMLE greater than µMAP?
4. Compute:
Solution:
Problem 2 (Bayesian Frequentist Reconciliation)
In this question, we connect the Bayesian version of regression with the frequentist view we have seen in the first week of class by showing how appropriate priors could correspond to regularization penalities in the frequentist world, and how the models can be different.
Suppose we have a (p + 1)-dimensional labelled dataset. We can assume that yi is generated by the following random process:
yi = w⊤xi + ϵi
where all ϵi ∼ N(0,σ2) are iid. Using matrix notation, we denote
X
y
.
Then we can write have y = Xw +ϵ. Now, we will suppose that w is random as well as our labels! We choose to impose the Laplacian prior , where denotes the
L1 norm of w, µ the location parameter, and τ is the scale factor.
1. Compute the posterior distribution p(w|X,y) of w given the observed data X,y, up to a normalizing constant. You do not need to simplify the posterior to match a known distribution.
4. As τ decreases, what happens to the entries of the estimate wMAP? What happens in the limit as τ → 0?
5. Consider the point estimate wmean, the mean of the posterior w|X,y. Further, assume that the model assumptions are correct. That is, w is indeed sampled from the posterior provided in subproblem 1, and that y|x,w ∼ N(wTx,σ2). Suppose as well that the data generating processes for x,w,y are all independent (note that w is random!). Between the models with estimates wMAP and wmean, which model would have a lower expected test MSE, and why? Assume that the data generating distribution for x has mean zero, and that distinct features are independent and each have variance 1.a
aThe unit variance assumption simplifies computation, and is also commonly used in practical applications.
Solution:
Problem 3 (Neural Net Optimization)
In this problem, we will take a closer look at how gradients are calculated for backprop with a simple multi-layer perceptron (MLP). The MLP will consist of a first fully connected layer with a sigmoid activation, followed by a one-dimensional, second fully connected layer with a sigmoid activation to get a prediction for a binary classification problem. Assume bias has not been merged. Let:
• W1 be the weights of the first layer, b1 be the bias of the first layer.
• W2 be the weights of the second layer, b2 be the bias of the second layer.
The described architecture can be written mathematically as:
yˆ = σ(W2 [σ (W1x + b1)] + b2)
where ˆy is a scalar output of the net when passing in the single datapoint x (represented as a column vector), the additions are element-wise additions, and the sigmoid is an element-wise sigmoid.
1. Let:
• N be the number of datapoints we have
• M be the dimensionality of the data
• H be the size of the hidden dimension of the first layer. Here, hidden dimension is used to describe the dimension of the resulting value after going through the layer. Based on the problem description, the hidden dimension of the second layer is 1.
Write out the dimensionality of each of the parameters, and of the intermediate variables:
a1 = W1x + b1, z1 = σ(a1) a2 = W2z1 + b2, yˆ = z2 = σ(a2)
and make sure they work with the mathematical operations described above.
2. We will derive the gradients for each of the parameters. The gradients can be used in gradient descent to find weights that improve our model’s performance. For this question, assume there is only one datapoint x, and that our loss is L = −(y log(ˆy) + (1 − y)log(1 − yˆ)). For all questions, the chain rule will be useful.
(a) Find .
(b) Find , where represents the hth element of W2.
(c) Find , where represents the hth element of b1. (*Hint: Note that only the hth element of a1 and z1 depend on – this should help you with how to use the chain rule.)
(d) Find, whererepresents the element in row h, column m in W1.1
Solution:
Problem 4 (Modern Deep Learning Tools: PyTorch)
In this problem, you will learn how to use PyTorch. This machine learning library is massively popular and used heavily throughout industry and research. In T3_P3.ipynb you will implement an MLP for image classification from scratch. Copy and paste code solutions below and include a final graph of your training progress. Also submit your completed T3_P3.ipynb file.
You will recieve no points for code not included below.
You will recieve no points for code using built-in APIs from the torch.nn library.
Solution:
Plot:
Code:
n_inputs = ’not implemented’ n_hiddens = ’not implemented’ n_outputs = ’not implemented’
W1 = ’not implemented’ b1 = ’not implemented’ W2 = ’not implemented’ b2 = ’not implemented’
def relu(x):
’not implemented’
def softmax(x):
’not implemented’
def net(X):
’not implemented’
def cross_entropy(y_hat, y):
’not implemented’
def sgd(params, lr=0.1):
’not implemented’
def train(net, params, train_iter, loss_func=cross_entropy, updater=sgd):
’not implemented’
Name
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.