Description
CS181-S22
Homework 3: Bayesian Methods and Neural Networks
Introduction
This homework is about Bayesian methods and Neural Networks. Section 2.9 in the textbook as well as reviewing MLE and MAP will be useful for Q1. Chapter 4 in the textbook will be useful for Q2. Please type your solutions after the corresponding problems using this LATEX template, and start each problem on a new page.
Please submit the writeup PDF to the Gradescope assignment โHW3โ. Remember to assign pages for each question. All plots you submit must be included in your writeup PDF. We will not be checking your code / source files except in special circumstances.
Please submit your LATEXfile and code files to the Gradescope assignment โHW3 – Supplementalโ.
Problem 1 (Bayesian Methods)
This question helps to build your understanding of making predictions with a maximum-likelihood estimation (MLE), a maximum a posterior estimator (MAP), and a full posterior predictive.
Consider a one-dimensional random variable x = ยต + ฯต, where it is known that ฯต โผ N(0,ฯ2). Suppose we have a prior ยต โผ N(0,ฯ2) on the mean. You observe iid data(denote the data as D). We derive the distribution of x|D for you.
The full posterior predictive is computed using:
Z Z p(x|D) = p(x,ยต|D)dยต = p(x|ยต)p(ยต|D)dยต
One can show that, in this case, the full posterior predictive distribution has a nice analytic form:
!
(1)
1. Derive the distribution of ยต|D.
2. In many problems, it is often difficult to calculate the full posterior because we need to marginalize out the parameters as above (here, the parameter is ยต). We can mitigate this problem by plugging in a point estimate of ยตโ rather than a distribution.
a) Derive the MLE estimate ยตMLE.
b) Derive the MAP estimate ยตMAP.
c) What is the relation between ยตMAP and the mean of x|D?
d) For a fixed value of ยต = ยตโ, what is the distribution of x|ยตโ? Thus, what is the distribution of x|ยตMLE and x|ยตMAP?
e) Is the variance of x|D greater or smaller than the variance of x|ยตMLE? What is the limit of the variance of x|D as n tends to infinity? Explain why this is intuitive.
3. Let us compare ยตMLE and ยตMAP. There are three cases to consider:
a) Assume PxiโD xi = 0. What are the values of ยตMLE and ยตMAP?
b) Assume PxiโD xi > 0. Is ยตMLE greater than ยตMAP?
c) Assume PxiโD xi < 0. Is ยตMLE greater than ยตMAP?
4. Compute:
Solution:
Problem 2 (Bayesian Frequentist Reconciliation)
In this question, we connect the Bayesian version of regression with the frequentist view we have seen in the first week of class by showing how appropriate priors could correspond to regularization penalities in the frequentist world, and how the models can be different.
Suppose we have a (p + 1)-dimensional labelled dataset. We can assume that yi is generated by the following random process:
yi = wโคxi + ฯตi
where all ฯตi โผ N(0,ฯ2) are iid. Using matrix notation, we denote
X
y
.
Then we can write have y = Xw +ฯต. Now, we will suppose that w is random as well as our labels! We choose to impose the Laplacian prior , where denotes the
L1 norm of w, ยต the location parameter, and ฯ is the scale factor.
1. Compute the posterior distribution p(w|X,y) of w given the observed data X,y, up to a normalizing constant. You do not need to simplify the posterior to match a known distribution.
4. As ฯ decreases, what happens to the entries of the estimate wMAP? What happens in the limit as ฯ โ 0?
5. Consider the point estimate wmean, the mean of the posterior w|X,y. Further, assume that the model assumptions are correct. That is, w is indeed sampled from the posterior provided in subproblem 1, and that y|x,w โผ N(wTx,ฯ2). Suppose as well that the data generating processes for x,w,y are all independent (note that w is random!). Between the models with estimates wMAP and wmean, which model would have a lower expected test MSE, and why? Assume that the data generating distribution for x has mean zero, and that distinct features are independent and each have variance 1.a
aThe unit variance assumption simplifies computation, and is also commonly used in practical applications.
Solution:
Problem 3 (Neural Net Optimization)
In this problem, we will take a closer look at how gradients are calculated for backprop with a simple multi-layer perceptron (MLP). The MLP will consist of a first fully connected layer with a sigmoid activation, followed by a one-dimensional, second fully connected layer with a sigmoid activation to get a prediction for a binary classification problem. Assume bias has not been merged. Let:
โข W1 be the weights of the first layer, b1 be the bias of the first layer.
โข W2 be the weights of the second layer, b2 be the bias of the second layer.
The described architecture can be written mathematically as:
yห = ฯ(W2 [ฯ (W1x + b1)] + b2)
where หy is a scalar output of the net when passing in the single datapoint x (represented as a column vector), the additions are element-wise additions, and the sigmoid is an element-wise sigmoid.
1. Let:
โข N be the number of datapoints we have
โข M be the dimensionality of the data
โข H be the size of the hidden dimension of the first layer. Here, hidden dimension is used to describe the dimension of the resulting value after going through the layer. Based on the problem description, the hidden dimension of the second layer is 1.
Write out the dimensionality of each of the parameters, and of the intermediate variables:
a1 = W1x + b1, z1 = ฯ(a1) a2 = W2z1 + b2, yห = z2 = ฯ(a2)
and make sure they work with the mathematical operations described above.
2. We will derive the gradients for each of the parameters. The gradients can be used in gradient descent to find weights that improve our modelโs performance. For this question, assume there is only one datapoint x, and that our loss is L = โ(y log(หy) + (1 โ y)log(1 โ yห)). For all questions, the chain rule will be useful.
(a) Find .
(b) Find , where represents the hth element of W2.
(c) Find , where represents the hth element of b1. (*Hint: Note that only the hth element of a1 and z1 depend on – this should help you with how to use the chain rule.)
(d) Find, whererepresents the element in row h, column m in W1.1
Solution:
Problem 4 (Modern Deep Learning Tools: PyTorch)
In this problem, you will learn how to use PyTorch. This machine learning library is massively popular and used heavily throughout industry and research. In T3_P3.ipynb you will implement an MLP for image classification from scratch. Copy and paste code solutions below and include a final graph of your training progress. Also submit your completed T3_P3.ipynb file.
You will recieve no points for code not included below.
You will recieve no points for code using built-in APIs from the torch.nn library.
Solution:
Plot:
Code:
n_inputs = โnot implementedโ n_hiddens = โnot implementedโ n_outputs = โnot implementedโ
W1 = โnot implementedโ b1 = โnot implementedโ W2 = โnot implementedโ b2 = โnot implementedโ
def relu(x):
โnot implementedโ
def softmax(x):
โnot implementedโ
def net(X):
โnot implementedโ
def cross_entropy(y_hat, y):
โnot implementedโ
def sgd(params, lr=0.1):
โnot implementedโ
def train(net, params, train_iter, loss_func=cross_entropy, updater=sgd):
โnot implementedโ
Name
Whom did you work with, and did you use any resources beyond cs181-textbook and your notes?
Calibration
Approximately how long did this homework take you to complete (in hours)?
Reviews
There are no reviews yet.