Description
TAs: Aditi, Chongling, Hang, Hayden, Sami
Homework 6 covers topics on learning theory, MLE/MAP, Naive Bayes, and CNNs. The homework includes multiple choice, True/False, and short answer questions. There will be no consistency points in general, so please make sure to double check your answers to all parts of the questions!
START HERE: Instructions
˜mgormley/courses/10601/syllabus.html
• Late Submission Policy: See the late submission policy here: https://www.cs.cmu.edu/
˜mgormley/courses/10601/syllabus.html
• Submitting your work: You will use Gradescope to submit your answers to all questions.
1
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley
⃝ Marie Curie
⃝ Noam Chomsky
Select One: Who taught this course?
Henry Chai
⃝ Marie Curie
@@ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
■ Stephen Hawking
■ Albert Einstein
■ Isaac Newton
2 I don’t know
Select all that apply: Which are scientists?
■ Stephen Hawking
■ Albert Einstein
■ Isaac Newton
@■ I don’t know
Fill in the blank: What is the course number?
10-S6301S
Written Questions (84 points)
1 LATEX Bonus Point (1 points)
1. (1 point) Select one: Did you use LATEX for the entire written portion of this homework?
Yes
⃝ No
2 Convolutional Neural Network (14 points)
1. In this problem, consider only the convolutional layer of a standard implementation of a CNN as described in lecture. We are given an image X and filter F below.
(a) (1 point) Let X be convolved with F using no padding and a stride of 1 to produce an output Y . What is value of j in the output Y ?
(b) (1 point) Suppose you had an input feature map of size (height × width) 6×4 and filter size 2×2, using no padding and a stride of 2, what would be the resulting output size? Write your answer in the format height × width.
2. Parameter sharing is a very important concept for CNN because it drastically reduces the complexity of the learning problem. For the following questions, assume that there is no bias term in our convolutional layer.
(a) (1 point) Select all that apply: Which of the following are parameters of a convolutional layer?
Stride size 2 Padding size ■ Image size
Filter size Weights in the filter
2 None of the above
(b) (1 point) Select all that apply: Which of the following are hyperparameters of a convolutional layer?
■ Stride size ■ Padding size
2 Image size
■ Filter size
Weights in the filter
2 None of the above
(c) (1 point) Suppose for the convolutional layer, we are given grayscale images of size 22×22. Using one single 4 × 4 filter with a stride of 2 and no padding, what is the number of parameters you are learning in this layer?
(d) (1 point) Now suppose we do not do parameter sharing. That is, each output pixel of this layer is computed by a separate 4×4 filter. Again we use a stride of 2 and no padding. What is the number of parameters you are learning in this layer?
(e) (1 point) Now suppose you are given a 40 × 40 colored image, which consists of 3 channels (so your input is a 40 × 40 × 3 tensor), each representing the intensity of one primary color. Suppose you again do not do parameter sharing, using a unique 4 × 4 filter per output pixel, with a stride of 2 and no padding. What is the number of parameters you are learning in this layer?
(f) (1 point) In one concise sentence, describe a reason why parameter sharing is a good idea for a convolutional layer applied to image data, besides the reduction in number of learned parameters.
3. Neural the Narwhal was expecting to implement a CNN for Homework 5, but he is disappointed that he only got to write a simple fully-connected neural network.
(a) (2 points) Neural decides to implement a CNN himself and comes up with the following naive implementation:
# image X has shape (H_in, W_in), and filter F has shape (K, K)
# the output Y has shape (H_out, W_out) Y = np.zeros((H_out, W_out)) for r in range(H_out):
for c in range(W_out):
for i in range(K):
for j in range(K):
Y[r, c] += X[ blank ] * F[i, j]
What should be in the blank above so that the output Y is correct? Assume that Hout and Wout are pre-computed correctly.
Suppose you have an input vector x = [x1,x2,x3,x4,x5]T and a 1D convolution filter w = [w1,w2,w3]T . Then if the output is y = [y1,y2,y3]T , y1 = w1x1 + w2x2 + w3x3, y2 = ···, y3 = ···. If you look at this closely, this is equivalent to
where the matrix A is given as ···
What is matrix A for this x, y and w? Write only the final answer. Your work will not be graded.
3 Learning Theory (19 points)
1. Neural the Narwhal is given a classification task to solve, which he decides to use a decision tree learner with 2 binary features X1 and X2. On the other hand, you think that Neural should not have used a decision tree. Instead, you think it would be best to use logistic regression with 16 real-valued features in addition to a bias term. You want to use PAC learning to check whether you are correct. You first train your logistic regression model on N examples to obtain a training error Rˆ.
(a) (1 point) Which of the following case of PAC learning should you use for your logistic regression model?
⃝ Finite and realizable
⃝ Finite and agnostic
⃝ Infinite and realizable
Infinite and agnostic
(c) (3 points) Select one: You want to argue your method has a lower bound on the true error. Assume that you have obtained enough data points to satisfy the PAC criterion with the same ϵ and δ as Neural. Which of the following is true?
⃝ Neural’s model will always classify unseen data more accurately because it only needs 2 binary features and therefore is simpler.
⃝ You must first regularize your model by removing 14 features to make any comparison at all.
⃝ It is sufficient to show that the VC dimension of your classifier is higher than that of Neural’s, therefore having a lower bound for the true error.
⃝ It is necessary to show that the training error you achieve is lower than the training error Neural achieves.
2. In lecture, we saw that we can use our sample complexity bounds to derive bounds on the true error for a particular algorithm. Consider the sample complexity bound for the infinite, agnostic case:
.
(a) (2 points) Rewrite the big-O bound in terms of N and δ using the definition of big-O notation (i.e. if N = O(M) (for some value M), then there exists a constant c ∈ R such that N ≤ cM).
(b) (2 points) Now, using the definition of ϵ (i.e. |R(h) − Rˆ(h)| ≤ ϵ), show that with probability at least (1 − δ):
.
3. (3 points) Consider the hypothesis space of functions that map M binary attributes to a binary label. A function f in this space can be characterized as f : {0,1}M → {0,1}. Neural the Narwhal says that regardless of the value of M, a function in this space can always shatter 2M points. Is Neural wrong? If so, provide a counterexample. If Neural is right, briefly explain why in 1-2 concise sentences.
4. Consider an instance space X which is the set of real numbers.
(a) (3 points) Select one: What is the VC dimension of hypothesis class H, where each hypothesis h in H is of the form “if a < x < b or c < x < d then y = 1; otherwise y = 0”? (i.e., H is an infinite hypothesis class where a,b,c, and d are arbitrary real numbers).
⃝ 2
⃝ 3
4
⃝ 5
⃝ 6
(b) (3 points) Given the set of points in X below, construct a labeling of some subset of the points to show that any dimension larger than the VC dimension of H by exactly 1 is incorrect (e.g. if the VC dimension of H is 3, only fill in the answers for 4 of the points). Fill in the boxes such that for each point in your example, the corresponding label is either 0 or 1. For points you are not using in your example, write N/A (do not leave the answer box blank).
−3 −2 −1 0 1 2 3
4 MLE/MAP (32 points)
1. (1 point) True or False: Suppose you place a Beta prior over the Bernoulli distribution, and attempt to learn the parameter θ of the Bernoulli distribution from data. Further suppose an adversary chooses “bad” but finite hyperparameters for your Beta prior in order to confuse your learning algorithm. As the number of training examples grows to infinity, the MAP estimate of θ can still converge to the MLE estimate of θ.
True
⃝ False
2. (2 points) Select one: Let Γ be a random variable with the following probability density function (pdf):
(
2γ if 0 ≤ γ ≤ 1
f(γ) =
0 otherwise
Suppose another random variable Y , which is conditioning on Γ, follows an exponential distribution with λ = 3γ. Recall that the exponential distribution with parameter λ has the following pdf:
(λe−λy if y ≥ 0
fexp(y) =
0 otherwise
What is the MAP estimate of γ given is observed?
3. (4 points) Neural the Narwhal found a mystery coin and wants to know the probability of landing on heads by flipping this coin. He models the coin toss as sampling a value from Bernoulli(θ) where θ is the probability of heads. He flips the coin three times and the flips turned out to be heads, tails, and heads. An oracle tells him that θ ∈ {0,0.25,0.5,0.75,1}, and no other values of θ should be considered.
Find the MLE and MAP estimates of θ. Use the following prior distribution for the MAP estimate:
if θ = 0
0.04 p(θ) = 0.03
00..0201 if θ = 0.25 if θ = 0.5 . if θ = 0.75 if θ = 1
Again, remember that θ ∈ {0,0.25,0.5,0.75,1}, so the MLE and MAP should also be one of them.
4. In a previous homework assignment, you have derived the closed form solution for linear regression. Now, we are coming back to linear regression, viewing it as a statistical model, and deriving the MLE and MAP estimate of the parameters in the following questions. As a reminder, in MLE, we have
θˆMLE = argmaxp(D|θ)
θ
For MAP, we have
θˆMAP = argmaxp(θ|D)
θ
Assume we have data , where x . So our data has N instances and each instance has M features. Each y(i) is generated given x(i) with additive noise ϵ(i) ∼ N(0,σ2): that is, y(i) = wT x(i) + ϵ(i) where w is the parameter vector of linear regression. (a) (2 points) Select one: Given this assumption, what is the distribution of y(i)?
y(i) ∼ N(wT x(i),σ2)
⃝ y(i) ∼ N(0,σ2)
⃝ y(i) ∼ Uniform(wT x(i) − σ,wT x(i) + σ)
⃝ None of the above
(b) (2 points) Select one: The next step is to learn the MLE of the parameters of the linear regression model. Which expression below is the correct conditional log likelihood ℓ(w) with the given data?
(c) (3 points) Select all that apply: Then, the MLE of the parameters is just argmaxw ℓ(w). Among the following expressions, select ALL that can yield the correct MLE.
argmaxw ■ argmaxw
■ argmaxw argmaxw
■ argmaxw
2 None of the above
5. Now we are moving on to learn the MAP estimate of the parameters of the linear regression model. Consider the same data D we used for the previous problem.
(a) (3 points) Select all that apply: Which expression below is the correct optimization problem the MAP estimate is trying to solving? Recall that D refers to the data, and w to the regression parameters (weights).
■ wMAP = argmaxw p(D,w)
■ wMAP = argmaxw wMAP = argmaxw
■ wMAP = argmaxw p(D|w)p(w)
■ wMAP = argmaxw p(w|D)
2 None of the above
(b) (3 points) Select one: Suppose we are using a Gaussian prior distribution with mean 0 and variance for each element wm of the parameter vector w, i.e. . Assume that w1,··· ,wM are mutually independent of each other. Which expression below is the correct log joint-probability of the data and parameters logp(D,w)? Please show your work below.
(c) (2 points) Select one: For the same linear regression model with a Gaussian prior on the parameters as in the previous question, maximizing the log posterior probability ℓMAP(w) gives you the MAP estimate of the parameters. Which of the following is an equivalent definition of maxw ℓMAP(w)?
(d) (2 points) Select one: You found a MAP estimator that has a much higher test error than train error using some Gaussian prior. Which of the following could be a possible approach to fixing this?
⃝ Increase the variance of the prior used
Decrease the variance of the prior used
6. (4 points) Select one: Suppose now the additive noise ϵ is different per datapoint. That is, each y(i) is generated given x(i) with additive noise ϵ(i) ∼ N(0,σi2), i.e. y(i) = wT x(i) + ϵ(i). Unlike the standard regression model we have worked with until now, there is now an example specific variance σi2. Maximizing the log-likelihood of this new model is equivalent to minimizing the weighted mean squared error with which of the following as the weights? Please show your work below.
⃝ 1/y(i)
1/σi2
7. (4 points) Select one: MAP estimation with what prior is equivalent to ℓ1 regularization? Please show your work below.
Note:
• The pdf of a uniform distribution over and 0 otherwise.
• The pdf of an exponential distribution with rate parameter a is f(x) = aexp(−ax) for x > 0.
• The pdf of a Laplace distribution with location parameter a and scale parameter b is for all x ∈ R.
⃝ Uniform distribution over [−1,1]
⃝ Uniform distribution over
⃝ Exponential distribution with rate parameter
⃝ Exponential distribution with rate parameter a = wT x(i)
Laplace distribution with location parameter a = 0
⃝ Laplace distribution with location parameter a = wT x(i)
5 Na¨ıve Bayes (18 points)
1. (3 points) Suppose that 0.3% of all people have cancer. If someone is tested for cancer, the outcome of the test can either be positive (cancer) or negative (no cancer). The test is not perfect—among people who have cancer, the test comes back positive 97% of the time. Among people who don’t have cancer, the test comes back positive 4% of the time. For this question, you should assume that the test results are independent of each other, given the true state (cancer or no cancer). What is the probability of a test subject having cancer, given that the subject’s test result is positive? Round the answer to the fourth decimal place, e.g. 0.1234.
2. The following dataset describes several features of a narwhal and then whether or not it has an Instagram account.
Color Size Has Instagram?
Rainbow Small N
Cyan Small N
Cyan Small Y
Cyan Medium Y
Rainbow Medium N
Fuchsia Medium Y
Fuchsia Large Y
Cyan Large Y
Neural the Narwhal is cyan and medium-sized. We would like to determine whether he has an Instagram account, using the Na¨ıve Bayes assumption to estimate the following probabilities.
(a) (2 points) What is the probability that a narwhal is cyan, medium-sized, and has an Instagram account? Round the answer to the fourth decimal place, e.g. 0.1234.
(b) (2 points) What is the probability that a narwhal is cyan, medium-sized, and does not have an Instagram account? Round the answer to the fourth decimal place, e.g. 0.1234.
(c) (1 point) Select one: Does Neural the Narwhal have an Instagram account?
3. (3 points) Select all that apply: Gaussian Na¨ıve Bayes in general can learn non-linear decision boundaries. Consider the simple case where we have just one real-valued feature X1 ∈ R from which we wish to infer the value of label Y ∈ {0,1}. The corresponding generative story would be:
Y ∼ Bernoulli(ϕ) X1 ∼ Gaussian
where the parameters are the Bernoulli parameter ϕ and the class-conditional Gaussian parameters and corresponding to Y = 0 and Y = 1, respectively.
A linear decision boundary in one dimension can be described by a rule of the form:
if X1 > c then Y = k, else Y = 1 − k
where c is a real-valued threshold and k ∈ {0,1}. Note that X1 > c here denotes a generic linear inequality. That is, we use X1 > c to mean any of X1 > c, X1 ≥ c, X1 < c, and X1 ≤ c.
Is it possible in this simple one-dimensional case to construct a Gaussian Na¨ıve Bayes classifier with a decision boundary that cannot be expressed by a rule in the above form?
■ Yes, this can occur if the Gaussians are of equal means and unequal variances.
■ Yes, this can occur if the Gaussians are of unequal means and equal variances.
■ Yes, this can occur if the Gaussians are of unequal means and unequal variances. 2 None of the above
4. Now we will expand this to 2D and derive the decision boundary of a 2D Gaussian Na¨ıve Bayes.
■ (a)
■ (b)
2 (c)
■ (d)
2 None of the above
5. (2 points) Select all that apply: Neural the Narwhal used a 2D Gaussian Na¨ıve Bayes model for a binary classification task, and obtained the following contour plot for the density f(x1,x2 | y = +). You instantly notice that he made a mistake. Why is his result theoretically impossible to be obtained from a Gaussian Na¨ıve Bayes model?
2 The mean of the distribution should be the mean of all points, not just positive points (+)
■ A Bernoulli Na¨ıve Bayes model should be used instead for discrete labels (+ and −)
■ The covariance matrix of the distribution is not a diagonal matrix 2 None of the above
6 Collaboration Questions
1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.
3. Did you find or come across code that implements any part of this assignment? If so, include full details.
Reviews
There are no reviews yet.