Description
Learning Theory and Generative Models
TAs: Sankalp Patro, Vinay Sameer Kadi, Hanyue Chai, Sana Lakdawala, Vishal Baskar
START HERE: Instructions
Homework 6 covers topics on learning theory, MLE/MAP, Naive Bayes, and revisits neural networks, logistic regression and regularization. The homework includes multiple choice, True/False, and short answer questions.
• Late Submission Policy: See the late submission policy here: http://www.cs.cmu. edu/~mgormley/courses/10601/about.html#6-general-policies
• Submitting your work:
correctly by our AI assisted grader. In addition, please tag the problems to the corresponding pages when submitting your work.
For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.
Instructions for Specific Problem Types
For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley
Marie Curie
Noam Chomsky
Select One: Who taught this course?
Matt Gormley # Marie Curie
@ @ Noam Chomsky
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
2 Stephen Hawking Albert Einstein
Isaac Newton 2 None of the above
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein Isaac Newton
@@ I don’t know
Fill in the blank: What is the course number?
10-601 10-S7601
S
1 Neural Networks, Logistic Regression, Regularization Revisited [28 pts]
Imagine you work for a pharmaceutical company that is trying to predict whether certain patients will respond well to a new drug. Specifically, these patients have high blood pressure in their lungs, a condition known as pulmonary hypertension. Doctors recommend that the best predictors of a treatment effect is the patient’s heart function (measured by cardiac output) and the blood pressure in their lungs (pulmonary blood pressure). You plot the data and visualize the following:
1. [3 pt] Draw on the graph above the decision boundaries of a trained neural network that minimizes the training error when classifying Good Responders vs all others (Poor or Average). Assume the neural network has two hidden units and one hidden layer. What is the smallest training error you can achieve?
Fill in the blank (write answer as a fraction):
2. [2 pt] Using your decision boundaries above, assuming a logistic activation function, which point has the highest probability of being a Good Responder? Provide the point number as shown on the graph (poitnts are labeled by their y value).
Fill in the blank:
3. [2 pt] True or False: Increasing the number of hidden units of a neural network will always guarantee a lower training error.
True
False
4. [2 pt] When performing linear regression, which of the following options will decrease mean-squared training error:
Select all that apply:
Increasing the order of the polynomial
Increasing the regularization weight
For the same weight on the regularizer, using an L1 regularizer instead of an L2
For the same weight on the regularizer, using an L1 regularizer instead of an L0
2 None of the above
5. [2 pt] Convolutional neural networks often consist of convolutional layers, max-pooling layers, and fully-connected layers. Select all the layer types below that have parameters (i.e. weights) which can be learned by gradient descent / backpropagation.
Select all that apply:
convolutional layer max-pooling layer 2 fully-connected layer
6. [3 pt] Consider the black-and-white 5 pixel by 5 pixel image shown below. Black pixels are represented by the value 0 and white pixels by 1.
1 1 0 0 0
1 1 1 0 0
0 1 1 1 0
0 0 1 1 1
0 0 0 1 1
Next consider the 3 by 3 convolution with weights shown below.
-1 0 1
-1 0 1
-1 0 1
Suppose we apply the above convolution (as one would in a CNN convolutional layer) to the above image to produce a new output image. Assume that we do not permit any padding to be used and the stride of the convolution is 1. What is the value of the pixel in the upper-left corner of the output image?
(Important Note: Convolution is sometimes defined differently in machine learning than in other fields, such as signal processing. So be sure to follow the method of convolution shown in the guest lecture on CNNs.) Fill in the blank:
7. [3 pt] For the same output image produced in the previous question, what is the value of the pixel in the upper-right corner of the output image?
Fill in the blank:
8. [2 pt] True or False: Long Short Term Memory (LSTM) networks partially address the vanishing gradient problem by incorporating input, output, and forget gates.
True
False
9. [2 pt] True or False: In a recurrent neural network (RNN), the weight vector passed to a hidden unit at the same hidden layer level from a prior unit will have different values.
True
False
10. [2 pt] True or False: Recurrent neural networks (RNNs) can accept sequence data as input, but can only output a single classification decision for that input sequence.
True
False
11. [2 pt] Which of the following are true about regularization? Select all that apply:
One of the goals of regularization is combating overfitting.
A model with regularization fits the training data better than a model without regularization
The L-0 norm (number of non-zero parameters) is rarely used in practice in part because it is non-differentiable.
One way to understand regularization is that it attempts to follow Occam’s razor and make the learning algorithm prefer ”simpler” solutions.
12. [3 pt] When performing linear regression, which of the following options will decrease mean-squared training error:
Select all that apply:
Adding higher-order functions of the features as separate features
Increasing the regularization weight
For the same weight on the regularizer, using an L1 regularizer instead of an L2
For the same weight on the regularizer, using an L1 regularizer instead of an L0
2 None of the above
2 Learning Theory [18 pts]
1. [3 pt] Let . According to the PAC theorems discussed in class, which of the following is correct? Select one.
Select one:
With probability at least 1−δ, every hypothesis with training error at most has true error 0.
With probability at least 1 − , a random hypothesis with training error 0 has true error at most δ.
With probability at least 1 − δ, every hypothesis with training error 0 has true error at most .
With probability at least 1 − , a random hypothesis with true error 0 has training error at most δ.
2. [3 pt] Consider a decision tree learner applied to data where each example is described by 10 boolean variables X1,X2,··· ,X10. What is the VC dimension of the hypothesis space used by this decision tree learner?
Fill in the blank:
3. [4 pt] Consider instance space X which is the set of real numbers. What is the VC dimension of hypothesis class H, where each hypothesis h in H is of the form “if a < x < b or c < x < d then y = 1; otherwise y = 0”? (i.e., H is an infinite hypothesis class where a, b, c, and d are arbitrary real numbers.
Select one:
4. Alex is given a classification task to solve. He has no idea where to start, so he decided to try out a decision tree learner with 2 binary features X1 and X2. He recently learned about PAC learning, and would like to know what is the minimum number (N) of data points that would suffice for the PAC criterion with 1 and δ = 0.01.
Fill in the blank:
Fill in the blank:
6. [3 pt] Sally wants to argue her method has lower bound on the true error. Assuming Sally has obtained enough data points to satisfy PAC criterion with 1 and δ = 0.01. Which of the following is true?
Select one:
Sally is wrong. Alex’s method will always classify unseen data more accurately since it is simpler as it only needs 2 binary features.
She must first regularize her model by removing 14 features to make any comparison at all.
It is sufficient to show that the VC Dimension of her classifier is higher than Alex’s, therefore having lower bound for the true error.
It is necessary to show that the training error she achieves is lower than the training error Alex achieves.
3 MLE/MAP [34 pts]
1. [3 pt] True or False: Suppose you place a Beta prior over the Bernoulli distribution, and attempt to learn the parameter of the Bernoulli distribution from data. Further suppose an adversary chooses “bad”, but finite hyperparameters for your Beta prior in order to confuse your learning algorithm. As the number of training examples grows to infinity, the MAP estimate of θ can still converge to the MLE estimate of θ.
Select One:
True
False
2. [3 pt] Let θ be a random variable with the following probability density function (pdf):
(
2θ if 0 ≤ θ ≤ 1
f(θ) =
0 otherwise
Suppose another random variable Y, which is conditioning on θ, follows an exponential distribution with λ = 3θ. Recall that the exponential distribution with parameter λ has the following pdf:
What is the MAP estimate of θ given is observed?
Select one:
3. In HW3, you have derived the closed form solution for linear regression. Now, we are coming back to linear regression, viewing it as a statistical model, and deriving the MLE and MAP estimate of the parameters in the following questions.
Assume we have data , where x ) . So our data has N instances and each instance has M attributes/features. Each y(i) is generated given x(i) with additive noise ), that is where w is the parameter vector of linear regression. Given this assumption, what is the distribution of y?
Select one:
Uniform(wTx(i) − σ,wTx(i) + σ)
None of the above
4. [4 pt] The next step is to learn the MLE of the parameters of the linear regression model. Which expression below is the correct conditional log likelihood `(w) with the given data?
Select one:
5. [4 pt] Then, the MLE of the parameters is just argmaxw `(w) . Among the following expressions, select ALL that can yield the correct MLE.
Select all that apply: argmaxw
argmaxw argmaxw argmaxw 2 argmaxw
6. According to the above derivations, is the MLE for the conditional log likelihood equivalent to minimizing mean squared errors (MSE) for the linear regression model when making predictions? Why or why not?
Select one:
Yes, because the derivative of the negative conditional log-likelihood has the same form as the derivative of the MSE loss.
Yes, because the parameters that maximize the conditional log-likelihood also minimize the MSE loss.
No, because one is doing maximization and the other is doing minimization.
No, because the MSE has an additional error term in the expression whereas the quantity to be minimized in MLE does not.
No, because the conditional log-likelihood has additional constant terms that do not appear in the MSE loss.
7. [3 pt] Now we are moving on to learn the MAP estimate of the parameters of the linear regression model. The MAP estimate is obtained through solving the following optimization problem.
wMAP = argmaxw p(w|D) = argmaxw p(D,w)
Suppose are using a Gaussian prior distribution with mean 0 and variance for each element wm of the parameter vector w(1 ≤ m ≤ M), i.e. wm ∼ N(0, λ1). Assume that w1,··· ,wM are mutually independent of each other. Which expression below is the correct log joint-probability of the data and parameters logp(D,w))?
(For simplicity, just use p(D|w) to denote the data likelihood.)
Select one:
8. A MAP estimator with a Gaussian prior N(0,σ2) you trained gives significantly higher test error than train error. What could be a possible approach to fixing this?
Select one:
Increase variance σ2
Decrease variance σ2
Try MLE estimator instead
None of the above
9. [4 pt] Maximizing the log posterior probability `MAP(w) gives you the MAP estimate of the parameters. The MAP estimate with Gaussian prior is actually equivalent to a L2 regularization on the parameters of linear regression model in minimizing an objective function J(w) that consists of a term related to log conditional likelihood `(w) and a L2 regularization term. The following options specify the two terms in J(w) explicitly. Which one is correct based on your derived log posterior probability in the previous question?
Select one:
10. [4 pt] MAP estimation with what prior is equivalent to L1 regularization?
Note:
The pdf of a Uniform distribution over [a,b] is ] and 0 otherwise.
The pdf of an exponential distribution with rate parameter a is f(x) = aexp(−ax) for x > 0.
The pdf of a Laplace distribution with location parameter a and scale parameter b is for all x ∈ R.
Select one:
Uniform distribution over [−wTx(i),wTx(i)]
Exponential distribution with rate parameter
Exponential distribution with rate parameter a = wTx(i)
Laplace prior with location parameter a = 0
Laplace prior with location parameter a = wTx(i)
Uniform distribution over [-1, 1]
4 Naive Bayes [27 pts]
1. [3 pt] I give you the following fact: for events A and B, P(A | B) = 2/3 and P(A | ¬B) = 1/3, where ¬B denotes the complement of B. Do you have enough information to calculate P(B | A)? If not, choose “not enough information”, if so, compute the value of P(B | A).
Select one:
Not enough information
2. [3 pt] Instead if I give you for events A and B, P(A | B) = 2/3, P(A | ¬B) = 1/3 and P(B) = 1/3 and P(A) = 4/9, where ¬B denotes the complement of B. Do you have information to calculate P(B | A)? If not, choose “not enough information”, if so, compute the value of P(B | A).
Select one:
Not enough information
3. [4 pt] Gaussian Naive Bayes in general can learn non-linear decision boundaries. Consider the simple case where we have just one real-valued feature X1 ∈ R from which we wish to infer the value of label Y ∈ {0,1}.The corresponding generative story would be:
Y ∼ Bernoulli(φ) X1 ∼ Gaussian( where the parameters are the Bernoulli parameter φ and the class-conditional Gaussian parameters and corresponding to Y = 0 and Y = 1 , respectively.
A linear decision boundary in one dimension, of course, can be described by a rule of the form “if X1 > c then Y = 1, else Y = 0”, where c is a real-valued threshold (see diagram provided). Is it possible in this simple one-dimensional case to construct a Gaussian Naive Bayes classifier with a decision boundary that cannot be expressed by a rule in the above form)?
Select one:
Yes, this can occur if the Gaussians are of equal means and equal variances.
Yes, this can occur if the Gaussians are of equal means and unequal variances.
Yes, this can occur if the Gaussians are of unequal means and equal variances. No, this cannot occur regardless of the relationship of the means or variances.
4. [4 pt] Suppose that 0.3% people have cancer. Someone decided to take a medical test for cancer. The outcome of the test can either be positive (cancer) or negative (no cancer). The test is not perfect – among people who have cancer, the test comes back positive 97% of the time. Among people who don’t have cancer, the test comes back positive 4% of the time. For this question, you should assume that the test results are independent of each other, given the true state (cancer or no cancer). What is the probability of a test subject having cancer, given that the subject’s test result is positive?
If your answer is in decimals, answer with precision 4, e.g. (6.051, 0.1230, 1.234e+7) Fill in the blank:
5. [4 pt] In a Naive Bayes problem, suppose we are trying to compute P(Y | X1,X2,X3,X4) . Furthermore, suppose X2 and X3 are identical (i.e., X3 is just a copy of X2 ). Which of the following are true in this case?
Select all that apply:
Naive Bayes will learn identical parameter values for P(X2|Y ) and P(X3|Y ).
Naive Bayes will output probabilities P(Y |X1,X2,X3,X4) that are closer to 0 and 1 than they would be if we removed the feature corresponding to X3.
This will not raise a problem in the output P(Y |X1,X2,X3,X4) because the conditional independence assumption will correctly treat this situation. 2 None of the above
6. [3 pt] Which of the following machine learning algorithms are probabilistic generative models?
Select one:
Decision Tree
K-nearest neighbors
Perceptron
Naive Bayes
Logistic Regression
Feed-forward neural network
7. [6 pt] Logistic Regression and Naive Bayes.
When Y is Boolean and X = hX1…Xni is a vector of continuous variables, then the assumptions of the Gaussian Naive Bayes classifier imply that P(Y | X) is given by the logistic function with appropriate parameters wi for all i and b. In particular:
and
Consider instead the case where Y is Boolean and X = hX1…Xni is a vector of Boolean variables.
Since the Xi are Boolean variables, you need only one parameter to define P(Xi | Y = yk). Define φi1 ≡ P(Xi = 1 | Y = 1), in which case P(Xi = 0 | Y = 1) = (1 − φi1). Similarly, use φi0 to denote P(Xi = 1|Y = 0).
(a) Given that we can simplify
Find expressions for P(Y = 1|X) and P(Y = 0|X) in terms of b and wi. Explicitly define b and wi.
Collaboration Questions Please answer the following:
1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.
2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.
3. Did you find or come across code that implements any part of this assignment ? If so, include full details.
Solution:
Reviews
There are no reviews yet.