## Description

LEARNING THEORY AND GENERATIVE MODELS

TAs: Catherine, Zachary, Ari, Siyuan, Anoushka

Homework 6 covers topics on learning theory, MLE/MAP, Naive Bayes, and revisits neural networks, logistic regression and regularization. The homework includes multiple choice, True/False, and short answer questions.

START HERE: Instructions

˜mgormley/courses/10601/syllabus.html

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu.edu/ ˜mgormley/courses/10601/syllabus.html

• Submitting your work: You will use Gradescope to submit answers to all questions and code. Please follow instructions at the end of this PDF to correctly submit all your code to Gradescope.

1

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley / Henry Chai

# Marie Curie

# Noam Chomsky

Select One: Who taught this course?

Matt Gormley / Henry Chai

# Marie Curie

@ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

2 Stephen Hawking Albert Einstein

Isaac Newton 2 None of the above

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@ I don’t know

Fill in the blank: What is the course number?

10-S7601S

1 Neural Networks, Logistic Regression, Regularization Revisited

Imagine you work for a pharmaceutical company that is trying to predict whether certain patients will respond well to a new drug. Specifically, these patients have high blood pressure in their lungs, a condition known as pulmonary hypertension. Doctors recommend that the best predictors of a treatment effect is the patient’s heart function (measured by cardiac output) and the blood pressure in their lungs (pulmonary blood pressure). You plot the data and visualize the following:

1. (3 points) [SOLO] Draw on the graph above the decision boundaries of a trained neural network that minimizes the training error when classifying Good Responders vs all others (Poor or Average). Assume the neural network has two hidden units and one hidden layer. What is the smallest training error you can achieve?

Fill in the blank (write answer as a fraction):

2. (2 points) [SOLO] Using your decision boundaries above, assuming a logistic activation function, which point has the highest probability of being a Good Responder? Provide the point number as shown on the graph (points are labeled by their y value).

Fill in the blank:

3. (2 points) [SOLO] True or False: Increasing the number of hidden units of a neural network will always guarantee a lower training error.

True

False

4. (2 points) [SOLO] Convolutional neural networks often consist of convolutional layers, max-pooling layers, and fully-connected layers. Select all the layer types below that have parameters (i.e. weights) which can be learned by gradient descent / backpropagation.

Select all that apply:

convolutional layer max-pooling layer

2 fully-connected layer

5. (2 points) [SOLO] Regularization.

Which of the following are true about regularization? Select all that apply:

One of the goals of regularization is combating overfitting.

A model with regularization fits the training data better than a model without regularization

The L-0 norm (number of non-zero parameters) is rarely used in practice in part because it is non-differentiable.

One way to understand regularization is that it attempts to follow Occam’s razor and make the learning algorithm prefer ”simpler” solutions.

6. (6 points) [SOLO] Regularization in Linear Regression.

When performing linear regression, which of the following options will not increase mean-squared training error:

Select all that apply:

Adding higher-order functions of the features as separate features

Increasing the regularization weight

For the same weight on the regularizer, using an L1 regularizer instead of an L2 For the same weight on the regularizer, using an L1 regularizer instead of an L0 2 None of the above

7. (3 points) [GROUP] Recall the convolution operation in CNN from lecture. A kernel, or a convolutional filter, is a small matrix that can be used for blurring, sharpening, embossing, edge detection, etc. Given that convolving the below black-and-white picture (labeled as Original) with kernel X gives output image (a), choose the most probable output image for kernel Y .

Select one:

None of the above

2 Learning Theory

1. [SOLO] Alex is given a classification task to solve.

(a) (3 points) [SOLO] He has no idea where to start, so he decided to try out a decision tree learner with 2 binary features X1 and X2. He recently learned about PAC learning, and would like to know what is the minimum number (N) of data points that would suffice for the PAC criterion with .

Fill in the blank:

Fill in the blank:

(c) (3 points) [SOLO] Sally wants to argue her method has lower bound on the true error. Assuming Sally has obtained enough data points to satisfy PAC criterion with . Which of the following is true?

Select one:

Sally is wrong. Alex’s method will always classify unseen data more accurately since it is simpler as it only needs 2 binary features.

She must first regularize her model by removing 14 features to make any comparison at all.

It is sufficient to show that the VC Dimension of her classifier is higher than Alex’s, therefore having lower bound for the true error.

It is necessary to show that the training error she achieves is lower than the training error Alex achieves.

2. (4 points) [SOLO] In lecture, we saw that we can use our sample complexity bounds to derive bounds on the true error for a particular algorithm. Use the sample complexity bound for the infinite, agnostic case:

to prove that with probability at least (1 − δ):

!

Hint: Start by applying the definition of big-O notation (i.e. if N = O(M) (for some value M), then there exists a c ∈ R such that N ≤ cM).

3. (3 points) [GROUP] Consider an arbitrary decision tree learner applied to data where each example is described by M binary attributes, with no overlapping points (i.e. no points have identical attributes but different labels). Your friend Paul says that no matter the value of M, a decision tree can always shatter 2M points. Is Paul wrong? If so, provide a counterexample. If Paul is right, briefly explain why in 1-2 concise sentences.

4. [GROUP] Consider instance space X which is the set of real numbers.

(a) (3 points) [GROUP] What is the VC dimension of hypothesis class H, where each hypothesis h in H is of the form “if a < x < b or c < x < d then y = 1; otherwise y = 0”? (i.e., H is an infinite hypothesis class where a,b,c, and d are arbitrary real numbers).

Select one:

(b) (3 points) [GROUP] Given the set of points in X below, construct a labeling of some subset of the points to show that any dimension larger than your choice of VC dimension in part (a) by exactly 1 is incorrect (e.g. if the VC dimension of H is 3, only fill in the answers for 4 of the points). Fill in the boxes such that for each point in your example, the corresponding label is either 1 or 0 (for points you are not using in your example, leave the boxes blank).

−3 −2 −1 0 1 2 3

-3:

-2:

-1:

0:

1:

2:

3:

3 MLE/MAP

1. (3 points) [SOLO] True or False: Suppose you place a Beta prior over the Bernoulli distribution, and attempt to learn the parameter of the Bernoulli distribution from data. Further suppose an adversary chooses “bad”, but finite hyperparameters for your Beta prior in order to confuse your learning algorithm. As the number of training examples grows to infinity, the MAP estimate of θ can still converge to the MLE estimate of θ.

Select One:

True

False

2. (3 points) [SOLO] Let Θ be a random variable with the following probability density function (pdf):

(

2θ if 0 ≤ θ ≤ 1

f(θ) =

0 otherwise

Suppose another random variable Y, which is conditioning on Θ, follows an exponential distribution with λ = 3θ. Recall that the exponential distribution with parameter λ has the following pdf:

What is the MAP estimate of Θ given is observed?

Select one:

3. (3 points) [SOLO] In HW3, you have derived the closed form solution for linear regression. Now, we are coming back to linear regression, viewing it as a statistical model, and deriving the MLE and MAP estimate of the parameters in the following questions. As a reminder, in MLE, we have

θˆMLE = argmaxp(D|θ)

θ

For MAP, we have

θˆMAP = argmaxp(θ|D)

θ

Assume we have data , where x . So our data has N instances

and each instance has M attributes/features. Each y(i) is generated given x(i) with additive noise N(0,σ2), that is where w is the parameter vector of linear regression. Given this assumption, what is the distribution of y?

Select one:

Uniform(wT x(i) − σ,wT x(i) + σ)

None of the above

4. (3 points) [SOLO] The next step is to learn the MLE of the parameters of the linear regression model. Which expression below is the correct conditional log likelihood `(w) with the given data?

Select one:

5. (3 points) [SOLO] Then, the MLE of the parameters is just argmaxw `(w) . Among the following expressions, select ALL that can yield the correct MLE.

Select all that apply: argmaxw

argmaxw argmaxw argmaxw

2 argmaxw

6. (3 points) [SOLO] According to the above derivations, is the MLE for the conditional log likelihood equivalent to minimizing mean squared errors (MSE) for the linear regression model when making predictions? Why or why not?

Select one:

Yes, because the derivative of the negative conditional log-likelihood has the same form as the derivative of the MSE loss.

Yes, because the parameters that maximize the conditional log-likelihood also minimize the MSE loss.

No, because one is doing maximization and the other is doing minimization.

No, because the MSE has an additional error term in the expression whereas the quantity to be minimized in MLE does not.

No, because the conditional log-likelihood has additional constant terms that do not appear in the MSE loss.

7. (3 points) [SOLO] Now we are moving on to learn the MAP estimate of the parameters of the linear regression model. Which expression below is the correct optimization problem the MAP estimate is trying to solving? (recall that D refers to the data, and w to the regression parameters (weights)).

Select all that apply:

2 wMAP = argmaxw p(D,w)

2 wMAP = argmaxw p(Dp|w(D)p)(w)

wMAP = argmaxw wMAP = argmaxw p(D|w)p(w)

2 wMAP = argmaxw p(w|D)

8. (3 points) [SOLO] Suppose we are using a Gaussian prior distribution with mean 0 and variance for each element wm of the parameter vector w(1 ≤ m ≤ M), i.e. wm ∼ N(0, λ1). Assume that w1,··· ,wM are mutually independent of each other. Which expression below is the correct log jointprobability of the data and parameters logp(D,w))? Please show your work below.

Select one:

9. (3 points) [SOLO] Maximizing the log posterior probability `MAP(w) gives you the MAP estimate of the parameters. Which one is correct to estimate the parameters using maxw `MAP(w) based on your derived log posterior probability in the previous question? With the result, please specify how the MAP estimate with Gaussian prior related to the linear regression model.

Select one:

10. (2 points) [GROUP] A MAP estimator with a Gaussian prior N(0,σ2) you trained gives significantly higher test error than train error. What could be a possible approach to fixing this?

Select one:

Increase variance σ2

Decrease variance σ2

Try MLE estimator instead

None of the above

11. (4 points) [GROUP] MAP estimation with what prior is equivalent to L1 regularization? Please show your work below.

Note:

The pdf of a Uniform distribution over and 0 otherwise.

The pdf of an exponential distribution with rate parameter a is f(x) = aexp(−ax) for x > 0.

The pdf of a Laplace distribution with location parameter a and scale parameter for all x ∈ R.

Select one:

Uniform distribution over [−wT x(i),wT x(i)]

Exponential distribution with rate parameter

Exponential distribution with rate parameter a = wT x(i)

Laplace prior with location parameter a = 0

Laplace prior with location parameter a = wT x(i)

Uniform distribution over [-1, 1]

12. (2 points) [GROUP] When we estimate linear regression, we naturally choose the objective function of mean square error: . We could instead minimize the weighted mean squared error: where a1,…an are fixed weights of the training examples that are given alongside the training data. This includes ordinary least squares as the special case where all the weights ai = 1. Please suggest one reason that we would prefer minimizing weighted mean square error instead of just mean square error.

13. (4 points) [GROUP] Suppose we define a new regression model. Each y(i) is generated given x(i) with additive noise , that is . Unlike the standard regression model we’ve worked with until now, there is now an example specific variance σi2.

Show that maximizing the log-likelihood of this new model is equivalent to minimizing a weighted mean squared error in the box below. Then select the correct value of ai in the weighted mean squared error.

Select one:

Justification

4 Naive Bayes

1. (3 points) [SOLO] I give you the following fact: for events A and B, P(A | B) = 2/3 and P(A | ¬B) = 1/3, where ¬B denotes the complement of B. Do you have enough information to calculate P(B | A)? If not, choose “not enough information”, if so, compute the value of P(B | A).

Select one:

Not enough information

2. (3 points) [SOLO] Instead if I give you for events A and B, P(A | B) = 2/3, P(A | ¬B) = 1/3 and P(B) = 1/3 and P(A) = 4/9, where ¬B denotes the complement of B. Are the information consistent to calculate P(B | A)? If not, choose “conflicting information”, if so, compute the value of P(B | A).

Select one:

Conflicting information

3. (4 points) [SOLO] Suppose that 0.3% people have cancer. Someone decided to take a medical test for cancer. The outcome of the test can either be positive (cancer) or negative (no cancer). The test is not perfect – among people who have cancer, the test comes back positive 97% of the time. Among people who don’t have cancer, the test comes back positive 4% of the time. For this question, you should assume that the test results are independent of each other, given the true state (cancer or no cancer). What is the probability of a test subject having cancer, given that the subject’s test result is positive? If your answer is in decimals, answer with precision 4, e.g. (6.051, 0.1230, 1.234e+7) Fill in the blank:

4. (3 points) [SOLO] Which of the following machine learning algorithms are probabilistic generative models?

Select one:

Decision Tree

K-nearest neighbors

Perceptron

Naive Bayes

Logistic Regression

Feed-forward neural network

Select all that apply:

(a)

(b)

(c)

(d)

2 None of the above

6. (4 points) [GROUP] Logistic Regression and Naive Bayes.

When Y is Boolean and X = hX1…Xni is a vector of continuous variables, then the assumptions of the Gaussian Naive Bayes classifier imply that P(Y | X) is given by the logistic function with appropriate parameters wi for all i and b. In particular:

and

Consider instead the case where Y is Boolean and X = hX1…Xni is a vector of Boolean variables.

Since the Xi are Boolean variables, you need only one parameter to define P(Xi | Y = yk). Define φi1 ≡ P(Xi = 1 | Y = 1), in which case P(Xi = 0 | Y = 1) = (1−φi1). Similarly, use φi0 to denote P(Xi = 1|Y = 0).

1. Show that

can be written in the form of a Gaussian Naive Bayes classifier by finding expressions for P(Y = 1|X) and P(Y = 0|X) in terms of b and wi. Explicitly define b and wi.

7. (3 points) [GROUP] In a Naive Bayes problem, suppose we are trying to compute P(Y | X1,X2,X3,X4). Furthermore, suppose X2 and X3 are identical (i.e., X3 is just a copy of X2). Which of the following are true in this case?

Select all that apply:

Naive Bayes will learn identical parameter values for P(X2|Y ) and P(X3|Y ).

Naive Bayes will output probabilities P(Y |X1,X2,X3,X4) that are closer to 0 and 1 than they would be if we removed the feature corresponding to X3.

There is not enough information to determine the change in the output P(Y |X1,X2,X3,X4). 2 None of the above

8. (4 points) [GROUP] Gaussian Naive Bayes in general can learn non-linear decision boundaries. Consider the simple case where we have just one real-valued feature X1 ∈ R from which we wish to infer the value of label Y ∈ {0,1}.The corresponding generative story would be:

Y ∼ Bernoulli(φ)

X1 ∼ Gaussian(µy,σy2)

where the parameters are the Bernoulli parameter φ and the class-conditional Gaussian parameters and corresponding to Y = 0 and Y = 1 , respectively.

A linear decision boundary in one dimension, of course, can be described by a rule of the form “if X1 > c then Y = 1, else Y = 0”, where c is a real-valued threshold (see diagram provided). Is it possible in this simple one-dimensional case to construct a Gaussian Naive Bayes classifier with a decision boundary that cannot be expressed by a rule in the above form)?

Select one:

Yes, this can occur if the Gaussians are of equal means and equal variances.

Yes, this can occur if the Gaussians are of equal means and unequal variances.

Yes, this can occur if the Gaussians are of unequal means and equal variances.

Yes, this can occur if the Gaussians are of unequal means and unequal variances.

Draw the corresponding Gaussians and the decision boundaries:

5 Collaboration Questions

1. Did you receive any help whatsoever from anyone in solving this assignment? If so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? If so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

## Reviews

There are no reviews yet.