## Description

Learning Theory and Generative Models

TAs: Brynn Edmunds, Zhuoran Liu, Emilio Arroyo-Fang, George Xu

START HERE: Instructions

Homework 6 covers topics on learning theory, MLE/MAP, and Naive Bayes. The homework includes multiple choice, True/False, and short answer questions.

• Late Submission Policy: See the late submission policy here: http://www.cs.cmu. edu/~mgormley/courses/10601bd-f18/about.html#6-general-policies

• Submitting your work:

correctly by our AI assisted grader. In addition, please tag the problems to the corresponding pages when submitting your work.

For multiple choice or select all that apply questions, shade in the box or circle in the template document corresponding to the correct answer(s) for each of the questions. For LATEXusers, use and for shaded boxes and circles, and don’t change anything else.

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:

Select One: Who taught this course?

Matt Gormley

# Marie Curie # Noam Chomsky

Select One: Who taught this course?

Matt Gormley # Marie Curie

@ @ Noam Chomsky

For “Select all that apply” questions, please fill in all appropriate squares completely:

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein

Isaac Newton

I don’t know

Select all that apply: Which are scientists?

Stephen Hawking

Albert Einstein Isaac Newton

@@ I don’t know

Fill in the blank: What is the course number?

10-601 10-S7601S

1 Learning Theory [22 pts]

1. [3 pt] Let . According to the PAC theorems discussed in class, which of the following is correct? Select one.

Select one:

With probability at least 1 − δ, every hypothesis with training error at most has true error 0.

With probability at least 1 − , a random hypothesis with training error 0 has true error at most δ.

With probability at least 1−δ, every hypothesis with training error 0 has true error at most .

With probability at least 1 − , a random hypothesis with true error 0 has training error at most δ.

2. [3 pt] Consider a decision tree learner applied to data where each example is described by 10 boolean variables X1,X2,··· ,X10. What is the VC dimension of the hypothesis space used by this decision tree learner?

Fill in the blank:

3. [4 pt] Consider instance space X which is the set of real numbers. What is the VC dimension of hypothesis class H, where each hypothesis h in H is of the form “if a < x < b or c < x < d then y = 1; otherwise y = 0”? (i.e., H is an infinite hypothesis class where a, b, c, and d are arbitrary real numbers.

Select one:

2

5

6

4. [3 pt] Alex is given a classification task to solve. He has no idea where to start, so he decided to try out a decision tree learner with 2 binary features X1 and X2. He recently learned about PAC learning, and would like to know what is the minimum number (N) of data points that would suffice for the PAC criterion with 1 and δ = 0.01.

Fill in the blank:

Fill in the blank:

6. [3 pt] Sally wants to argue her method has lower bound on the true error. Assuming Sally has obtained enough data points to satisfy PAC criterion with 1 and δ = 0.01. Which of the following is true?

Select one:

Sally is wrong. Alex’s method will always classify unseen data more accurately since it is simpler as it only needs 2 binary features.

She must first regularize her model by removing 14 features to make any comparison at all.

It is sufficient to show that the VC Dimension of her classifier is higher than Alex’s, therefore having lower bound for the true error.

It is necessary to show that the training error she achieves is lower than the training error Alex achieves.

2 MLE/MAP [34 pts]

1. [3 pt] True or False: Suppose you place a Beta prior over the Bernoulli distribution, and attempt to learn the parameter of the Bernoulli distribution from data. Further suppose an adversary chooses “bad”, but finite hyperparameters for your Beta prior in order to confuse your learning algorithm. As the number of training examples grows to infinity, the MAP estimate of θ can still converge to the MLE estimate of θ.

Select One:

True

False

2. [3 pt] Let θ be a random variable with the following probability density function (pdf):

(

2θ if 0 ≤ θ ≤ 1

f(θ) =

0 otherwise

Suppose another random variable Y, which is conditioning on θ, follows an exponential distribution with λ = 3θ. Recall that the exponential distribution with parameter λ has the following pdf:

What is the MAP estimate of θ given is observed?

Select one:

0

1/3

1

2

3. In HW3, you have derived the closed form solution for linear regression. Now, we are coming back to linear regression, viewing it as a statistical model, and deriving the MLE and MAP estimate of the parameters in the following questions.

Assume we have data , where x ) . So our data has N instances and each instance has M attributes/features. Each y(i) is generated given x(i) with additive noise ), that is where w is the parameter vector of linear regression. Given this assumption, what is the distribution of y?

Select one:

y(i) ∼ N(wTx(i),σ2) y(i) ∼ N(0,σ2) y(i) ∼ Uniform(wTx(i) − σ,wTx(i) + σ)

None of the above

4. [4 pt] The next step is to learn the MLE of the parameters of the linear regression model. Which expression below is the correct conditional log likelihood `(w) with the given data?

Select one:

5. [4 pt] Then, the MLE of the parameters is just argmaxw `(w) . Among the following expressions, select ALL that can yield the correct MLE.

Select all that apply:

argmaxw

argmaxw

argmaxw

argmaxw

argmaxw

6. According to the above derivations, is the MLE for the conditional log likelihood equivalent to minimizing mean squared errors (MSE) for the linear regression model when making predictions? Why or why not?

Select one:

Yes, because the derivative of the negative conditional log-likelihood has the same form as the derivative of the MSE loss.

Yes, because the parameters that maximize the conditional log-likelihood also minimize the MSE loss.

No, because one is doing maximization and the other is doing minimization.

No, because the MSE has an additional error term in the expression whereas the quantity to be minimized in MLE does not.

No, because the conditional log-likelihood has additional constant terms that do not appear in the MSE loss.

7. [3 pt] Now we are moving on to learn the MAP estimate of the parameters of the linear regression model. The MAP estimate is obtained through solving the following optimization problem.

wMAP = argmaxw p(w|D) = argmaxw p(D,w)

Suppose are using a Gaussian prior distribution with mean 0 and variance for each element wm of the parameter vector w(1 ≤ m ≤ M), i.e. wm ∼ N(0, λ1). Assume that w1,··· ,wM are mutually independent of each other. Which expression below is the correct log joint-probability of the data and parameters logp(D,w))?

(For simplicity, just use p(D|w) to denote the data likelihood.)

Select one:

8. A MAP estimator with a Gaussian prior N(0,σ2) you trained gives significantly higher test error than train error. What could be a possible approach to fixing this?

Select one:

Increase variance σ2

Decrease variance σ2

Try MLE estimator instead

None of the above

9. [4 pt] Maximizing the log posterior probability `MAP(w) gives you the MAP estimate of the parameters. The MAP estimate with Gaussian prior is actually equivalent to a L2 regularization on the parameters of linear regression model in minimizing an objective function J(w) that consists of a term related to log conditional likelihood `(w) and a L2 regularization term. The following options specify the two terms in J(w) explicitly. Which one is correct based on your derived log posterior probability in the previous question?

Select one:

−`(w) + λkwk2

10. [4 pt] MAP estimation with what prior is equivalent to L1 regularization?

Note:

The pdf of a Uniform distribution over [a,b] is ] and 0 otherwise.

The pdf of an exponential distribution with rate parameter a is f(x) = aexp(−ax) for x > 0.

The pdf of a Laplace distribution with location parameter a and scale parameter b is for all x ∈ R.

Select one:

Uniform distribution over [

Exponential distribution with rate parameter

Exponential distribution with rate parameter a = wTx(i)

Laplace prior with location parameter a = 0

Laplace prior with location parameter a = wTx(i)

Uniform distribution over [-1, 1]

3 Naive Bayes [44 pts]

1. [3 pt] I give you the following fact: for events A and B, P(A | B) = 2/3 and P(A | ¬B) = 1/3, where ¬B denotes the complement of B. Do you have enough information to calculate P(B | A)? If not, choose “not enough information”, if so, compute the value of P(B | A).

Select one:

1/2

2/3

1/3

Not enough information

2. [3 pt] Instead if I give you for events A and B, P(A | B) = 2/3, P(A | ¬B) = 1/3 and P(B) = 1/3 and P(A) = 4/9, where ¬B denotes the complement of B. Do you have information to calculate P(B | A)? If not, choose “not enough information”, if so, compute the value of P(B | A).

Select one:

1/2

2/3

1/3

Not enough information

3. Suppose you are given the following set of data with three Boolean input variables A, B, and C, and a single Boolean output variable K.

A B C K

1 0 1 1

1 1 1 1

0 1 1 0

1 1 0 0

1 0 1 0

0 0 0 1

0 0 0 1

0 0 1 0

Suppose you train a Naive Bayes classifier without any priors.

According to the Naive Bayes classifier, what is P(K = 1 | A = 1,B = 1,C = 0)?

If your answer is in decimals, answer with precision 4, e.g. (6.051, 0.1230, 1.234e+7). Fill in the blank:

4. [4 pt] Using the same table from the previous question, according to the Naive Bayes classifier, what is (K = 0 | A = 1,B = 1)?

If your answer is in decimals, answer with precision 4, e.g. (6.051, 0.1230, 1.234e+7).

Fill in the blank:

5. Gaussian Naive Bayes in general can learn non-linear decision boundaries. Consider the simple case where we have just one real-valued feature X1 ∈ R from which we wish to infer the value of label Y ∈ {0,1}.The corresponding generative story would be:

Y ∼ Bernoulli(φ) X1 ∼ Gaussian( where the parameters are the Bernoulli parameter φ and the class-conditional Gaussian parameters and corresponding to Y = 0 and Y = 1 , respectively.

A linear decision boundary in one dimension, of course, can be described by a rule of the form “if X1 > c then Y = 1, else Y = 0”, where c is a real-valued threshold (see diagram provided). Is it possible in this simple one-dimensional case to construct a Gaussian Naive Bayes classifier with a decision boundary that cannot be expressed by a rule in the above form)?

Select one:

Yes, this can occur if the Gaussians are of equal means and equal variances.

Yes, this can occur if the Gaussians are of equal means and unequal variances. Yes, this can occur if the Gaussians are of unequal means and equal variances.

No, this cannot occur regardless of the relationship of the means or variances.

6. Suppose that 0.3% people have cancer. Someone decided to take a medical test for cancer. The outcome of the test can either be positive (cancer) or negative (no cancer). The test is not perfect – among people who have cancer, the test comes back positive 97% of the time. Among people who don’t have cancer, the test comes back positive 4% of the time. For this question, you should assume that the test results are independent of each other, given the true state (cancer or no cancer). What is the probability of a test subject having cancer, given that the subject’s test result is positive?

If your answer is in decimals, answer with precision 4, e.g. (6.051, 0.1230, 1.234e+7)

Fill in the blank: 0.0680

7. [4 pt] In a Naive Bayes problem, suppose we are trying to compute P(Y | X1,X2,X3,X4) . Furthermore, suppose X2 and X3 are identical (i.e., X3 is just a copy of X2 ). Which of the following are true in this case?

Select all that apply:

Naive Bayes will learn identical parameter values for P(X2|Y ) and P(X3|Y ).

Naive Bayes will output probabilities P(Y |X1,X2,X3,X4) that are closer to 0 and 1 than they would be if we removed the feature corresponding to X3.

This will not raise a problem in the output P(Y |X1,X2,X3,X4) because the conditional independence assumption will correctly treat this situation.

None of the above

8. [3 pt] Which of the following machine learning algorithms are probabilistic generative models?

Select one:

Decision Tree

K-nearest neighbors

Perceptron Naive Bayes

Logistic Regression

Feed-forward neural network

9. [15 pt] Logistic Regression and Naive Bayes.

When Y is Boolean and X = hX1…Xni is a vector of continuous variables, then the assumptions of the Gaussian Naive Bayes classifier imply that P(Y | X) is given by the logistic function with appropriate parameters W. In particular:

and

Consider instead the case where Y is Boolean and X = hX1…Xni is a vector of Boolean variables. Prove for this case also that P(Y | X) follows this same form (and hence that Logistic Regression is also the discriminative counterpart to a Naive Bayes generative classifier over Boolean features).

Hints

(a) Simple notation will help. Since the Xi are Boolean variables, you need only one parameter to define P(Xi | Y = yk). Define φi1 ≡ P(Xi = 1 | Y = 1), in which case P(Xi = 0 | Y = 1) = (1−φi1). Similarly, use φi0 to denote P(Xi = 1|Y = 0). (b) Notice with the above notation you can represent P(Xi | Y = 1) as follows

Note when Xi = 1 the second term is equal to 1 because its exponent is zero. Similarly, when Xi = 0 the first term is equal to 1 because its exponent is zero.

Write your solution on the following page.

Collaboration Questions Please answer the following:

1. Did you receive any help whatsoever from anyone in solving this assignment? Is so, include full details.

2. Did you give any help whatsoever to anyone in solving this assignment? Is so, include full details.

3. Did you find or come across code that implements any part of this assignment ? If so, include full details.

## Reviews

There are no reviews yet.