Description
• This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam.
• This exam booklet contains four problems. You need to solve all problems to get 100%.
• Please check that the exam booklet contains 15 pages.
• You have 75 minutes to earn a total of 100 points.
• Answer each question in the space provided. If you need more room, write on the reverse side of the paper and indicate that you have done so.
• Besides having the correct answer, being concise and clear is very important. For full credit, you must show your work and explain your answers.
Good Luck!
Name (NetID): (1 Point)
Logistic Regression /25
Naive Bayes /25
Decision Trees /25
Short Answer /24
Total /100
1 Logistic Regression (25 pts)
4 Free Points!
(a) (1 pts) Briefly describe the difference between the generative approach and the discriminative approach for probabilistic classifiers.
The Discriminative approach models P(Y|X) directly whereas the generative approach models P(Y,X) and then conditions on X to derive P(Y|X). (Optional: Naive bayes is an example of a generative model whereas Logistic regression is an example of a discriminative model.)
Rubric:
Define what discriminative and generative approach models (1pt)
(b) (3 pts) Choose from the table below and fill in the blank.
Logistic Regression is an example of a discriminative model that models P(y|x,w) according to a Bernoulli distribution. It is best suited for a binary classification problem.
discriminative estimation Bernoulli
generative Poisson creative
binary classification regression normal
Rubric:
Each correct answer gets (1pt)
(c) (2 pts) Why is the sigmoid function used in logistic regression? Give two reasons. Hint: Consider representation and optimization.
The sigmoid funciton maps R → [0,1], which is a good representation of probability. It is also smooth and differentiable, which is preferred for optimization.
Rubric:
Each valid reason gets (1pt)
(d) (15 pts) For the following questions, assume we have a dataset of size n with yi ∈ {0,1} and xi ∈ Rd. Also assume we are modeling P(yi = 1|xi,w) = sigm(w|xi) for 1 ≤ i ≤ n.
(i) (2 pts) Write down the logistic regression decision rule for maximizing accuracy.
(1 sigm(w|xi) > θ
yˆ(xi) =
0 otherwise
Where θ is the decision boundary. Any 0 < θ < 1 is correct.
Rubric:
Correct decision labels (1pt)
Correct conditions (1pt)
(ii) (6 pts) Write down the conditional log likelihood of Y conditioned on X and w, LL(w), for logistic regression in terms of sigm(w|xi), yi, and n.
(sigm(w|xj)yj(1 − sigm(w|xj))1−yj
LL(w) = Xyjlog(sigm(w|xj)) + (1 − yj)log(1 − sigm(w|xj))
j=1
Rubric:
Correct L(w) (2pts)
Correct LL(w) (4pts)
(iii) (7 pts) In addition to the assumption of P(yi = 1|xi,w) = sigm(w|xi), we now want to regularize our w by imposing a gaussian prior, p(w) = N(0,λ−1I) =
w| λ
λ
, where I is the identity matrix and λ is a positive number.
• (5 pts) Suppose we are interested in the MAP estimate of w, derive the new log likelihood, LL0(w).
L0(w) = p(Y |X,w)p(w)
j r λ −w|wλ
(sigm(w|xj)y (1 − sigm | 1−yj
LL0(w) = Xyjlog(sigm(w|xj)) + (1 − yj)log(1 − sigm(w|xj)) − λ w|w + const
2
j=1
Rubric:
Error carried forward from previous part(ECF) (-1pt)
MAP Estimate (3pts)
Correct LL’(w) (2pt)
• (2 pts) Derive the gradient descent/ascent update rule for our new regularized logistic regression in terms of ∇LL(w0k),η,λ,w0k. Let
0 λw|w
LL (w) = LL(w) − + const
2
be the new objective function. Then,
∇LL0(w) = ∇LL(w) − λw
And the gradient ascent update rule is then
w0k+1 = w0k + η(∇LL(w0k) − λw0k)
Rubric:
Correct derivation of ∇LL0(w) (1pt)
Correct derivation of the gradient descent/ascent update rule (1pt)
2 Naive Bayes (25 pts)
(a) (2pt) Is the following statement correct? Please write True or False, and briefly explain your answer. Incorrect explanation will result in 0 points.
A Naive Bayes classifier is trained on a dataset which satisfies the naive Bayes assumption, i.e. conditional independence. As the training dataset’s size grows to infinity, the trained model will eventually always achieve zero TRAINING error.
(b) (2pt) Is the following statement correct? Please write True or False, and briefly explain your answer. Incorrect explanation will result in 0 points.
A Naive Bayes classifier is trained on a dataset which satisfies the naive Bayes assumption, i.e. conditional independence. As the training dataset’s size grows to infinity, the trained model will eventually always achieve zero TESTING error, given that the naive Bayes assumption is true on testing data as well.
False: same reason as above.
Other valid reasons will get full points as well as long as they makes senses.
(c) (5pt) You are asked to pick a ball from one of three colored jars, one of which is green. The probability of selecting a ball from each jar is uniform. If the overall probability of picking a red ball is , the probability of a red ball from the green jar is . What is the probability that you selected a green jar given the pick of a red ball? Either show the exact computation or briefly explain how you derived the result.
For partial credits:
Stating will get 1 point.
Stating will get 1 point.
Correctly stating Bayes formula and apply will get 2 point.
Correct answer will get another 1 point.
(d) In this problem, we will train a probabilistic model to classify a CS graduate S is from UIUC or UMich, based on their grades X on 4 common courses.
For each student si, he/she is either from I for UIUC or M for UMich, with grades xi = {x1,x2,x3,x4}, where xi denote their grades on course Ci, and all the grades follows the Bernoulli distributions, supposing there are only 2 grades, A or B. Given
Si,
P(xi = A | si = I) = αiI
P(xi = A | si = M) = αiM P(xi = B | si = I) = βiI
P(xi = B | si = M) = βiM
We want to build a naive bayes classifier to compute P(si = I | x).
(i) (3 pts): Using naive Bayes assumption, write down an expression for P(si = I | x) using the following terms P(si = I),P(si = M),P(xi | si = I),P(xi | si = M)
For partial credits:
Correct Bayes formula on x including the expanded denominator will get 1 point.
Correct independent assumption expansion on the joint probability (nominator) will get 1 point.
Overall correctness get another 1 point.
(ii) (5 pts): How many parameters MUST be estimated to train such a naive Bayes classifier? Please list all of them using probability expressions or/and α, β. Note: not all parameters are required to be estimated for the predictive model. (2 pts for correct number, 3 pts for correct listing. )
There are total 9 parameters needs to be estimated: P(s =
I),α1I,α2I,α3I,α4I,α1M,α2M,α3M,α4M. For partial credits:
αiI and βiI are not listed together, or stating they are duplicated (similarly αiM) will get 1 point.
Listing P(s = I) or P(s = M) will get 1 point.
Overall correctness get another 1 point.
(iii) (5 pts) Suppose you observe 4 students, 3 from UIUC and 1 from UMich. Among 3 UIUC students, 2 of them have A grades over all 4 courses, and 1 of them has B grades over all 4 courses. The UMich student has B grades over all 4 courses. Suppose this is all the data you have, using the MLE, what’s the probability that all B grade student comes from UMich? Either show the exact computation or briefly explain how you derived the result.
From data, for UIUC student, each course has chance to get B, and for UMich Student, each course has 1 chance to get B. P(M | B) =
.
For partial credits:
Stating P(B | M) = 1 will get 1 point.
Stating will get 1 point.
Correct naive Bayes formula on x including the expanded denominator will get 1 point.
Overall correctness get another 2 point.
(iv) (3 pts) Someone argues that the Naive Bayes assumption is not appropriate for this question. Do you agree with this? Explain your answer.
The following answer “It’s not appropriate because the naive Bayes assumption assumes conditional independence, and this is not true.” will receive 0 credits because it only restates the question with the “condition independence” information, which has been already provided at the part(a) as well. We expect you to explain more on why you think this assumption is not true or this assumption is true.
3 Decision Trees (25 pts)
(a) (8 pts) You are given a dataset where each sample has two continuous-valued features, x1 and x2, and a label, either a + or a ◦. The dataset has been plotted for you below. Each tick on the axis is one unit.
(i) (6 pts) Using tests in the form of either {x1 ≤ a}, {x1 ≥ a}, {x2 ≤ a}, or {x2 ≥ a}, where a ∈ Z, draw the decision tree that can correctly classify all 14 points in the figure using as few tests as possible. Be sure to clearly label which branch of the test corresponds to Yes and No. (Trees with fewer tests will earn more credit.)
One acceptable answer is given below. There are others, but the best trees use three tests.
(b) (8 pts) Consider three models trained over the same dataset. Model A is a single decision tree (with no depth limit or pruning), model B is trained using the Adaboost algorithm with decision stumps, and model C is a random forest. Compare the bias and variance of these models by filling in the blanks below with either <, ≈, or >, and provide a brief justification of your choice. If you would need more information to answer the problem, write “?” and state your reasoning. Each comparison is worth 1 point and each justification is worth 1 point.
(i) Bias(B) ≈ Bias(A)
Justification: A will have a low bias since it is not pruned. The individual stumps in B will have a high bias but the overall model will have a low bias. Therefore, Bias(A) ≈ Bias(B).
(ii) V ar(B) < V ar(A)
Justification: The variance of a decision tree with no depth limit or pruning is very high. However, each decision stump has low variance and weighting their votes together will only decrease the variance.
(iii) Bias(C) ≈ Bias(A)
Justification: A will have a low bias since it is not pruned. Each individual tree has a low bias and bagging does not increase the bias, so we can say that C will have a low bias as well.
(iv) V ar(C) < V ar(A)
(c) (9 pts) Boosting with Decision Trees
(i) (3 pts) When we use boosting with decision trees, we train a series of shallow decision trees (decision stumps) that, when combined, form a strong learner. In order for us to say this, what condition must the error of each of these shallow trees satisfy?
We need the Weak Learning assumption, that is, that the error rate of each individual each weak learner is slightly better than chance, i.e., . Full points were awarded if it was indicated that the error of each weak learner is less than (but not equal to) 0.5.
(ii) (2 pts) When training Adaboost, we want to decrease or leave (increase /decrease/leave) the weights of the samples that our weak learner predicted correctly, and we want to increase (increase/decrease/leave) the weights of the samples that our weak learner predicted incorrectly.
(iii) (4 pts) On the next page, we have provided you with a skeleton for the algorithm for binary Adaboost. Refer to the table below. For each blank line, write the letter corresponding to the appropriate expression.
(e) exp(αi1x correctly predicted by treei)
(f) wx exp(αierrori)
(g) wx exp(αi1x correctly predicted by treei)
(h) wx exp(αi1x incorrectly predicted by treei)
Algorithm 1 Binary Adaboost Training
1: n ← number of training examples
2: k ← number of stumps to train
3: max depth ← the depth of each stump
4: for x ∈ data do
5: wx
6: end for
7: for i ∈ {1,··· ,k} do
8: treei ← train stump(data, max depth)
9: errori ← error(tree, data)
10: αi ← (1) (b)
11: for x ∈ data do
12: wx ← (2) (h)
13: end for 14: for x ∈ data do
15: wx ← wx/Px0∈data wx0
16: end for
17: end for
18: Return (treei,alphai) for i ∈ {1,··· ,k}
4 Short Answer Questions (24 pts)
Some useful guidelines for this section.
• If the question is asking to select multiple choices, you will get points only when all the choices are correct.
• If the question is asking whether a statement is true or false, then you will get points only when you explain your choice. Writing only true or false will get you 0 pt.
(a) (2 pts) (Select One)
What are the priors on the weights which correspond to L1 and L2 regularization?
(i) Bernoulli for L1 and Uniform for L2. (ii) Uniform for L1 and Bernoulli for L2.
(iii) Gaussian for L1 and Laplace for L2. (iv) Laplace for L1 and Gaussian for L2. iv
(b) (2 pts) (Select One)
Suppose you figured out that the labels of nearby points in your training data are very different. Which of the following options would you consider in k-Nearest Neighbor model to improve the performance?
(i) Increase the value of k.
(ii) Decrease the value of k.
(iii) The performance of the model with noisy data does not depend on k.
(iv) None of the above.
(i)
(c) (2 pts) (Select One)
Suppose you want to learn a logistic regression model using gradient descent. You run gradient descent for 100 iterations with the learning rate, α = 0.1, and compute the negative of the log-likelihood after each iteration. You observe that the negative of the log-likelihood decreases quickly and then levels off after some iterations. What can you conclude from this observation?
(i) A larger value for the learning rate α is required.
(ii) A smaller value for the learning rate α is required.
(iii) α = 0.1 is an effective choice of the learning rate.
(iv) None of the above.
(iii)
(d) (3 pts) (Select Multiple)
Suppose the data is being generated from some probability distribution and you created two Decision Tree models; DT1 and DT2. DT1 is trained using finite number of training examples, whereas DT2 is trained using infinite number of training examples. In comparison to DT1, DT2 will have:
(i) lower variance.
(ii) same variance.
(iii) lower bias.
(iv) same bias.
(i) and (iv)
(e) (3 pts) (Select Multiple)
Which of the following is / are ensemble technique(s)?
(i) Random Forest.
(ii) Singular Vector Machine.
(iii) Adaboost.
(iv) Neural Network.
(i) and (iii)
(f) (3 pts) Write two ways to avoid overfitting?
Any two of the followings:
(a) Go for simpler models.
(b) Reduce variance by taking into account fewer variables and parameters.
(c) Use cross-validation.
(d) Use regularization techniques (such as L1 penalty, L2 penalty, etc.)
1.5 points are given for each correct answer.
(g) (3 pts) When we minimize the sum of squared errors using gradient descent in the Linear Regression problem, we can get multiple local optimum solutions. Is this statement true or false? (You can assume that the number of data points is larger than the number of features.)
False. The objective function in Linear Regression is strongly convex. (Full points are given even if the student has written only convex)
(h) (3 pts) Suppose you and three of your friends are working on the CS446 project, individually. You developed a new learning model which helped you to achieve the best predictions for the given dataset. In order to beat your model predictions, your friends came up with their own models and claimed that they are better than yours. With whom would you agree and why?
Friend A: “Your model is nowhere near to my model. Look at the training error rates in my model! They are well below than yours.”
Friend B: “Your model is just nothing in front of mine! Look at the test error rates! I got these results for the best value of α, chosen by experimenting on the test data.”
Friend C: “My model is better than yours. Look at the test error rates! I got these results for the best value of α, chosen with 5-fold cross validation.”
I would agree with C. A is claiming better error rates on training data which is not a useful metric to compare. B claims to have better error rates on the test data, but she has used test data as validation set. So, it is essentially not capturing generalization error.
1 point for agreeing with C.
1 point for writing what is wrong with A (or writing why C is better on the same arguments).
1 point for writing what is wrong with B (or writing why C is better on the same arguments).
(i) (3 pts) Your friend shared a fantastic experience working on CS446 project and then asked you a question. She said, “In order to reduce overfitting, I restricted my Decision Tree model to have a small maximum depth. It seems to work much better on the test data, but I really don’t know why; what do you think?”. What answer would you give in order to clarify her confusion?
3 points: Having a bound on the depth reduces variance, hence it is generalizing well.
3 points: Having a bound on the depth reduces complexity of the model, hence it is generalizing well.
1 point: Deeper trees are more prone to noise in the data. 1 point: Shallow trees generalize better than deeper trees.
0 point: It reduces overfitting. (That is already given in the question. One needs to justify why it is working better on test data.)
This page is intentionally left blank. You can use it for scratch paper.
Reviews
There are no reviews yet.