Description
STUDENT ID: …………………………………………….. SIGNATURE: ……………………………………………..
COMP9417 Machine Learning and Data Mining – Final Examination
1. TIME ALLOWED — 24 HOURS
2. THIS EXAMINATION PAPER HAS 12 PAGES
3. TOTAL NUMBER OF QUESTIONS — 4
4. ANSWER ALL 4 QUESTIONS
5. TOTAL MARKS AVAILABLE — 100
6. OPEN BOOK EXAM – LECTURE NOTES, TUTORIALS, AND ONLINE RESOURCES ARE PERMITTED. PLEASE REFER TO EXAM INSTRUCTIONS ON MOODLE COURSE PAGE FOR SUBMISSION AND OTHER GUIDANCE.
7. DISCUSSION WITH OTHER STUDENTS IS STRICTLY PROHIBITED. EXAM
Question 1
(a) (Bias, Variance & MSE) Let X1,…,Xn be i.i.d. random variables with mean µ and variance σ2. Define the estimator for some constants a1,…,an.
(i) What condition must the ai’s satisfy to ensure that T is an unbiased estimator of µ? Unbiased means that ET = µ.
What to submit: your working out, either typed or handwritten.
(ii) Under the condition identified in the previous part, which choice of the ai’s will minimize the MSE of T? Does this choice of ai’s surprise you? Provide some brief discussion. Hint: this is a constrained minimization problem.
What to submit: your working out, either typed or handwritten, and some commentary.
(iii) Suppose that instead, we let , where b is some constant. Find the value of b (in terms of µ and σ2) which minimizes the MSE of T. How does the answer here compare to the estimator found in the previous part? How do the two compare as the sample size n increases? Are there any obvious issues with using the result of this question in practice? What to submit: your working out, either typed or handwritten, and some commentary.
(iv) Suppose now that you are told that σ2 = µ2. For the choice of b identified in the previous part, find the MSE of the estimator in this setting. How does the MSE compare to the MSE of the
sample average X? (Recall that MSE(X) = var(X) = σ2/n = µ2/n). Further, explain whether or not you can use this choice of b in practice.
What to submit: your working out, either typed or handwritten, and some commentary.
(b) (kNN Regression) Consider the usual data generating process , where f is some unknown function, and is a noise variable with mean zero and variance σ2. Recall that in kNN regression, we look at the k nearest neighbours (in our dataset) of an input point x0, we then consider their corresponding response values and average them to get a prediction for x0. Given a dataset we can write down the kNN prediction as
,
where Nk(x0) is the set of indices of the k nearest neighbours of x0. Without loss of generality, label the k nearest neighbours of x0 as z1,…,zk and their corresponding response values by t1,…,tk.
(i) Show that
[bias .
What to submit: your working out, either typed or handwritten.
(ii) Derive an expression for the variance var(mˆ(x0)).
What to submit: your working out, either typed or handwritten.
(iii) Using the results so far, write down an expression for the MSE of mˆ(x0). Describe what happens to the bias of the kNN estimator at x0 when k is very small (1NN), and what happens when k is very large (k → ∞). Similarly, what happens to the variance? What does this tell you about the relationship between bias and variance and choice of k?
What to submit: your working out, either typed or handwritten, and some commentary.
Question 2
(a) (Kernel SVM) Consider the following 2-dimensional data-set, where y denotes the class of each point.
index x1 x2 y
1 1 0 -1
2 0 1 -1
3 0 -1 -1
4 -1 0 +1
5 0 2 +1
6 0 -2 +1
7 -2 0 +1
(i) Use the transformation x = (x1,x2) 7→ (φ1(x),φ2(x)) where and φ2(x) =
. What is the equation of the best separating hyper-plane in the new feature space? Provide a plot with the data set and hyperplane clearly shown.
What to submit: a single plot, the equation of the separating hyperplane, a screen shot of your code, a copy of your code in your .py file for this question.
(ii) Fit a hard margin linear SVM to the transformed data-set in the previous part . What are the estimated values of (α1,…,α7). Based on this, which points are the support vectors? What error does your computed SVM achieve?
What to submit: the indices of your identified support vectors, the train error of your SVM, the computed α’s (rounded to 3 d.p.), a screen shot of your code, a copy of your code in your .py file for this question.
(iii) Consider now the kernel k(x,z) = (2 + x>z)2. Run a hard-margin kernel SVM on the original (untransformed) data given in the table at the start of the question. What are the estimated values of (α1,…,α7). Based on this, which points are the support vectors? What error does your computed SVM achieve?
What to submit: the indices of your identified support vectors, the train error of your SVM, the computed α’s (rounded to 3 d.p.), a screen shot of your code, a copy of your code in your .py file for this question.
(iv) Provide a detailed argument explaining your results in parts (i), (ii) and (iii). Your argument should explain the similarities and differences in the answers found. In particular, is your answer in (iii) worse than in (ii)? Why? To get full marks, be as detailed as possible, and use mathematical arguments or extra plots if necessary.
What to submit: some commentary and/or plots. If you use any code here, provide a screen shot of your code, and a copy of your code in your .py file for this question.
(b) (Test Set Linear Regression) Recall that the rMSE (root-MSE) of a hypothesis h on a test set , where xi is a p-dimensional vector and yi is a real number, is defined as
rMSE .
For all parts, assume that the xi’s are known to you, but the yi’s are not. Suppose however that you are permitted to query rMSE(h). What this means is that you can query, for any hypothesis h, the rMSE of h on the test set. Suppose further that you know that rMSE(z) = c0, where z is the hypothesis that returns zero for any input, i.e. z(xi) = 0 for each i = 1,…,n, and c0 is some arbitrary positive number.
(i) Assume you have a set of T hypotheses H = {h1,h2,…,hT}. You are told that for each i, there is a hypothesis h in H, such that h(xi) = yi. In other words, for any point in the test set, there is at least one hypothesis in H that predicts that point correctly . Suppose that you are permitted to blend predictions of different hypotheses in H . Design a naive, brute-force algorithm that constructs a hypothesis g from the elements of H such that rMSE(g) = 0. How many queries of rMSE do you need to make? How does your algorithm scale (in the worst case) with the test size n? Describe your algorithm in detail.
What to submit: a description of your algorithm and the number of queries required, either typed or handwritten.
(ii) We now consider a better approach than brute-force. For a given hypothesis h, define
h(X) = (h(x1),h(x2),…,h(xn))> y = (y1,y2,…,yn)>.
We can always compute h(X), but we do not know y. What is the smallest number of queries required to compute y>h(X)? Describe your approach in detail.
What to submit: a description of your approach and the number of queries required, either typed or handwritten.
(iii) Give a set of K hypotheses H = {h1,…,hK}, use your result in the previous part to solve
min rMSE,
α1,…,αK
and obtain the optimal weights α1,…,αK. Describe your approach in detail, and be sure to detail how many queries are needed and the exact values of the α’s, in terms of X,y and the elements of H.
What to submit: a description of your algorithm and the number of queries required, either typed or handwritten.
Question 3
+ 1 = 25.
In the 9417 group project, many of you applied gradient boosting, which is sometimes regarded as the best out-of-the-box learning algorithm. In this question, we will derive and implement the gradient boosting algorithm from scratch, and apply it to a toy data set. The gradient boosting algorithm can be described as follows: Let F be a set of base learning algorithms , and let `(y,yˆ) be a loss function, where y is the target, and yˆ is the predicted value. Note that this set-up is general enough to include both regression and classification algorithms. Suppose we have data (xi,yi) for i = 1,…,n, which we collect into a single data set D0. We then set the number of desired base learners to T, and proceed as follows:
(I) Initialize f0(x) = 0 (i.e. f0 is the zero function.) (II) For t = 1,2,…,T:
(GB1) Compute:
for i = 1,…,n. We refer to rt,i as the i-th pseudo-residual at iteration t.
(GB2) Construct a new pseudo data set, Dt, consisting of pairs: (xi,rt,i) for i = 1,…,n.
(GB3) Fit a model to Dt using our base class F. That is, we solve
n
ht = argminX`(rt,i,f(xi))
f∈F
i=1
(GB4) Choose a step-size. This can be done by either of the following methods:
(SS1) Pick a fixed step-size αt = α
(SS2) Pick a step-size adaptively according to
n
αt = argminX`(yi,ft−1(xi) + αht(xi)). α i=1
(GB5) Take the step
ft(x) = ft−1(x) + αtht(x).
(III) return fT.
We can view this algorithm as either a generalization of AdaBoost covered in lectures/labs, or as performing (functional) gradient descent on the base class F. Note that in (GB1), the notation means that after taking the derivative with respect to f(xi), set all occurences of f(xj) int he resulting expression with the prediction of the current model ft−1(xj), for all j. For example,
.
(a) Consider the regression setting where we allow the y-values in our data set to be real numbers. Suppose that we use squared error loss . For round t of the algorithm, show that rt,i = yi − ft−1(xi). Then, write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(b) Using the same setting as in the previous part, show that the adaptive approach (SS2) leads to a step-size:
.
What to submit: your working out, either typed or handwritten.
np.random.seed(123)
X, y = f_sampler(f, 160, sigma=0.2) X = X.reshape(-1,1)
fig = plt.figure(figsize=(7,7)) dt = DecisionTreeRegressor(max_depth=2).fit(X,y) # example model xx = np.linspace(0,1,1000) plt.plot(xx, f(xx), alpha=0.5, color=’red’, label=’truth’) plt.scatter(X,y, marker=’x’, color=’blue’, label=’observed’) plt.plot(xx, dt.predict(xx.reshape(-1,1)), color=’green’, label=’dt’) # plotting
example model plt.legend() plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
The figure generated is
Your task is to generate a 5 x 2 figure of subplots showing the predictions of your fitted gradient boosted model. There are 10 subplots in total, the first should show the model with 5 base learners, the second subplot should show it with 10 base learners, etc. The last subplot should be the gradient boosted model with 50 base learners. Each subplot should include the scatter of data, as well as a plot of the true model (basically, the same as the plot provided above but with your implemented gradient boosted model in place of dt). Comment on your results, what happens as the number of base learners is increased? You should do this two times (two 5×2 plots), once with the adaptive step size, and the other with the step-size taken to be α = 0.1 fixed throughout. There is no need to split into train and test data here. Comment on the differences between your fixed and adaptive step-size implementations. How does your model perform on different parts of the data?
What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commentary, and a screen shot of your code and a copy of your code in your .py file.
(d) Repeat the analysis in the previous question but with depth 2 decision trees as base learners instead. Provide the same plots. What do you notice for the adaptive case? What about the nonadaptive case? What to submit: two 5 x 2 plots, one for adaptive and one for fixed step size, some commentary, and a copy of your code in your .py file.
(e) Now, consider the classification setting where y is taken to be an element of {−1,1}. We consider the following classification loss: `(y,yˆ) = log(1 + e−yyˆ). For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it).
What to submit: your working out, either typed or handwritten.
(f) Using the same setting as in the previous part, write down an expression for αt using the adaptive approach in (SS2). Can you solve for αt in closed form? Explain.
What to submit: your working out, either typed or handwritten, and some commentary.
(g) Consider a different classification loss: `(y,yˆ) = e−yyˆ. For round t of the algorithm, what is rt,i? Write down an expression for the optimization problem in step (GB3) that is specific to this setting (you don’t need to actually solve it). Further, can you find an expression for αt in (SS2) for this setting? Explain.
What to submit: your working out, either typed or handwritten.
(h) In practice, if you cannot solve for αt exactly, explain how you might implement gradient boosting. Assume that using a constant step-size is not a valid alternative. Be as specific as possible in your answer. What, if any, are the additional computational costs of your approach relative to using a constant step size ?
What to submit: some commentary.
Question 4
Throughout this question, try to keep your answers as brief as possible. For discussion questions you need to provide 2-3 sentences of explanation at most. For calculation questions, please circle your final answers and show all working. (a) (Perceptron Power)
(i) Consider the following Boolean functions f1,f2,…,f5 which each take two binary inputs x1 and x2, and return a binary output. Their outputs can be represented in the following table
(each column represents the output of fi on the corresponding inputs x1 and x2.)
x1 x2 f1 f2 f3 f4 f5
0 0 0 0 0 0 1
0 1 0 1 1 0 0
1 0 0 0 1 1 0
1 1 0 1 0 0 1
Which of the functions can be learned exactly by a perceptron? Explain why. What to submit: some commentary.
(ii) Consider the following two layer network which implements the XOR function.
The numbers on the edges represent weights, and the numbers inside the nodes represent thresholds. The nodes return 1 if the inputis larger than the threshold value. For example, the top node in the middle layer takes as input w1x1 + w2x2 = x1 − x2, and if this is larger than the threshold b1 = 0.5, outputs a value of 1, and 0 otherwise. You should check that this network implements the XOR function by testing it for all possible boolean inputs (x1,x2), and comparing it to the XOR truth table. It turns out that a more efficient representation of the XOR function can be constructed, which is of the following form:
Provide weights and thresholds w1,…,w5 and b1,b2 that implement the XOR function using this new architecture. Demonstrate that the network works by computing its output on all possible inputs. Here, your weights w1,…,w5 must be whole numbers, e.g. ±1,±2,±3,… , and your thresholds b1,b2 should be chosen from 0.5,1,1.5,2,2.5,… . Try to find the smallest thresholds/weights that work.
What to submit: the weights and thresholds for the new network, along with a demonstration that the network does what it is meant to.
(b) (K-means) Note that the sub-questions in this question are independent of each other.
(i) Consider the following two-dimensional points: {(5,3),(4,4),(6,3),(5,4),(2,3)}. You run K-means (using Euclidean distance) on this dataset with K = 2 and initial cluster centers C0 = {(5,2),(4,5)}, where the subscript 0 denotes that these are the cluster centers at time 0. Compute C1 and C2, the cluster centers after 1 and 2 iterations of the K-means algorithm. Be sure to detail your working.
What to submit: your cluster centers and any working, either typed or handwritten.
(ii) Your friend, who studied accounting, is now promoted to the role of Data Scientist at his consulting company. They excitedly tell you that they’re working on a clustering problem. You ask for more details and they tell you they have an unlabelled dataset with p = 10000 features and they ran K-means clustering using Euclidean distance. They identified 52 clusters and managed to define labellings for these clusters based on their expert domain knowledge.
What do you think about the usage of K-means here? Do you have any criticisms?
What to submit: some commentary.
(c) (Learning Theory) Note that the sub-questions in this question are independent of each other.
(i) Consider a hypothesis space H consisting of 1000 hypotheses. At least how many samples must you provide the learning algorithm in order to assure that with probability .98, the learner will output a hypothesis in H with true accuracy .95? What to submit: your answer, and any working either typed or handwritten.
(ii) Consider a hypothesis class F with VC dimension equal to 4. Consider the following statements:
(I) Given a dataset of 4 samples, there is at least one classifier in F that can achieve an error of zero on these points.
(II) Classifiers in F will always achieve non-zero error on any dataset of size 5.
(III) Since the VC dimension of F is finite, the size of H must be finite.
Which of the statements are true? Which are false? Explain why or provide a counter-example.
What to submit: your answer, and any working either typed or handwritten.
(iii) Consider the hypothesis class V which contains all hypotheses of the form
(
1 if x ∈ [a,b] va,b(x) =
0 therwise
So the classifier assigns a positive label to points inside the interval [a,b], and zero to anything outside the interval. The class V contains all such classifiers (i.e. one classifier for each interval [a,b] for every pair of real numbers a and b). What is the VC dimension of V? Explain. What to submit: your answer, and any working either typed or handwritten.
(d) (Tree Construction) Consider the data set below, with features: W, X and Z, and target Y .
index Y W X Z
1 0 A 0 0
2 0 B 1 1
3 0 B 1 1
4 1 B 2 0
5 1 A 2 0
6 0 A 0 1
7 1 A 2 1
8 1 B 1 0
9 1 B 1 0
10 0 A 0 1
Throughout this question, use log2 in your calculations.
(i) What is the entropy of Y ?
What to submit: your answer, and any working either typed or handwritten.
(ii) Suppose that feature X is chosen for the root of a decision tree. What is the information gain associated with this attribute?
What to submit: your answer, and any working either typed or handwritten.
(iii) Draw the full decision tree (without pruning) and with X as the feature chosen for the root node. Label your tree clearly and outline the prediction made at each leaf node. What can you say about feature W? Show all working.
What to submit: your answer, and any working either typed or handwritten.
Reviews
There are no reviews yet.