CSC311 – Homework 2 (Solution)

$ 24.99
Category:

Description

Submission: You need to submit five files through MarkUs :
• Your answers to Questions 1, 2, 3, and 4, as a PDF file titled hw2_writeup.pdf. You can produce the file however you like (e.g. LATEX, Microsoft Word, scanner), as long as it is readable.
• Python files run_knn.py, logistic.py, and run_logistic_regression.py completed for Question 3.
• Python files q4.py completed for Question 4.
Neatness Point: One point will be given for neatness. You will receive this point as long as we don’t have a hard time reading your solutions or understanding the structure of your code.
Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted.
Computing: To install Python and required libraries, see the instructions on the course web page. Homeworks are individual work. See the Course Information handout for detailed policies.
1. [9pts] Expected Loss and Bayes Optimality You are running an email service, and one of your key features is a spam filter. Every email is either spam or non-spam, which we represent with the target t ∈ {Spam,NonSpam}. You need to decide whether to keep it in the inbox or remove it to the spam folder. We represent this with the decision variable y ∈ {Keep,Remove}. We’d like to remove spam emails and keep non-spam ones, but the customers will be much more unhappy if we remove a non-spam email than if we keep a spam email. We can represent this with the following loss function L(y,t):
NonSpam Spam
Keep 0 1
Remove 100 0
Your studies indicate that 10% of the emails are spam, i.e. Pr(t = Spam) = 0.1.
(a) [2pts] Evaluate the expected loss E[L(y,t)] for the policy that keeps every email (y = Keep), and for the policy that removes every email (y = Remove).
(b) [2pts] Now suppose you get to observe a feature vector x for each email, and using your knowledge of the joint distribution p(x,t), you infer p(t|x). What is the Bayes optimal decision rule? I.e., say how to determine the Bayes optimal decision y∗ from the conditional probability Pr(t = Spam|x).
(c) [4pts] After some analysis, you’ve found two words that are indicative of an email being spam: “Viagra” and “Nigeria”. You define two input features: x1, which takes the value 1 if the email contains the word “Viagra” and 0 otherwise, and x2, which is the same for “Nigeria”. For a spam email, these features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.4 0.3
x1 = 1 0.2 0.1
For a non-spam email, these features have the following joint distribution:
x2 = 0 x2 = 1
x1 = 0 0.998 0.001
x1 = 1 0.001 0
Determine the Bayes optimal decision rule, i.e. determine the Bayes optimal decision y∗ for each value of x. Justify your answer.
(d) [1pts] What is the expected loss E[L(y∗,t)]?
2. [4pts] Feature Maps. Suppose we have the following 1-D dataset for binary classification:
x t
-1 1
1 0
3 1
(a) [2pts] Argue briefly (at most a few sentences) that this dataset is not linearly separable. (Your argument should resemble the one we used in lecture to prove XOR is not linearly separable.)
(b) [2pts] Now suppose we apply the feature map
.
Assume we have no bias term, so that the parameters are w1 and w2. Write down the constraint on w1 and w2 corresponding to each training example, and then find a pair of values (w1,w2) that correctly classify all the examples. Remember that there is no bias term.
3. [15pts] kNN vs. Logistic Regression. In this problem, you will compare the performance and characteristics of different classifiers, namely k-Nearest Neighbors and Logistic Regression. You will complete the provided code in q3/ and experiment with the completed code. You should understand the code instead of using it as a black box.
The data you will be working with is a subset of MNIST hand-written digits, 4s and 9s, represented as 28×28 pixel arrays. We show the example digits in figure 1. There are two training sets: mnist_train, which contains 80 examples of each class, and mnist_train_small, which contains 5 examples of each class. There is also a validation set mnist_valid that you should use for model selection, and a test set mnist_test that you should use for reporting the final performance. Optionally, the code for visualizing the datasets is located at plot_digits.py.

Figure 1: Example digits. Top and bottom show digits of 4s and 9s, respectively.
3.1. k-Nearest Neighbors. Use the supplied kNN implementation to predict labels for mnist_valid, using the training set mnist_train.
(a) [2pts] Implement a function run_knn in run_knn.py that runs kNN for different values of k ∈ {1,3,5,7,9} and plots the classification rate on the validation set (number of correctly predicted cases, divided by total number of data points) as a function of k. Report the plot in the write-up.
(b) [2pts] Comment on the performance of the classifier and argue which value of k you would choose. What is the classification rate for k∗, your chosen value of k? Also report the classification rate for k∗ + 2 and k∗ − 2. How does the test performance of these values of k correspond to the validation performance ?
3.2. Logistic Regression. Read the provided code in run_logistic_regression.py and logistic.py. You need to implement the logistic regression model, where the cost is defined as:
,
where N is the total number of data points.
(a) [4pts] Implement functions logistic_predict, evaluate, and logistic located at logistic.py.
(c) [2pts] Examine how the cross entropy changes as training progresses. Generate and report 2 plots, one for each of mnist_train and mnist_train_small. In each plot, you need show two curves: one for the training set and one for the validation set. Run your code several times and observe if the results change. If they do, how would you choose the best parameter settings?
4. [8pts] Locally Weighted Regression.
(a) [2pts] Given {(x(1),y(1)),..,(x(N),y(N))} and positive weights a(1),…,a(N) show that the solution to the weighted least squares problem
w∗ = argmin (1)
is given by the formula
w∗ = XTAX Ay (2)
where X is the design matrix (defined in class) and A is a diagonal matrix where Aii =
a(i)
(b) [2pts] Locally reweighted least squares combines ideas from k-NN and linear regression. For each new test example x we compute distance-based weights for each training example i , computes w∗ = argmin
and predicts ˆy = xTw∗. Complete the implementation of locally reweighted least
squares by providing the missing parts for q4.py.
Important things to notice while implementing: First, do not invert any matrix, use a linear solver (numpy.linalg.solve is one example). Second, notice that
but if we use B = maxj Aj it is much more numerically stable as
overflows/underflows easily. This is handled automatically in the scipy package with the scipy.misc.logsumexp function .
(c) [2pt] Randomly hold out 30% of the dataset as a validation set. Compute the average loss for different values of τ in the range [10,1000] on both the training set and the validation set. Plot the training and validation losses as a function of τ (using a log scale for τ).
(d) [2pt] How would you expect this algorithm to behave as τ → ∞? When τ → 0? Is this what actually happened?

Reviews

There are no reviews yet.

Be the first to review “CSC311 – Homework 2 (Solution)”

Your email address will not be published. Required fields are marked *