CS760 – HOMEWORK4 (Solution)

Description

>>NAME HERE<< >>ID HERE<<
Instructions:
• Please submit your answers in a single pdf file. Preferably made using latex. No need to submit latex code. See piazza post @114 for some recommendations on how to write the answers.
• Submit code for programming exercises. You can use any programming language you like, but we highly recommend python and pytorch.
• Submit all the material on time.
1 Best Prediction Under 0-1 Loss (10 pts)
Suppose the world generates a single observation x ∼ multinomial(θ), where the parameter vector θ = (θ1,…,θk) with θi ≥ 0 and . Note x ∈ {1,…,k}. You know θ and want to predict x. Call your prediction xˆ.
What is your expected 0-1 loss:
E[1[xˆ 6= x]]
using the following two prediction strategies respectively? Prove your answer.
Strategy 1: xˆ ∈ argmaxx θx, the outcome with the highest probability. (5 pts)
Strategy 2: You mimic the world by generating a prediction xˆ ∼ multinomial(θ). (Hint: your randomness and the world’s randomness are independent) (5pts)
2 Best Prediction Under Different Misclassification Losses (10 pts)
Like in the previous question, the world generates a single observation x ∼ multinomial(θ). Let cij ≥ 0 denote the loss you incur, if x = i but you predict xˆ = j, for i,j ∈ {1,…,k}. cii = 0 for all i. This is a way to generalize different costs on false positives vs false negatives from binary classification to multi-class classification. You want to minimize your expected loss:
E[cxxˆ] Derive your optimal prediction xˆ.
3 Best Prediction Under Online Learning (10 pts)
Here we consider a repeated game: In each round t = 1,2,…,T
• An adversary choose a real number yt ∈ [0,1] and he keeps it secret;
• You try to guess the real number, choosing xt ∈ [0,1];
• The adversary’s number is revealed and you pay the squared difference (xt − yt)2.
To simplify the game a bit, we assume that the adversary is drawing the numbers i.i.d. from some fixed distribution over [0,1]. However, he is still free to decide which distribution at the beginning of the game. Suppose µ is the mean of the distribution he chose and σ2 is the variance.
(i) If we knew the distribution, how would you decide the optimal strategy to minimize your payments in T rounds in expectation? Please derive your optimal strategy and give the payments in T rounds in expectation of your strategy. (5 pts) Homework 4 CS 760 Machine Learning

(ii) However, usually we need to decide our strategy without knowing the distribution, so it’s important to measure whether our strategy has a good performance. It is natural to benchmark our strategy with respect to the optimal one (you are asked to derive it in (i)). Please design a quantity to measure such benchmark.
(5pts)
4 Language Identification with Naive Bayes (5 pts each)
Implement a character-based Naive Bayes classifier that classifies a document as English, Spanish, or Japanese all written with the 26 lower case characters and space.
The dataset is languageID.tgz, unpack it. This dataset consists of 60 documents in English, Spanish and Japanese. The correct class label is the first character of the filename: y ∈ {e,j,s}. (Note: here each file is a document in corresponding language, and it is regarded as one data.)
KL. Here S is the set of all character types, and L is the set of all classes of data labels. Then by the additive smoothing with parameter α, we can estimate the conditional probability as
,
where as ∈ S,ck ∈ L. Similarly, we can estimate the prior probability
,
where ck ∈ L and N is the number of training samples.
1. Use files 0.txt to 9.txt in each language as the training data. Estimate the prior probabilities pˆ(y = e), pˆ(y = j), pˆ(y = s) using additive smoothing with parameter . Give the formula for additive smoothing with parameter in this case. Print the prior probabilities.
2. Using the same training data, estimate the class conditional probability (multinomial parameter) for English
θi,e := pˆ(ci | y = e)
where ci is the i-th character. That is, c1 = a,…,c26 = z,c27 = space. Again use additive smoothing with parameter . Give the formula for additive smoothing with parameter in this case. Print θe which is a vector with 27 elements.
3. Print θj,θs, the class conditional probabilities for Japanese and Spanish.
4. Treat e10.txt as a test document x. Represent x as a bag-of-words count vector (Hint: the vocabulary has size 27). Print the bag-of-words vector x.
5. For the x of e10.txt, compute pˆ(x | y) for y = e,j,s under the multinomial model assumption, respectively. Use the formula
d pˆ(x | y) = Y(θi,y)xi
i=1
where x = (x1,…,xd). Show the three values: pˆ(x | y = e),pˆ(x | y = j),pˆ(x | y = s).
Homework 4 CS 760 Machine Learning

6. For the x of e10.txt, use Bayes rule and your estimated prior and likelihood, compute the posterior pˆ(y | x). Show the three values: pˆ(y = e | x),pˆ(y = j | x),pˆ(y = s | x). Show the predicted class label of x.
7. Evaluate the performance of your classifier on the test set (files 10.txt to 19.txt in three languages). Present the performance using a confusion matrix. A confusion matrix summarizes the types of errors your classifier makes, as shown in the table below. The columns are the true language a document is in, and the rows are the classified outcome of that document. The cells are the number of test documents in that situation. For example, the cell with row = English and column = Spanish contains the number of test documents that are really Spanish, but misclassified as English by your classifier.
English Spanish Japanese
English
Spanish
Japanese
8. Take a test document. Arbitrarily shuffle the order of its characters so that the words (and spaces) are scrambled beyond human recognition. How does this shuffling affect your Naive Bayes classifier’s prediction on this document? Explain the key mathematical step in the Naive Bayes model that justifies your answer.
5 Simple Feed-Forward Network ( 30 pts)
In this exercise, you will be derive, implement back-propagation for a simple neural network and compare your output with some standard library’s output. Consider the following 3-layer neural network.
yˆ = f(x) = g(W2σ(W1x))
Suppose x ∈ Rd,W1 ∈ Rd1×d and W2 ∈ Rk×d1 i.e. f : Rd 7→ Rk , Let σ(z) = [σ(z1),…,σ(zn)] for any z ∈ Rn where is the sigmoid (logistic) activation function and g is the softmax function. Suppose the true pair is (x,y) where y ∈ {0,1}k with exactly one of the entries equal to 1 and you are working with the cross-entropy loss function given below,
k
L(x,y) = −Xylog(yˆ)
i=1
• Derive backpropagation updates for the above neural network. (10 pts )
• Implement it in numpy or pytorch using basic linear algebra operations. (e.g. You are not allowed to use auto-grad, built-in optimizer, model etc. in this step. You can use library functions for data loading, processing etc.). Evaluate your implementation on MNIST dataset, report test error and learning curve. (10 pts)
• Implement the same network in pytorch (or any other framework). You can use all the features of the framework e.g. auto-grad etc. Evaluate it on MNIST dataset, report test error and learning curve. (6 pts)
• Try different weight initializations a) all weights initialized to 0, and b) Intialize the weights randomly between -1 and 1. Report test error and learning curves for both. ( You can use either of the implementations) (4 pts)
You should play with different hyperparameters like learning-rate, batch size etc. for your own learning. You only need to report reuslts for any particular setting of hyperparameters. You should mention the values of those along with the results. Use d1 = 300,d2 = 200. For optimization use SGD (Stochastic gradient descent) without momentum, with some batch-size say 32, 64 etc. MNIST can be obtained from here (https://pytorch.org/vision/ stable/datasets.html)

Reviews

There are no reviews yet.

Be the first to review “CS760 – HOMEWORK4 (Solution)”

CS760 – HOMEWORK4 (Solution)

Description

Reviews

Related products

CS760 – HOMEWORK8 (Solution)

CS760 – HOMEWORK5 (Solution)

CS760 – HOMEWORK2 (Solution)

CS760 – HOMEWORK1 (Solution)