EE-411, HomeWork 2 : Classifying digits & learning theory (Solution)

$ 25.00
Category:

Description

This homework involves some coding, with the language of your choice. Present your results with graphs, plots, and data.
Submission instruction : The homework must be uploaded on Moodle and the name of the submission must be name surname SCIPER HW2.ipynb. All the results must be included in a single notebook. We recommend you to use Google Colab.
1 Classifying digits with Scikit-learn
We are going to use the UCI ML handwritten digits datasets. Using Python and Scikit-learn, they can be downloaded writing : from sklearn.datasets import load digits and then digits = load digits() . This dataset contains ntot = 1797 images, each a set of d = 8 × 8 = 64 pixels, with each pixel being an integer between 0 and 16. The image represents handwritten digits, from 0 to 9. Our goal will be to train a classifier to recognize if the digits are representing even numbers or odd ones1 .
1) Import and prepare the data :
— Import the dataset using load digits(return X y=True).
— The images are automatically imported as vectors xi,(i = 1,…,n), but so far the labels are numbers between 0 and 9. Change the targets yi to correspond to 0/1 for even and odd numbers respectively.
— Split the data into a “training” & a “testing” set (using roughly 60% and 40% ).
— Check that these two subsets have roughly the same proportion of labelled numbers.
2) Logistic Regression :
— Using linear model.LogisticRegression, train a classifier based on Logistic regression to decide if an image represents an odd or an even digit, comparing `2 and `1 penalty (you can fix the penalty just by setting the corresponding parameter when you call LogisticRegression() to generate the method).
— Using GridSearchCV with n splits=5, as seen during the exercise sessions, perform cross validation to fix the optimal hyperparameters (the regularization constant for the penalty) studying the interval λ ∈ [10−1,104] on a logarithmic scale. Then, with the resulting classifier, compute the Accuracy on the Test set, i.e. the rate of correctly classified images.
3) Ridge and Hinge : Repeat these operations for Ridge and Hinge (also called SVM) :
— Use linear model.RidgeClassifier to train a classifier based on Ridge, studying λ ∈ [10,107]; — Use svm.LinearSVC to train a classifier based on SVM/Hinge, studying λ ∈ [1,104]. — Compute the accuracy of the classifiers on the Test set as before.
4) Random Forest : Repeat the operations described in point 2 for a random forest classifier :
— Use sklearn.ensemble.RandomForestClassifier to train a random forest classifier. Study the tuning of the hyperparameter n estimators with cross validation as in previous points, fix the search interval to be n estimators ∈ [10,104].
— Compute the accuracy of the classifiers on the test set as before.
5) Random Feature :
— An alternative is to consider the random feature classifier. This is done as follows : first we transform the vector x containing for each image into another vector u using the following transformation :
u )with, σ(x) = 1/(1 + e−x) (1)
1
Here the xi are vectors of dimension d that contain the values of the pixels for a given image, F is a random Gaussian matrix FD×d, where each element is a Gaussian random number of variance 1/d (this can be created using random.normal in numpy). The function σ(·) is applied on each component. This transformation maps all the original vector x ∈ Rd into seemingly ”randomish” vectors u ∈ RD.
— Perform the same classification as before with `2-Logistic regression, now using the vector u instead of the vector x. How does the accuracy behave as D increases? (try with D = 2d,D = 4d,D = 8d…). We suggest to use λ ∈ [10−7,10−2] for the CV analysis.
2 Statistical Learning with Nearest-Neighbors
Here, we are going to follow the steps we took in lecture 3 to understand the limitation of learning with nearest neighbors. We shall assume that data are generated from an unknown distribution P(X,Y ), with X ∈ Rd, and Y ∈ R, with variance var[Y ] = σ2. Our task is to find a function f(x) that “learns” to output the corresponding y.
1. First, we assume that we are a “genie” that knows the distribution P(X,Y ). Show that the “best” estimator f(X) in terms of minimizing the expected population risk EX,Y [(f(X) − Y )2] is given by fBayes(X) = E(Y |X). This will serve as our reference point.
2. We are given a set of point x1,…,n and their labels y1,…,n and consider the KNN function (where Nk(X) is the set of k-nearest neighbors of the point X) :
(2)
i∈Nk(X)
Show that the difference between the expected population risk (where we also average over the labels Yi, i = 1,…,n) and the Bayes population risks follows a Bias-Variance decomposition given by
RKNN − RBayes Bayes + v (3)
with X Bayes(xi) − fBayesand v =(4)
 k
i∈Nk(X)
3. To bound the bias term, we make the following assumptions : (i) we suppose that the regression function fBayes(x) is is L-Lipschitz, and (ii) that the x1,x2,…,xn are are evenly spaced on a d-dimensional unit (hyper-)cube of volume 1. Show that this leads to
RKNN − RBayes (5)
Pause to think about the relationship between our results and L, k, n, and d. Does this dependence qualitatively make sense?
From now on, for the rest of the homework, we shall assume L = σ = 1 for simplicity.
4. To get an idea of what the terms in the bound look like and what the best k might be, plot the individual terms in the bound for the excess risk and the bound itself for n = 100 and d = 1 as a function of k to visualize the trade-off. Find the value of k that minimizes the excess risk. Repeat for different values of n and d.
5. For general n and d, compute analytically the best k, denoted k∗(n,d), as a function of n and
d. How does k∗(n,d) changes with d and n? Does this relation with respect to n make sense?
6. Plug this value of k∗(n,d) in the bound (5). Obtain an optimal estimate for the bound on the excess squared risk for k-nearest-neighbor regression in terms of n and d. Pause to think about the result.
7. To understand and get an intuition on the meaning of this result, plot the number of samples n required to achieve excess squared risk of 0.1 as a function of the dimension d. Do you find anything worrying about this plot?
2

Reviews

There are no reviews yet.

Be the first to review “EE-411, HomeWork 2 : Classifying digits & learning theory (Solution)”

Your email address will not be published. Required fields are marked *