Supervised Learning (COMP0078) – Coursework 2 (Solution)

Description

Mark Herbster
CW2-2223-v6
Submission
Questions please use the moodle forum or alternatively e-mail sl-support@cs.ucl.ac.uk .
1 PART I [40%]
1.1 Kernel perceptron (Handwritten Digit Classification)
Introduction: In this exercise you will train a classifier to recognize hand written digits. The task is quasi-realistic and you will (perhaps) encounter difficulties in working with a moderately large dataset which you would not encounter on a “toy” problem.

Figure 1: Scanned Digits
Adding a kernel: The kernel allows one to map the data to a higher dimensional space as we did with basis functions so that class of functions learned is larger than simply linear functions. We will consider a single type of kernel, the polynomial Kd(p,q) = (p · q)d which is parameterized by a positive integer d controlling the dimension of the polynomial.

epoch 1 epoch 2 epoch m
where x1 = x41 = x81 = … = x(m−1)×40+1, etc. Testing is performed as follows, once we have trained a classifier w on the training set, we simply use the trained classifier with only the prediction step for each example in test set. It is a mistake when ever the prediction ˆyt does not match the desired output yt, thus the test error is simply the number of mistakes divided by test set size. Remember in testing the update step is never performed.
Two Class Kernel Perceptron (training)
Input:
Initialization: w1 = 0 (α0 = 0)
Prediction: Upon receiving the tth instance xt, predict
t−
Update: if yˆt = yt then αt = 0
else αt = yt
wt+1(·) = wt(·) + αtK(xt,·)
Generalizing to k classes: Design a method (or research a method) to generalise your two-class classifier to k classes. The method should return a vector κ ∈ <k where κi is the “confidence” in label i; then you should predict either with a label that maximises confidence or alternately with a randomised scheme.
I’m providing you with mathematica code for a 3-classifier and a demonstration on a small subset of the data. First, however, my mathematica implementation is flawed and is relatively inefficient for large datasets. One aspect of your goals are to improve my code so that it can work on larger datasets. The mathematical logic of the algorithm should not change, however either the program logic and/or the data structures will need to change. Also, I suspect that it will be considerable easier to implement sufficiently fast code in Python (or the language of your choice) rather than Mathematica.
Files: From http://www0.cs.ucl.ac.uk/staff/M.Herbster/SL/misc/, you will find files relevant to this assignment. These are,
poorCodeDemoDig.nb demo mathematica code
dtrain123.dat mini training set with only digits 1,2,3 (329 records)
dtest123.dat mini testing set with only digits 1,2,3 (456 records)
zipcombo.dat full data set with all digits (9298 records)
Experimental Protocol: Your report on the main results should contain the following (errors reported should be percentages not raw totals):
1. Basic Results: Perform 20 runs for d = 1,…,7 each run should randomly split zipcombo into 80% train and 20% test. Report the mean test and train error rates as well as well as standard deviations. Thus your data table, here, will be 2 × 7 with each “cell” containing a mean±std.
2. Cross-validation: Perform 20 runs : when using the 80% training data split from within to perform5-fold cross-validation to select the “best” parameter d∗ then retrain on full 80% training set using d∗ and then record the test errors on the remaining 20%. Thus you will find 20 d∗ and 20 test errors. Your final result will consist of a mean test error±std and a mean d∗ with std.
3. Confusion matrix: Perform 20 runs : when using the 80% training data split that further to perform5-fold cross-validation to select the “best” parameter d∗ retrain on the full “80%” training set using d∗ and then produce a confusion matrix. Here the goal is to find “confusions” thus if the true label (on the test set) was “7” and “2” was predicted then a “error” should recorded for “(7,2)”; the final output will be a 10 × 10 matrix where each cell contains a confusion error rate and its standard deviation (here you will have averaged over the 20 runs). Note the diagonal will be 0.
4. Within the dataset relative to your experiments there will be five hardest to predict correctly “pixelatedimages.” Print out the visualisation of these five digits along with their labels. Is it surprising that these are hard to predict?
5. Repeat 1 and 2 (d∗ is now c and {1,…,7} is now S) above with a Gaussian kernel
K(p,q) = e−ckp−qk2,
c the width of the kernel is now a parameter which must be optimised during cross-validation however, you will also need to perform some initial experiments to a decide a reasonable set S of values to crossvalidate c over.
6. Choose (research) an alternate method to generalise the kernel perceptron to k-classes then repeat 1 and 2.
Assessment: In your report you will not only be assessed on the correctness/quality of your experiment (e.g., sound methods for choosing parameters, reasonable final test errors) but also on the clarity of presentation and the insightfulness of your observations. Thus the aim is that your report is sufficiently detailed so that the reader could largely repeat your experiments based on the description in your report alone. The report should also contain the following.
• A discussion of any parameters of your method which were not cross-validated over.
• A discussion of the two methods chosen for generalising 2-class classifiers to k-class classifiers.
• A discussion comparing results of the Gaussian to the polynomial Kernel.
• A discussion of your implementation of the kernel perceptron. This should at least discuss how the sum w ) was i) represented, ii) evaluated and iii) how new terms are added to the sum during training.
2 PART II [20%]
Notation
Recall that [m] := {1,2,…,m} and we also overload notation so that
(
1 pred is true
[pred] := .
0 pred is false
The vector ith “coordinate” vector is denoted ei := ([j = i] : j ∈ [m]). The notation M+ denotes the pseudoinverse of M.
2.1 Semi-supervised Learning via Laplacian Interpolation
In this section, you are asked to implement two intimately connected semi-supervised learning methods over a graph. The two methods will be Laplacian (semi-norm) interpolation (LI) and Laplacian kernel interpolation (LKI)
Semi-supervised learning is somewhere between unsupervised learning and supervised learning. In the context of the label prediction problems, the semi-supervised learning task is similar to supervised learning; to predict the labels of unlabeled data by using the label information. Supervised learning constructs a model using only labeled data points and then applies the model to predict on unlabelled data. On the other hand, in semi-supervised learning, a model is constructed using both labeled and unlabelled data to make predictions.
In the following, we focus on the Laplacian interpolation method, one of the established graph-based semi-supervised learning methods. The Laplacian interpolation method can be described as follows. We are given a data set which is represented by a data matrix X = (x1,x2,…,xm) where each datum xi ∈ <n is a vector. Let y ∈ {−1,1}m be a label vector, where yi ∈ {−1,1} is the label of xi. We assume that we know some of the values of yi, and the task is to predict the rest. More formally, we are given {(i,yi) : i ∈ L}, where L ⊂ [m] is a set of indices of the known labels, and the task is to predict {(i,yi) : i ∈ [m]L}. In general, we only know the very few labels, i.e., . We will build a graph using the 3-NN method (with the Euclidean distance). For the 3-NN graph, we create a weight “1” edge if i-th datum is among the 3-nearest neighbors of j-th datum or if j-th datum is among the 3-nearest neighbors of i-th datum; otherwise, we put weight 0. Thus, we produce an m × m weight matrix as
(
“+1” xi is 3-NN of xj or xj is 3-NN of xi Wij :=
“0” otherwise.
Note that Wii = 0. Then, we solve the following optimization,
v := argminu>Lu : ui = yi,∀i ∈ L. (1)
u∈<n
Where L is Laplacian matrix created from the matrix W. We will then predict by assigning ˆyi = sign(ui) for i ∈ [m]L (as summarised Figure 2). For Laplacian kernel interpolation the set-up is the same but we will exploit the induced Laplacian differently than in Equation (1) instead see Figure 2 for details.
For each dataset ∈ {dtrain13 50.dat,dtrain13 100.dat,dtrain13 200.dat,dtrain13 400.dat} For each ` ∈ {1,2,4,8,16} do
Do 20 runs …
L = random sample (w/o replacement) of size ` from each class (|L| = 2`) errLI = LaplacianInterpolation(dataset,L) errLKI = LaplacianKernelInterpolation(dataset,L)
The estimated generalisation error is the mean empirical error of these 20 runs
Figure 3: Experimental Protocol A
Inputs: Data: X := (x1,x2,…,xm),y := (y1,…,ym)
Observed label set : L ⊂ [m]
Adjacency weight matrix W: (
“+1” xi is 3-NN of xj or xj is 3-NN of xi Wij :=
“0” otherwise
Note that Wii = 0. ;
Degree Matrix D : Dii := Pj6=i Wij and if i 6= j then Dij := 0.
Graph Laplacian L: L = D − W,
Laplacian (semi-norm) Interpolation (LI) : v := argminuu>Lu : ui = yi,∀i ∈ L
Laplacian Kernel Interpolation (LKI): Kernel Matrix K yL := (yi : i ∈ L) α∗ := K+yL
v := Pi∈Lα∗i e>i L+
LKI implementation note: Note: K,yL,α∗ are indexed by the elements of L and hence the indices are not likely to be consecutive.
Discrete prediction vector: yˆ := (sign(vi) : i ∈ [m])
Empirical generalisation error: err =
Figure 2: Laplacian Interpolation and Laplacian Kernel Interpolation
Files: From http://www0.cs.ucl.ac.uk/staff/M.Herbster/SL/misc/, you will find files relevant to this assignment. These are
dtrain13 50.dat subset of USPS digits 1,3 (50 examples per class)
dtrain13 100.dat subset of USPS digits 1,3 (100 examples per class)
dtrain13 200.dat subset of USPS digits 1,3 (200 examples per class)
dtrain13 400.dat subset of USPS digits 1,3 (400 examples per class)
Experiments [10pts]:
In this experiment, you will implement Laplacian interpolation and Laplacian kernel interpolation (see Figure 2). This experiment uses four datasets in the Files; these are the subsets of the USPS dataset which were created by randomly sampling 50, 100, 200, 400 examples per class (|m| ∈ {100,200,400,800}. For each dataset, we will experiment with label sets L that contain 1,2,4,8 or 16 labels per class. The set L is formed by sampled uniformly at random without replacement from [m] per class. so that we given {1,2,4,8,16} labels per class thus as a consequence |L| ∈ {2,4,8,16,32}. The task is to predict the labels of {1,…,m}L. You will repeat each experiment 20 times per dataset paired with label set size. The protocol for the experiments is detailed in Figure 3.
You will include your results of your experiments in the report requested below rather than separately.
# of known labels (per class)
1 2 4 8 16
# data points per label 50
100
200 ??? ± ???
??? ± ???
??? ± ??? ??? ± ???
??? ± ???
??? ± ??? ??? ± ???
??? ± ???
??? ± ??? ??? ± ???
??? ± ???
??? ± ??? ??? ± ???
??? ± ???
??? ± ???
400 ??? ± ??? ??? ± ??? ??? ± ??? ??? ± ??? ??? ± ???
Table 1: Errors and standard deviations for semi-supervised learning via Laplacian (kernel) interpolation.
Experimental Report [10pts]:. You should write a brief experimental report (1 to 2 pages expected length). The report should include the following :
1. You should give two tables with your results each in the form of Table 1 where one table is forLaplacian interpolation and the other for Laplacian kernel interpolation
2. You should give observations about each table that you believe of are of interest.
3. You should compare the performances of the two methods. If in your opinion one method significantlyperformed better than the other than the other in some aspects you are then encouraged propose reasons why this occurred. Note : It is better to say that one cannot explain than to give unreasonable explanations.
4. If there is a significant difference in the performance of the two algorithms and you believe thosereasons depend in someway on the datasets, you are encouraged to speculate (with reasons) about datasets where the weaker method might outperform the other method.
Note: As in Part I, your report will be assessed on both its clarity and scientific excellence.
3 PART III [40%]
3.1 Questions
1. [(a,b,c,d 24pts, e 16pts)] Sparse learning:
The ‘just a little bit’ problem. In the following problem we will consider the sample complexity of the perceptron, winnow, least squares, and 1-nearest neighbours algorithms for a specific problem.
Problem (‘just a little bit’): The m patterns x1,… xm are sampled uniformly at random from {−1,1}n, and each label is defined as yi := xi,1, i.e., the label of a pattern x is just its first coordinate. Thus for example here is a typical data set with m = 4 examples in n = 3 dimensions,

We are concerned with estimating the sample complexity as a function of the dimension (n) of the data of this problem. Where our “working definition” of sample complexity is the minimum number of examples (m) to incur no more than 10% generalisation error (on average).
(a) In this part, you will implement the four classification algorithms and then use them to estimatethe sample complexity of these algorithms. Here you will plot m (left axis) versus n (bottom axis). As an illustration I include an example plot of estimated sample complexity for least squares. Please include sample complexity plots for all four algorithms.

Figure 4: Estimated number of samples (m) to obtain 10% generalisation error versus dimension (n) for least squares.
(b) Computing the sample complexity “exactly” by simulation would be extremely expensive computationally. Thus for your method in part (a) it is necessary to trade-off accuracy and computation time. Hence, i) Please describe your method for estimating sample complexity in detail. ii). please discuss the tradeoffs and biases of your method.
(c) Please estimate how m grows as a function of n as n → ∞ for each of the four algorithms based on experimental or any other “analytical” observations. Here the use of O(·),L(·),Θ(·) will be useful. For example experimentally from the plot given for least squares it seems that sample complexity grows linearly as a function of dimension, i.e., it is m = Θ(n). Discuss your observations and compare the performance of the four algorithms.
(d) Now additionally, suppose we sample an integer s ∈ {1,…,m} uniformly at random. Derive a non-trivial upper bound ˆpm,n on the probability that the perceptron will make a mistake on the sth example after being trained on examples (x1,y1),…,(xs−1,ys−1). The derived upper bound ˆpm,n should be function of only m and n (i.e., it is independent of s) and should be with respect to the dataset decribed above. Justify your derivation, analytically.
(e) [Challenge] Find a simple function f(n) which is a good lower bound of the sample complexity of 1-nearest neighbor algorithm for the ‘just a little bit’ problem. Prove m = Ω(f(n)). There is no partial credit for this problem.
Notes
1. Winnow: Use {0,1}n for the patterns and {0,1} for the labels. This follows our presentation in the notes.
2. Linear Regression:
(a) We use regression vector w to define a classifier fw(x) := sign(w>x).
3. Sample Complexity: For this problem, let Sm, denote a set of m patterns drawn uniformly at random from {−1,1}n with their derived labels. Then let AS(x) denote the prediction of an algorithm A trained from data sequence S on pattern x. Thus the generalisation error is then
E(AS) := 2−n X I[AS(x) 6= x1]
x∈{−1,1}n
and thus the sample complexity on average at 10% generalisation error is
C(A) := min{m ∈ {1,2,…} : E[E(ASm)] ≤ 0.1}.
C(A) = min{m ∈ {1,2,…} : (2−nm X E(AS)) ≤ 0.1}.
S⊆{−1,1}nm
Note: Under the same guidance about reports in Part I, the reporting for Part III (parts b,c) is normally expected to be less than one page.

Reviews

There are no reviews yet.

Be the first to review “Supervised Learning (COMP0078) – Coursework 2 (Solution)”

Supervised Learning (COMP0078) – Coursework 2 (Solution)

Description

Reviews

Related products

Supervised Learning (COMP0078) – Coursework 1 (Solution)