## Description

Clustering

Antoni Chan

In this programming assignment you will implement and test several clustering algorithms on both synthetic and real data. Given a set of points X = {x1,··· ,xn}, where xi ∈ Rd, the goal is to assign a cluster label to each point, yi ∈ {1,··· ,K}, where K is the number of clusters. In this programming assignment, you will consider three clustering algorithms:

1. K-means algorithm – The K-means algorithm calculates the cluster centers µj using the current assignment to each cluster (K is assumed to be known). In each iteration, K-means performs the following two steps:

( 2

1, j = argmink∈{1,···,K} kxi − µkk

Cluster Assignment : zij = (1)

0, otherwise.

Estimate Center : . (2)

The cluster label for point xi is the label of the closest cluster center, yi = argmaxj zij.

2. EM algorithm for Gaussian mixture models (EM-GMM) – The EM algorithm calculates a maximum likelihood estimate of a GMM with K components, with parameters is also assumed to be known. The E- and M-steps are:

E − step : ˆ . (3)

M − step :, (4)

(5)

After convergence, the cluster label for xi is the component with the largest posterior probability, yi = argmaxj zˆij. The document “gmm-tips”, available on the course website, contains some useful tips on implementing EM for GMMs.

3. Mean-shift algorithm – The mean-shift algorithm uses gradient ascent with an adaptive step-size to find local peaks in a kernel density estimate of X. We will use a Gaussian kernel with bandwidth h. Given an initial point ˆx(0) (where the superscript denotes the iteration number), the mean-shift algorithm update rule is:

. (6)

To obtain the clustering, the mean-shift algorithm is run using each data point xi as an initial point. After convergence, the points that converged to the same local peak are given the same cluster label. In this case, the number of clusters K is not fixed (and depends on the selected bandwidth h).

Problem 1 Clustering synthetic data

In the first problem, you will test these clustering methods on synthetic data, and examine how each method performs on different configurations of data. The zip file PA2-cluster-data.zip contains the synthetic data files. Inside the MATLAB data file cluster data.mat (or cluster data *.txt if you are not using MATLAB) are three data sets (points and ground-truth labels). The file contains the following variables:

• dataA X – dataset A points. Each column is a data point, xi ∈ R2.

• dataA Y – dataset A true labels, {yi}.

• dataB X – dataset B points.

• dataB Y – dataset B true labels.

• dataC X – dataset C points.

• dataC Y – dataset C true labels.

The three datasets are shown in the following figure, where the color represents the ground-truth label.

The goal is to discover the clusters in each dataset using the data points X.

(a) Implement the three clustering algorithms. You will reuse these algorithms in the next problem,so try to make them as general as possible.

(b) Run the algorithms on the three synthetic datasets. Qualitatively, how does each clusteringalgorithm perform on each dataset? Comment on the advantages and limitations of each algorithm, in terms of the configuration of the data.

(c) How sensitive is mean-shift to the bandwidth parameter h?

………

Problem 2 A real world clustering problem – image segmentation

In this problem we will consider a real world clustering problem, image segmentation, where the goal is to find homogeneous regions in the image, which hopefully belong to objects or object parts. Below is one of the test images and an example segmentation using K-means (In the right image, each segment is colored based on the average color within the segment).

original segments segments (colored)

505050

100100100

150150150

200200200

250250250

300300300

100 200 300 400 100 200 300 400 100 200 300 400

To segment the image, we first extract feature vectors from each pixel in the image (or a subset of pixels along a regular grid). In particular, we form a window centered around each pixel and extract a four-dimensional feature vector, , where (u,v) are the average chrominance values (color without brightness) in the window, and (cx,cy) is the pixel location (center of the window in x-y-coordinates). The feature values for the example image are shown below (each subplot represents one feature dimension over the entire image).

x x

100 200 300 400 100 200 300 400

x

100 200 300 400 100 200 300 400

The zip file PA2-cluster-images.zip contains the images and MATLAB code for extracting the features, and creating the segmentation image from the cluster labels. The following MATLAB functions are provided:

• getfeatures.m – MATLAB function for extracting features from an image.

• labels2segm.m – MATLAB function for generating a segmentation image from the cluster labels.

• colorsegm.m – MATLAB function to create a nicely colored segmentation image.

As an example, the following code was used to generate the above segmentation images:

% read the image and view it img = imread(’images/12003.jpg’); subplot(1,3,1); imagesc(img); axis image;

% extract features (stepsize = 7)

[X, L] = getfeatures(img, 7);

XX = [X(1:2,:) ; X(3:4,:)/10]; % downscale the coordinate features (see part (b))

% run kmeans — this is the MATLAB version. You have to write your own!

[Y, C] = kmeans(XX’, 2);

% make a segmentation image from the labels segm = labels2segm(Y, L); subplot(1,3,2); imagesc(segm); axis image;

% color the segmentation image csegm = colorsegm(segm, img); subplot(1,3,3); imagesc(csegm); axis image

Documentation for each function can be viewed in MATLAB using “help getfeatures”, etc.

For Python users, the file PA2-cluster-python.zip contains Python versions of the above demo code and helper functions for extracting features. This Python code requires the numpy, scipy, matplotlib, and Image modules. Alternatively, the file PA2-cluster-imfeatures-txt.zip contains a text files of the extracted image features for a step size of 7.

(a) Use the three clustering algorithms to segment a few of the provided images. Qualitatively,which algorithm gives better results? How do the results change with different K and h? Which is less sensitive to changes in the parameters? Comment on any interesting properties or limitations observed about the clustering algorithms.

The feature vector x contains two types of features that are on different scales; the chrominance values (u,v) range from 0-255, while the pixel locations (cx,cy) range from 0-512. The EM-GMM clustering algorithm can adapt to the different scales of each feature by changing the covariance matrix. On the other hand, K-means and mean-shift assume a fixed isotropic covariance matrix. We can modify these two algorithms to allow different scaling of the features:

• For K-means, the distance to a cluster center x0 is modified to apply a weighting between the feature types,

, (7)

where λ ≥ 0 controls the weighting.

• For mean-shift, the kernel is modified to use separate bandwidths on each feature type,

, (8)

where hc is the bandwidth for the color features, and hp is the bandwidth for the pixel locations.

(b) Modify your K-means and mean-shift implementations to allow different feature scaling. Hint: changing the distance in (7) or kernel in (8) is equivalent to scaling each dimension in the feature vector x by an appropriate amount. Rerun the segmentation experiment. Do the segmentation results improve?

………

Problem 3 Quantitative Evaluation of Clustering (Optional)

In this optional problem, you will quantitatively evaluate your results from Problems 1 and 2. Consider a set S = {s1,··· ,sn} with n elements, and two partitions of S into subsets, Y = {Y1,··· ,YR} and Z = {Z1,··· ,ZC}, where Yi are the subsets of Y, and similarly for Zi and Z.

The Rand index can be used to quantitatively measure the amount of agreement between the two clusterings of the set S. Intuitively, the Rand index corresponds to the probability of pair-wise agreement between the Y and Z, i.e. the probability that the assignment of any two items will be correct with respect to each other (in the same cluster, or in different clusters). Formally, the Rand index is calculated as follows. Define the following counts:

• a – the number of pairs in S that are in the same set in Y and in the same set in Z,

• b – the number of pairs in S that are in different sets in Y and in different sets in Z, • c – the number of pairs in S that are in the same set in Y and in different sets in Z,

• d – the number of pairs in S that are in different sets in Y and in the same set in Z, then the Rand index is the fraction of pairwise agreements

. (9)

The Rand index can be efficiently calculated from a contingency table:

Partition Z

Class Z1 Z2 … ZC Sums

Y1 n11 n12 … n1C r1

Y2 n21 n22 … n2C r2

… … … … …

YR nR1 nR2 … nRC rR

Sums c1 c2 … cC n

Partition Y

Each entry nij is the number of items in S that are common to Yi and Zj, and cj and ri are the sums over the jth column and ith row, respectively. Using the contingency table, the number of agreements (the numerator in (9)) can be calculated as

. (10)

(a) Problem 1: Use the Rand index to evaluate the your clusterings with the ground-truth clusters.Which algorithm performs best overall, and for each individual dataset? How do the results change with the hyperparameter value (K or h)? (e.g., make a plot)

References for the Rand index are available here:

• W. M. Rand. “Objective criteria for the evaluation of clustering methods”. Journal of the American Statistical Association, 66 (336): 846-850, 1971.

• Lawrence Hubert and Phipps Arabie. “Comparing partitions”. Journal of Classification, 2 (1): 193218, 1985.

………

• What to hand in – You need to turn in the following things:

1. Answers, plots, analysis, discussion, etc. for the above problems.

2. Source code files.

You must submit your programming assignment using the Canvas website. Go to “Assignments” ⇒ “Programming Assignment 2” and submit the assignment.

• Grading – The marks for this assignment will be distributed as follows:

– 20% – Implementation of the K-means, EM, and mean-shift algorithm (1a).

– 20% – Results on toy datasets (1b).

– 10% – Sensitivity to parameters (1c).

– 20% – Results on image segmentation and sensitivity to parameters (2a).

– 10% – Results using feature scaling (2b).

– 20% – Quality of the written report. More points for insightful observations and analysis.

Note: if you really cannot implement the algorithms correctly, it is okay to use 3rd party software. You will not get marks for implementation, but you can still get marks for presenting the results. If you use a 3rd party implementation, you must acknowledge the source in your report.

• Optional Problems – The optional problem 3 will not be graded.

## Reviews

There are no reviews yet.