Model AI Assignment - Clustering

Assignment created and tested by:
Todd Neller, Gettysburg College
Laura E. Brown, Michigan Technological University

Assignment Summary

Summary	An Introduction to k-means Clustering - students learn the practical basics of k-means clustering experientially through programming, use of common data mining tools, on-line demo apps, and observation.
Topics	unsupervised learning, clustering problem, k-means clustering, local minima, elbow methods, gap statistic, feature selection, k-medoids clustering
Audience	Introductory Artificial Intelligence (AI) students
Difficulty	Suitable for an upper-level undergraduate or graduate course with a focus on AI programming
Strengths	Multifaceted approach allows students to learn both through implementation and through application of the Weka data mining tool. A variety of instructor resources are made available.
Weaknesses	Difficulty of characterizing when k-means clustering "works"
Dependencies	Students must be mathematically literate and able to comprehend pseudocode with mathematical notation. Students must also be competent programmers in a fast, high-level programming language.
Variants	Instructors may easily incorporate new data sets. Different means for choosing the number of clusters, k, may be taught. Other clustering algorithms such as hierarchical approaches are a natural extension.

Assignment

Student Materials

All assignment materials are available in the zip-file, with the following materials.

Exercises.docx
Exercises.pdf   
*.dat               # data files for Exercises
gap.pdf             # reference paper for Gap statistic question
data2d.m            # Octave script for visualizing input data 
cluster2d.m         # Octave script for visualizing output cluster data
doc/                # Javadoc documentation for sample solutions
iris/               # folder containing Weka tutorial, data, and Exercises
ppt/                # Powerpoint and PDF presentations for k-means
java/               # Java version of programming exercises with starter code

As part of this assignment, you will run your k-means algorithm on several sample data sets. The data files are indicated with the .dat extension, e.g., hardgaussian{1-9}.dat, easyguassian{1-9}.dat, etc. All the data files use the following format.

Data Set Input Format

The first two lines of input data are:

% {number1} dimensions
% {number2} points

Each of the number2 lines following has number1 space-separated numbers, where number2 and number1 are positive integers.

A MATLAB or Octave script data2d.m is available to visualize this 2D input data.

Clustered Output Format

On the standard output, the first four lines of output cluster data are:

% {number1} dimensions
% {number2} points
% {number3} clusters/centroids 
% {number4} within-cluster sum of squares

Each of the next number3 + number2 lines is space-separated and contains point coordinates of number1 dimensions preceded by that point's associated 0-based cluster number.

In the first number3 lines following the 4 lines above, the coordinates are of the centroids, ordered by ascending centroid cluster number.

In the next number2 lines, the coordinates are from the original data in the same order.

The value number4 is the computed within cluster sum of squares (WCSS).

A MATLAB or Octave script cluster2d.m is available to visualize the 2D output data.

k-means Programming Exercises

The assignment [PDF] [Word] has 3 main components to develop a method to perform k-means clustering using the Standard Algorithm with Forgy initialization.

k-means with given k

Create a program that:

takes the number of clusters k to compute as a command line parameter,
reads input data of the form above from the standard input,
computes k clusters using Forgy initialization and the standard k-means algorithm, and
prints output cluster data of the form above on the standard output.

The program should be invoked from the command line:

{your program} {k} < {input file} > {output file}
# Java example
java KMeans 2 < bullseye2.dat > bullseye2-out.dat

You can examine the first 4 lines of the output, with the UNIX head command, e.g., :

head -4 bullseye2-out.dat

You can visualize the 2D input/output data files using MATLAB or Octave.

Copy the files data2d.m and cluster2d.m into the working directory with your data
Start Octave or MATLAB and cd to change to your working directory
Visualize the input: data2d('<filename>')
Visualize the output: cluster2d('<filename>')

You will answer several questions related to the k-means algorithm and how the method performs on the provided data.

Iterated k-means with given k

Create an augmented version of the first program that performs 10 independent runs of the k-means algorithm and outputs only the clustering result with the lowest WCSS value.

You will answer questions on whether and why the iterated k-means helps improve the quality of the output clusters.

k-means using the Gap statistics to choose k

One very difficult problem concerns how one can automatically determine the number of clusters in a data set. There are many, many suggested techniques:

A simple technique is to plot the lowest WCSS found for each k and look for an "elbow", that is a k after which reductions in WCSS are not significant as k increases.

A difficulty with the Elbow Method, is that the most significant elbow bend is not always easily discernable from this graph. What one would like is some means of comparing the gap between the expected WCSS on uniformly-distributed random data and the WCSS obtained on the given data to see which k yields the most significant gap. This measure is what is known as the Gap Statistic described here and included in the project zip-file.

Create a program that

takes the maximum number of clusters k to compute as a command line parameter,
reads the input data of the form above from the standard input,
for each k from 1 to the maximum given at the command line:

(a) computes the log of the minimum WCSS form 10 iterations of k-means clustering as in the previous exercise,
(b) generates 100 data sets of an equal number of data points randomly distributes within the [minimum, maximum] range for each dimension of the given data, computes the k-means algorithm for each data set, and computes the average log of the WCSS for each data set, and
(c) computes the gap statistic for this k by subracting the value of (a) above from the value of (b); and,
for the k yielding the maximum gap statistic, prints the minimum WCSS output cluster data of the form above on the standard output.