Model AI Assignment  Clustering
Assignment created and tested by:
Todd Neller, Gettysburg College
Laura E. Brown, Michigan Technological University
Assignment Summary
Summary  An Introduction to kmeans Clustering  students learn the practical basics of kmeans clustering experientially through programming, use of common data mining tools, online demo apps, and observation. 
Topics  unsupervised learning, clustering problem, kmeans clustering, local minima, elbow methods, gap statistic, feature selection, kmedoids clustering 
Audience  Introductory Artificial Intelligence (AI) students 
Difficulty  Suitable for an upperlevel undergraduate or graduate course with a focus on AI programming 
Strengths  Multifaceted approach allows students to learn both through implementation and through application of the Weka data mining tool. A variety of instructor resources are made available. 
Weaknesses  Difficulty of characterizing when kmeans clustering "works" 
Dependencies  Students must be mathematically literate and able to comprehend pseudocode with mathematical notation. Students must also be competent programmers in a fast, highlevel programming language. 
Variants  Instructors may easily incorporate new data sets. Different means for choosing the number of clusters, k, may be taught. Other clustering algorithms such as hierarchical approaches are a natural extension. 
Assignment
Student Materials
All assignment materials are available in the zipfile, with the following materials.
Exercises.docx
Exercises.pdf
*.dat # data files for Exercises
gap.pdf # reference paper for Gap statistic question
data2d.m # Octave script for visualizing input data
cluster2d.m # Octave script for visualizing output cluster data
doc/ # Javadoc documentation for sample solutions
iris/ # folder containing Weka tutorial, data, and Exercises
ppt/ # Powerpoint and PDF presentations for kmeans
java/ # Java version of programming exercises with starter code
As part of this assignment, you will run your kmeans algorithm on several sample data sets. The data files are indicated with the .dat
extension, e.g., hardgaussian{19}.dat
, easyguassian{19}.dat
, etc. All the data files use the following format.
Data Set Input Format
The first two lines of input data are:
% {number1} dimensions
% {number2} points
Each of the number2
lines following has number1
spaceseparated numbers, where number2
and number1
are positive integers.
A MATLAB or Octave script data2d.m
is available to visualize this 2D input data.
Clustered Output Format
On the standard output, the first four lines of output cluster data are:
% {number1} dimensions
% {number2} points
% {number3} clusters/centroids
% {number4} withincluster sum of squares
Each of the next number3
+ number2
lines is spaceseparated and contains point coordinates of number1
dimensions preceded by that point's associated 0based cluster number.
In the first number3
lines following the 4 lines above, the coordinates are of the centroids, ordered by ascending centroid cluster number.
In the next number2
lines, the coordinates are from the original data in the same order.
The value number4
is the computed within cluster sum of squares (WCSS).
A MATLAB or Octave script cluster2d.m
is available to visualize the 2D output data.
kmeans Programming Exercises
The assigment [PDF] [Word] has 3 main components to develop a method to perform kmeans clustering using the Standard Algorithm with Forgy initialization.
kmeans with given k
Create a program that:
 takes the number of clusters k to compute as a command line parameter,
 reads input data of the form above from the standard input,
 computes k clusters using Forgy initialization and the standard kmeans algorithm, and
 prints output cluster data of the form above on the standard output.
The program should be invoked from the command line:
{your program} {k} < {input file} > {output file}
# Java example
java KMeans 2 < bullseye2.dat > bullseye2out.dat
You can examine the first 4 lines of the output, with the UNIX head
command, e.g., :
head 4 bullseye2out.dat
You can visualize the 2D input/output data files using MATLAB or Octave.
 Copy the files
data2d.m
andcluster2d.m
into the working directory with your data  Start Octave or MATLAB and
cd
to change to your working directory  Visualize the input:
data2d('<filename>')
 Visualize the output:
cluster2d('<filename>')
You will answer several questions related to the kmeans algorithm and how the method performs on the provided data.
Iterated kmeans with given k
Create an augmented version of the first program that performs 10 independent runs of the kmeans algorithm and outputs only the clustering result with the lowest WCSS value.
You will answer questions on whether and why the iterated kmeans helps improve the quality of the output clusters.
kmeans using the Gap statistics to choose k
One very difficult problem concerns how one can automatically determine the number of clusters in a data set. There are many, many suggested techniques:
A simple technique is to plot the lowest WCSS found for each k and look for an "elbow", that is a k after which reductions in WCSS are not significant as k increases.
A difficulty with the Elbow Method, is that the most significant elbow bend is not always easily discernable from this graph. What one would like is some means of comparing the gap between the expected WCSS on uniformlydistributed random data and the WCSS obtained on the given data to see which k yields the most significant gap. This measure is what is known as the Gap Statistic described here and included in the project zipfile.
Create a program that
 takes the maximum number of clusters k to compute as a command line parameter,
 reads the input data of the form above from the standard input,

for each k from 1 to the maximum given at the command line:
(a) computes the log of the minimum WCSS form 10 iterations of kmeans clustering as in the previous exercise,
(b) generates 100 data sets of an equal number of data points randomly distributes within the [minimum, maximum] range for each dimension of the given data, computes the kmeans algorithm for each data set, and computes the average log of the WCSS for each data set, and
(c) computes the gap statistic for this k by subracting the value of (a) above from the value of (b); and, 
for the k yielding the maximum gap statistic, prints the minimum WCSS output cluster data of the form above on the standard output.