Model AI Assignment - Clustering
Assignment created and tested by:
Todd Neller, Gettysburg College
Laura E. Brown, Michigan Technological University
Assignment Summary
Summary | An Introduction to k-means Clustering - students learn the practical basics of k-means clustering experientially through programming, use of common data mining tools, on-line demo apps, and observation. |
Topics | unsupervised learning, clustering problem, k-means clustering, local minima, elbow methods, gap statistic, feature selection, k-medoids clustering |
Audience | Introductory Artificial Intelligence (AI) students |
Difficulty | Suitable for an upper-level undergraduate or graduate course with a focus on AI programming |
Strengths | Multifaceted approach allows students to learn both through implementation and through application of the Weka data mining tool. A variety of instructor resources are made available. |
Weaknesses | Difficulty of characterizing when k-means clustering "works" |
Dependencies | Students must be mathematically literate and able to comprehend pseudocode with mathematical notation. Students must also be competent programmers in a fast, high-level programming language. |
Variants | Instructors may easily incorporate new data sets. Different means for choosing the number of clusters, k, may be taught. Other clustering algorithms such as hierarchical approaches are a natural extension. |
Assignment
Student Materials
All assignment materials are available in the zip-file, with the following materials.
Exercises.docx
Exercises.pdf
*.dat # data files for Exercises
gap.pdf # reference paper for Gap statistic question
data2d.m # Octave script for visualizing input data
cluster2d.m # Octave script for visualizing output cluster data
doc/ # Javadoc documentation for sample solutions
iris/ # folder containing Weka tutorial, data, and Exercises
ppt/ # Powerpoint and PDF presentations for k-means
java/ # Java version of programming exercises with starter code
As part of this assignment, you will run your k-means algorithm on several sample data sets. The data files are indicated with the .dat
extension, e.g., hardgaussian{1-9}.dat
, easyguassian{1-9}.dat
, etc. All the data files use the following format.
Data Set Input Format
The first two lines of input data are:
% {number1} dimensions
% {number2} points
Each of the number2
lines following has number1
space-separated numbers, where number2
and number1
are positive integers.
A MATLAB or Octave script data2d.m
is available to visualize this 2D input data.
Clustered Output Format
On the standard output, the first four lines of output cluster data are:
% {number1} dimensions
% {number2} points
% {number3} clusters/centroids
% {number4} within-cluster sum of squares
Each of the next number3
+ number2
lines is space-separated and contains point coordinates of number1
dimensions preceded by that point's associated 0-based cluster number.
In the first number3
lines following the 4 lines above, the coordinates are of the centroids, ordered by ascending centroid cluster number.
In the next number2
lines, the coordinates are from the original data in the same order.
The value number4
is the computed within cluster sum of squares (WCSS).
A MATLAB or Octave script cluster2d.m
is available to visualize the 2D output data.
k-means Programming Exercises
The assignment [PDF] [Word] has 3 main components to develop a method to perform k-means clustering using the Standard Algorithm with Forgy initialization.
k-means with given k
Create a program that:
- takes the number of clusters k to compute as a command line parameter,
- reads input data of the form above from the standard input,
- computes k clusters using Forgy initialization and the standard k-means algorithm, and
- prints output cluster data of the form above on the standard output.
The program should be invoked from the command line:
{your program} {k} < {input file} > {output file}
# Java example
java KMeans 2 < bullseye2.dat > bullseye2-out.dat
You can examine the first 4 lines of the output, with the UNIX head
command, e.g., :
head -4 bullseye2-out.dat
You can visualize the 2D input/output data files using MATLAB or Octave.
- Copy the files
data2d.m
andcluster2d.m
into the working directory with your data - Start Octave or MATLAB and
cd
to change to your working directory - Visualize the input:
data2d('<filename>')
- Visualize the output:
cluster2d('<filename>')
You will answer several questions related to the k-means algorithm and how the method performs on the provided data.
Iterated k-means with given k
Create an augmented version of the first program that performs 10 independent runs of the k-means algorithm and outputs only the clustering result with the lowest WCSS value.
You will answer questions on whether and why the iterated k-means helps improve the quality of the output clusters.
k-means using the Gap statistics to choose k
One very difficult problem concerns how one can automatically determine the number of clusters in a data set. There are many, many suggested techniques:
A simple technique is to plot the lowest WCSS found for each k and look for an "elbow", that is a k after which reductions in WCSS are not significant as k increases.
A difficulty with the Elbow Method, is that the most significant elbow bend is not always easily discernable from this graph. What one would like is some means of comparing the gap between the expected WCSS on uniformly-distributed random data and the WCSS obtained on the given data to see which k yields the most significant gap. This measure is what is known as the Gap Statistic described here and included in the project zip-file.
Create a program that
- takes the maximum number of clusters k to compute as a command line parameter,
- reads the input data of the form above from the standard input,
-
for each k from 1 to the maximum given at the command line:
(a) computes the log of the minimum WCSS form 10 iterations of k-means clustering as in the previous exercise,
(b) generates 100 data sets of an equal number of data points randomly distributes within the [minimum, maximum] range for each dimension of the given data, computes the k-means algorithm for each data set, and computes the average log of the WCSS for each data set, and
(c) computes the gap statistic for this k by subracting the value of (a) above from the value of (b); and, -
for the k yielding the maximum gap statistic, prints the minimum WCSS output cluster data of the form above on the standard output.