Nate Derbinsky: https://derbinsky.info
Laney Strange: http://www.ccs.neu.edu/home/laney
College of Computer and Information Science
Northeastern University
Summary This assignment engages students in basic Machine-Learning concepts and implementation, including classification and similarity-based search, with minimal background knowledge. Students are tasked with implementing representative distance functions and applying them towards the task of classification across several small datasets.
Topics Similarity-based search, classification
Audience K-12/CS1 or AI/CS-Outreach
Difficulty Difficulty: easy/moderate. Supplied homework would require 1 week to complete.
Strengths The primary strength of this project is exposure to real-world Machine-Learning concepts and a useful algorithm with little-to-no computation background. The presentation and type-along program assume no CS/AI/ML/programming background, with only minor arithmetic assumptions. The homework can be used within one month of a typical CS1 class (dependent upon instruction language), includes self-grading tests for fast feedback, and offers numerous possibilites for student extension/exploration. The accompanying practice problems prepare students for the syntactic and conceptual programming necessary to implement nearest-neighbor classification and associated distance functions. The materials have been succssfully employed within an undergraduate CS1-level course, a graduate introductory CS course, and one outreach event to attract women of diverse backgrounds to study Computer Science.
Weaknesses Given the objective to minimize requisite background knowledge, the depth and extent of the assignment is quite limited. Supplied materials assume Python 3 - nearly any language would work, though rate of deployment in a class might slow.
Dependencies Knowledge:
  • CS1: Python (variables, expressions, lists, loops, conditionals, functions, interactive execution)
  • Outreach: nothing
Requirements:
  • Python 3
Variants Obvious extensions include file I/O for more interesting datasets, visualizing/explaining results automatically, kNN (i.e. k>1). For outreach, once similarity-based search is explained, similarity-based clustering is a natural next step.
Acknowledgments We would like to acknowledge and express our gratitude to Byron Wallace for valuable discussion on the assignment, contributions to the handout, and support in evaluating the assignment within an introductory data science course.

Assignment Components