Introduction to Python for Data Science

Marion Neumann (m.neumann@wustl.edu) and Jonathan Chen (jonathan.chen@wustl.edu), Department of Computer Science & Engineering, Washington University in St. Louis

Overview

We provide an interactive guided lab to introduce Python for data science (DS). We provide two Jupyter notebooks introducing the basics of Python and the DS workflow using the Iris dataset. We interactively introduce expressions, variables, strings, printing, lists, dictionaries, control flow, and functions to students that are already familiar with a programming language from an introductory CS course. The second lab aims at motivating students to acquire skills such as using statistics to model and analyze data, knowing how to design and use algorithms to store, process, and visualize data, while not forgetting the importance of domain expertise. We begin by establishing the example problem to be studied based on the Iris dataset. The next step is to acquire and process the data, where students practice how to load data and how to process strings into numeric arrays. Then, we explain different plotting methods such as box plots, histograms, and scatter plots for data exploration. Finally, we split the data into training and test set, build a model, use it for predictions, and evaluate the results. The main learning objectives are to get to know and practice Python in the context of data science and machine learning.

Meta Information

Summary

This is a sequence of two interactive guided labs (lab0 and lab1). The first lab is an introduction to the basics of Python and can be used as a refresher or omitted if students are already familiar with the Python programming language. The second and main lab introdcues the data science workflow using the Iris dataset.

Topics

Introduction to Python. Introduction to data science and machine learning.

Audience

Early career undergraduate students in CS (major and minor). Familiarity with a high-level programming language (gained from an Introduction to Programming course) is required.

Difficulty

Introductory level. Lab0 is designed for students to complete in a 1-2hr lab session or as a take-home. Lab1 is desigend for students to complete in a 1-2hr lab session with an instructor or teaching assistants present to provide help.

Strengths

Introduction to Python using a relevant and exciting field in CS. Easy to understand introduction that can serve as a precurser (or refresher) to courses related to data science such as data mining, AI, or ML or as a stand-alone activity to spark students' interest in AI/ML.

Weaknesses

The labs do not aim at providing a comprehensive introduction to Python. It is challenging to decide what to include for the sake of completeness and what to leave out for the sake of focusing on the essentials. Students with different background knowledge: CS majors vs. minors vs. students from other programs will need different levels of this introduction.

Dependencies

Coding is done in Python 3.6 (using numpy, matplotlib, scikit-learn). We recommend installing Python via the Anaconda package. The iris data used in lab1 is provided.

Variants

Lab0 could be extended to provide an introduction to the basics of programming in general. Lab1 could be used as a more sophisticated introduction to machine learning by replacing the provided Model class (used purposely as a black box here) with an explicit implementation of a ML method (such as kNN or logistic regression).

Instructions and Starter Code

We provide two Jupyter notebooks combining instructions and starter code each:

Supporting Materials

This assignment should be supported by 1-2 hours of lecture covering an introduction to data science and machine learning. Relevant slides and a worksheet for students to do in-class activities to engage with the presented materials are available on request.