Introducing the Data Science Workflow using Sentiment Analysis

Marion Neumann (m.neumann@wustl.edu) and Zac Christensen (zac.d.christensen@wustl.edu), Department of Computer Science & Engineering, Washington University in St. Louis

Overview

We provide an interactive guided lab with a follow-up homework assignment to introduce the basic data science workflow by exploring sentiment analysis. The lab focuses on introducing the machinery using a given dataset of movie reviews and the assignment highlights data acquisition and exploration. After introducing sentiment analysis, we explain a simple rule-based approach to predict the sentiment of textual reviews using three handcrafted examples. This introduction shows how to pre-process text data and how to use lists of positive and negative expressions to compute a sentiment score. Then students will implement the approach to predict the sentiment of movie reviews and evaluate the results. The lab concludes with a discussion of the limitations of the rule-based approach and a quick introduction to sentiment classification via machine learning. The homework assignment reiterates over the process of building and analyzing a sentiment predictor with the focus on collecting and preprocessing their own dataset scraped from twitter using an API. The main learning objective of this activity is getting to know the inference problem and walking through the entire data science workflow to tackle it. This module only requires minimal programming background and is an ideal precursor to introducing machine learning.

Meta Information

Summary

This is an interactive guided lab with a follow-up homework assignment introducing the basic data science workflow by exploring sentiment analysis. The lab focuses on introducing the machinery using a given dataset of movie reviews and the assignment highlights data acquisition and exploration.

Topics

Sentiment analysis: rule-based prediction, evaluation, data acquisition and exploration

Audience

Early career undergraduate students in CS (major and minor). Familiarity with a high-level programming language (gained from an Introduction to Programming course) is required.

Difficulty

Introductory level. The lab is designed for students to complete in a 1-2hr lab session with an instructor or teaching assistens present to provide help. The assignment can be completed in as (part of) a one week assignment after completing the lab.

Strengths

Nicely motivated problem that should be easy to grasp and understand for students with no background in ML. It will be fun for the students to use an API to acquire, explore, and understand their own data set.

Weaknesses

There are a some concepts and extensions (text pre-processing could be covered in a more principled way, a ML-based approach such as logistic regression could be applied as a follow-up to the lab to showcase the difference between the rule-based and the learning based approach) that are left unexplained.

Dependencies

Coding is done in Python 3.6 (using string, re, os, zipfile, shutil) making use of a Twitter API (twitter, sys). Visualizations are generated using wordcloud, matplotlib (and numpy). We recommend installing Python via the Anaconda package and adding the wordcloud package (pip install wordcloud) and the twitter API (pip install python-twitter). The movie review data (negative examples, positive examples) used in the lab and a static snapshot of the twitter data used in the homework are provided. Note that scraping the twitter data is part of the excercise and the provided snapshot should only serve as a backup, if the Pyhton Twitter API fails to work.

Variants

A more in-depth introduction to the ML-based approach to sentiment prediction (including a discussion of bag-of-words or TF-IDF feature generation) and comparsion of the two approaches could be added to the lab. More sophisticated text preprocessing and negation handling could be applied.

Instructions and Starter Code

We provide two Jupyter notebooks combining instructions and starter code each:

Supporting Materials

This assignment should be supported by 2-3 hours of lecture covering sentiment analysis, text processing, and an introduction to classification. Relevant slides and a worksheet for students to do in-class activities to engage with the presented materials are available on request. An example solution for the lab is available on request.