Building a Fake News Detector

Michael Guerzhoy (Princeton University, University of Toronto, and the Li Ka Shing Knowledge Institute, St. Michael's Hospital, `guerzhoy@princeton.edu`) and Lisa Zhang (University of Toronto, `lczhang@cs.toronto.edu`)

Overview

In this assignment, students classify news headlines as "real" or "fake." Students build and compare several standard classifiers: Naive Bayes, Logistic Regression, and a Decision Tree. In addition, students have the option of training their own custom classifier and entering it into a classwide fake news detection competition.

Fake news, and the efforts to combat fake news, have come to prominence recently. Students are excited to work on this problem. Interested students are encouraged to augment the training set with data collected from the internet. These students will quickly see the challenges of applying machine learning to real-world problems: labelling news items as "real" or "fake" is an inherently subjective task, determining which features to use can be an art as well as a science, and different trained classifiers can provide conflicting interpretations about which features are important.

Our pedagogical approach emphasizes having students analyze the models that they build. In particular, we ask students to obtain keywords whose presence and absence most strongly indicates that a headline is "fake" or "real". Students explore the connection between Naive Bayes and Logistic Regression.

Our fake news detection competition allows interested students to explore a variety of NLP and Machine Learning algorithms, and to collect their own datasets in order to improve their performance in the competition. Our secret test set is collected and curated manually. We chose headlines that are recent, and that therefore would not appear in any training set that we provide or that students collect (details are available to interested instructors upon request). Our experience shows that students who took the most sophisticated approaches performed better in the competition.

We used this assignment in an introductory machine learning class. However, introductory AI classes that cover spam filtering can adopt this assignment as well.

Meta Information

Summary	Students use Machine Learning to classify news headlines as "real" or "fake". Students build and compare several standard classifiers. A classwide competition is held for the best fake news detector.
Topics	Text classification, Naive Bayes, Logistic Regression, Decision Trees, (optionally) ensemble methods, (optionally) advanced Natural Language Processing techniques.
Audience	Third and fourth year students in Intro Machine Learning classes or Intro to Artificial Intelligence classes.
Difficulty	This is a fairly easy assignment for an undergraduate machine learning course. Some students spent a lot of time on their entry into the competition.
Strengths	Relevance. Students enjoy working on a problem that they've read about in the (real) news. Exploration. There are many ways to improve on the performance of standard classifiers on the fake news detection task, making it a natural candidate for an in-class competition. Students enjoy trying to improve their systems in a variety of ways, and more sophisticated systems do win the in-class competition. Students tried out methods that were discussed throughout the semester. Working with real data. The competition gives students a more realistic feel for working with messy data. Students need to figure out how to label the extra data that they collect and what data to use for training. Students need to decide how to compare multiple classifiers and how to collect training and validation sets in order to perform well on an unseen test set. Just like in real-life scenarios, students don't find out that they're overfitting until it's too late. Interpretability. Students explore keywords that influence output by analyzing the coefficients of the trained classifier. This is good practice. The fun factor. Students liked building fake news detectors. For example, some students ran our announcement "headlines" through the systems that they built.
Weaknesses	There is no good large-scale fake news dataset. Many of the headlines labelled "fake" merely come from websites which publish opinion pieces, or are hyperbolic rather than fake. We chose to collect and curate a small test set for the competition. This was quite labor-intensive. Ideally, a new test set would be collected after the deadline for competition entries. The Naive Bayes classifier works frustratingly well on the task. It is certainly possible to do better than NB, but many students struggled.
Dependencies	Python, Logistic Regression, Naive Bayes, and Decision Trees. Some students were technically savvy enough to scrape their own datasets from the web, and to use packages such as scikit-learn, pandas, and spaCy.
Variants	We chose to ask students to use scikit-learn to train a Decision Tree, mostly because we wanted students to get practice with using off-the-shelf libraries. It is possible to ask them to write the code themselves. Courses that put emphasis on Natural Language Processing can require students to use, for example, parts-of-speech tagging (though the gain from doing that is quite minimal). Using entire articles, rather than just the headlines, is possible.

Handout

Lessons learned