Building a Fake News Detector
Michael Guerzhoy (Princeton University, University of Toronto, and the Li Ka Shing Knowledge Institute, St. Michael's Hospital, guerzhoy@princeton.edu) and Lisa Zhang (University of Toronto, lczhang@cs.toronto.edu)
Overview
In this assignment, students classify news headlines as "real" or "fake." Students build and compare several standard classifiers: Naive Bayes, Logistic Regression, and a Decision Tree. In addition, students have the option of training their own custom classifier and entering it into a classwide fake news detection competition.
Fake news, and the efforts to combat fake news, have come to prominence recently. Students are excited to work on this problem. Interested students are encouraged to augment the training set with data collected from the internet. These students will quickly see the challenges of applying machine learning to realworld problems: labelling news items as "real" or "fake" is an inherently subjective task, determining which features to use can be an art as well as a science, and different trained classifiers can provide conflicting interpretations about which features are important.
Our pedagogical approach emphasizes having students analyze the models that they build. In particular, we ask students to obtain keywords whose presence and absence most strongly indicates that a headline is "fake" or "real". Students explore the connection between Naive Bayes and Logistic Regression.
Our fake news detection competition allows interested students to explore a variety of NLP and Machine Learning algorithms, and to collect their own datasets in order to improve their performance in the competition. Our secret test set is collected and curated manually. We chose headlines that are recent, and that therefore would not appear in any training set that we provide or that students collect (details are available to interested instructors upon request). Our experience shows that students who took the most sophisticated approaches performed better in the competition.
We used this assignment in an introductory machine learning class. However, introductory AI classes that cover spam filtering can adopt this assignment as well.
Meta Information
Summary 
Students use Machine Learning to classify news headlines as "real" or "fake". Students build and compare several standard classifiers. A classwide competition is held for the best fake news detector.

Topics 
Text classification, Naive Bayes, Logistic Regression, Decision Trees, (optionally) ensemble methods, (optionally) advanced Natural Language Processing techniques.

Audience 
Third and fourth year students in Intro Machine Learning classes or Intro to Artificial Intelligence classes.

Difficulty 
This is a fairly easy assignment for an undergraduate machine learning course. Some students spent a lot of time on their entry into the competition.

Strengths 

Relevance. Students enjoy working on a problem that they've read about in the (real) news.

Exploration. There are many ways to improve on the performance of standard classifiers on the fake news detection task, making it a natural candidate for an inclass competition. Students enjoy trying to improve their systems in a variety of ways, and more sophisticated systems do win the inclass competition. Students tried out methods that were discussed throughout the semester.

Working with real data. The competition gives students a more realistic feel for working with messy data. Students need to figure out how to label the extra data that they collect and what data to use for training. Students need to decide how to compare multiple classifiers and how to collect training and validation sets in order to perform well on an unseen test set. Just like in reallife scenarios, students don't find out that they're overfitting until it's too late.

Interpretability. Students explore keywords that influence output by analyzing the coefficients of the trained classifier. This is good practice.

The fun factor. Students liked building fake news detectors. For example, some students ran our announcement "headlines" through the systems that they built.

Weaknesses 

There is no good largescale fake news dataset. Many of the headlines labelled "fake" merely come from websites which publish opinion pieces, or are hyperbolic rather than fake.

We chose to collect and curate a small test set for the competition. This was quite laborintensive. Ideally, a new test set would be collected after the deadline for competition entries.

The Naive Bayes classifier works frustratingly well on the task. It is certainly possible to do better than NB, but many students struggled.

Dependencies 
Python, Logistic Regression, Naive Bayes, and Decision Trees. Some students were technically savvy enough to scrape their own datasets from the web, and to use packages such as scikitlearn, pandas, and spaCy.

Variants 
 We chose to ask students to use scikitlearn to train a Decision Tree, mostly because we wanted students to get practice with using offtheshelf libraries. It is possible to ask them to write the code themselves.
 Courses that put emphasis on Natural Language Processing can require students to use, for example, partsofspeech tagging (though the gain from doing that is quite minimal).
 Using entire articles, rather than just the headlines, is possible.

Handout
Lessons learned
 Running a competition in a large class is challenging. It is better to set strict rules about the format in which submissions will be accepted: allowable packages, the maximum acceptable size of the weights file (which can easily run into the gigabytes), and the output format.
 Student code may not be deterministic, and in that case we averaged their accuracy over multiple runs. To reduce staff workload, future instructors can mandate that students set a random seed or disqualify nonreproducible submissions altogether.
 It is challenging to ensure fairness for participants from different sections of the course: students in one section may be exposed to useful techniques several days earlier than students in another section.