Party Affiliation Classification from State of the Union Addresses

Summary Text Classification with Naive Bayes
Students will explore and implement methods to perform text classification using the multinomial and Bernoulli models. The assignment task is to predict US President's party affiliation from their State of the Union addresses (SOTU).
Topics Machine Learning; supervised learning; classification; Naive Bayes; bag-of-words representation; text classification
Audience Undergraduate students in an introductory course in artificial intelligence, machine learning, data mining, or information retrieval
Difficulty The assignment is of medium difficulty, extending the ideas of basic Naive Bayes to the consider the two models for text classification. The assignment should be completed in approximately 2 weeks.
Strengths The problem of text categorization is very relevant to many current applications for analyzing information, e.g., sentiment analysis, article categorization. The assignment may also provide students the opportunity to learn new skills and available software to help analyze and process text.
Weaknesses As is, the assignment focuses on Naive Bayes a single classification method (though it could be extended to included SVMs, kNN, or other approaches).
Dependencies Students must have basic programming skills, understanding of probability and statistical distributions, knowledge of English to follow the methods in text processing. Note, understanding of the US political systems is not necessary for completion of the assignment however, foreign students may appreciate a short description of the American party system and presidential SOTU addresses, to place the assignment in context to their own culture.
Any programming language that provides support for text comparison and analysis is appropriate for use. Languages such as Python or R have libraries that may help support student's processing of the text, e.g., "nltk" - Natural Language Toolkit in Python and "tm" - the Text Mining package in R, but any language could be used for the assignment.
Variants Several variants of the assignment can be created each to have topical relevancy. For example, if the assignment is given after a main election year inauguration addresses can be used rather than state of the union addresses. Other publicly available speeches, poems, novels, and play could be considered as alternative data sets, e.g., determine whether a play or scene by Shakespeare is a comedy/tradegy/history.
The assignment could be extended to incorporate other classification methods including kNN and SVMs.

Additional assignment extensions and variants are discussed in the Project Description below.

Project Information

Project Description: [PDF]
Project Data: [ZIP]

Resources for Students

Manning, C.D., Raghavan, P. and Schutze, H., Introduction to Information Retrieval, Cambridge University Press, 2009.
[Online Website] [Ch 13. Text classification and Naive Bayes]

Sebastiani, F. "Machine Learning in Automated Text Categorization" ACM Computing Surveys, 34(1):1-47, 2002.

McCallum, A. and Nigam, K. "A Comparison of Event Models for Naive Bayes Text Classification" in Proceedings of AAAI, 1998.

Tom Mitchell Machine Learning McGraw-Hill, 1997. [Online Website]

Resources for Faculty

Project Sources: Manning, Raghavan, Schutze, Introduction to Information Retrieval

Further Information

Contact: Laura E. Brown at Michigan Technological University [lebrown at mtu dot edu]