|Summary||Text Classification with Naive Bayes
Students will explore and implement methods to perform text classification using the multinomial and Bernoulli models. The assignment task is to predict US President's party affiliation from their State of the Union addresses (SOTU).
|Topics||Machine Learning; supervised learning; classification; Naive Bayes; bag-of-words representation; text classification|
|Audience||Undergraduate students in an introductory course in artificial intelligence, machine learning, data mining, or information retrieval|
|Difficulty||The assignment is of medium difficulty, extending the ideas of basic Naive Bayes to the consider the two models for text classification. The assignment should be completed in approximately 2 weeks.|
|Strengths||The problem of text categorization is very relevant to many current applications for analyzing information, e.g., sentiment analysis, article categorization. The assignment may also provide students the opportunity to learn new skills and available software to help analyze and process text.|
|Weaknesses||As is, the assignment focuses on Naive Bayes a single classification method (though it could be extended to included SVMs, kNN, or other approaches).|
|Dependencies||Students must have basic programming skills, understanding of probability and statistical distributions, knowledge of English to follow the methods in text processing. Note, understanding of the US political systems is not necessary for completion of the assignment however, foreign students may appreciate a short description of the American party system and presidential SOTU addresses, to place the assignment in context to their own culture.
Any programming language that provides support for text comparison and analysis is appropriate for use. Languages such as Python or R have libraries that may help support student's processing of the text, e.g., "nltk" - Natural Language Toolkit in Python and "tm" - the Text Mining package in R, but any language could be used for the assignment.
Several variants of the assignment can be created each to have topical relevancy. For example, if the assignment is given after a main election year inauguration addresses can be used rather than state of the union addresses. Other publicly available speeches, poems, novels, and play could be considered as alternative data sets, e.g., determine whether a play or scene by Shakespeare is a comedy/tradegy/history.
The assignment could be extended to incorporate other classification methods including kNN and SVMs.
Additional assignment extensions and variants are discussed in the Project Description below.