The Mario Project

This project will focus on using code from the RL 2009 competition. The assignment will be written for the Generalized Mario domain.
The product of this project will be:

a write-up describing your experiments (most important)
all the code in one zip file (not very important)
an in-person demonstration of the code (a little important)
an in-person discussion of your write-up (a little important)

Step 0:
Install the RL competition code and run Mario. You should be able to see a visualizer and run with a demo agent. Change the agent so that it only runs to the right. This is not hard, but will force you to install everything and get started early.
Send the code for this simple agent to the instructor by date.

Step 1:
Test how well the example ExMarioAgent.java agent plays. Run two experiments, using the following parameters:

Level Seed = 121
Level Type = 0 and 1
Level Difficulty = 0
Instance = 0

Run thirty trials for level type 0 and thirty trials for level type 1. Report the average and standard deviation for this simple agent on both level types.
Hint: You will want to set up a script to run this test and report the final reward for each trial. If you try to use runDemo.bash exclusively, this project will take much longer.

Step 2:
Given the information used by getAction() in ExMarioAgent.java, what do you think would make a good set of state variables to describe Mario's state?
Example: Suppose I defined Mario's state variables as follows:

Is there a pit to my right? (as done in ExMarioAgent)
Is there a smashable block above me? (using the getTileAt() function)

Then, Mario could potentially learn to jump if he is under a smashable block or if there's a pit next to him. However, he would ignore all monsters and would only notice pits at the last moment. Thus, a better representation of state might involve putting some information about the nearest monster into the state, and/or the distance to any pit to Mario's right.
Send your proposed state representation to the instructor by date for feedback.

Step 3:
Using the state representation you designed in step 2 (after taking into account instructor feedback), finalize a tabular representation of the action-value function. I recommend somewhere between 10,000 and 100,000 states. What are the (dis)advantages of having a small or a large state space?

Step 4:
Determine out how you can debug your (to be developed) learning algorithm without relying on Mario. For instance, you may want to design your own very simple test environment so that you can give the agent a state and a return and see if the learning update is working correctly.

Step 5:
Using the tabular approximation of the action value function, program a Sarsa. You will need to modify the start, step, end, and getAction functions.
Hint: Try multiple learning rates (alpha) and exploration rates (epsilon).
Send a learning curve from this step, showing successful agent improvement, to the instructor by date for feedback. A single episode is sufficient at this step.

Step 6:
Test your learning algorithm on Mario. Let Mario play for many episodes - the reward you receive should increase. Run 10 trials using your algorithm on level type 0, using the same parameters as in Step 1. Plot the average reward vs. Episode Number, along with the standard deviation.

Step 7:
Implement Q-Learning or Monty Carlo. Tune the learning parameters and compare with Sarsa. Graph and explain your results.

Step 8:
Using the same state features, change from a tabular representation to using function approximation. Matt suggests a neural network or a CMAC, and will suggest how to set it up for your state representation. Tune the parameters of the function approximator and compare to learning with the same algorithm in the tabular representation. Graph and explain your results.

Step 9:
Update one or more of your algorithms to use eligibility traces. Tune the value of lambda that you use, and then compare the learning results to learning with lambda=0. Graph and explain your results.

Step 10:
Using one of your learning methods, allow Mario to learn in level 0 and save your action-value function. Now, compare learning in level 1 between I) learning as normal and II) beginning with the old action-value function. This is an example of transfer learning --- if the two levels are similar, what Mario learned on level 0 should help.

Step 11:
In this step, you are teaching the computer to play Mario. First, develop a keyboard interface for the Mario game. Second, write to a text file that records all the states the agent sees, and what action you took (if any). Third, use you ID3 algorithm from project 0 to learn to classify all of your data into a policy (i.e., given a state, what action would you most likely take). Fourth, use this learned policy to play Mario. How well does it do? Does the amount of demonstration that you give the agent affect its performance?

Grading Rubric

All code and the write-up are due by date.
Pass: You generate a learning curve in step 6.

Maximum benefits from the following conditions:
+1/2 letter grade: The learning curve has a positive slope (i.e., it does learn)
+1/2 letter grade: Mario is able to learn to outperform the example policy
+1/2 letter grade: overall thoroughness, presentation quality, insight, etc.
+1/2 letter grade: in person demo + discussion with instructor goes well
+1 letter grade: Step 7, Step 10
+2 letter grades: Step 8, Step 9
+3 letter grades: Step 11

-1/2 letter grade: Miss any of the intermediate checkpoints (Steps 0, 2, and 5)
-1 letter gade: Every date the assignment is late

My hope that you'll do more than the minimum number of steps in this project, as this should be a fun project, and it will ensure that you receive a high grade. The maximum grade I will give for this project is a 100%, but there are other, less tangible reasons, for showing off (e.g., "geek cred" and good rec letters).