The Mario Project
This project will focus on using code from the RL 2009 competition.
The assignment will be written for the Generalized Mario domain.
The product of this project will be:
- a write-up describing your experiments (most important)
- all the code in one zip file (not very important)
- an in-person demonstration of the code (a little important)
- an in-person discussion of your write-up (a little important)
Step 0:
Install the RL competition code and run Mario. You should be able to
see a visualizer and run with a demo agent. Change the agent so that
it only runs to the right. This is not hard, but will force you to
install everything and get started early.
Send the code for this simple agent to the instructor
by date.
Step 1:
Test how well the example ExMarioAgent.java agent plays. Run two
experiments, using the following parameters:
- Level Seed = 121
- Level Type = 0 and 1
- Level Difficulty = 0
- Instance = 0
Run thirty trials for level type 0 and thirty trials for level type 1.
Report the average and standard deviation for this simple agent on
both level types.
Hint: You will want to set up a script to run this test and report the
final reward for each trial. If you try to use runDemo.bash
exclusively, this project will take much longer.
Step 2:
Given the information used by getAction() in ExMarioAgent.java, what
do you think would make a good set of state variables to describe
Mario's state?
Example: Suppose I defined Mario's state variables as follows:
- Is there a pit to my right? (as done in ExMarioAgent)
- Is there a smashable block above me? (using the getTileAt() function)
Then, Mario could potentially learn to jump if he is under a smashable
block or if there's a pit next to him. However, he would ignore all
monsters and would only notice pits at the last moment. Thus, a better
representation of state might involve putting some information about
the nearest monster into the state, and/or the distance to any pit to
Mario's right.
Send your proposed state representation to the instructor
by date for feedback.
Step 3:
Using the state representation you designed in step 2 (after taking
into account instructor feedback), finalize a tabular representation
of the action-value function. I recommend somewhere between 10,000
and 100,000 states. What are the (dis)advantages of having a small or
a large state space?
Step 4:
Determine out how you can debug your (to be developed) learning
algorithm without relying on Mario. For instance, you may want to
design your own very simple test environment so that you can give the
agent a state and a return and see if the learning update is working
correctly.
Step 5:
Using the tabular approximation of the action value function, program
a Sarsa. You will need to modify the start, step, end, and getAction
functions.
Hint: Try multiple learning rates (alpha) and exploration
rates (epsilon).
Send a learning curve from this step, showing successful agent
improvement, to the instructor by date for feedback. A single
episode is sufficient at this step.
Step 6:
Test your learning algorithm on Mario. Let Mario play for many
episodes - the reward you receive should increase. Run 10 trials
using your algorithm on level type 0, using the same parameters as in
Step 1. Plot the average reward vs. Episode Number, along with the
standard deviation.
Step 7:
Implement Q-Learning or Monty Carlo. Tune the learning parameters and
compare with Sarsa. Graph and explain your results.
Step 8:
Using the same state features, change from a tabular representation to
using function approximation. Matt suggests a neural network or a
CMAC, and will suggest how to set it up for your state representation.
Tune the parameters of the function approximator and compare to
learning with the same algorithm in the tabular representation. Graph
and explain your results.
Step 9:
Update one or more of your algorithms to use eligibility traces. Tune
the value of lambda that you use, and then compare the learning
results to learning with lambda=0. Graph and explain your results.
Step 10:
Using one of your learning methods, allow Mario to learn in level 0
and save your action-value function. Now, compare learning in level
1 between I) learning as normal and II) beginning with the old
action-value function. This is an example of transfer learning --- if
the two levels are similar, what Mario learned on level 0 should help.
Step 11:
In this step, you are teaching the computer to play Mario. First,
develop a keyboard interface for the Mario game. Second, write to a
text file that records all the states the agent sees, and what action
you took (if any). Third, use you ID3 algorithm from project 0 to
learn to classify all of your data into a policy (i.e., given a state,
what action would you most likely take). Fourth, use this learned
policy to play Mario. How well does it do? Does the amount of
demonstration that you give the agent affect its performance?
Grading Rubric
All code and the write-up are due by date.
Pass: You generate a learning curve in step 6.
Maximum benefits from the following conditions:
+1/2 letter grade: The learning curve has a positive slope (i.e., it does learn)
+1/2 letter grade: Mario is able to learn to outperform the example policy
+1/2 letter grade: overall thoroughness, presentation quality, insight, etc.
+1/2 letter grade: in person demo + discussion with instructor goes well
+1 letter grade: Step 7, Step 10
+2 letter grades: Step 8, Step 9
+3 letter grades: Step 11
-1/2 letter grade: Miss any of the intermediate checkpoints (Steps 0, 2, and 5)
-1 letter gade: Every date the assignment is late
My hope that you'll do more than the minimum number of steps in this
project, as this should be a fun project, and it will ensure that you
receive a high grade. The maximum grade I will give for this project
is a 100%, but there are other, less tangible reasons, for showing off
(e.g., "geek cred" and good rec letters).