**Rook Jumping Maze Instructions**** :**
Starting at the circled cell in the upper-left corner, find a path to the goal
cell marked “G”. From each numbered cell, one may move that exact
number of cells horizontally or vertically in a straight line. How many
moves does the shortest path have?

Solution (select to reveal): [19 moves: RDLRUDRLRLUDLDRRUDD]

**Problem:** Use
ε-greedy and softmax strategies to learn an
approximately optimal restart period that maximizes generated maze utility per
unit of computation.

"One-armed bandit" is a slang reference to a
slot machine. A well-known problem in psychology and machine learning is
the n-armed bandit problem where the learner is faced with a set of single
actions with variable payouts and must balance *explorative *actions
that gain information about expected action utility, versus *exploitative *
actions that use such information to maximize utility.

In this case, let us suppose that our goal is to create a Rook Jumping Maze
generator that uses hill descent with random restarts in order to yield mazes
with the longest solutions most efficiently. Further, let us suppose that
we value a maze with goal depth *d *twice as much as a maze with goal depth
(*d *- 1). Finally, suppose that computation has a uniform cost per
iteration. Let *d* = 0 for a maze without a solution. Then we
can measure the utility of a single hill descent of *i *iterations as 2^{d}
/ *i*.

In this exercise, we will implement two simple techniques for learning the
approximate best number of iterations *i* that maximizes the expected
utility 2^{d} / *i * for hill descent. Then one
can restart after *i* iterations and expect to maximize expected utility of
the generator per unit of computation.

For each method, we will create, in effect, a 6-armed bandit by allowing 6
choices for the number of iterations between each restart: 100, 200, 400, 800,
1600, 3200, that is, 100 * 2^{a}, *a* in {0, ..., 5}.
Each method will learn the utility of actions by using a action selection strategy that allows
explorative actions while largely taking actions that are expected to be
superior.

The ε-greedy strategy for action selection chooses a random action with probability ε, and otherwise chooses an action with the maximum average utility experienced so far. If ε = 1, this is purely explorative. If ε = 0, this is purely exploitative.

The softmax strategy for action selection
assigns a probability to the choice of each action *a*
according to the normalized weight *e*^{(U(a)/τ) }
for some given constant τ.
That is, we compute the sum of *e*^{(U(a)/τ)
}for all *a, *and divide each individual term of the sum by the
sum to get the action probability*. *A simple algorithm to compute a
softmax action is as follows:

- For each action
*a*, - prob[a] =
*e*^{(U(a)/τ)} - sum = sum + prob[a]
- r = random number in [0, 1)
- For each action
*a*, - prob[
*a*] = prob[*a*] / sum - cumProb = cumProb + prob[
*a*] - if r < cumProb, return
*a*

Evaluate ε-greedy strategy and softmax strategy for Hill Descent generation of Rook Jumping mazes. Evaluate three different values of ε and τ for ε-greedy strategy and softmax strategy, respectively, for 10000 iterations each. Print tables of the data collected for each 10000 iterations of learning. (See transcript below for an example.)

Possible background readings include:

- Russell and Norvig. Artificial Intelligence: A Modern Approach, 3rd edition, Chapter 21 through section 21.3.1 (exploration). Also, read the p. 841 sidebar on "Exploration and Bandits"
- Sutton and Barto, Reinforcement Learning: an introduction, Chapter 2 (Evaluative feedback), esp. through 2.3 (Softmax Action Selection)

**Sample Transcript:**

e-greedy(epsilon=0.100000): Iters Avg. U Total U Count 100 29.093091 6400.480000 220 200 55.150983 9541.120000 173 400 133.123432 49655.040000 373 --> 800 154.603927 1368399.360000 8851 1600 123.704698 18432.000000 149 3200 71.111111 16640.000000 234 e-greedy(epsilon=0.500000): Iters Avg. U Total U Count 100 29.425384 24158.240000 821 200 62.412564 50866.240000 815 400 125.300579 108134.400000 863 --> 800 159.189108 933007.360000 5861 1600 130.904069 106163.200000 811 3200 71.226055 59046.400000 829 e-greedy(epsilon=1.000000): Iters Avg. U Total U Count 100 25.534548 42617.160000 1669 200 67.629684 113550.240000 1679 400 123.813365 211225.600000 1706 --> 800 157.323225 267292.160000 1699 1600 119.995910 197153.280000 1643 3200 70.797406 113559.040000 1604 softmax(tau=1.000000): Iters Avg. U Total U Count 100 0.000000 0.000000 0 200 0.000000 0.000000 0 400 0.000000 0.000000 0 --> 800 155.814144 1558141.440000 10000 1600 0.000000 0.000000 0 3200 0.000000 0.000000 0 softmax(tau=50.000000): Iters Avg. U Total U Count 100 34.719614 14408.640000 415 200 68.736124 35467.840000 516 400 126.309737 312111.360000 2471 --> 800 159.523264 633466.880000 3971 1600 121.644326 228326.400000 1877 3200 68.771840 51578.880000 750 softmax(tau=100.000000): Iters Avg. U Total U Count 100 31.231344 24391.680000 781 200 57.177993 61523.520000 1076 400 129.181492 275285.760000 2131 --> 800 158.914955 461171.200000 2902 1600 120.441905 227635.200000 1890 3200 74.624000 91041.280000 1220

Questions:

- Empirically, how does varying ε vary the behavior of ε-greedy strategy? What happens for extreme high/low values of ε?
- Empirically, how does varying τ vary the behavior of softmax strategy? What happens for extreme high/low values of τ?
- Which strategy do you prefer and why?
- In the transcript above, we got lucky for the τ= 1 case. As long as there are only positive utility experiences with an action, softmax strategy will tend not to deviate from it for low τ. Similar behavior occurs for ε-greedy strategy with low ε. One approach to this problem is to begin search with each action having a prior false experience with positive utility in order to encourage initial exploration. Experiment with this approach and observe the results. How should one choose the initial positive utility value?