An Introduction to Monte Carlo Techniques in Artificial Intelligence - Part I

Todd W. Neller, Gettysburg College Department of Computer Science

Monte Carlo (MC) techniques have become important and pervasive in the work of AI practitioners. In general, they provide a relatively easy means of providing deep understanding of complex systems as long as important events are not infrequent.

Consider this non-exhaustive list of areas of work with relevant MC techniques:

General: Monte Carlo simulation for probabilistic estimation
Machine Learning: Monte Carlo reinforcement learning
Uncertain Reasoning: Bayesian network reasoning with the Markov Chain Monte Carlo method
Robotics: Monte Carlo localization
Search: Monte Carlo tree search
Game Theory: MC regret-based techniques

Herein, we will recommend readings and provide novel exercises to support the experiential learning of the first two of these uses of MC techniques: MC simulation, and MC reinforcement learning. Through these exercises, we hope that the student gains a deeper understanding of the strengths and weaknesses of these techniques, is able to add such techniques to their problem-solving repertoire, and can discern appropriate applications.

Summary	An Introduction to Monte Carlo (MC) Techniques in Artificial Intelligence - Part I: Learn the strengths and weaknesses of MC simulation and MC reinforcement learning (RL).
Topics	Monte Carlo (MC) simulation, MC reinforcement learning and related concepts (e.g., Markov Decision Processes (MDPs), states, actions, rewards, returns, state- and action- value estimations, policy evaluation, on-/off-policy control)
Audience	Introductory Artificial Intellegence (AI) students
Difficulty	Suitable for an upper-level undergraduate or graduate course with a focus on AI programming
Strengths	Provides a rich collection of tested, interesting, and fun Monte Carlo exercises with an emphasis on game AI applications. Exercises follow a progression of understanding and development complexity that provides an entry point to more complex MC techniques. Solutions to main exercises are available upon request to instructors.
Weaknesses	Solution code is currently only available in the Java programming language. (Solution ports to other languages are invited!)
Dependencies	Students must be mathematically literate and able to comprehend pseudocode with mathematical notation. Students must also be competent programmers in a fast, high-level programming language.
Variants	Given that there are many simple and complex games that have yet to receive research attention (see "Further Exercises" sections), there are many possible alternative exercises, follow-on projects, and parallel examples that might be presented by the instructor. Suggestions are given for further exercises in each section. The core exercises of MC RL are parameterized.

Monte Carlo Simulation

Prerequisite Reading

Skim the Wikipedia article on Monte Carlo methods
Optional, but highly recommended for further exercises: Digital Dice: Computational Solutions to Practical Probability Problems by Paul J. Nahin

As can be seen from the Wikipedia article on Monte Carlo methods, there is some disagreement on what defines a "Monte Carlo simulation". Here, we define a Monte Carlo (MC) simulation as a large representative sampling of stochastic system trajectories (sometimes called "stochastic simulations"), where one gains insight to the system through collection and analysis of trajectory data. To demonstrate, consider the following problem: Let us suppose that a disk-head is reading uniformly-distributed random requests in first-come-first-served order. If the distance over which the requests are distributed is one unit distance, what is the average distance the disk-head will travel per request?

Approaching this with MC simulation, we can quickly program a large number of simulations and compute the average distance traveled:

		double distance = 0;
		int numReads = 100000000;
		Random random = new Random(0);
		double position = random.nextDouble();
		for (int i = 0; i < numReads; i++) {
			double newPosition = random.nextDouble();
			distance += Math.abs(newPosition - position);
			position = newPosition;
		}
		System.out.println("Average distance per read: " + (distance / numReads));

These 10 lines of Java code quickly returns this result:

Average distance per read: 0.3333392024169018

The correct value is 1/3. This value can also be computed analytically by expressing a function f(x) for average distance for the next read from a position x, and then integrating f(x) over the interval [0, 1]. However, let us next consider the case where we would wish to compare different disk-head scheduling algorithms. Whereas our MC simulation program would increase only moderately in complexity, we would in many circumstances find great difficulty expressing and/or solving an equation for more complex scheduling algorithms. Add some greedy direction and/or locality heuristics with buffering of disk requests and analytical solutions fly out of reach. Once process dynamics move beyond simple examples, simulation becomes an essential tool for understanding.

In his book Digital Dice: Computational Solutions to Practical Probability Problems, Paul J. Nihin expressed his motivational philosophical theme as follows:

No matter how smart you are, there will always be probabilistic problems that are too hard for you to solve analytically.
Despite (1), if you know a good scientific programming language that incorporates a random number generator (and if it is good it will), you may still be able to get numerical answers to those "too hard" problems.

For the following MC simulation exercises, we recommend the following approach:

Program a single simulation with enough printed output to convince you of the correctness of your model.
Add your statistical measure of interest and test its correctness as well.
Remove printing from the code.
Wrap the code in a loop of many iterations.
Add printing to summarize the analysis of the collected statistical data.

Exercises

Note: In each exercise (except the "The Limits of Monte Carlo Simulation"), numeric answers (but not online code) are available via the given links to check your work.

Yahtzee in 3 Rolls: Read the rules of the category dice game Yahtzee (see also here). Let us assume that one attempts to score a Yahtzee by keeping dice of any single value occurring most frequently and rerolling others. Use MC simulation to estimate the probability of getting a Yahtzee on a single 3-roll turn.

Monte Carlo Estimations in the Game of Pig:

a) Estimate the probability of different turn outcomes given a "hold at 20" play policy.
b) Estimate the expected number of turns for a single player to reach the goal score given a "hold at 20" play policy.
c) Estimate the first player advantage in win probability given two "hold at 20" players.

Risk Attack Rollouts: In a territory attack of the board game Risk, let us assume an attacker with a armies attacks a defender with d armies and does not call off the attack until either the defender is eliminated (reduced to 0 armies) or the attacker is reduced to 1 army. Further, assume that both player roll the maximum number of dice allowable. Following attacking rules, use MC simulation to estimate the probability of eliminating the defender for each (a, d) pair where 2 <= a <= 10 and 1 <= d <= 10. Note: In the low-precision table "Probabilities of attacker winning a whole battle in Risk", the "number of attacking armies" is 1 less than the total number of armies a.

The Limits of Monte Carlo Simulation: One of the weaknesses of MC simulation is the occurrence of large relative errors when making estimations that are strongly dependent on infrequent events. In this exercise, you will explore the reduction of relative error when estimating the probability of events of various probabilities as one increases the number of simulations.

Let us assume we are interested in the probability of rolling all 1's with n standard dice where 1 <= n <= 10. We begin by computing the exact probability of rolling n 1's for each n: (1/6)ⁿ

Next, we perform one billion dice rolling simulations. If the first n of these are equal to 1, tally the occurence. Report both the estimated probability and the relative error of this estimation for each n after 10 iterations, 100 iterations, 1,000 iterations, etc. up to 1,000,000,000 iterations. For a suggested output format, see this sample transcript with MC data replaced by "#.######e-##". (We used Java's printf output with "%e" for scientific e notation.)

How does relative error appear to decrease as the number of simulations increases? For which pairs of number of iterations and n do you tend to observe a count of 0 and the maximum relative error?

Further exercises:

Pick a game! It's easy to devise many interesting MC simulation exercises based on card and dice games. Pick a simple game, devise a simple play policy, and use MC simulation to learn about expected outcomes. This approach has been used for many games from determining expected payouts of Monopoly properties to relative hand strengths in Poker.
Paul Nihin's book Digital Dice: Computational Solutions to Practical Probability Problems is an excellent source of interesting MC simulation exercises.

Monte Carlo Reinforcement Learning

Prerequisite Reading

Markov Decision Processes: Learn the following terms from a number of good sources.
- Terms: states (s, set of states S), actions (a, set of actions A), rewards (r, R^a_{s, s'} reward from action a in state s transitioning to state s'), return, transition function (P^a_{s,
  s'} probability of action a in state s transitioning to state s'), policy π(s) (π^* optimal policy), state-value function V(s) (V^π state-value function given π, V^* optimal state-value function), action-value function Q(s, a) (Q^π action-value function given π, Q^* optimal action-value function), Markov Decision Process (MDP)
- Sources (choose one):
  - Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Chapter 3
  - Kaelbling, L.P., Littman, M.L., and Moore, A.W. Reinforcement learning: a survey, sections 1 and 3.1
    - Note that the MDP formulation here has the reward as a function of state and action and unlike Sutton and Barto do not additionally parameterize it in terms of the next state s'
    - State transition function T relates to Sutton and Barto's P^a_{s,
      s'}.

.Russell, S. and Norvig, P. Artificial Intelligence: a modern approach, 3rd ed., section 17.1
- Returns (sums of rewards) are referred to here as utilities and notated as U.

Monte Carlo Reinforcement Learning:
- Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Chapter 5

With MC Reinforcement Learning (RL) methods, one uses MC simulation of a MDP in order to estimate state values or (state, action)-pair values given control policies (i.e. action selection in a given state), or optimize control policies on- or off-policy. In these exercises, we will sample a few such approaches to provide a basis for understanding.

Exercises

In jeopardy approach dice games, the object is to most closely approach a goal total without exceeding it. These include Macao, Twenty-One (also known as Vingt-et-Un, Pontoon, Blackjack), Sixteen (also known as Golden Sixteen), Octo, Poker Hunt, Thirty-Six, and Altars. (See here pp. 43-44 for references.) Here we will describe a simple, parameterizable, 2-player dice game we have created as a prototype for such games:

The dice game "Approach n" is played with 2 players and a single standard 6-sided die (d6). The goal is to approach a total of n without exceeding it. The first player roll a die until they either (1) "hold" (i.e. end their turn) with a roll sum less than or equal to n, or (2) exceed n and lose. If the first player holds at exactly n, the first player wins immediately. If the first player holds with less than n, the second player must roll until the second player's roll sum (1) exceeds the first player's roll sum without exceeding n and wins, or (2) exceeds n and loses. Note that the first player is the only player with a choice of play policy. For n >= 10, the game is nearly fair, i.e. can be won approximately half of the time with optimal decisions.

In the following exercises, we will evolve our approach from the previous MC simulation to comparative MC simulation, to policy evaluation with full state-value estimation, to various forms of MC control where policy is learned through the reinforcement of MC simulations:

Comparative MC simulation
First-visit MC method for policy evaluation (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.1)
MC control with exploring starts (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.3)
Epsilon-soft on-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.4)
Off-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.6)

For each of the following exercises, you will focus on the problem of optimal play for Approach n for some specific value of n > 10 (perhaps assigned by your instructor). Solutions will be given for n = 10, so that if you parameterize your code according to n, you can both check the correctness of your approach and enjoy a process of discovery as well.

Comparative MC simulation:

When one can (uncommonly) discern that the optimal policy for a control problem is one of a few, the simplest approach is to apply MC simulation to each policy and see which has the optimal outcome. For Approach n, the optimal policy will take the form "hold at s", that is, the first player should hold when the roll sum is equal to or exceeds s. For each holding sum s in the range [n - 5, n], perform 1,000,000 MC simulations to estimate and print the probability of winning. Which holding sum s maximizes the probability of winning? For n = 10, our computed results are:

  s: estimated player 1 win probability
 5: 0.268023
 6: 0.389456
 7: 0.477032
 8: 0.492804
 9: 0.441647
10: 0.288904
For n = 10, player 1 should hold at 8.

Note: Since this simple technique works well for Approach n, one may question the usefulness of the more complex techniques that follow. Although Approach n is chosen for the simplicity of its implementation, problems that are simple may yield complex optimal control policies. These exercises may be repeated with slightly more complex problems listed in the extra exercises where there is not a clear small selection of candidate optimal policies. Such a candidate selection is a limiting requirement for the application of this technique.

First-visit MC method for policy evaluation (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.1):

For the optimal s computed in the previous exercise, print the estimated probability of winning at [and occurrence count of] each possible player 1 roll sum in the game using the first-visit MC method in Figure 5.1 of Section 5.1. In order to estimate the probability of winning, the return R will be 1 for a player 1 win and 0 for a player 1 loss. Since states are not repeated, each player 1 roll sum in the sequence of simulated roll sums will be the first occurrence in that episode (i.e. simulation). For greater memory and time efficiency, we recommend that, rather than keeping a list of returns to compute the average return, one instead accumulates a sum of returns and a count of returns for each state. Generate at least 1,000,000 episodes. For n = 10 and s = 8, our computed results are:

sum: estimated player 1 win probability [visit count]
  0: 0.492756 [1000000]
  1: 0.474970 [166701]
  2: 0.475012 [194557]
  3: 0.509303 [227448]
  4: 0.579096 [265227]
  5: 0.496524 [307802]
  6: 0.423126 [360195]
  7: 0.364895 [253750]
  8: 0.475646 [268658]
  9: 0.711896 [235484]
 10: 1.000000 [197330]

Note: Since the second player has no real choice for their rollout, one could imagine the first player performing this player 2 rollout with no difference in outcome. Thus, we could treat each moment after a player 2 roll as a state with a single dictated roll/hold action in order to gain insight to expected outcomes throughout that rollout. This is left as an option to the student here and in the exercises to follow. We will omit such states for brevity.

MC control with exploring starts (MCES) (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.3):

In this exercise, we now let go of our assumption that we have a policy or policies to evaluate. Implement Monte Carlo ES from Figure 5.4 of Section 5.3 with the following design decisions: Let the initial probability estimate of each state-action pair, Q(s,a), be a random probability. Let the initial policy be a random choice of roll/hold. Each episode will start with a random roll sum in the range [0, n] and a random action. Generate at least 1,000,000 episodes. Print the table of Q(s,a) values that results. For n = 10, our computed results are:

           Q(s, a)
sum:   roll     hold   [action]
  0: 0.000000 0.493921 [roll]
  1: 0.000000 0.477271 [roll]
  2: 0.000000 0.479286 [roll]
  3: 0.000000 0.505797 [roll]
  4: 0.000000 0.580803 [roll]
  5: 0.051270 0.498027 [roll]
  6: 0.170403 0.427388 [roll]
  7: 0.297536 0.365115 [roll]
  8: 0.478196 0.285432 [hold]
  9: 0.711051 0.164235 [hold]
 10: 1.000000 0.000000 [hold]

Epsilon-soft on-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.4):

Sometimes, exploring starts (ES) are not practical (as in the case with on-line robotics learning) or undesirable (as in the case where a large proportion of states occur with very low probability, or are made unreachable by reasonable control that is easily discerned). One way of eliminating the need for ES is to restrict the policy such that it is always exploring with some probability. Here, we modify the MC control with ES such that episodes are generated without exploring starts and the policy is always epsilon-soft. An epsilon-soft policy chooses one action with probability (1 - epsilon + epsilon/|A(s)|) and all other actions with probability (epsilon/|A(s)|), where |A(s)| is the number of actions in state s. This is equivalent to saying that the policy is to choose a random action with some small probability epsilon, and another single (dominant) action otherwise. For our problem and epsilon = .1, this means that the dominant of the 2 actions will be chosen with probability .95 with the other chosen with probability .05.

Implement epsilon-soft on-policy control for Approach n according to Figure 5.6 of . Use epsilon = .1, initialize Q(s,a), be a random probability, and initialize the epsilon-soft policy to favor a random action. In your results, print only the action that maximizes Q(s,a). Generate at least 1,000,000 episodes. For n = 10, our computed results are:

           Q(s, a)
sum:   hold     roll   [action]
  0: 0.000000 0.442053 [roll]
  1: 0.000000 0.431764 [roll]
  2: 0.000000 0.439501 [roll]
  3: 0.000000 0.479471 [roll]
  4: 0.000000 0.549074 [roll]
  5: 0.052468 0.472950 [roll]
  6: 0.176265 0.408975 [roll]
  7: 0.302062 0.350095 [roll]
  8: 0.476712 0.273652 [hold]
  9: 0.711675 0.160527 [hold]
 10: 1.000000 0.000000 [hold]

Compare these results to the previous results. In particular, how does it differ from MC control with ES, and why?

Note: Implementation may be simplified by internally representing the policy as deterministic, but applying it as epsilon-greedy.

Off-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.6):

It is possible to estimate the value of one policy while generating episodes with a different behavior policy. For example, we can learn an optimal policy for Approach n, while using an epsilon-soft policy to generate episodes. The trick is that we can only update Q using the tails of episodes, starting after the last non-greedy action. Rather than computing simple return totals and visit counts, we effectively weight returns and visits by the inverse probability of with the latter of the start state or the last occurence of an action inconsistent with the current estimation policy.

Implement the off-policy MC control algorithm of Figure 5.7 of . Use the same epsilon-greedy policy as your epsilon-soft behavior policy continuing to use epsilon = .1. Use epsilon = .1, initialize Q(s,a), be a random probability, and initialize your deterministic evaluation policy to favor a random action. Your epsilon-soft behavior policy will always be the epsilon-greedy version of your current evaluation policy (which was randomly initialized). Generate at least 1,000,000 episodes. For n = 10, our computed results are:

           Q(s, a)
sum:   hold     roll   [action]
  0: 0.000000 0.493405 [roll]
  1: 0.000000 0.475422 [roll]
  2: 0.000000 0.475861 [roll]
  3: 0.000000 0.509050 [roll]
  4: 0.000000 0.580554 [roll]
  5: 0.053596 0.497167 [roll]
  6: 0.168275 0.425661 [roll]
  7: 0.300287 0.363330 [roll]
  8: 0.477287 0.290901 [hold]
  9: 0.711453 0.161685 [hold]
 10: 1.000000 0.000000 [hold]

Further exercises:

Hog Solitaire: The objective of the game Hog Solitaire is to reach the goal score of 100 in the minimum number of turns. Each turn, a player simultaneaously rolls a number of dice in the range [1, 10]. If any of the dice show a value of 1 ("hog"), the player scores no points for that turn. However, if no 1s are rolled, the player scores the sum of the dice rolled. (The game of Hog is similar to Pig where a player commits to the number of rolls they will take in a single turn.) From the perspective of MC reinforcement learning the state is simply the player score (i), and the actions are the number of dice to roll in the range [1, 10]. For each turn completed, let there be a reward (a cost in this case) r = -1. Thus, the return R would be the negated number of turns remaining in the solitaire game episode. Apply the above techniques of MC RL to (1) evaluate a strategy that you devise, and/or (2) compute an optimal (or approximately optimal) strategy. What is your policy's expected number of turns? Additional exercises:

The expected average score gain is the same for rolling 5 or 6 dice. (One can calculate these values as a separate exercise.) Does this imply that your policy should be indifferent between rolling 5 or 6 dice in all states? Why or why not?
With a score of 98 or 99, where any non-1 roll will end the game, it is clearly optimal to roll 1. What is the Q value for each of these states paired with the action of rolling 1 die?
With a score of 97, is the expected number of turns rolling 1 die better than, worse than, or equivalent to the expected number of turns for rolling 2 dice? That is, is Q(97,1) > Q(97,2)? Or Q(97,1) < Q(97,2)? Or Q(97,1) = Q(97,2)?

Pig Solitaire: Read the rules to the dice game Pig. Now suppose we define a solitaire game of Pig where a single player seeks to reach the goal score of 100 in the minimum number of turns. From the perspective of MC reinforcement learning, the state is the player score (i) and the current turn total (k). Actions are roll or hold. For each turn completed, let there be a reward (a cost in this case) r = -1. Thus, the return R would be the negated number of turns remaining in the solitaire game episode. Apply the above techniques of MC RL to (1) evaluate a strategy that you devise, and/or (2) compute an optimal (or approximately optimal) strategy. (Given the lengths of Pig-type game episodes, we recommend an on-policy method.) Note that your expected reward in the initial game state is the negated expected number of turns for your policy. What is your policy's expected number of turns? (Note: Convergence is slow for this problem given the long action sequences. Applying MCES to this problem requires < 1.5x10^8 iterations to come within 1% of optimal play expectations.)

Yahtzee or Chance: Suppose a Yahtzee player has only the following two scoring categories remaining: Yahtzee and Chance. Yahtzee (5 of a kind) scores 50 points. Chance scores the sum of the dice. Compute the policy maximizing the expected score gain for such a single 3-roll turn. What is the expected score gain for that turn with your policy?
- Variation: Solve the same problem for 3 (rather than 5) dice with a 3 of a kind scoring 30 points (rather than 50). What is the expected score gain for that turn with your policy? Does there exist a situation where one should reroll a 6 on one's third roll?
- Note: Since the number of rolls is always advancing with each action, Yahtzee has no repeated states and such problems are better solved using a dynamic-programming pattern. One can compute all action expectations and form policy decisions backwards through a topologically sorted state graph.
Sutton and Barto text exercises: Reproduce Sutton and Barto's Blackjack results from Example 5.3 of Section 5.3, or program racetrack control of Exercise 5.4 in Section 5.6. Other exercise problem domains throughout the book provide interesting avenues for discovery and learning as well.