An Introduction to Monte Carlo Techniques in Artificial Intelligence - Part I

Todd W. Neller, Gettysburg College Department of Computer Science

Monte Carlo (MC) techniques have become important and pervasive in the work of AI practitioners.  In general, they provide a relatively easy means of providing deep understanding of complex systems as long as important events are not infrequent. 

Consider this non-exhaustive list of areas of work with relevant MC techniques:

Herein, we will recommend readings and provide novel exercises to support the experiential learning of the first two of these uses of MC techniques: MC simulation, and MC reinforcement learning.  Through these exercises, we hope that the student gains a deeper understanding of the strengths and weaknesses of these techniques, is able to add such techniques to their problem-solving repertoire, and can discern appropriate applications.

Summary An Introduction to Monte Carlo (MC) Techniques in Artificial Intelligence - Part I: Learn the strengths and weaknesses of MC simulation and MC reinforcement learning (RL).
Topics Monte Carlo (MC) simulation, MC reinforcement learning and related concepts (e.g., Markov Decision Processes (MDPs), states, actions, rewards, returns, state- and action- value estimations, policy evaluation, on-/off-policy control)
Audience Introductory Artificial Intellegence (AI) students
Difficulty Suitable for an upper-level undergraduate or graduate course with a focus on AI programming
Strengths Provides a rich collection of tested, interesting, and fun Monte Carlo exercises with an emphasis on game AI applications.  Exercises follow a progression of understanding and development complexity that provides an entry point to more complex MC techniques.  Solutions to main exercises are available upon request to instructors.
Weaknesses Solution code is currently only available in the Java programming language.  (Solution ports to other languages are invited!)
Dependencies Students must be mathematically literate and able to comprehend pseudocode with mathematical notation. Students must also be competent programmers in a fast, high-level programming language.
Variants Given that there are many simple and complex games that have yet to receive research attention (see "Further Exercises" sections), there are many possible alternative exercises, follow-on projects, and parallel examples that might be presented by the instructor.  Suggestions are given for further exercises in each section.  The core exercises of MC RL are parameterized.

Monte Carlo Simulation

Prerequisite Reading

As can be seen from the Wikipedia article on Monte Carlo methods, there is some disagreement on what defines a "Monte Carlo simulation".  Here, we define a Monte Carlo (MC) simulation as a large representative sampling of stochastic system trajectories (sometimes called "stochastic simulations"), where one gains insight to the system through collection and analysis of trajectory data.  To demonstrate, consider the following problem: Let us suppose that a disk-head is reading uniformly-distributed random requests in first-come-first-served order.  If the distance over which the requests are distributed is one unit distance, what is the average distance the disk-head will travel per request?

Approaching this with MC simulation, we can quickly program a large number of simulations and compute the average distance traveled:

		double distance = 0;
		int numReads = 100000000;
		Random random = new Random(0);
		double position = random.nextDouble();
		for (int i = 0; i < numReads; i++) {
			double newPosition = random.nextDouble();
			distance += Math.abs(newPosition - position);
			position = newPosition;
		System.out.println("Average distance per read: " + (distance / numReads));

These 10 lines of Java code quickly returns this result:

Average distance per read: 0.3333392024169018

The correct value is 1/3.  This value can also be computed analytically by expressing a function f(x) for average distance for the next read from a position x, and then integrating f(x) over the interval [0, 1].  However, let us next consider the case where we would wish to compare different disk-head scheduling algorithms.  Whereas our MC simulation program would increase only moderately in complexity, we would in many circumstances find great difficulty expressing and/or solving an equation for more complex scheduling algorithms.  Add some greedy direction and/or locality heuristics with buffering of disk requests and analytical solutions fly out of reach.  Once process dynamics move beyond simple examples, simulation becomes an essential tool for understanding.

 In his book Digital Dice: Computational Solutions to Practical Probability Problems, Paul J. Nihin expressed his motivational philosophical theme as follows:

  1. No matter how smart you are, there will always be probabilistic problems that are too hard for you to solve analytically.
  2. Despite (1), if you know a good scientific programming language that incorporates a random number generator (and if it is good it will), you may still be able to get numerical answers to those "too hard" problems.

For the following MC simulation exercises, we recommend the following approach:


Note: In each exercise (except the "The Limits of Monte Carlo Simulation"), numeric answers (but not online code) are available via the given links to check your work.

Yahtzee in 3 Rolls: Read the rules of the category dice game Yahtzee (see also here).  Let us assume that one attempts to score a Yahtzee by keeping dice of any single value occurring most frequently and rerolling others.  Use MC simulation to estimate the probability of getting a Yahtzee on a single 3-roll turn.

Monte Carlo Estimations in the Game of Pig:

a) Estimate the probability of different turn outcomes given a "hold at 20" play policy.
b) Estimate the expected number of turns for a single player to reach the goal score given a "hold at 20" play policy.
c) Estimate the first player advantage in win probability given two "hold at 20" players.

Risk Attack Rollouts:  In a territory attack of the board game Risk, let us assume an attacker with a armies attacks a defender with d armies and does not call off the attack until either the defender is eliminated (reduced to 0 armies) or the attacker is reduced to 1 army.  Further, assume that both player roll the maximum number of dice allowable.  Following attacking rules, use MC simulation to estimate the probability of eliminating the defender for each (a, d) pair where 2 <= a <= 10 and 1 <= d <= 10.  Note: In the low-precision table "Probabilities of attacker winning a whole battle in Risk", the "number of attacking armies" is 1 less than the total number of armies a.

The Limits of Monte Carlo Simulation: One of the weaknesses of MC simulation is the occurrence of large relative errors when making estimations that are strongly dependent on infrequent events. In this exercise, you will explore the reduction of relative error when estimating the probability of events of various probabilities as one increases the number of simulations.

Let us assume we are interested in the probability of rolling all 1's with n standard dice where 1 <= n <= 10.  We begin by computing the exact probability of rolling n 1's for each n: (1/6)n

Next, we perform one billion dice rolling simulations.  If the first n of these are equal to 1, tally the occurence.  Report both the estimated probability and the relative error of this estimation for each n after 10 iterations, 100 iterations, 1,000 iterations, etc. up to 1,000,000,000 iterations.  For a suggested output format, see this sample transcript with MC data replaced by "#.######e-##".  (We used Java's printf output with "%e" for scientific e notation.)

How does relative error appear to decrease as the number of simulations increases?  For which pairs of number of iterations and n do you tend to observe a count of 0 and the maximum relative error?

Further exercises:

Monte Carlo Reinforcement Learning

Prerequisite Reading

With MC Reinforcement Learning (RL) methods, one uses MC simulation of a MDP in order to estimate state values or (state, action)-pair values given control policies (i.e. action selection in a given state), or optimize control policies on- or off-policy.  In these exercises, we will sample a few such approaches to provide a basis for understanding.


In jeopardy approach dice games, the object is to most closely approach a goal total without exceeding it. These include Macao, Twenty-One (also known as Vingt-et-Un, Pontoon, Blackjack), Sixteen (also known as Golden Sixteen), Octo, Poker Hunt, Thirty-Six, and Altars. (See here pp. 43-44 for references.)  Here we will describe a simple, parameterizable, 2-player dice game we have created as a prototype for such games:

The dice game "Approach n" is played with 2 players and a single standard 6-sided die (d6). The goal is to approach a total of n without exceeding it.  The first player roll a die until they either (1) "hold" (i.e. end their turn) with a roll sum less than or equal to n, or (2) exceed n and lose.  If the first player holds at exactly n, the first player wins immediately. If the first player holds with less than n, the second player must roll until the second player's roll sum (1) exceeds the first player's roll sum without exceeding n and wins, or (2) exceeds n and loses.  Note that the first player is the only player with a choice of play policy.  For n >= 10, the game is nearly fair, i.e. can be won approximately half of the time with optimal decisions.

In the following exercises, we will evolve our approach from the previous MC simulation to comparative MC simulation, to policy evaluation with full state-value estimation, to various forms of MC control where policy is learned through the reinforcement of MC simulations:

For each of the following exercises, you will focus on the problem of optimal play for Approach n for some specific value of n > 10 (perhaps assigned by your instructor).  Solutions will be given for n = 10, so that if you parameterize your code according to n, you can both check the correctness of your approach and enjoy a process of discovery as well.

Comparative MC simulation:

When one can (uncommonly) discern that the optimal policy for a control problem is one of a few, the simplest approach is to apply MC simulation to each policy and see which has the optimal outcome.  For Approach n, the optimal policy will take the form "hold at s", that is, the first player should hold when the roll sum is equal to or exceeds s.  For each holding sum s in the range [n - 5, n], perform 1,000,000 MC simulations to estimate and print the probability of winning. Which holding sum s maximizes the probability of winning?  For n = 10, our computed results are:

  s: estimated player 1 win probability
 5: 0.268023
 6: 0.389456
 7: 0.477032
 8: 0.492804
 9: 0.441647
10: 0.288904
For n = 10, player 1 should hold at 8.

Note: Since this simple technique works well for Approach n, one may question the usefulness of the more complex techniques that follow.  Although Approach n is chosen for the simplicity of its implementation, problems that are simple may yield complex optimal control policies. These exercises may be repeated with slightly more complex problems listed in the extra exercises where there is not a clear small selection of candidate optimal policies. Such a candidate selection is a limiting requirement for the application of this technique.

First-visit MC method for policy evaluation (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.1):

For the optimal s computed in the previous exercise, print the estimated probability of winning at [and occurrence count of] each possible player 1 roll sum in the game using the first-visit MC method in Figure 5.1 of Section 5.1.  In order to estimate the probability of winning, the return R will be 1 for a player 1 win and 0 for a player 1 loss.  Since states are not repeated, each player 1 roll sum in the sequence of simulated roll sums will be the first occurrence in that episode (i.e. simulation).  For greater memory and time efficiency, we recommend that, rather than keeping a list of returns to compute the average return, one instead accumulates a sum of returns and a count of returns for each state.  Generate at least 1,000,000 episodes.  For n = 10 and s =  8, our computed results are:

sum: estimated player 1 win probability [visit count]
  0: 0.492756 [1000000]
  1: 0.474970 [166701]
  2: 0.475012 [194557]
  3: 0.509303 [227448]
  4: 0.579096 [265227]
  5: 0.496524 [307802]
  6: 0.423126 [360195]
  7: 0.364895 [253750]
  8: 0.475646 [268658]
  9: 0.711896 [235484]
 10: 1.000000 [197330]

Note: Since the second player has no real choice for their rollout, one could imagine the first player performing this player 2 rollout with no difference in outcome.  Thus, we could treat each moment after a player 2 roll as a state with a single dictated roll/hold action in order to gain insight to expected outcomes throughout that rollout.  This is left as an option to the student here and in the exercises to follow.  We will omit such states for brevity.

MC control with exploring starts (MCES) (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.3):

In this exercise, we now let go of our assumption that we have a policy or policies to evaluate.  Implement Monte Carlo ES from Figure 5.4 of Section 5.3 with the following design decisions:  Let the initial probability estimate of each state-action pair, Q(s,a), be a random probability.  Let the initial policy be a random choice of roll/hold.  Each episode will start with a random roll sum in the range [0, n] and a random action.  Generate at least 1,000,000 episodes.  Print the table of Q(s,a) values that results. For n = 10, our computed results are:

           Q(s, a)
sum:   roll     hold   [action]
  0: 0.000000 0.493921 [roll]
  1: 0.000000 0.477271 [roll]
  2: 0.000000 0.479286 [roll]
  3: 0.000000 0.505797 [roll]
  4: 0.000000 0.580803 [roll]
  5: 0.051270 0.498027 [roll]
  6: 0.170403 0.427388 [roll]
  7: 0.297536 0.365115 [roll]
  8: 0.478196 0.285432 [hold]
  9: 0.711051 0.164235 [hold]
 10: 1.000000 0.000000 [hold]

Epsilon-soft on-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.4):

Sometimes, exploring starts (ES) are not practical (as in the case with on-line robotics learning) or undesirable (as in the case where a large proportion of states occur with very low probability, or are made unreachable by reasonable control that is easily discerned).  One way of eliminating the need for ES is to restrict the policy such that it is always exploring with some probability.  Here, we modify the MC control with ES such that episodes are generated without exploring starts and the policy is always epsilon-soft.  An epsilon-soft policy chooses one action with probability (1 - epsilon + epsilon/|A(s)|) and all other actions with probability (epsilon/|A(s)|), where |A(s)| is the number of actions in state s.  This is equivalent to saying that the policy is to choose a random action with some small probability epsilon, and another single (dominant) action otherwise.  For our problem and epsilon = .1, this means that the dominant of the 2 actions will be chosen with probability .95 with the other chosen with probability .05.

Implement epsilon-soft on-policy control for Approach n according to Figure 5.6 of  .  Use epsilon = .1, initialize Q(s,a), be a random probability, and initialize the epsilon-soft policy to favor a random action.  In your results, print only the action that maximizes Q(s,a).  Generate at least 1,000,000 episodes.  For n = 10, our computed results are:

           Q(s, a)
sum:   hold     roll   [action]
  0: 0.000000 0.442053 [roll]
  1: 0.000000 0.431764 [roll]
  2: 0.000000 0.439501 [roll]
  3: 0.000000 0.479471 [roll]
  4: 0.000000 0.549074 [roll]
  5: 0.052468 0.472950 [roll]
  6: 0.176265 0.408975 [roll]
  7: 0.302062 0.350095 [roll]
  8: 0.476712 0.273652 [hold]
  9: 0.711675 0.160527 [hold]
 10: 1.000000 0.000000 [hold]

Compare these results to the previous results.  In particular, how does it differ from MC control with ES, and why?

Note: Implementation may be simplified by internally representing the policy as deterministic, but applying it as epsilon-greedy.

Off-policy MC control (see Sutton, R.S. and Barto, A.G. Reinforcement Learning: an introduction, Section 5.6):

It is possible to estimate the value of one policy while generating episodes with a different behavior policy.  For example, we can learn an optimal policy for Approach n, while using an epsilon-soft policy to generate episodes.  The trick is that we can only update Q using the tails of episodes, starting after the last non-greedy action.  Rather than computing simple return totals and visit counts, we effectively weight returns and visits by the inverse probability of with the latter of the start state or the last occurence of an action inconsistent with the current estimation policy.

Implement the off-policy MC control algorithm of Figure 5.7 of .  Use the same epsilon-greedy policy as your epsilon-soft behavior policy continuing to use epsilon = .1.  Use epsilon = .1, initialize Q(s,a), be a random probability, and initialize your deterministic evaluation policy to favor a random action.  Your epsilon-soft behavior policy will always be the epsilon-greedy version of your current evaluation policy (which was randomly initialized).  Generate at least 1,000,000 episodes.  For n = 10, our computed results are:

           Q(s, a)
sum:   hold     roll   [action]
  0: 0.000000 0.493405 [roll]
  1: 0.000000 0.475422 [roll]
  2: 0.000000 0.475861 [roll]
  3: 0.000000 0.509050 [roll]
  4: 0.000000 0.580554 [roll]
  5: 0.053596 0.497167 [roll]
  6: 0.168275 0.425661 [roll]
  7: 0.300287 0.363330 [roll]
  8: 0.477287 0.290901 [hold]
  9: 0.711453 0.161685 [hold]
 10: 1.000000 0.000000 [hold]

Further exercises: