DeepStack
Summary 
In this assignment, students thoroughly absorb a recent seminal paper in machine learning by traversing the tree of dependencies that led to that paper. Targeting, "DeepStack", the successful result tackling the game of poker (simultaneously accomplished by Libratus), students will gain understanding of how to solve 1v1 games of incomplete information. Through six sessions, they will build up a strong understanding of game theory fundamentals and Counterfactual Regret Minimization, culminating in using this firm base to grasp the background and signficance of the DeepStack paper. 
Topics  Normal Form Games, Extensive Form Games, Game Theory, Counterfactual Regret Minimization, Optimality & Equilibrium, Games of Incomplete Information, Poker. 
Audience  Advanced: late stage undergraduate to professional practitioners. There is no expectation of topic maturity, but there is an expectation of familiarity with probability theory, statistics, the practice of using formal mathematical notating and proving theorems, and an understanding of artificial neural network training. Participants should have the ability and desire to engage indepth with the material. This content may not be ideal for classes with a predefined syllabus. A better fit would be seminars, reading groups, elective classes such as "Advanced Topics in AI" or "Advanced Topics in Game Theory". 
Difficulty  The expectation is that each of the 6 sessions in this class will take on the order of 610 hours to complete. This includes the assignments, which are embedded as Questions. Note that solutions are available to teachers upon request to depthfirstlearning@gmail.com. 
Strengths 

Weaknesses 

Dependencies  We assume that students are familiar with the basics of probability theory, some basic statitics, the practice of using formal mathematical notation and formally proving theorems and an understanding of artificial neural network training. Other than these dependencies, our approach guides students through a careful study of all of the other dependencies that this material depends on. Note that there was a previous Model Assignment in 2013 on Counterfactual Regret Minimization that is very related to this material. We incorporate two of the sections from that assignment's handout as a required dependency in week four ("ICRM" below). 
Variants  The assignment is meant to be done in order. Optional readings are marked as such, but otherwise all of the required readings are necessary to answer the questions. 
Why
Along with Libratus, DeepStack is one of two approaches to solving NoLimit Texas Holdem that debuted coincidentally. This game was notoriously difficult to solve as it has just as large a branching factor as Go, but additionally is a game of imperfect information.
The main idea behind both DeepStack and Libratus is to use Counterfactual Regret Minimization (CFR) to find a mixed strategy that approximates a Nash Equilibrium strategy. CFR’s convergence properties guarantee that we will yield such a strategy and the closer we are to it, the better our outcome will be. They differ in their implementation. In particular, DeepStack uses deep neural networks to approximate the counterfactual value of each hand at specific points in the game. While still being mathematically tight, this lets it cut short the necessary computation to reach convergence.
In this curriculum, you will explore the study of games with a tour through game theory and counterfactual regret minimization while building up the requisite understanding to tackle DeepStack. Along the way, you will learn all of the necessary topics, including what is the branching factor, all about Nash Equilibria, and CFR (Quora).
Common Resources:
 MAS: Multi Agent Systems (get PDF with 'eBook Download' link).
 LT: Marc Lanctot’s Thesis. c
 ICRM: Introduction to Counterfactual Regret Minimization.
 PLG: Prediction, Learning, and Games.
1 Normal Form Games & Poker
Motivation: Most of game theory, as well as the particular techniques used in DeepStack and Libratus, is built on the framework of Normal Form Games. These are game descriptions and are familiarly represented as a matrix, a famous example being the Prisoner’s Dilemma. In this section, we cover the basics of Normal Form Games. In addition, we go over the rules of Poker and why it had proved so difficult to solve.
Required Reading:
 MAS: Sections 3.1 & 3.2.
 LT: Pages 57.
 The Game of Poker: Supplementary #1 on pages 1617.
Optional Reading:
 The State of Solving Large IncompleteInformation Games, and Application to Poker (2010)
 Why Poker is Difficult Very good video by Noam Brown, the main author of Libratus. The first eighteen minutes are the most relevant.
 Pagat reference for Texas Hold'em.
 Wikipedia reference for Texas Hold'em.
Questions:
 LT: Prove that in a zerosum game, the Nash Equilibrium strategies are interchangeable.
Hint
Use the definition of a Nash Equilibrium along with the fact that ${\mu}_{i}({\sigma}_{i},{\sigma}_{i})+{\mu}_{i}({\sigma}_{i},{\sigma}_{i})=c$.
 LT: Prove that in a zerosum game, the expected payoff to each player is the same for every equilibrium.
Solution
We will solve both this problem and the one above here. We have that if ${\mu}_{i}(\sigma )=\mu ({\sigma}_{i},{\sigma}_{i})$ and ${\mu}_{i}({\sigma}^{\prime})=\mu ({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})$ are both Nash Equilibria, then:
$\begin{array}{rl}{\mu}_{i}({\sigma}_{i},{\sigma}_{i})& \ge {\mu}_{i}({\sigma}_{i}^{\prime},{\sigma}_{i})\\ & =c{\mu}_{i}({\sigma}_{i}^{\prime},{\sigma}_{i})\\ & \ge c{\mu}_{i}({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})\\ & ={\mu}_{i}({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})\end{array}$
In a similar fashion, we can show that $\mu ({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})\ge \mu ({\sigma}_{i},{\sigma}_{i})$.
Consequently, $\mu ({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})=\mu ({\sigma}_{i},{\sigma}_{i})$, which also implies that the strategies are interchangeable, i.e. $\mu ({\sigma}_{i}^{\prime},{\sigma}_{i}^{\prime})=\mu ({\sigma}_{i}^{\prime},{\sigma}_{i})$.  MAS: Prove Lemma 3.1.6 (repeated below). In more natural English, this states that if there is a relationship among
preferences that satisfy four important axioms (see MAS for details on those axioms), then
the relationship among those preferences is linear with regards to the probabilities of attaining them.
$\mathit{\text{Lemma}}$: If a preference relation $\u2ab0$ satisfies the axioms completeness, transitivity, decomposability, and monotonicity, and if preference ${o}_{1}\succ {o}_{2}$ and preference ${o}_{2}\succ {o}_{3}$, then there exists probability $p$ such that for all $\mathrm{\forall}{p}^{\prime}<p$, ${o}_{2}\succ [{p}^{\prime}:{o}_{1};(1{p}^{\prime}):{o}_{3}]$ and for all ${p}^{\u2033}>p$, $[{p}^{\u2033}:{o}_{1};(1{p}^{\u2033}):{o}_{3}]\succ {o}_{2}.$  MAS: Theorem 3.1.8 ensures that rational agents need only maximize the expectation
of singledimensional utility functions. Prove this result as a good test of your
understanding.
$\mathit{\text{Theorem}}$: If a preference relation $\u2ab0$ satisfies the axioms completeness, transitivity, substitutability, decomposability, monotonicity, and continuity, then there exists a function $u:\mathbb{L}\mapsto [0,1]$ with the properties that: $u({o}_{1})\ge u({o}_{2})$ iff ${o}_{1}\u2ab0{o}_{2}$.
 $u([{p}_{1}:{o}_{1},...,{p}_{k}:{o}_{k}])=\sum _{i=1}^{k}{p}_{i}u({o}_{i})$.
2 Optimality & Equilibrium
Motivation: How do you reason about games? The best strategies in multiagent scenarios depend on the choices of others. Game theory deals with this problem by identifying subsets of outcomes called solution concepts. In this section, we discuss the fundamental solution concepts: Nash Equilibrium, Pareto Optimality, and Correlated Equilibrium. For each solution concept, we cover what it implies for a given game and how difficult it is to discover a representative strategy.
Required Reading:
 MAS: Sections 3.3, 3.4.5, 3.4.7, 4.1, 4.2.4, 4.3, & 4.6.
 LT: Section 2.1.1.
Optional Reading:
 MAS: Section 3.4.
Questions:
 Why must every game have a Pareto optimal strategy?
Solution
Say that a game does not have a Pareto optimal outcome. Then, for every outcome $O$, there was another ${O}^{\prime}$ that Paretodominated $O$. Say ${O}_{2}>{O}_{1}$. Because ${O}_{2}$ is not Pareto optimal, there is some ${O}_{k}>{O}_{2}$. There cannot be a max in this chain (because that max would be Pareto optimal) and thus there must be some cycle. Consequently, there exists for some agent a strategy ${O}_{j}$ s.t. ${O}_{j}>{O}_{j}$, which is a contradiction.
 Why must there always exist at least one Pareto optimal strategy in which all players adopt pure strategies?
 Why in commonpayoff games do all Pareto optimal strategies have the same payoff?
Solution
Say two strategies $S$ and ${S}^{\prime}$ are Pareto optimal. Then neither dominates the other, so either $\mathrm{\forall}i{\mu}_{i}(S)={\mu}_{i}({S}^{\prime})$ or there are two players $i,j$ for which $m{u}_{i}(S)<{\mu}_{i}({S}^{\prime})$ and $m{u}_{j}(S)>{\mu}_{j}({S}^{\prime})$. In the former case, we see that the two strategies have the same payoff as desired. In the latter case, we have a contradiction because ${\mu}_{j}({S}^{\prime})={\mu}_{i}({S}^{\prime})>{\mu}_{i}(S)={\mu}_{j}(S)>{\mu}_{j}({S}^{\prime})$. Thus, all of the Pareto optimal strategies must have the same payoff.
 MAS: Why does definition 3.3.12 imply that the vertices of a simplex must
all receive different labels?
Solution
This follows from the definitions of $\mathbb{L}(v)$ and $\chi (v)$. At the vertices of the simplex, $\chi $ will only have singular values in its range defined by the vertice itself. Consequently, $\mathbb{L}$ must as well.
 MAS: Why in definition 3.4.12 does it not matter that the mapping is to pure strategies rather than to mixed strategies?
 Take your favorite normalform game, find a Nash Equilibrium, and then find a corresponding Correlated Equilibrium.
3 Extensive Form Games
Motivation: What happens when players don’t act simultaneously? Extensive Form Games are an answer to this question. While this representation of a game always has a comparable Normal Form, it’s much more natural to reason about sequential games in this format. Examples include familiar ones like Go, but also more exotic games like Magic: The Gathering and Civilization. This section is imperative as Poker is best described as an Extensive Form Game.
Required Reading:
 MAS: Sections 5.1.1  5.1.3.
 MAS: Sections 5.2.1  5.2.3.
 Accelerating Best Response Calculation in Large Extensive Games: This is important for understanding how to evaluate Poker algorithms.
Optional Reading:
 LT: Section 2.1.2.
Questions:
 What is the intuition for why not all normal form games can be transformed
into perfectform extensive games?
Solution
The problem is one of modeling simultaneity. Perfect information extensive form games have trouble modeling concurrent moves because they have an explicit temporal structure of moves.
 Why does that change when the transformation is to imperfect extensive games?
 How are the set of behavioral strategies different from the set of mixed strategies?
Solution
The set of mixed strategies are each distributions over pure strategies. The set of behavioral strategies are each vectors of distributions over the actions and assign that distribution independently at each Information Set.
 Succinctly describe the technique demonstrated in the Accelerating Best Response paper.
4 Counterfactual Regret Minimization #1
Motivation: Counterfactual Regret Minimization (CFR) is only a decade old but has already achieved huge success as the foundation underlying DeepStack and Libratus. In the first of two weeks dedicated to CFR, we learn how the algorithm works practically and get our hands dirty coding up our implementation.
The optional readings are papers introducing CFRD and CFR+, further iterations upon CFR. These are both used in DeepStack.
Required Reading:
 ICRM: Sections 2.12.4.
 ICRM: Sections 3.13.4.
 LT: Section 2.2.
 Regret Minimization in Games with Incomplete Information.
Optional Reading: These two papers are CFR extensions used in DeepStack.
 Solving Imperfect Information Games Using Decomposition: CFRD.
 Solving Large Imperfect Information Games Using CFR+: CFR+.
Questions:
 What is the difference between external regret, internal regret, swap regret,
and counterfactual regret?
Hint
The definitions of the three are the following:
 External Regret: How much the algorithm regrets not taking the best single decision in hindsight. We compare to a policy that performs a single action in all timesteps.
 Internal Regret: How much the algorithm regrets making one choice over another in all instances. An example is whenever you bought Amazon stock, you instead bought Microsoft stock.
 Swap Regret: Similar to Internal Regret but instead of one categorical action being replaced wholesale with another categorical action, now we allow for any number of categorical swaps.
 Counterfactual Regret: Assuming that your actions take you to a node, this is the expectation of that node over your opponents' strategies. The counterfactual component is that we assume you get to that node with a probability of one.
 Why is Swap Regret important?
Hint
Swap Regret is connected to Correlated Equilibrium. Explain why.
 Leduc Poker (Southey et al) and Liar’s Dice (Wikipedia) are two different games that are more tractable than games with larger state spaces like Texas Hold'em while still being intuitive to grasp. Test your understanding by implementing CFR (or CFR+ / CFRD) to solve one of these two games in your favorite programming language.
 How do you know if you’ve implemented CFR correctly?
Solution
One way is to test it by implementing Local Best Response. It should perform admirably against that algorithm, which is meant to best it.
5 Counterfactual Regret Minimization #2
Motivation: In the last section, we saw the practical side of CFR and how effective it can be. In this section, we’ll understand the theory underlying it. This will culminate with Blackwell’s Approachability Theorem, a generalization of repeated twoplayer zerosum games. This is a challenging session but the payoff will be a much keener understanding of CFR’s strengths.
Required:
 PLG: Sections 7.3  7.7, 7.9.
Optional:
 A Simple Adaptive Procedure Leading to Correlated Equilibrium.
 Prof. Johari’s 2007 Class  11.
 Prof. Johari’s 2007 Class  13.
 Prof. Johari’s 2007 Class  14.
 Prof. Johari’s 2007 Class  15.
Questions:

PLG: Prove Lemma 7.1.
$\mathit{\text{Lemma}}$: A probability distribution $P$ over the set of all $K$tuples $i=({i}_{1},...,{i}_{K})$ of actions is a correlated equilibrium iff, for every player $k\in 1,...,K$ and actions $j,{j}^{\prime}\in 1,...,{N}_{k}$, we have$$\sum _{i:{i}_{k}=j}P(i)(\mathcal{l}(i)\mathcal{l}({i}^{},{j}^{\prime}))\le 0$$where $({i}^{},{j}^{\prime})=({i}_{1},...,{i}_{k1},{j}^{\prime},{i}_{k+1},...,{i}_{K})$.

It’s brushed over in the proof of Theorem 7.5 in PLG, but prove that if set $S$ is approachable, then every halfspace $H$ containing $S$ is approachable.
Solution
Because $S\in H$ is approachable, we can always find a strategy for player one s.t. the necessary approachability clauses hold (see Johari's Lecture 13). Namely, choose the strategy in $S$ that asserts $S$ as being approachable.
6 DeepStack
Motivation: Let’s read the paper! A summary of what’s going on to help with your understanding:
DeepStack runs counterfactual regret minimization at every decision. However, it uses two separate neural networks, one for after the flop and one for after the turn, to estimate the counterfactual values without having to continue running CFR after those moments. This approach is trained beforehand and helps greatly with cutting short the search space at inference time. Each of the networks take as input the size of the pot and the current Bayesian ranges for each player across all hands. They output the counterfactual values for each hand for each player.
In addition to DeepStack, we also include Libratus as required reading. This paper highlights game theory and CFR as the really important concepts in this curriculum; deep learning is not necessary to build a champion Poker bot.
Required Reading:
 DeepStack: ExpertLevel Artificial Intelligence in HeadsUp NoLimit Poker.
 DeepStack Supplementary Materials.
 Libratus.
 Michael Bowling on DeepStack.
Optional Reading:
 (Github) DeepStack Implementation for Leduc Hold’em.
 Noam Brown on Libratus.
 DepthLimited Solving for ImperfectInformation Games: This paper is fascinating because it is achieves a pokerplaying bot almost as good as Libratus but using a fraction of the necessary computation and disk space.
Questions:
 What are the differences between the approaches taken in DeepStack and in Libratus?
Solution
Here are some differences:
 A clear difference is that DeepStack uses a deep neural network to reduce the necessary search space, and Libratus does not.
 DeepStack does not use any action abstraction and instead melds those considerations into the pot size input. Libratus does use a dense action abstraction but adapts it each game and additionally constructs new subgames on the fly for actions not in its abstraction.
 DeepStack uses card abstraction by first clustering the hands into 1000 buckets and then considering probabilities over that range. Libratus does not use any card abstraction preflop or on the flop, but does use it on later rounds such that the game's ${10}^{61}$ decision points are reduced to ${10}^{12}$.
 DeepStack does not have a way to learn from recent games without further neural network training. On the other hand, Libratus improves via a background process that adds novel opponent actions to its action abstraction.
 Can you succinctly explain “Continual Resolving”?
 Can you succinctly explain AIVAT?