DeepStack

Summary	In this assignment, students thoroughly absorb a recent seminal paper in machine learning by traversing the tree of dependencies that led to that paper. Targeting, "DeepStack", the successful result tackling the game of poker (simultaneously accomplished by Libratus), students will gain understanding of how to solve 1v1 games of incomplete information. Through six sessions, they will build up a strong understanding of game theory fundamentals and Counterfactual Regret Minimization, culminating in using this firm base to grasp the background and signficance of the DeepStack paper.
Topics	Normal Form Games, Extensive Form Games, Game Theory, Counterfactual Regret Minimization, Optimality & Equilibrium, Games of Incomplete Information, Poker.
Audience	Advanced: late stage undergraduate to professional practitioners. There is no expectation of topic maturity, but there is an expectation of familiarity with probability theory, statistics, the practice of using formal mathematical notating and proving theorems, and an understanding of artificial neural network training. Participants should have the ability and desire to engage in-depth with the material. This content may not be ideal for classes with a pre-defined syllabus. A better fit would be seminars, reading groups, elective classes such as "Advanced Topics in AI" or "Advanced Topics in Game Theory".
Difficulty	The expectation is that each of the 6 sessions in this class will take on the order of 6-10 hours to complete. This includes the assignments, which are embedded as Questions. Note that solutions are available to teachers upon request to depthfirstlearning@gmail.com.
Strengths	Students do not just read the paper but also study the base of dependencies that led to it. This provides a firm understanding of what underlies the culminating work. In this way, they are more capable of grasping its nuances, building upon this knowledge, and advancing the state of the art. Students implement their own version of the Counterfactual Regret Minimization algorithm and use it to solve a simpler game. Once completed, students are encouraged to continue their studies in similar ways through resources at depthfirstlearning.com.
Weaknesses	The assignments can be rather challenging and are best done in discussion groups. Solo participants may be discouraged.
Dependencies	We assume that students are familiar with the basics of probability theory, some basic statitics, the practice of using formal mathematical notation and formally proving theorems and an understanding of artificial neural network training. Other than these dependencies, our approach guides students through a careful study of all of the other dependencies that this material depends on. Note that there was a previous Model Assignment in 2013 on Counterfactual Regret Minimization that is very related to this material. We incorporate two of the sections from that assignment's handout as a required dependency in week four ("ICRM" below).
Variants	The assignment is meant to be done in order. Optional readings are marked as such, but otherwise all of the required readings are necessary to answer the questions.

Why

Along with Libratus, DeepStack is one of two approaches to solving No-Limit Texas Hold-em that debuted coincidentally. This game was notoriously difficult to solve as it has just as large a branching factor as Go, but additionally is a game of imperfect information.

The main idea behind both DeepStack and Libratus is to use Counterfactual Regret Minimization (CFR) to find a mixed strategy that approximates a Nash Equilibrium strategy. CFR’s convergence properties guarantee that we will yield such a strategy and the closer we are to it, the better our outcome will be. They differ in their implementation. In particular, DeepStack uses deep neural networks to approximate the counterfactual value of each hand at specific points in the game. While still being mathematically tight, this lets it cut short the necessary computation to reach convergence.

In this curriculum, you will explore the study of games with a tour through game theory and counterfactual regret minimization while building up the requisite understanding to tackle DeepStack. Along the way, you will learn all of the necessary topics, including what is the branching factor, all about Nash Equilibria, and CFR (Quora).

Common Resources:

MAS: Multi Agent Systems (get PDF with 'eBook Download' link).
LT: Marc Lanctot’s Thesis.
ICRM: Introduction to Counterfactual Regret Minimization.
PLG: Prediction, Learning, and Games.

1 Normal Form Games & Poker

Motivation: Most of game theory, as well as the particular techniques used in DeepStack and Libratus, is built on the framework of Normal Form Games. These are game descriptions and are familiarly represented as a matrix, a famous example being the Prisoner’s Dilemma. In this section, we cover the basics of Normal Form Games. In addition, we go over the rules of Poker and why it had proved so difficult to solve.

Required Reading:

MAS: Sections 3.1 & 3.2.
LT: Pages 5-7.
The Game of Poker: Supplementary #1 on pages 16-17.

Optional Reading:

The State of Solving Large Incomplete-Information Games, and Application to Poker (2010)
Why Poker is Difficult Very good video by Noam Brown, the main author of Libratus. The first eighteen minutes are the most relevant.
Pagat reference for Texas Hold'em.
Wikipedia reference for Texas Hold'em.

Questions:

LT: Prove that in a zero-sum game, the Nash Equilibrium strategies are interchangeable.
Hint

Use the definition of a Nash Equilibrium along with the fact that $\mu_{i}(\sigma_{i}, \sigma_{-i}) + \mu_{-i}(\sigma_{i}, \sigma_{-i}) = c$ .
LT: Prove that in a zero-sum game, the expected payoff to each player is the same for every equilibrium.
Solution

We will solve both this problem and the one above here. We have that if $\mu_{i}(\sigma) = \mu(\sigma_{i}, \sigma_{-i})$ and $\mu_{i}(\sigma') = \mu(\sigma_{i}', \sigma_{-i}')$ are both Nash Equilibria, then:

$\begin{align} \mu_{i}(\sigma_{i}, \sigma_{-i}) &\geq \mu_{i}(\sigma_{i}', \sigma_{-i}) \\ &= c - \mu_{-i}(\sigma_{i}', \sigma_{-i}) \\ &\geq c - \mu_{-i}(\sigma_{i}', \sigma_{-i}') \\ &= \mu_{i}(\sigma_{i}', \sigma_{-i}') \end{align}$

In a similar fashion, we can show that $\mu(\sigma_{i}', \sigma_{-i}') \geq \mu(\sigma_{i}, \sigma_{-i})$ .
Consequently, $\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}, \sigma_{-i})$ , which also implies that the strategies are interchangeable, i.e. $\mu(\sigma_{i}', \sigma_{-i}') = \mu(\sigma_{i}', \sigma_{-i})$ .
MAS: Prove Lemma 3.1.6 (repeated below). In more natural English, this states that if there is a relationship among preferences that satisfy four important axioms (see MAS for details on those axioms), then the relationship among those preferences is linear with regards to the probabilities of attaining them.
$\textit{Lemma}$ : If a preference relation $\succeq$ satisfies the axioms completeness, transitivity, decomposability, and monotonicity, and if preference $o_1 \succ o_2$ and preference $o_2 \succ o_3$ , then there exists probability $p$ such that for all $\forall p' < p$ , $o_2 \succ [p': o_1; (1 - p'): o_3]$ and for all $p'' > p$ , $[p'': o_1; (1 - p''): o_3] \succ o_2.$
MAS: Theorem 3.1.8 ensures that rational agents need only maximize the expectation of single-dimensional utility functions. Prove this result as a good test of your understanding.
: If a preference relation satisfies the axioms completeness, transitivity, substitutability, decomposability, monotonicity, and continuity, then there exists a function with the properties that:
1. $u(o_1) \geq u(o_2)$ iff $o_1 \succeq o_2$ .
2. $u([p_1 : o_1, ..., p_k: o_k]) = \sum_{i=1}^k p_{i}u(o_i)$ .

2 Optimality & Equilibrium

Motivation: How do you reason about games? The best strategies in multi-agent scenarios depend on the choices of others. Game theory deals with this problem by identifying subsets of outcomes called solution concepts. In this section, we discuss the fundamental solution concepts: Nash Equilibrium, Pareto Optimality, and Correlated Equilibrium. For each solution concept, we cover what it implies for a given game and how difficult it is to discover a representative strategy.

Required Reading:

MAS: Sections 3.3, 3.4.5, 3.4.7, 4.1, 4.2.4, 4.3, & 4.6.
LT: Section 2.1.1.

Optional Reading:

MAS: Section 3.4.

Questions:

Why must every game have a Pareto optimal strategy?
Solution

Say that a game does not have a Pareto optimal outcome. Then, for every outcome $O$ , there was another $O'$ that Pareto-dominated $O$ . Say $O_2 > O_1$ . Because $O_2$ is not Pareto optimal, there is some $O_k > O_2$ . There cannot be a max in this chain (because that max would be Pareto optimal) and thus there must be some cycle. Consequently, there exists for some agent a strategy $O_j$ s.t. $O_j > O_j$ , which is a contradiction.
Why must there always exist at least one Pareto optimal strategy in which all players adopt pure strategies?
Why in common-payoff games do all Pareto optimal strategies have the same payoff?
Solution

Say two strategies $S$ and $S'$ are Pareto optimal. Then neither dominates the other, so either $\forall i \mu_{i}(S) = \mu_{i}(S')$ or there are two players $i, j$ for which $mu_{i}(S) < \mu_{i}(S')$ and $mu_{j}(S) > \mu_{j}(S')$ . In the former case, we see that the two strategies have the same payoff as desired. In the latter case, we have a contradiction because $\mu_{j}(S') = \mu_{i}(S') > \mu_{i}(S) = \mu_{j}(S) > \mu_{j}(S')$ . Thus, all of the Pareto optimal strategies must have the same payoff.
MAS: Why does definition 3.3.12 imply that the vertices of a simplex must all receive different labels?
Solution

This follows from the definitions of $\mathbb{L}(v)$ and $\chi(v)$ . At the vertices of the simplex, $\chi$ will only have singular values in its range defined by the vertice itself. Consequently, $\mathbb{L}$ must as well.
MAS: Why in definition 3.4.12 does it not matter that the mapping is to pure strategies rather than to mixed strategies?
Take your favorite normal-form game, find a Nash Equilibrium, and then find a corresponding Correlated Equilibrium.

3 Extensive Form Games

Motivation: What happens when players don’t act simultaneously? Extensive Form Games are an answer to this question. While this representation of a game always has a comparable Normal Form, it’s much more natural to reason about sequential games in this format. Examples include familiar ones like Go, but also more exotic games like Magic: The Gathering and Civilization. This section is imperative as Poker is best described as an Extensive Form Game.

Required Reading:

MAS: Sections 5.1.1 - 5.1.3.
MAS: Sections 5.2.1 - 5.2.3.
Accelerating Best Response Calculation in Large Extensive Games: This is important for understanding how to evaluate Poker algorithms.

Optional Reading:

LT: Section 2.1.2.

Questions:

What is the intuition for why not all normal form games can be transformed into perfect-form extensive games?
Solution

The problem is one of modeling simultaneity. Perfect information extensive form games have trouble modeling concurrent moves because they have an explicit temporal structure of moves.
Why does that change when the transformation is to imperfect extensive games?
How are the set of behavioral strategies different from the set of mixed strategies?
Solution

The set of mixed strategies are each distributions over pure strategies. The set of behavioral strategies are each vectors of distributions over the actions and assign that distribution independently at each Information Set.
Succinctly describe the technique demonstrated in the Accelerating Best Response paper.

4 Counterfactual Regret Minimization #1

Motivation: Counterfactual Regret Minimization (CFR) is only a decade old but has already achieved huge success as the foundation underlying DeepStack and Libratus. In the first of two weeks dedicated to CFR, we learn how the algorithm works practically and get our hands dirty coding up our implementation.

The optional readings are papers introducing CFR-D and CFR+, further iterations upon CFR. These are both used in DeepStack.

Required Reading:

ICRM: Sections 2.1-2.4.
ICRM: Sections 3.1-3.4.
LT: Section 2.2.
Regret Minimization in Games with Incomplete Information.

Optional Reading: These two papers are CFR extensions used in DeepStack.

Questions:

What is the difference between external regret, internal regret, swap regret, and counterfactual regret?
Hint

The definitions of the three are the following:
- External Regret: How much the algorithm regrets not taking the best single decision in hindsight. We compare to a policy that performs a single action in all timesteps.
- Internal Regret: How much the algorithm regrets making one choice over another in all instances. An example is whenever you bought Amazon stock, you instead bought Microsoft stock.
- Swap Regret: Similar to Internal Regret but instead of one categorical action being replaced wholesale with another categorical action, now we allow for any number of categorical swaps.
- Counterfactual Regret: Assuming that your actions take you to a node, this is the expectation of that node over your opponents' strategies. The counterfactual component is that we assume you get to that node with a probability of one.
Why is Swap Regret important?
Hint

Swap Regret is connected to Correlated Equilibrium. Explain why.
Leduc Poker (Southey et al) and Liar’s Dice (Wikipedia) are two different games that are more tractable than games with larger state spaces like Texas Hold'em while still being intuitive to grasp. Test your understanding by implementing CFR (or CFR+ / CFR-D) to solve one of these two games in your favorite programming language.
How do you know if you’ve implemented CFR correctly?
Solution

One way is to test it by implementing Local Best Response. It should perform admirably against that algorithm, which is meant to best it.

5 Counterfactual Regret Minimization #2

Motivation: In the last section, we saw the practical side of CFR and how effective it can be. In this section, we’ll understand the theory underlying it. This will culminate with Blackwell’s Approachability Theorem, a generalization of repeated two-player zero-sum games. This is a challenging session but the payoff will be a much keener understanding of CFR’s strengths.

Required:

PLG: Sections 7.3 - 7.7, 7.9.

Optional:

Questions:

PLG: Prove Lemma 7.1.
$\textit{Lemma}$ : A probability distribution $P$ over the set of all $K$ -tuples $i = (i_{1}, ..., i_{K})$ of actions is a correlated equilibrium iff, for every player $k \in {1, ..., K}$ and actions $j, j' \in {1, ..., N_{k}}$ , we have
$\sum_{i : i_{k} = j} P (i) (l (i) - l (i^{-}, j^{'})) \leq 0$ $\sum_{i: i_{k} = j} P(i)\big(\mathcal{l}(i) - \mathcal{l}(i^{-}, j')\big) \leq 0$
where $(i^{-}, j') = (i_{1}, ..., i_{k-1}, j', i_{k+1}, ..., i_{K})$ .
It’s brushed over in the proof of Theorem 7.5 in PLG, but prove that if set $S$ is approachable, then every halfspace $H$ containing $S$ is approachable.

Solution

Because $S \in H$ is approachable, we can always find a strategy for player one s.t. the necessary approachability clauses hold (see Johari's Lecture 13). Namely, choose the strategy in $S$ that asserts $S$ as being approachable.

6 DeepStack

Motivation: Let’s read the paper! A summary of what’s going on to help with your understanding:

DeepStack runs counterfactual regret minimization at every decision. However, it uses two separate neural networks, one for after the flop and one for after the turn, to estimate the counterfactual values without having to continue running CFR after those moments. This approach is trained beforehand and helps greatly with cutting short the search space at inference time. Each of the networks take as input the size of the pot and the current Bayesian ranges for each player across all hands. They output the counterfactual values for each hand for each player.

In addition to DeepStack, we also include Libratus as required reading. This paper highlights game theory and CFR as the really important concepts in this curriculum; deep learning is not necessary to build a champion Poker bot.

Required Reading:

Optional Reading:

(Github) DeepStack Implementation for Leduc Hold’em.
Noam Brown on Libratus.
Depth-Limited Solving for Imperfect-Information Games: This paper is fascinating because it is achieves a poker-playing bot almost as good as Libratus but using a fraction of the necessary computation and disk space.

Questions:

What are the differences between the approaches taken in DeepStack and in Libratus?
Solution

Here are some differences:
- A clear difference is that DeepStack uses a deep neural network to reduce the necessary search space, and Libratus does not.
- DeepStack does not use any action abstraction and instead melds those considerations into the pot size input. Libratus does use a dense action abstraction but adapts it each game and additionally constructs new sub-games on the fly for actions not in its abstraction.
- DeepStack uses card abstraction by first clustering the hands into 1000 buckets and then considering probabilities over that range. Libratus does not use any card abstraction preflop or on the flop, but does use it on later rounds such that the game's $10^{61}$ decision points are reduced to $10^{12}$ .
- DeepStack does not have a way to learn from recent games without further neural network training. On the other hand, Libratus improves via a background process that adds novel opponent actions to its action abstraction.
Can you succinctly explain “Continual Re-solving”?
Can you succinctly explain AIVAT?