Reinforced learning (RL) is a machine learning technique characterized by choosing actions to evolve states in such a way as to maximize expected rewards. RL is distinct from supervised learning, where sets of objects are classified by an algorithm trained on a subset of data to predict a target class. And it is also distinct from unsupervised learning, where an algorithm finds its own clusters in the data. RL owes much to the concept of Markov Decision Processes which are explained in more detail here: https://en.wikipedia.org/wiki/Markov_decision_process .
This report describes code that can be found in the GitHub repository https://github.com/ggraham-412/RLDiceGame (7c26s3). The code is written for Python version 3.4 or greater. The repo contains a simple dice roll generator module (DiceGen), a module for identifying roll patterns (RollIDentifier), a module that implements a simple one-step RL framework (RLGame), a module that implements an example RLGame based on a popular family dice game called Yamslam from Blue Orange games (http://www.blueorangegames.com/), and a driver module called YamslamTrain that provides a command line interface for training, playing and evaluating the Yamslam engine.
The RL framework has a set of states \(S\), a set of actions \(A\), a function \(p\) to pick an action \(a\) given an initial state \(s\),
\(a = p(s)\hspace{4pt}with\hspace{4pt}s \in S,\hspace{4pt}a \in A\)
a function \(f\) to evaluate an action against an input state \(s\)
\(s' = f(s,a)\hspace{4pt}with\hspace{4pt}s,s' \in S,\hspace{4pt}a \in A\)
and a function \(g\) to score the action \(a\) in the state \(s\). Note that in order for RL to be interesting, \(f\) should not be a deterministic function. If \(f\) is deterministic, then the RL problem essentially reduces to a supervised learning problem, classifying over the set \((s,a)\) with results \(R(s') = 1\) if the action was optimal, and \(=0\) if the action was not optimal.
In the realm of many dice games, \(S\) is the set of all possible rolls of N M-sided dice, and \(A\) is the set of all possible re-rolls, which can be encoded as integers from 0 to \(2^{N-1}\) where each bit determines if the corresponding die is held (0) or re-rolled (1). \(f\) is a stochastic function that re-rolls the indicated dice, and \(g(s,a,s')\) computes the score of the action \(a\). Generally we have \(g>0\) if the roll improved, \(g<0\) if the roll got worse, and \(g=0\) if the roll maintained value. \(p\) is simply a function which picks an action \(a\) that maximizes the expected score improvement. The expected score improvement is kept in an action table which forms the basis of the learning component in this framework.
Note that this framework is capable of learning while it plays. Since each play involves a an initial state, and action, and a final state the scoring function can be applied incrementally to the action table after each play.
Yamslam is a popular game by Blue Orange Games (http://www.blueorangegames.com/). The concept is that during each turn, a player has three chances to roll the highest possible amount on 5 6-sided dice. All dice are rolled at first, and then the player gets to choose which dice to hold and which to re-roll on subsequent rolls. The final roll is scored according to the following table.
| Roll Pattern | Point Value | Explanation |
|---|---|---|
| Yamslam | 50 | Five of a Kind |
| Large Straight | 50 | Five in a row |
| Four of a Kind | 40 | |
| Full House | 30 | Three and a Pair |
| Flush | 25 | All odds or evens |
| Small Straight | 20 | Four in a Row |
| Three of a Kind | 10 | |
| Two Pair | 5 | |
| One Pair, Bupkis | 0 |
Bupkis is an unofficial fan name for a “nothing” hand. More information about probbilities for rolling particular patterns may be found at https://graciesdad.wordpress.com/2015/08/25/yamslam-odds-part-1-2/.
For the Yamslam game implemented in the module Yamslam.py, the state space \(S\) is the set of distinct sorted rolls of 5 6-sided dice (5d6). There are \(6^5 = 7776\) possible 5d6 rolls, but resulting in only 252 distinct sorted dice patterns, so \(|S| = 252\). Since \(N=5\), the action space \(A\) consists of the integers in the closed interval \([0,31]\), so \(|A| = 32\). \(f\) is a stochastic function which re-rolls dice in some state \(s\) according to the bitmask in \(a\) to produce a new roll \(s'\), and \(g\) is a function that evaluates an action \(a\) by the point difference between initial and final rolls as indicated by the above table.
The action table itself consists of three numbers for every state, action pair: a sum of point differences seen in gamesplayed, a count of the games played, and the average point difference per game. For a given state \(s\), \(p\) selects an action based on the maximum average point increase in the action table.
This implementation omits some of the finer points of Yamslam play, namely the chip tokens which limit the number of rollsthat can be scored in each category, and a multi-step reward system for modeling two re-rolls instead of only one. These will be addressed in the section “For Further Study” below.
The RL code comes with a training module, YamslamTrain.py.
python YamslamTrain.py --help
Usage: YamslamTrain.py [options]
Options:
-h, --help show this help message and exit
-n NAME, --name=NAME Name of Yamslam instance
-t NTRAIN, --train=NTRAIN
Number of learning trials per roll/action
-i INAME, --import=INAME
Name of game to import training from
-p NPLAY, --play=NPLAY
Number of demo games to play
-r RNAME, --rms=RNAME
Name of game to compare (rms) training -c CNAME, --compare=CNAME
Name of game to compare (% agree) training
The README.md markdown file contains some usage examples.
YamslamTrain contains a function to produce \(N\) trials of directed learning for every roll and possible action. Directing the engine to play without any training results in some comical plays, because the actions are essentially random. But after only a few (~100 per state-action pair) learning trials, the game can play a reasonable game of Yamslam.
$ python YamslamTrain.py -n Yamslam100 -t 100
$ python YamslamTrain.py -n Yamslam100 -p 5
Initial roll: [1, 1, 1, 2, 2]
Keeping dice: [1, 2, 3, 4, 5]
Final roll: [1, 1, 1, 2, 2]
# Knew to hold on to the Full House
Initial roll: [1, 2, 5, 6, 6]
Keeping dice: [2]
Final roll: [2, 3, 5, 5, 6]
# This is an error
Initial roll: [2, 3, 3, 4, 6]
Keeping dice: [1, 3, 4]
Final roll: [1, 2, 2, 3, 4]
# Going for small straight
Initial roll: [2, 3, 4, 6, 6]
Keeping dice: [1, 2, 3]
Final roll: [2, 3, 3, 4, 5]
# Going for small straight
Initial roll: [1, 2, 4, 6, 6]
Keeping dice: [2, 3, 4, 5]
Final roll: [2, 4, 4, 6, 6]
# Going for flush
To answer the question of how much training is enough training, one can look at the differences between the action tables for different game instances trained up to different levels. For the following measurements, a reference action table was generated with 10,000 trials per state-action pair. This action table is provided in the repo and is called YamslamReference_actiontable.dat.
Figure 1 shows the fraction of states where the given action table agrees with the refence table on the optimal action \(a_{max}\), versus training set size. Figure 2 shows the average RMS difference between each entry of the given action table and the reference table. In both plots, the point at 10,000 is a comparison of a separately generated 10,000 trial action table to the reference table.
Figure 1: Fraction of states that agree between given and reference training sets.
In figure 1, we see that the optimal agreement efficiency knees around 800 trials per roll-action, at about 90% efficiency. Further gains in efficiency beyond 800 trials come at increasing cost.
Figure 2: Mean RMS difference of all corresponding entries between given and reference training sets.
In figure 2, we see that the mean RMS differnce knees around 1600 trials per roll-action, at about 0.3. Further gains in RMS come at increasing cost.
What accounts for the sharp knee in figure 1? It can be accounted for by looking at the relative optimality of actions in the action table for each roll. Some roll patterns have obvious optimal actions, but other rolls have several actions that are “close” to optimal. Using functions provided in the accompanying script loadActionTable.R from the GitHub repository, one can look at the histogram of expected differences in point improvement between the following the best action and the next best action (figures 3a, 3b and 3c).
Figure 3: (a) For all of the rolls in the Yamslam reference table, difference in point value between following the best action and following the next best action, and (b) same for the subset of rolls that do NOT match the best action predicted by the Yamslam100 model, and (c) same for the subset of rolls that do NOT match the best action predicted by the Yamslam1600 model.
Figure 3(a) shows the distribution of optimal differences for each roll in the reference table. From figures 3(b) and 3(c), one can see that the rolls for which the reference action table disagrees with the 100-trial and 1,600-trial action tables, the optimal differences are increasingly smaller. This suggests that the residual disaggreement is benign and simply reflects small differences in strategy for ambiguous rolls: for example, 24633 could be re-rolled to go for a small straight (hold 234) or for a three of a kind (hold 33) or for a flush (hold 246). If the differences bewteen the average point gains for these actions are small, then the training will need to have a larger number of events to resolve the small differences. However, it also suggests that strategy convergence is not uniform for all possible rolls, and further convergence may be very slow once the low-hanging fruit is gathered.
This report presents a simple framework for reinforced learning RL in dice games. In particular, a simplified form of Yamslam was presented, and performance of the learning algorithm was explored. The RL framework was based on maximizing the expected point improvement by re-rolling dice, and the framework is capable of learning by adjusting its action table after every play.
This learning framework is remarkable similar to using Monte Carlo methods to perform the stochastic integrals that correspond to the entries in the action table
\(\int_{A} (g(f(s,a)) - g(s)) dA = \sum_{A} (g(f(s,a)) - g(s))\! prob(A)\)
For a simple dice game, these integrals could of course be calculated directly using the probability weighted sum; however, there are rather a lot of them! And the method remains valid for cases where the integrals are not easily calculated.
This also sheds light on the slow convergence for some roll patterns in Figure 3. Such convergence issues are well known from Monte Carlo methods as arising from investigating small regions of interest and might benefit from modifying the directed training to more effectively target such regions, for example by directing more training preferentially to the more optimal actions, although one must be careful not to bias the results.
The actions chosen by \(p\) in this report are always maximal. However, the framework does also support a tolerance value, so that a returned action may be randomly chosen from a set of actions that have expected point increases within tolerance of the maximum. Tolerance should be > 0, but a special value of -1 is recognized to mean that an action should be chosen at random from among all possible action values. The RMS action difference plots shown in figure 2 may be used as a rough guide for setting tolerance at a specific level of training. With a little more information in the action table (eg- a running standard deviation) one can use statistical inference to set the tolerance to a value that is guaranteed to include the optimal action at a specified confidence level.
Setting a tolerance = -1 reflects the need for strategy exploration. In situations where there is not great confidence that the training has captured optimal plays, exploration has an important role to get data on lesser actions that would otherwise never be played. While the data in this report has relied instead on directed training, this is not always practial. Other possibilities include using a Boltzmann distribution to select actions, or in case it is suspected that the probabilities are not stationary, using an exponential moving average to update the values in the action table.
This report has looked at a simplified version of Yamslam consisting of a one-round roll and re-roll. A more realistic example would include the constraints on roll patterns and multi-step action evaluation. In Yamslam, each roll pattern can be scored only 3 times, so if players collectively score 3 “Large Straights” for 50 points each, on subsequent rolls a “Large Straight” will weigh 0. This can be accounted for in the action table by including expected point increases for each target roll pattern.
Multi-step actions can be incorporated even more easily by updating the action table N times with exponentially decreasing weights for step. For example, using a weight of 0.1, if we roll a 11246, hold dice 1 and 2, re-roll 11136, hold dice 1, 2, 3, and 4, and re-roll 11135 we would score a Flush for 25 points, and we would update the action table a full 25 points on the third roll and 2.5 points on the second roll.
## Training Agree RMS
## 1 10 0.512 3.584
## 2 20 0.583 2.584
## 3 40 0.663 1.788
## 4 80 0.742 1.262
## 5 100 0.762 1.143
## 6 200 0.825 0.822
## 7 400 0.893 0.580
## 8 800 0.877 0.424
## 9 1600 0.889 0.307
## 10 3200 0.917 0.232
## 11 6400 0.929 0.182
## 12 10000 0.937 0.162