■ Unravelling Tesauro's world champion backgammon player mystery

The first major achievement of artificial intelligence against humans — before chess playing Deep Blue beat Kasparov — was Gerald Tesauro's TD-Gammon back in the early 90s. Tesauro, an IBM researcher combined Reinforcement Learning (RL) with Artificial Neural Networks (ANN) into an AI program that learned how to play backgammon "itself" through unsupervised self-play. TD-Gammon reached world champion performance level in 1995 (narrowly beaten at the time).

I frequently came across citations of TD-Gammon in all AI books I read, but lacking any substantial information that would allow one to understand how Tesauro did it. Tesauro himself was very thin on details. Apparently he used Temporal Difference (the TD part) learning, and a weirdly configured neural net to learn the utility (or value) function of each backgammon board state. But there are so many question marks, at every turn of the implementation, in terms of mapping the backgammon play as reinforcement learning and approximating the value function through an ANN. Can the feat be reproduced? Now there's a challenge!

I found some python code on gihub that claims to reproduce TDGammon using TensorFlow, but nothing wrong with reinventing the wheel for personal edification

The game of backgammon and reinforcement learning AI agents

If you've never heard of backgammon then you haven't worked for the Greek fire department <g> It is a very old game with simple rules combining strategy with luck. The idea with reinforcement learning (RL) is to let the program (agent) figure out which moves are good and which are bad, trying out things at random (within the legal game moves), and offered a reward for winning, and a penalty for losing. RL algorithms then work out backwards assigning the ultimate game payoff to the actual moves the computer played, estimating the utility of each state-action pair.

For games with a small number of possible states and actions (state transitions) like tic-tac-toe, there are exact algorithms like Q-learning, that can be proved to converge to the true value of the utility (the likelihood of winning) for each game board state. After many games exploring the complete range of possible game states and actions, the agent learns the optimal policy, and knows which is the best move at each point of the game. Such table-driven policies are not feasible for more complex games like backgammon that has 10^20 states. The idea then is to approximate the utility function through self-play using e.g a neural net. As it is impossible to explore the complete range of possible state-action pairs, we can never reach the true utility function, but surprisingly the agent can become quite capable player after only a few hundred thousand games. There is no theoretical justification or convergence guarantee, but it somehow works automagically!

Probably the utility function estimate is good enough relatively; it is not the true utility but it can compare alternative moves successfuly.

The main ingredients for teaching the computer backgammon are:

Game rules. Each player has 15 checkers originally in a fixed layout, then rolls the dice. He has to move checkers towards the home quadrant. For each possible arrangement of the 2 player's checkers and dice roll, the agent must know how many alternative moves are possible. This is the basis for evaluating the best move. We don't tell the computer in advance not to do silly moves, like leaving checkers exposed. All this he'll figure out it(him?)self.
Explorative self-play. The AI plays many games against itself, using the current estimate of the "winning game strategy" as a guide. Each game is a Monte Carlo random simulation, where part of the time the move is chosen at random. At other times the best move is chosen using the current utility function. As this is an adversarial game, what's best for one side is worst for the other, so the AI chooses the highest utility move for one side and the lowest utility when playing for the opponent (the minimax game principle). At the beginning of this greedy/random exploration, all moves will be basically random, but as experience builds up, the AI will choose better moves.
Learning the winning moves. The AI sees no intermediate rewards, except when a game ends. If the AI side wins, a positive reward is offered, otherwise nothing (or a penalty). Based on this result, we go back to all the recent game moves and teach the AI that they are all good or bad, depending on who won. At first sight this doesn't make sense because a good move could be penalized because it was followed by many bad moves that caused a game loss. However that's the best we can do and is similar to SARSA learning. After lots of repetitions, the truly good moves will statistically have better utility values than the bad ones. This part of the algorithm is essentially supervised learning using a neural net.

The beauty of RL is that we never directly tell the computer what to do. After thousands of repetitions it understands that leaving checkers exposed is a bad idea, hitting opponent's checkers is a good idea and so on. But as far as we as teachers are concerned, we only reward a win, there is no other game model involved. Notice there is no look-ahead (tree exploration) either, the ANN-based utility function estimate has the final word deciding the best move. The neural net input is just the board state (position of the checkers) without any intelligent precomputed board features (blots, pip count, chance of getting hit etc) that human players use to guide their game.

To use Temporal Difference learning or not to?

General ideas aside, how do we actually create our own backgammon AI player? Sutton's original paper on temporal difference learning never made an impression on me. I may be thick but other than the "sequential calculation" idea, I don't see the point of the TD method in general, and for learning backgammon in particular. Its ANN weight update rule, instead of using data directly from the game, spends most of the time trying to keep the (inaccurate) board value estimate self consistent with each ply (move) of the game. The TD weight θi update formula is: temporal difference weight update rule

For all game moves except for the last one, the driving force (error minimization) is to keep 2 consecutive predictions in agreement (the future U(s') and current U(s)), and only at the very end of the game the actual reward is used, which is the only real data of the experiment. Then instead of the normal delta error terms and derivatives used in backpropagation, "eligibility traces" must be used. I don't get it at all.

In backgammon, it's no big deal to save all states and moves comprising a full game (not more than 100 plies or so), so the professed advantage of storage minimization of TD does not apply. You just need to keep track of the moves, and the states can be replayed on the fly. Instead of the TD error terms, we try to make the ANN learn to predict the actual game outcome, which is now known. Then we go back and generate N training points (one for each game move), each associating a board state with the final game outcome, then train the network to predict them using standard backpropagation.

Without intermediate rewards and without discounting (γ=1) each state utility has the same value as the final game outcome, as U(s)=R(s)+γU(s+1)

This idea of mine to use the game outcome for training all intermediate game moves is intellectually appealing (to me at least :) but sadly this flat function profile didn't rear smart backgammon AI players. By introducing a very small discount factor γ=0.99 the ANN was much better behaved. So instead of each move getting the same training value, only the final winning move got the full reward U, and earlier moves got progressively smaller chunks of it (γ*U then γ*γ*U and so on).

Backgammon doesn't need discounting to ensure any theoretical convergence properties (there aren't any!), but discounting the early moves of the game to smaller values ensures that early moves (possibly randomly chosen) will not bear heavily on the training of the ANN either way.

Another important RL parameter is the balance between random exploration and greedy moves. As we cannot keep track of how many times each board state has been visited, I decided on a fixed percentage of random moves (20% seemed a good level after many trials) — meaning that 80% of moves was chosen using the utility function. For long training sessions (over 200,000 games) I observed that a ramping policy worked best: instead of a flat 20% random moves, I started with 40% and linearly reduced it to 1% by the end of the games. This makes sense because as experience grows we want to minimize the random moves and help with the convergence of the utility function.

Neural network utility estimation

The heart of the learning algorithm is the ANN that learns to predict the utility of each board state. Given a starting state and a dice roll, we enumerate all possible moves and choose the one that has the biggest chance of winning (or smallest if the opponent is playing). ANNs are glorified function estimators of the form:

U = F(s)

where U is the utility and s is the board state (position of the checkers)

Note we train for the board state utility, not the Q state-action function

Tesauro uses a wacky encoding of the backgammon board state and I thought I could do better. As there are 24 board positions and only one opponent can have checkers at each position, wouldn't it be reasonable to have 24 input variables (the checkers at each position), positive values for X player, negative for O player, and zero for empty slots? Plus 2 inputs for pawns on the bar (hit), another 2 for checkers already collected (out of the game) and 1 for the player who is next to play, a total of 29 ANN inputs. Alas, this model didn't learn backgammon at all!

Obviously Tesauro put a lot of thinking (and trial and error) before deciding on his "weird" input encoding. It comprises 198 input neurons, and most inputs are 0-1 numbers. For example each position is described by 8(!) input variables, 4 for each opponent. If one checker is present, the assignment is [1-0-0-0], if 3 it's [1-1-1-0], and for 7 checkers it is [1-1-1-2]. Quirky or what? Here is a complete description (click on programming assignment #4)

I don't know what's the magic about this input scheme, it could be the 0-1 values or the great number of them (lots of adjustable weights to train the ANN), but it works nicely and that's that!

There are many other architectural and learning parameters to consider for the ANN. You will find full details below. Some of them are close to what Tesauro used, some had to be tweaked a lot. A single hidden layer of 45 neurons is linked to 2 outputs, the probability of winning for X or O. This 198x45x2 network has 9292 weights (including biases), a toy problem in terms of today's deep learning standards. Therefore there was no need for TensorFlow, PyTorch or cloud GPU computing, just a simple backpropagation C code did the training.

⊕ Click here to see ANN design details →

The learner was very sensitive to these parameters; any deviation to either side would lead to a severely worse backgammon player

Backgammon training runs

Armed with this reinforcement learning algorithm, using the simple backpropagation library and some C++ code of my own to generate board state alternatives, it was time to learn some backgammon. A dice roll decides who goes first (X or O), then they play against each other using the same ANN as utility estimator (in the minimax sense mentioned above). The roll of dice brings lots of randomness and variability to the game. Once a winner is confirmed, all current game states are replayed and used to update the ANN utility function with the discounted scheme mentioned. The learning procedure is repeated again and again for thousands of games.

How do we know when to stop the training? This is no regular ANN learning of experimental data or pattern recognition. Luckily for all RL learners, Tesauro has published a decent backgammon player called pubeval that is used as a benchmark and measure of the training progress. It is just a linear utility function based on checker positions, and it can only play the X side, but it is quite capable player, and it was used to measure the proficiency of the self-made players.

In tournament playing mode, AI plays the X side and always chooses the best move without random explorations

I spent quite some time tweaking learning parameters, but as it turns out, good learners reveal themselves in relatively few training games (100,000 or so), and further training achieves little, as you can see in this table:

Extra training could even make things worse (!) as you can see with the player trained on 480K games

# of games	pubeval performance	training time
60,000	34% wins	870 sec
120,000	40% wins	1608 sec
240,000	41% wins	2921 sec
360,000	51% wins	4404 sec
480,000	50% wins	5574 sec

Table 1. Progress of best learner agent program
(all parameters fixed except training length)

Due to the randomness of the game and the exploration, these results have large variability, but you can see that the 120K game learner is quite close to the 360K "champion" that is just better than Tesauro's pubeval automaton (wins it 51% of the time on average). Tripling the training returned merely 10% extra wins. Faced with such diminishing returns and having the scalp of pubeval in the pocket (just!) I call it end of backgammon school for this time round.

Deep (or shallow) AI learning requires lots of CPU resources. Using the profiler I managed to improve performance by 25% using a variety of tweaks, but it still takes more than one hour for 360K training games. I take pity of my laptop, straining it for even longer sessions like that. One possibility is to use a better ANN library. I found one called FANN that is 30% faster but has bugs. We will see if this ever grows to a world champion player... For the time being it is only fit to compete in local Cyprus backgammon competitions :)

Hello to all firemen

So my PC didn't grow itself into a world champion backgammon player, but it has taught itself the game very well, and it is a very good match for my skills — and probably yours too if you want to take the challenge and play against it. Do you feel lucky? Let's roll the dice!

Click to download backgammon AI player (168 KB)

for best results make sure you maximize the player window

It is a C++ console program, so it looks lame, but it is good fireman's fun. To play just follow the instructions. Each board slot has a number 1 to 24, so when it's your turn to play you give 2 numbers for the checkers to move (as the dice dictate). Last century playstation 0 technology. You will discover that it is an eccentric aggressive player that always goes for your exposed checkers, but manages to win me most of the time.

ps. I wasted lots of electricity tweaking parameters and training lengths, the darn thing won't get better than 52% wins. I even tried TD(1) without much improvement. Looks like I hit a brick wall, a ceiling to the achiavable performance!? There's more mystery to be unravelled here but it must wait for some other day...

ADDENDUM: It is possible to improve substantially on the basic backgammon player using look-ahead (tree search for subsequent player moves) as you can read in the sequel article