"Rise of the Machines"
The first major achievement of artificial intelligence against humans — before chess playing Deep Blue beat Kasparov — was Gerald Tesauro's TDGammon back in the early 90s. Tesauro, an IBM researcher combined Reinforcement Learning (RL) with Artificial Neural Networks (ANN) into an AI program that learned how to play backgammon "itself" through unsupervised selfplay. TDGammon reached world champion performance level in 1995 (narrowly beaten at the time).
I frequently came across citations of TDGammon in all AI books I read, but lacking any substantial information that would allow one to understand how Tesauro did it. Tesauro himself was very thin on details. Apparently he used Temporal Difference (the TD part) learning, and a weirdly configured neural net to learn the utility (or value) function of each backgammon board state. But there are so many question marks, at every turn of the implementation, in terms of mapping the backgammon play as reinforcement learning and approximating the value function through an ANN. Can the feat be reproduced? Now there's a challenge! I found some python code on gihub that claims to reproduce TDGammon using TensorFlow, but nothing wrong with reinventing the wheel for personal edification 
For games with a small number of possible states and actions (state transitions) like tictactoe, there are exact algorithms like Qlearning, that can be proved to converge to the true value of the utility (the likelihood of winning) for each game board state. After many games exploring the complete range of possible game states and actions, the agent learns the optimal policy, and knows which is the best move at each point of the game. Such tabledriven policies are not feasible for more complex games like backgammon that has 10^20 states. The idea then is to approximate the utility function through selfplay using e.g a neural net. As it is impossible to explore the complete range of possible stateaction pairs, we can never reach the true utility function, but surprisingly the agent can become quite capable player after only a few hundred thousand games. There is no theoretical justification or convergence guarantee, but it somehow works automagically!
Probably the utility function estimate is good enough relatively; it is not the true utility but it can compare alternative moves successfuly.
The main ingredients for teaching the computer backgammon are:
The beauty of RL is that we never directly tell the computer what to do. After thousands of repetitions it understands that leaving checkers exposed is a bad idea, hitting opponent's checkers is a good idea and so on. But as far as we as teachers are concerned, we only reward a win, there is no other game model involved. Notice there is no lookahead (tree exploration) either, the ANNbased utility function estimate has the final word deciding the best move. The neural net input is just the board state (position of the checkers) without any intelligent precomputed board features (blots, pip count, chance of getting hit etc) that human players use to guide their game.
In backgammon, it's no big deal to save all states and moves comprising a full game (not more than 100 plies or so), so the professed advantage of storage minimization of TD does not apply. You just need to keep track of the moves, and the states can be replayed on the fly. Instead of the TD error terms, we try to make the ANN learn to predict the actual game outcome, which is now known. Then we go back and generate N training points (one for each game move), each associating a board state with the final game outcome, then train the network to predict them using standard backpropagation.
Without intermediate rewards and without discounting (γ=1) each state utility has the same value as the final game outcome, as U(s)=R(s)+γU(s+1)
This idea of mine to use the game outcome for training all intermediate game moves is intellectually appealing (to me at least :) but sadly this flat function profile didn't rear smart backgammon AI players. By introducing a very small discount factor γ=0.99 the ANN was much better behaved. So instead of each move getting the same training value, only the final winning move got the full reward U, and earlier moves got progressively smaller chunks of it (γ*U then γ*γ*U and so on).
Backgammon doesn't need discounting to ensure any theoretical convergence properties (there aren't any!), but discounting the early moves of the game to smaller values ensures that early moves (possibly randomly chosen) will not bear heavily on the training of the ANN either way.
Another important RL parameter is the balance between random exploration and greedy moves. As we cannot keep track of how many times each board state has been visited, I decided on a fixed percentage of random moves (20% seemed a good level after many trials) — meaning that 80% of moves was chosen using the utility function. For long training sessions (over 200,000 games) I observed that a ramping policy worked best: instead of a flat 20% random moves, I started with 40% and linearly reduced it to 1% by the end of the games. This makes sense because as experience grows we want to minimize the random moves and help with the convergence of the utility function.
U = F(s)
Note we train for the board state utility, not the Q stateaction function
Tesauro uses a wacky encoding of the backgammon board state and I thought I could do better. As there are 24 board positions and only one opponent can have checkers at each position, wouldn't it be reasonable to have 24 input variables (the checkers at each position), positive values for X player, negative for O player, and zero for empty slots? Plus 2 inputs for pawns on the bar (hit), another 2 for checkers already collected (out of the game) and 1 for the player who is next to play, a total of 29 ANN inputs. Alas, this model didn't learn backgammon at all!
Obviously Tesauro put a lot of thinking (and trial and error) before deciding on his "weird" input encoding. It comprises 198 input neurons, and most inputs are 01 numbers. For example each position is described by 8(!) input variables, 4 for each opponent. If one checker is present, the assignment is [1000], if 3 it's [1110], and for 7 checkers it is [1112]. Quirky or what? Here is a complete description (click on programming assignment #4)
I don't know what's the magic about this input scheme, it could be the 01 values or the great number of them (lots of adjustable weights to train the ANN), but it works nicely and that's that!
There are many other architectural and learning parameters to consider for the ANN. You will find full details below. Some of them are close to what Tesauro used, some had to be tweaked a lot. A single hidden layer of 45 neurons is linked to 2 outputs, the probability of winning for X or O. This 198x45x2 network has 9292 weights (including biases), a toy problem in terms of today's deep learning standards. Therefore there was no need for TensorFlow, PyTorch or cloud GPU computing, just a simple backpropagation C code did the training.
⊕ Click here to see ANN design details →
The learner was very sensitive to these parameters; any deviation to either side would lead to a severely worse backgammon player
How do we know when to stop the training? This is no regular ANN learning of experimental data or pattern recognition. Luckily for all RL learners, Tesauro has published a decent backgammon player called pubeval that is used as a benchmark and measure of the training progress. It is just a linear utility function based on checker positions, and it can only play the X side, but it is quite capable player, and it was used to measure the proficiency of the selfmade players.
In tournament playing mode, AI plays the X side and always chooses the best move without random explorations
I spent quite some time tweaking learning parameters, but as it turns out, good learners reveal themselves in relatively few training games (100,000 or so), and further training achieves little, as you can see in this table:
Extra training could even make things worse (!) as you can see with the player trained on 480K games

Table 1. Progress of best learner agent program
(all parameters fixed except training length)
Due to the randomness of the game and the exploration, these results have large variability, but you can see that the 120K game learner is quite close to the 360K "champion" that is just better than Tesauro's pubeval automaton (wins it 51% of the time on average). Tripling the training returned merely 10% extra wins. Faced with such diminishing returns and having the scalp of pubeval in the pocket (just!) I call it end of backgammon school for this time round.
Deep (or shallow) AI learning requires lots of CPU resources. Using the profiler I managed to improve performance by 25% using a variety of tweaks, but it still takes more than one hour for 360K training games. I take pity of my laptop, straining it for even longer sessions like that. One possibility is to use a better ANN library. I found one called FANN that is 30% faster but has bugs. We will see if this ever grows to a world champion player... For the time being it is only fit to compete in local Cyprus backgammon competitions :)
for best results make sure you maximize the player window
It is a C++ console program, so it looks lame, but it is good fireman's fun. To play just follow the instructions. Each board slot has a number 1 to 24, so when it's your turn to play you give 2 numbers for the checkers to move (as the dice dictate). Last century playstation 0 technology. You will discover that it is an eccentric aggressive player that always goes for your exposed checkers, but manages to win me most of the time.
ps. I wasted lots of electricity tweaking parameters and training lengths, the darn thing won't get better than 52% wins. I even tried TD(1) without much improvement. Looks like I hit a brick wall, a ceiling to the achiavable performance!? There's more mystery to be unravelled here but it must wait for some other day...
Post a comment on this topic »