■ Improved reinforcement learning backgammon player with rollouts

In an earlier blog post we guided the computer to teach itself how to play decent backgammon, following the guidelines of Tesauro, who first built a world champion beater player using TD(λ) reinforcement learning (RL), based on an artificial neural network (ANN) state value estimator function. Our player was just about level with the benchmark backgammon player called pubeval.

At first I was rather disappointed that despite extensive learning parameter tweaking, including very long training sessions with self-play, the RL learner didn't improve beyond 52% over pubeval (meaning it was winning only half of the time). Pubeval is a simple linear state value function, so how come it is so hard to beat? Then I realized that Tesauro's own RL player wasn't much better straight out of the box. To get to world champion status he had to add various "supervised" tweaks on top. One of those is the idea of rollout (Tesauro and Galperin, 1997).

Backgammon is a game of skill and luck, because it involves rolling dice. The basic RL-ANN player given a starting board position and a dice roll, tries to figure out the best possible move just by comparing the immediate resulting board positions, in terms of their value estimate U(s). This playing strategy has several disadvantages:

The ANN value function is inherently inexact
Many alternative board positions have very similar values, especially at the beginning of play
Random next moves are not taken into account

Using rollouts we can tackle these problems. Instead of considering a single "best" move, we focus on the top N (e.g. 5) alternatives based on their U(s) value, and then we simulate each one in a random play and see which one leads to best results (wins). To get accurate worth estimates with Monte Carlo search, we must play many such games from our fixed starting position. This strategy is sound but computationally intensive, especially if we play all the way to the end of the game to evaluate each move.

Another idea, which I followed, is to examine all possible dice rolls that the opponent can throw (and the following dice throws), thereby looking ahead a couple of board positions, and taking an average (expected) value of the future possible game developments. So instead of deciding on the spot, we compare the best 5 move alternatives in terms of their average future utility (see figure). As there are 21 unique dice combinations per turn, we cannot look very far in the future this way, because of combinatorial explosion, but looking ahead 2 plies (one for opponent and then again for AI) is quite manageable (21*21=441 alternatives). When looking ahead we select only a single best move (no alternatives) judging on the ANN value function. So each of the 5 moves generates 441 board positions which are averaged, and the best one is selected for immediate play. Here are the benefits:

We take into account every possible dice outcome
We base our choice on state values closer to the end of the game (presumably more reliable)
No extra learning is involved, just a little tree search before each move

This simple look-ahead trick results in a backgammon player that achieves 60% wins over pubeval (20% improvement). A simple common sense improvement that goes a long way towards a better player! I tried to play a little with the rollout parameters, but as you see below there isn't much to gain on average. The experiments are based on 1000 games AI-rollout vs pubeval:

rollout plies	alternatives	moves considered	AI wins	ppg	CPU time (sec)
0	1	1	50%	-0.052	0.35
2	441	5	59.09%	0.172	496
2	441	3	58.40%	0.222	291
2	441	7	59.88%	0.201	608
4	194481	4	??		∞

Table 1. Effect of look-ahead on AI player performance vs pubeval

The base case (0 plies) is the original player without lookahead. It just levels pubeval — actually it trails in terms of points per game (ppg) because of double wins. Once we start looking ahead 2 dice rolls (plies), there's a clear advantage, both in wins and ppg. These runs differ in the number of moves considered, whether 3, 5 or 7 of the top value moves. The differences are slight and probably due to the randomness of the games. Clearly rollouts cost a lot in terms of CPU time per move (note the table lists time for 100 complete games), but it only gets prohibitive if we rollout 4 plies (194481 alternatives per move). I am still waiting for the results of the 4-ply experiment <g>