■ Artificial Intelligence 101: Computer learns itself to play Tic-Tac-Toe

The field of Artificial Intelligence (AI) is hyped beyond measure in the media lately, but really there's not much to write home about. Despite beating all the human world champions at Chess, Go, poker and all the other games, these artificial brains are really just dumb calculators — albeit very fast. They can only do one task: the best chess AI player will not be good at anything beyond its narrow field, like polite conversation <g> or driving a car. Terminator is light years away.

Notwithstanding the lamentable present state of AI, I was always fascinated by learning and consciousness and such philosophical matters. Is our brain just a computer? Can an electronic device acquire consciousness? During the past few months I was busy reading AI books and experimenting with simple learning algorithms. To cut a long story short, this last week I was trying to teach the PC to play Tic-Tac-Toe using the Reinforcement Learning approach.

The gist of Reinforcement Learning (RL) is that instead of telling the computer how to solve a task, we just allow it to wander randomly in the possible states and give it a reward when it is doing right. For example in the game of tic-tac-toe, the computer is rewarded when it wins, and gets a negative reward when it loses. How does the computer learn how to play when it only gets a reward at the end of the game? Using the Bellman equation to discount future rewards while exploring the possible moves of the two players X and O.

A good introduction to RL is chapter 13 of Tom Mitchell's book "Machine learning". However, the iterative equation that is the basis of Q-learning is stated for tasks where the agent works itself, e.g. to find a way out of a maze:

Q(s,a) = Reward + γ max Q(s',a')

This rule says that the current guess update of the optimal policy function Q from state s performing action (move) a leanding to state s', is the sum of the immediate reward (if any) and the best possible action at the next board position s' (considering all a' allowable at s') discounted by a factor γ (usually 0.9)

A few months ago I made an unbeatable tic-tac-toe player using the minimax method with a fixed board state heuristic function, but reinforcement learning (Exercise 13.3 in Mitchell's book) was getting (the computer) nowhere. Games are adversarial, so whatever's good for you is usually worst for the opponent and vice versa. I just could not figure out how to evaluate Q(s,a) in game mode. My RL learner playing against a random* opponent was comprehensively beaten.

* opponent did the right thing only when a win was obvious or to avoid an obvious loss (forced play), and played randomly in all other cases

After a lot of trial and error, I realized that the Q(s,a) update formula in the presense of negative rewards and game opponents must be different depending on whose turn it was to play. Assuming we are rooting for X to win, when it is its turn to play it must choose the action largest expected payoff (even when this is negative). When the next state s' is for O to play, the Q(s,a) formula must use the worst Q(s',a') — which usually is negative too. The idea will become clearer with this picture:

tictactoe reinforcement learning

Figure 1. state transitions and tic-tac-toe rewards

In this small fragment of the tic-tac-toe state space tree, we are trying to assign the payoff to a board state where it is O player's turn to play. If O plays the middle spot, it wins (reward -100 for X), whereas if it plays the lower left slot, X wins in the next move (reward +100, discounted to +90 for the previous step). Which is the best result for X in this context? sadly because O will play to win, we can only expect the inevitable negative -100 (discounted -90) instead of +81 that is just wishful thinking! So to recap, the correct Q(s',a') to use in the update formula is:

the maximum if subsequent state s' involves X's turn
the minimum if it is O's turn to play

With this Q learning update policy (inspired by the way minimax method takes player turns into account) the computer can learn itself after exploration of sufficient states and become unbeatable (i.e. not lose). Here are some other important points for this tic-tac-toe reinforcement learning situation:

Reward of +100 offered for a win state, -100 for lose, and a +10 for a draw. The latter small reward is important if X plays second
During the actual game, X picks the action with the best Q value for the current state, that could be negative too.
One of the first google search results on "tic-tac-toe reinforcement learning" claims that the best opening position is X in the middle. Talking about fake news... there is no best opening position, all actions beginning with a blank tableau have the same Q values!
If both tic-tac-toe players are intelligent and non-random, draw is the most usual outcome for this game.

Thus I have reproduced the assessment of WOPR computer (of War Games fame) that for tic-tac-toe (and likely for thermonuclear war too :) "the only winning move is not to play". The quest for artificial intelligence must go on!