Notwithstanding the lamentable present state of AI, I was always fascinated by learning and consciousness and such philosophical matters. Is our brain just a computer? Can an electronic device acquire consciousness? During the past few months I was busy reading AI books and experimenting with simple learning algorithms. To cut a long story short, this last week I was trying to teach the PC to play Tic-Tac-Toe using the Reinforcement Learning approach.
The gist of Reinforcement Learning (RL) is that instead of telling the computer how to solve a task, we just allow it to wander randomly in the possible states and give it a reward when it is doing right. For example in the game of tic-tac-toe, the computer is rewarded when it wins, and gets a negative reward when it loses. How does the computer learn how to play when it only gets a reward at the end of the game? Using the Bellman equation to discount future rewards while exploring the possible moves of the two players X and O.
A good introduction to RL is chapter 13 of Tom Mitchell's book "Machine learning". However, the iterative equation that is the basis of Q-learning is stated for tasks where the agent works itself, e.g. to find a way out of a maze:
Q(s,a) = Reward + γ max Q(s',a')
This rule says that the current guess update of the optimal policy function Q from state s performing action (move) a leanding to state s', is the sum of the immediate reward (if any) and the best possible action at the next board position s' (considering all a' allowable at s') discounted by a factor γ (usually 0.9)
A few months ago I made an unbeatable tic-tac-toe player using the minimax method with a fixed board state heuristic function, but reinforcement learning (Exercise 13.3 in Mitchell's book) was getting (the computer) nowhere. Games are adversarial, so whatever's good for you is usually worst for the opponent and vice versa. I just could not figure out how to evaluate Q(s,a) in game mode. My RL learner playing against a random* opponent was comprehensively beaten.
* opponent did the right thing only when a win was obvious or to avoid an obvious loss (forced play), and played randomly in all other cases
After a lot of trial and error, I realized that the Q(s,a) update formula in the presense of negative rewards and game opponents must be different depending on whose turn it was to play. Assuming we are rooting for X to win, when it is its turn to play it must choose the action largest expected payoff (even when this is negative). When the next state s' is for O to play, the Q(s,a) formula must use the worst Q(s',a') — which usually is negative too. The idea will become clearer with this picture:
Figure 1. state transitions and tic-tac-toe rewards
In this small fragment of the tic-tac-toe state space tree, we are trying to assign the payoff to a board state where it is O player's turn to play. If O plays the middle spot, it wins (reward -100 for X), whereas if it plays the lower left slot, X wins in the next move (reward +100, discounted to +90 for the previous step). Which is the best result for X in this context? sadly because O will play to win, we can only expect the inevitable negative -100 (discounted -90) instead of +81 that is just wishful thinking! So to recap, the correct Q(s',a') to use in the update formula is: