Applications of Deep Reinforcement Learning in Games

Calum McCartan
8 min readApr 11, 2021

AlphaGo made headlines when it beat world champion Lee Sedol at the board game Go, but what other types of games can deep reinforcement learning be applied to?

In this post I will be covering two papers on the topic of deep reinforcement learning in games. The first is Deep Reinforcement Learning for General Game Playing by Goldwaser, A., & Thielscher, M., which explores how the famous AlphaZero can be generalised to play a much wider variety of games. The second is Playing FPS Games with Deep Reinforcement Learning by Lample, G., & Chaplot, D. S., which expands beyond the realm of 2D, tackling the challenge of 3D video games by using deep reinforcement learning to train an agent to play the classic video game Doom.

Introduction

Traditional heuristic-based board game AI have long been effective at playing many board games, and some simple games have even been solved by fully exploring their state space. Where these systems begin to fall short however, is when playing games with many potential moves, leading to the state explosion problem. To continue to improve state of the art board game AI, machine learning came into play.

AlphaGo combined Monte Carlo tree search (MCTS)with deep reinforcement learning to produce the first AI capable of winning against world champion Go player Lee Sedol. This demonstrated the potential of using machine learning to build state of the art systems for playing board games with AI.

Later, AlphaZero improved on AlphaGo, and expanded the generalisation beyond Go to other board games such as chess. The major change that made this possible was to allow AlphaZero to learn entirely from self-play rather than learning using supervised learning with recordings of professional games.

AlphaZero is an extremely effective deep reinforcement learning system for playing board games, but what if it could be generalised further to play other games? This is the focus of the general game playing paper.

Deep Reinforcement Learning for General Game Playing

To improve the generalisation of the AlphaZero, the authors focus on removing restrictions to allow games with the following attributes to be played by the system.

  • Multiplayer games (more than 2 players)
  • Cooperative games (rather than zero-sum games)
  • Asymmetric games (player have different roles)
  • Realtime games (rather than turn based)
  • Non-board games (eg. Pacman)

To explain how this was accomplished, I’ll first describe the architecture of the network.

Architecture of Generalised AlphaZero

The details of the game for the network to be trained on is provided to the network via a game description language (GDL). This is encoded as a propositional network which consists of a set of boolean nodes and logic gates for defining the features of the game, allowing for more complex games to be described rather than just board games. This includes the reward system of the game, and logic used to determine whether a given move is legal or not.

Next are several fully connected layers using ReLU activation, which half in size each layer until below 50 hidden units in the layer. At this point, the different roles in the game are split into separate heads. This is the crucial step that allows for asymmetric games and multiplayer games to be played. In reinforcement learning, the model learns to play against itself, and so in multiplayer games with different roles, implementations of all players are required in order to train the network. This split-head approach allows all player roles to be trained in parallel in the same model, but also allows the roles to reuse the same features from the shared layers of the network in order to train faster.

Each head contains more fully connected layers, followed by an output layer which uses softmax, and an sigmoid activation function. The output is a list of probabilities for each possible move, and the expected rewards are a value between -1 and 1. When used with cooperative games, the reward is limited to only positive values between 0 and 1.

Evaluation

So how well does the generalised AlphaZero perform when playing other types of games? To evaluate the system, authors compare it with a tree search algorithm UCT (upper confidence bound on trees) as a baseline. They evaluated four games which have been used in the International General Game Playing Competition. The games used are common games, but some are slightly simplified (ie. smaller board size than usual).

  • Connect-4 (turn based game on 6x7 board)
  • Breakthrough (turn based game on 6x6 board)
  • Babel (3 player turn-based cooperative game)
  • Pacman (3 player simultaneous game on a 6x6 board. Zero-sum for the player, but cooperative for the 2 ghosts)

After running 50 games each of generalised AlphaZero against UCT, the following results were produced.

Connect-4 (top-left), Pacman (top-right), Babel (bottom-left), Breakthrough (bottom-right)

As you can see, generalised AlphaZero outperforms UCT on three of the four games, with the exception being the cooperative game Babel. The authors attribute this to the smaller state space of Babel due to less branching per move, which gives search tree techniques an advantage.

Now we have seen how a generic approach can work for a wide variety of simple games, but what about a more complex 3D game? This is the goal of the next paper.

Playing FPS Games with Deep Reinforcement Learning

In this paper, pixel information alone is used to allow an agent to play classic 3D video game doom. Usually internal game state information is provided to an AI, and so the ability for the agent in this paper to perform using only pixel information is particularly impressive.

A screenshot of Doom

To play the game, the agent uses two separately trained networks. The first is trained in a level free of enemies, and has the goal only of navigating the map and collecting items in the game. This is called the navigation network. The second network is trained to fight enemies when they are encountered, and this is called the action network. When there is an enemy within the agent’s view and the agent has not run out of ammunition, the action network is given control, otherwise the navigation network is used. Using two seperate networks allows for faster and parallel training, and each network can be tweaked and re-trained independently. In addition, the authors found the agent would remain stationary and wait for enemies to come to them when only a single network was used.

The action network uses a DRQN (deep recurrent Q-network), while the navigation network uses a simpler DQN (deep Q-network). In deep Q-networks, the goal is to find the optimal Q-function which is used to estimate an expected reward if a certain action is made from a given state. The recurrent version of this network also includes a third parameter in the Q-function which includes information from the previous step. In the paper this is done using an LSTM network (long term short memory network) which is a form of recurrent network which can capture inputs across multiple frames.

Shown here is the architecture of the network. The input image is fed through two convolutional layers which reduce the dimensions of the layers. The structure is then split into two streams - the bottom flattens the input and feeds to a LSTM which finally leads to the output action scores which control the agent. The top stream is fed through a hidden layer before being used to output detected game features (ie. game objects which are on screen).

Architecture of the Doom playing neural network

To improve the performance of the network, the authors used several techniques, including dropout, frame-skipping, and providing information of on screen features. The impact for each of these is shown below.

The effect of dropout, providing info of screen features, and frame-skipping respectively

Skipping frames greatly increases the rate a which the network can learn, however, skipping too many frames can make it impossible for the agent to play effectively. In the rightmost graph above, you can see that using every 5th frame performs better than using every 10th frame or using all frames.

Although the final trained agent agent must play without game state information, during training this state information is provided. In particular, boolean values are set to true when a specific game object is present in the agent’s view. This greatly helps improve the performance of the model as can be seen in the middle graph above.

The metric used to judge the performance of the agent is kill/death ratio which indicates the number of enemies defeated divided by the number of times the agent is defeated. However, there is a large reward delay between exploring the map, finding an item, and aiming and firing the weapon in order to defeat opponents. This would make it very difficult for a model to learn desirable actions, and so reward shaping is used. Reward shaping adds intermediate goals — such as collecting an item, or doing damage to an enemy. Negative penalties are also used, for example consuming limited ammunition or taking damage.

Evaluation

The agent was evaluated against Doom’s built in bots in a series of 15 minute games. The ‘limited’ version of the game uses only a single weapon on known maps, while the ‘full’ version includes more weapon variety and unknown maps.

Agent’s performance against Doom’s built in bots

The results show that the agent greatly outperforms the bots, achieving an impressive kill/death ratio (ranging from 2.83 to 6.94) in each of the scenarios. In addition, the paper states that the agent can even outperform the average human.

Conclusion

We’ve seen how reinforcement learning can be used to play a wide variety of board games, Atari style games, and even 3D games. The generic game playing networks performed particularly well against more traditional methods when the state space is larger. Future work based on the Doom paper could perhaps follow the approach of the general game playing paper in order to generalise toward more 3D FPS games. The system’s ability to navigate scenes based only on visual information could even have real world applications.

--

--