Shadows: Games with Reinforcement Learning, Part I
I'm interested in how machine learning can be used to produce more compelling behaviour for non-player characters (NPCs; that is, the agents controlled by the computer rather than a human) in video games. My go-to example is strategy games like the Civilization series: wouldn't it be nice if harder difficulty levels meant the enemy AI used more complex strategies, rather than simply getting higher combat stats and development bonuses? It would of course be quite difficult to get modern machine learning models working well in such a complex setting, so I started much smaller: with a game of tag.
Before getting into the details, you can find the code here and a web browser demo here.
The Game
The idea was to make a game that was as simple as possible while still being interesting. It also had to have multiple players, so that there's a computer agent to play against. I settled on a 2D, top-down game of tag with two agents, one controlled by the human player and the other controlled by the computer. The agents are circles. When it, the agent must chase down and touch the enemy agent to tag them, making them it. When not it, the agent must avoid being caught by the enemy agent while collecting treasures around the map to gain points.
The map contains obstacles that you must navigate around. As a twist, you cannot see behind the obstacles: the area behind the obstacles is hidden from the player's field of view, so that neither the computer agent nor the treasures can be seen.
Reinforcement Learning
While one could certainly write good policies for the computer agent without machine learning (classical game AI is pretty impressive!), I wanted to try to train the agent using reinforcement learning (RL). Fairly simple deep RL models from over ten years ago were able to play Atari games proficiently just by looking at images of the game's screen (and considerably more sophisticated ones have more recently played Dota 2 and Starcraft II), so I figured I could use a similar approach for my game.
I used the stable-baselines3 implementations of RL algorithms to learn the behaviour policy for the computer agent. I initially wanted to have the RL model learn from directly from images (like the Atari work), so that the computer gets the exact same information that a human player does. However, I ended up providing the model with the exact positions of both agents directly, even if the other agent was hidden behind an obstacle. I did this for two reasons: the resulting policy performs much better, and it is easier to deploy in the browser (more on this below). I do however want to go back to learning directly from images in subsequent games and projects.
There are actually two learned models/policies: one for when the computer is it and one for when it is not it. The it policy is trained by controlling the agent to reach a randomly placed (stationary) enemy position from a random start position. The not it policy is trained against an it enemy that always tries to move along the shortest direction to the computer agent to tag it. Both are trained for one million timesteps. A single static map is used, so the policies are particular to that one map. Some more details:
- Algorithm: I first tried deep Q-networks (DQN) and proximal policy optimization (PPO), but found that the best policy performance was obtained using soft actor-critic (SAC).
- Reward: When it, the reward is +1 for catching the other agent (which ends the episode; otherwise the episode ends after 500 timesteps). When not it, there is a -1 reward for being caught (again ending the episode; otherwise the episode ends after 1,000 timesteps), as well as +0.5 for each treasure collected. Finally, in both cases the reward is also shaped with a potential function that encourages moving toward (away from) the enemy when it (not it).
- Action: Each agent is commanded with a linear forward/backward velocity and an angular turning velocity. The action space I used for learning is only the one-dimensional angular velocity, with the linear velocity always set to the maximum forward value. This reduces the richness of possible motions, but it is easier to learn and is pretty much the strategy I find myself using when I play the game manually.
- Observation: The observation space consists of the computer agent's position and angle, the enemy's position, and the positions of any treasures.
In the Browser
The training of the RL models was done in Python, but I wanted to make a browser version of the game for convenient sharing. The first step was to rewrite the game in JavaScript, which wasn't too hard even though it's been a while since I've touched JS and it has no good vector math support (at some point I want to look into WebAssembly more though).
I then had to figure out how to get the RL model running in the browser. The standard way is to export the model to the ONNX format and load it using the ONNX web runtime. In the browser demo linked at the start of this post, the models are just hosted on my personal web server.
Future
I've optimistically included a "Part I" in the title of this post. I hope to return to this project to develop some other games with more sophisticated learned behaviours.