Reinforcement-learning AIs are vulnerable to a new kind of attack

Adversarial attacks against the technique that powers game-playing AIs and could control self-driving cars shows it may be less robust than we thought.

Feb 28, 2020

MIT Technology Review / Adam Gleave

The soccer bot lines up to take a shot at the goal. But instead of getting ready to block it, the goalkeeper drops to ground and wiggles its legs. Confused, the striker does a weird little sideways dance, stamping its feet and waving one arm, and then falls over. 1-0 to the goalie.

It’s not a tactic you’ll see used by the pros, but it shows that an artificial intelligence trained via deep reinforcement learning—the technique behind cutting-edge game-playing AIs like AlphaZero and the OpenAI Five—is more vulnerable to attack than previously thought. And that could have serious consequences.

Adam Gleave

In the last few years researchers have found many ways to break AIs trained using labeled data, known as supervised learning. Tiny tweaks to an AI’s input—such as changing a few pixels in an image—can completely flummox it, making it identify a picture of a sloth as a race car, for example. These so-called adversarial attacks have no sure fix.

Compared with supervised learning, reinforcement learning is a relatively new technique and has been studied less. But it turns out that it is also vulnerable to doctored input. Reinforcement learning teaches an AI how to behave in different situations by giving it rewards for doing the right thing. Eventually the AI learns a plan for action, known as a policy. Policies allow AIs to play games, drive cars, or run automated trading systems.

In 2017, Sandy Huang, who is now at DeepMind, and her colleagues looked at an AI trained via reinforcement learning to play the classic video game Pong. They showed that adding a single rogue pixel to frames of video input would reliably make it lose. Now Adam Gleave at the University of California, Berkeley, has taken adversarial attacks to another level.

Gleave is not too worried about most of the examples we have seen so far. “I'm a bit skeptical of them being a threat,” he says. “The idea that an attacker is going to break our machine-learning system by adding a small amount of noise doesn't seem realistic.” But instead of fooling an AI into seeing something that isn’t really there, you can change how things around it act. In other words, an AI trained using reinforcement learning can be tricked by weird behavior. Gleave and his colleagues call this an adversarial policy. It’s a previously unrecognized threat model, says Gleave.

Losing control

In some ways, adversarial policies are more worrying than attacks on supervised learning models, because reinforcement learning policies govern an AI’s overall behavior. If a driverless car misclassifies input from its camera, it could fall back on other sensors, for example. But sabotage the car’s control system—governed by a reinforcement learning algorithm—and it could lead to disaster. “If policies were to be deployed without solving these problems, it could be very serious,” says Gleave. Driverless cars could go haywire if confronted with an arm-waving pedestrian.

Gleave and his colleagues used reinforcement learning to train stick-figure bots to play a handful of two-player games, including kicking a ball at a goal, racing across a line, and sumo wrestling. The bots were aware of the position and movement of their limbs and those of their opponents.

Adam Gleave

They then trained a second set of bots to find ways to exploit the first, and this second group quickly discovered adversarial policies. The team found that the adversaries learned to beat their victims reliably after training for less than 3% of the time it took the victims to learn to play the games in the first place.

The adversaries learned to win not by becoming better players but by performing actions that broke their opponents’ policies. In the soccer game and the running game, the adversary sometimes never even stands up. This makes the victim collapse into a contorted heap or wriggle around in circles. What’s more, the victims actually performed far better when they were “masked” and unable to see their adversary at all.

The research, to be presented at the International Conference on Learning Representations in Addis Ababa, Ethiopia, in April, shows that policies that appear robust can hide serious flaws. “In deep reinforcement learning we're not really evaluating policies in a comprehensive enough fashion,” says Gleave. A supervised learning model, trained to classify images, say, is tested on a different data set from the one it was trained on to ensure that it has not simply memorized a particular bunch of images. But with reinforcement learning, models are typically trained and tested in the same environment. That means that you can never be sure how well the model will cope with new situations.

The good news is that adversarial policies may be easier to defend against than other adversarial attacks. When Gleave fine-tuned the victims to take into account the weird behavior of their adversaries, the adversaries were forced to try more familiar tricks, such as tripping their opponents up. That’s still dirty play but doesn’t exploit a glitch in the system. After all, human players do it all the time.