Montezuma's Revenge.
Squakenet

Artificial Intelligence / Machine Learning

Uber has cracked two classic ’80s video games by giving an AI algorithm a new type of memory

An algorithm that remembers previous explorations in Montezuma’s Revenge and Pitfall! could make computers and robots better at learning how to succeed in the real world.

Nov 26, 2018
Montezuma's Revenge.
Squakenet

A new kind of machine-learning algorithm just mastered a couple of throwback video games that have proved to be a big headache for AI.

Those following along will know that AI algorithms have bested the world’s top human players at the ancient, elegant strategy game Go, one of the most difficult games imaginable. But two pixelated classics from the era of 8-bit computer games—Montezuma’s Revenge and Pitfall!—have stymied AI researchers.

There’s a reason for this seeming contradiction. Although deceptively simple, both Montezuma’s Revenge and Pitfall! have been immune to mastery via reinforcement learning, a technique that’s otherwise adept at learning to conquer video games. DeepMind, a subsidiary of Alphabet focused on artificial intelligence, famously used it to develop algorithms capable of learning how to play several classic video games at an expert level. Reinforcement-learning algorithms mesh well with most games, because they tweak their behavior in response to positive feedback—the score going up. The success of the approach has generated hope that AI algorithms could teach themselves to do all sorts of useful things that are currently impossible for machines.

The problem with both Montezuma’s Revenge and Pitfall! is that there are few reliable reward signals. Both titles involve typical scenarios: protagonists explore blockish worlds filled with deadly creatures and traps. But in each case, lots of behaviors that are necessary to advance within the game do not help increase the score until much later. Ordinary reinforcement-learning algorithms usually fail to get out of the first room in Montezuma’s Revenge, and in Pitfall! they score exactly zero.

The new algorithms come from Uber’s AI research team in San Francisco, led by Jeff Clune, who is also an associate professor at the University of Wyoming. The team demonstrated a fundamentally different approach to machine learning within an environment that offers few clues to show an algorithm how it is doing.

The approach leads to some interesting practical applications, Clune and his team write in a blog post released today—for example, in robot learning. That’s because future robots will need to figure out what to do in environments that are complex and offer only a few sparse rewards.

Uber launched its AI lab in December 2016, with the goal of making fundamental breakthroughs that could prove useful to its business. Better reinforcement-learning algorithms could ultimately prove useful for things like autonomous driving and optimizing vehicle routes.

AI researchers have typically tried to get around the issues posed by by Montezuma’s Revenge and Pitfall! by instructing reinforcement-learning algorithms to explore randomly at times, while adding rewards for exploration—what’s known as “intrinsic motivation.”

But the Uber researchers believe this fails to capture an important aspect of human curiosity. “We hypothesize that a major weakness of current intrinsic motivation algorithms is detachment,” they write. “Wherein the algorithms forget about promising areas they have visited, meaning they do not return to them to see if they lead to new states.”

The team’s new family of reinforcement-learning algorithms, dubbed Go-Explore, remember where they have been before, and will return to a particular area or task later on to see if it might help provide better overall results. The researchers also found that adding a little bit of domain knowledge, by having human players highlight interesting or important areas, sped up the algorithms’ learning and progress by a remarkable amount. This is significant because there may be many real-world situations where you would want an algorithm and a person to work together to solve a hard task.

Their code scores an average of 400,000 points in Montezuma’s Revenge—an order of magnitude higher than the average for human experts. In Pitfall! it racks up 21,000 on average, far better than most human players.

“These results are very impressive,” says Emma Brunskill, an assistant professor at Stanford University who specializes in reinforcement learning. She says it is surprising, and exciting, that the techniques produced such big advantages.

Other AI researchers have been chipping away at these notoriously hard video games. In October, a team at OpenAI, a nonprofit in San Francisco, demonstrated an algorithm capable of making significant progress in Montezuma’s Revenge.

Brunskill’s group at Stanford  recently made more modest progress on Pitfall! using an approach similar to the Uber team’s.

Now that AI algorithms can solve these video games, the challenge is to emerge from the arcade and solve real-world problems.

Brunskill agrees that this sort of work could have a big impact in robotics. But she says other real-world situations, especially those that involve modeling human behavior, are far more difficult. “It will be very interesting to see how well this approach works for more complicated settings,” she says.

Not everyone is enthralled by the Uber research, however.

Alex Irpan, a software engineer working on machine learning and robotics at Google, wrote a blog post the in which he questions why the Uber AI team had not provided a technical paper, alongside a press release, to give more details of their work.

Irpan also points out that by altering the state of the game, in order to facilitate their approach, the Uber AI researchers may have changed the playing field in a significant way. Given this fact, he questions how practical the approach might be.

“The blog post says that this approach could be used for simulated robotics tasks, and then combined with sim-to-real transfer to get real-world policies. On this front, I’m fairly pessimistic,” he writes.

Updated 11.28 with comment from Alex Irpan.