Imagine you’re on vacation in Mexico. A sunny beach looks perfect for swimming, but there’s a conspicuous sign on the sand, and you don’t know enough Spanish to read it. You pull the iPhone from your pocket and point its camera at the sign. When you see it on the screen, there is a difference: in English, the sign warns, “Beach closed—recent attack of shark.”
This is the power of Word Lens, an iPhone app that identifies Spanish text in the live view captured by the phone’s camera and translates words in a fraction of a second, replacing the originals in the same color, size, and orientation: the translated text actually seems to be in front of you, as if the sign were printed in English. (The app can also translate English into Spanish.) It works its magic on signs, newspapers, restaurant menus, and Web pages, giving its users a feeling of familiarity with the territory that is unavailable to a tourist equipped only with a guidebook.
Word Lens is the most impressive commercially available example of the stunning potential for augmented reality—software applications that overlay computer-generated imagery on representations of the real world. Augmented reality (AR for short) became a hot topic two years ago, when demo videos flooded the Internet with examples of games, virtual shopping, and search engines that inserted digital information into live images or photos. But now AR is poised to become more than a nifty mode of entertainment. Thanks to a coming wave of more powerful, location-aware smart phones, it will profoundly change the way we interact with our surroundings.
The range of applications goes beyond word translation. The Google Goggles app can recognize products or landmarks—say, the Itsukushima shrine in Japan—and instantly display information that Google has compiled about them. The Netherlands Architecture Institute’s app draws on archival images to show buildings not as they are but as they used to be. Metaio is developing a printer-repair app that can guide a technically challenged office worker in diagnosing and fixing a recalcitrant machine.
This sort of maintenance-guide app was the original vision for augmented reality, which was named in 1992 by Thomas Caudell, then a Boeing researcher. Software that inserted relevant instructions into a real-time image viewed on a head-mounted display, Caudell realized, could help workers on a factory floor as they navigated the maze of electrical wiring for a giant airplane. The term now applies to everything from games to medical imaging to real-world guides like Word Lens.
With the best apps, “the phone isn’t just a window, it’s a magic wand,” says Christopher Stapleton, a researcher at the University of Central Florida, who has spent more than a decade developing AR apps, including simulations for military operations in tight urban combat zones. Those applications, however, required special equipment to achieve what is now becoming possible on phones.
The possibilities are taking shape for six reasons. First, phone CPUs—the central processing units that do most of the computing work—recently reached the one-gigahertz threshold. That’s not far from what is available in many little laptops; some of Intel’s Atom chips for netbooks clock in around 1.5 gigahertz. Second, top smart phones also have graphics processing units (GPUs) meant for gaming and YouTube watching. Third, phone cameras are now sophisticated enough to feed abundant raw data about their environs into computer-vision algorithms. Fourth, screen resolution on mobile devices has advanced from grainy to super-slick. Fifth, wireless data networks are becoming faster and more widely available. But most important of all, smart phones have accelerometers, gyroscopes, and compasses that detect their location and orientation. That means an AR app can tell where you’re standing and in which direction you’re pointing the camera. Location detection is done either through GPS or by scanning for local Wi-Fi networks and matching the list of names to a database.
Word Lens, in its current form, doesn’t use either location detection or a network connection. But it pushes the boundaries of handheld computing, given that optical character recognition—a trick it performs in real time—was designed for the less challenging task of reading scans of paper documents.
“We have to be able to tell a word from a tree or a face,” says Otávio Good, the app’s primary developer. “To do that, we run the image through a filter to remove shadows. Text is sharp, so remove whatever is not sharp. We make the image black and white, to help figure out where letters are. Still, these are blobs that may or may not be letters. Maybe that’s a tree and a house, not an A and a V.”
Once Word Lens has identified letters, it calculates their rotation and the perspective from which the viewer is seeing them. Then it tries to recognize each letter by consulting a library of reference font sets.
“At that point, we have a string of letters,” says Good. “But we’re not sure what each one is. We do a dictionary lookup of this probabilistic string of letters for the closest match, if there is one.”
If there’s a match, Word Lens’s final stunt is to “repaint” the sign. “We erase the original and use the existing orientation, foreground, and background color, which may be a gradient [rather than a constant color], to put new text on top,” Good says. “That’s a pretty straightforward computer graphics operation. It’s like using Photoshop.”
He makes it sound easy, but Word Lens wasn’t cobbled together from off-the-shelf software. Good, a former Xbox 360 programmer, found that the iPhone’s GPU was nowhere near powerful enough to perform the image-processing tricks he’d learned on the Xbox. Instead, he had to route computations through the CPU, whose single-core architecture limited his ability to run operations in parallel to speed up text recognition and translation. He resorted to some old-school assembly language programming for maximum efficiency. As a result, Word Lens on an iPhone 4 can redraw Spanish to English, or vice versa, up to 10 times per second as you move the phone around.
Just wait until Good can tap into an iPhone 5, or a forthcoming Android phone with a dual-core CPU and a more powerful graphics chip. When these hit the market this year, expect an even more head-spinning version of Word Lens. It will be able to recognize more fonts and more languages, and it won’t be stumped by a sign with rust from its mounting bolts dripping down over its letters. Good also expects to reduce any visible flickering in the app. “Photorealism makes it much more effective,” he says.
That’s the litmus test for AR: can you forget that you’re looking at a computer screen? To get to that point, many applications will need more precise input than is possible on today’s phones. “GPS on a phone is accurate to a few meters,” says Bruce Thomas, who directs a wearable-computing lab at the University of South Australia. He builds systems that require backpacks and headsets in order to offer AR suitable for military training, or for peacefully walking around a proposed suburban development site to see how it will look when built up. “We’re using $3,000 sensors that can pinpoint your location to within the width of your head, and the tilt of your head to within five degrees,” he says. That increased precision lets Thomas’s system overlay imaginary buildings onto your view as you move your head around. This concoction of hardware and software costs about $30,000. But it’s not unrealistic to think that such high-precision location detection could become possible in a handheld device. Already, in fact, the resolution on phone screens has leapfrogged ahead of the resolution in Thomas’s headset displays.
If augmented reality is to reach its mass potential, though, apps also need to be easier to build. A project at Georgia Tech is working on an open technical platform for mobile AR content. Others are working on establishing proprietary platforms; the software maker Layar, for example, builds tools that help other companies create apps. If developing apps is easy enough, the challenge for app creators won’t be technical, says Gene Becker, a Layar strategist. It will be to create “an experience that people will want to use instead of checking Twitter.”
Like GPS navigation, which in retrospect seems tantamount to a proto-AR, the best augmented-reality technologies will be those that make the miraculous seem mundane. Visualize an app that guides you to the correct shelf at the supermarket, talks you through changing a flat tire, or reminds you who the other people in the room are. Once it exists, you probably won’t want to live without it.
Paul Boutin is a freelance technology writer in Los Angeles who also contributes to Wired and the New York Times. He reviewed Google’s social-networking efforts in the November/December issue of TR.