AR adds virtual objects to the real-world environment. But how does the program know what your environment looks like?
What do cats have to do with computer vision?
One of this year’s hottest things in AR has been the Google AR animals. We mean those cats, lions, and sharks that materialize in your room when you use the AR feature in a Chrome search:
You might ask what a 3D cat has to do with computer vision. Well, think about it. You scan the floor or a table, and a lion appears. How does Google know where the floor or table is? How does it know that there is a couch in the scene and the cat should be behind it?
Or take another example – AR filters for Instagram and Snapchat. To apply a panda mask, the program needs to know where your nose and ears are and follow their position as you move. Does that mean it can see you?
Augmented reality inserts virtual 3D objects into the physical environment around you. But to do that, the app has to “see” that environment. Let’s explore how this happens.
The 4 elements of computer vision
AI vision tries to replicate how the human eye and brain see – or at least insofar as we can understand this process. In order for this to work, you need four things:
1) An imaging device (camera). This does the same thing as your eye: collects visual information and sends it to the “brain” (CPU) in a format that the computer can understand. In the case of computer vision, the image is broken into pixels, and each pixel is transformed into a series of ones and zeros.
2) Labeled data set. You need a set of data where different objects are clearly labeled, as in “this is a car,” “this is also a car,” “this is an apple,” etc. According to Cognilytica, any AI project team spends 80% of their time organizing and labeling data. There are many tools for this, such as bounding boxes and landmark recognition. Here’s an example with landmark points marked:
The more training data you have, the better. Ideally, you want thousands of labeled images. The largest publicly available set is ML Images from Tencent, with 18 million images in 11,000+ categories. The second-largest is Google Open Images, with 9 million pictures.
3) Learning algorithms. A deep-learning algorithm first analyzes labeled data and finds patterns. Then it creates equations based on these patterns to recognize objects in unlabeled bodies of data. In simple words: the computer goes through a thousand annotated images of an apple to understand what one is. The next time you give it a random picture of an apple, it will identify it instantly.
4) Processing power – and lots of it. Luckily, with cloud computing, this is no longer a problem.
Computer vision technology is built around patterns, because that’s how we think the brain sees.
It’s often said that the brain is a pattern recognition machine. For instance, you’ve seen many dogs in the past, so you’ll always recognize a dog when you see one.
Or… will you?
This picture of a weird animal recently went viral on Twitter:
People spent lots of time trying to understand what they were looking at. It’s clearly a mammal, but what one? A monster baby goat?
Hint: that ear with a brown tip isn’t an ear, it’s a nose. Can you see it now? A completely normal puppy. As soon as you shift your viewpoint by 90 degrees, the brain gets an “aha!” moment: now the picture fits into the “little dog” pattern.
Can AI do a better job? No, not in this case. Computer vision algorithms are much more precise than humans when it comes to identifying objects like dog breeds, spare parts, faces, etc. But there are many situations where their patterns don’t help, such as when an object is partially occluded or positioned in an unusual way, like the dog in the picture.
Still, what AI vision can do already is enough to take augmented reality to a completely new level.
Computer vision and AR: a perfect combo
Here are a few examples of high-quality AI vision algorithms used in AR.
1) Google Lens
Google Lens keeps getting more powerful with each update. It’s also the simplest way to see computer vision in action. Point it at a flower, and it will tell you its name. Direct it at a restaurant sign, and it will give you reviews for it. Aim your phone at a text in another language, and Google Lens will translate it for you.
This virtual makeup try-on is impressively precise when applying lipstick and eyeliner. The best part is that it’s available in WebAR, so you don’t have to download an app. If you’ve been following our blog, you know that here at Mozilla we are big WebAR fans!
IKEA’s app is great at evaluating size and distance, and it almost always puts the furniture on the floor (as opposed to suspending it in the air). But what we particularly like is that Place understands where the light falls from and creates light and shadow effects accordingly.
Predicting the future of technology is a risky business. Still, here are a few AR+AI vision implementations we expect to see in the next few years:
– SocialAR: amazing full-body masks that follow your every movement.
– Complex, hyper realistic scenes: AR objects will be seamlessly integrated into the environment and even interact with it. For instance, why not make an AR cat jump from the floor onto a desk?
– Engineering & maintenance: an AR app will be able to recognize which part of a machine is broken simply by “looking” at it and showing technicians how to fix it.
Are you curious about augmented reality? Would you like to use AR content in your marketing strategy? Then send us a message at firstname.lastname@example.org, and let’s discuss the possibilities!