The premature quest for AI-powered facial recognition to simplify screening
In 2009, 22-year-old student Nicholas George was going through a checkpoint at Philadelphia International Airport when Transportation Security Administration agents pulled him aside. A search of his luggage turned up flashcards with English and Arabic words. George was handcuffed, detained for hours, and questioned by the FBI.
George had been singled out by behavior-detection officers—people trained in picking out gestures and facial expressions that supposedly betrayed malicious intentions—as part of a US program called Screening Passengers by Observation Technique or SPOT. But the officers were wrong in singling him out, and George was released without charge the same day.
As the incident may suggest, SPOT produced very little useful information throughout its decade-long history. And in light of the technique’s failures, some computer scientists have recently concluded a machine could do a better job with this task than humans. But the machine techniques they intend to use share a surprising history with SPOT’s training procedures. In fact, both can be traced back to the same man—Paul Ekman, now an emeritus professor of psychology at the University of California.
The Rise of FACS
Ekman and his fellow psychologist, Wallace V. Friesen, were originally interested in whether nonverbal clues could betray a liar. In 1969, they published a paper titled Nonverbal Leakage and Clues to Deception that considered whether liars involuntarily communicated their deception. The face, they concluded, was equipped to “lie the most and betray the most.”
Facial cues haven’t been the only window into deception over the years. Alternative ideas for lie detection include the polygraph, a device measuring things like heart rate, skin conductivity, and capillary dilation. There are also EEG or fMRI brain scans, eye-tracking, and voice analysis. But analyzing facial expressions seemed the least invasive and didn’t need expensive equipment. So, both psychologists decided to give the human face a much closer examination.
Back then, most facial-expressions researchers simply showed some pictures to subjects and asked for their interpretation. Ekman suspected this measured what people thought about the face itself, not the content of said face. Instead, he and Friesen went about building a Facial Action Coding System that could distinguish all possible facial movements without the bias of human observers. They derived FACS from the anatomy of the face itself.
They spent a year learning to fire their facial muscles separately. When necessary, they used needles and electrical currents to stimulate a muscle. They photographed their faces with each muscle action and used that to describe a set of 28 original action units—basic building blocks of human facial expressions. (AU 1 stood for inner brow raiser, AU 2 for outer brow raiser, and so on.)
While Ekman and Friesen meant their system for use by humans, FACS eventually proved to be a dream come true for the artificial intelligence community. With pictures linked to concise labels and explanations, it was a perfect training database for facial recognition algorithms. As far as making sense of human faces in computer science was concerned, FACS, along with its 500-page manual written by Ekman himself, became the bible.
Making sense of the face
Generally, all facial and affect recognition algorithms follow three basic steps. They begin with finding a face in an input picture or video. Perhaps the most popular today is a method proposed by Paul Viola and Michael Jones in 2001. An algorithm searches the input picture for groups of pixels arranged in patterns indicative of a human face—eyes are usually darker than the cheeks, nose bridge is usually brighter than the eyes, and so on. If the patterns are in the right orientation, the Viola-Jones algorithm labels a face.
Facial recognition software next moves on to feature extraction. One approach relies on discerning geometric features like eyes and the line of the mouth and calculating relative distances and angles between them. Another approach is appearance-based feature extraction. This relies on a large training set; whenever a new picture is thrown at the algorithm, it calculates the distance between the newcomer and each of the pictures it has been trained on. Both techniques end up with a feature vector, a numerical value corresponding to the way we look, along with pretty much every silly face we can come up with.
This can be used for identification, where the software decides if a face under consideration belongs to a given person. But it can also be used for recognizing expressions based on which combination of FACS action units are present. Software is now quite good at this job. FACET, one of the best commercially available affect-recognition software systems, scores way above 80 percent in recognizing emotional expressions.
Still, some real-world scenarios can prove challenging for AI. A team of researchers at the University of Notre Dame tried to use FACET to identify children’s boredom, confusion, and delight in classrooms, but the software could hardly identify a face, much less emotions. The software’s performance depends heavily on the material it has been trained on. And most training databases usually consist of posed expressions where subjects were asked to stand motionless for a while. Kids simply failed to sit still.
If AI struggles to pick emotions that are there for everyone to see, recognizing emotions we deliberately try to hide seems like an impossible challenge. So why do experts still believe? How can a machine catch a sudden expression of surprise, disgust, fear, or anger that, for a fraction of a second, betrays a terrorist?
Signs of deception
There’s a world of apps for affect recognition nowadays. Affectiva, an MIT spin-off founded by Rosalind Pickard, offers AI for recognizing emotions to entertainment and advertising industries. Research on emotion is underway at Facebook; Apple bought Emotient, an affect-recognition startup, a year or so ago.
But nearly all of this software is designed to catch basic facial expressions that usually indicate one of the six basic emotions: anger, disgust, fear, happiness, sadness, and surprise. There’s no expression for deception.
Ekman and Friesen, however, long ago claimed they identified the secret to picking this up: micro-expressions.
In 1969, both psychologists were struggling with a particularly tricky patient. At first glance, she appeared normal, sometimes even cheerful, yet she had made numerous suicide attempts. After long hours of watching the video footage of her counseling session frame by frame, they finally saw what they’d expected. Somewhere in between her smiles, Friesen and Ekman caught a fleeting look of anguish. It lasted two frames.
After researching this phenomenon more closely, they concluded micro-expressions were involuntary expressions of a person’s internal state. They lasted just 1/5 to 1/25 of a second before being suppressed.
Ekman found that untrained observers scored only slightly better than chance in recognizing micro-expressions. So he developed the Micro Expression Training Tool, a program intended to train law enforcement officers in catching micro-expressions during interrogations. Trained interrogators could achieve 70 percent accuracy in spotting, and 80 percent in interpreting, micro-expressions. Some among the computer vision community began to believe a carefully designed artificial intelligence algorithm would do better.
AI for lie detection?
One of the first to try was a team led by Matthew Shreve at the University of South Florida. Back in 2009, Shreve came up with a technique for measuring the force exerted on the facial skin by its underlying tissue. He called this optical strain. “If you sum up all the strain magnitude over the entire face, for every skin pixel, you’ll see every pixel of the face has a certain value between zero and one associated with how much it is deformed. So, once we add up all those values over the time series, we’re looking at the signal,” says Shreve, who’s now working at PARC (a Xerox company focused on pioneering technology). Shreve adapted this technique for spotting micro-expressions. Whenever peaks in an optical strain signal lasted 1/25 to 1/5 of a second, an algorithm flagged that as a micro-expression.
But his team took a debatable approach to building a micro-expression database. Participants were shown sample pictures and asked to mimic those in front of a camera; the resulting images all look a bit artificial. “All of this early work was done on posed expressions,” says Shreve. “But we also tested this on videos of politicians in debates. So, you can imagine there were a few genuine micro-expressions. We also analyzed the video of Alex Rodriguez when he was asked by Katie Couric if he had doped. We found a scorn micro expression, just a flash.”
While this algorithm could spot an expression, it had no idea what it meant. Another team picked up the gauntlet. Senya Polikovsky at the University of Tsukuba joined forces with Yoshinari Kameda and Yuichi Ohta to build a dedicated tool designed from scratch to fight violence and extremism. “It was a hot topic. We were still in this 9/11 phase and such projects got a lot of funding. There was also this TV show Lie to Me, and lie detection had caught public imagination,” says Polikovsky, now the head of the Optics and Sensing Laboratory at the Max Planck Institute in Tübingen, Germany. “People were hooked on what could be done with analyzing faces and particularly micro-expressions.”
Polikovsky was one of the first to realize the need for a training database dedicated specifically for micro-expressions. Ten students were asked to enact one of a few facial expressions with low muscle intensity and get back to neutral as quickly as possible in order to simulate micro-expressions. Once the database was ready, all frames in gathered videos were manually FACS-coded. Polikovsky also thought recognizing micro-expressions required better sensors, so they recorded faces with a 200fps camera so the software would have at least 10 frames to work with.
His algorithm divided faces into 12 regions—forehead, left eyebrow, etc.—and then identified facial action units in each region. But Polikovsky’s technology suffered from the opposite problem of Shreve’s. It could recognize a micro-expression once it knew it was there, but this AI couldn’t spot it in a longer video.
MESR
Apart from technicalities, Polikovsky’s and Shreve’s algorithms had a more profound flaw: both were trained on posed expressions. They scored reasonably well on their respective databases (above 86-percent accuracy for Polikovsky and 100 percent for Shreve), but their performance in real-world scenarios was a mystery. Ekman argued a micro-expression would occur in a very specific set of conditions—a high-stake situation where a liar was under immense pressure. Recreating that in a lab with 22-year-old students seemed dubious, to say the least.
So Tomas Pfister of Oxford University invited 20 participants to the lab and made them watch videos proven to trigger strong emotional reactions. Subjects were asked to hide their true feelings and keep a neutral face during the session (they were told they’d be punished whenever any emotions leaked on their faces). The resulting database contained 164 samples of spontaneous micro-expressions from 16 subjects. Pfister’s team coded them as positive, surprised, or negative. A team at the Chinese Academy of Sciences built a similar database named CASME that was labeled with FACS action units.
Xiaobai Li, one of the scientists who worked on CASME, later teamed up with Pfister to build the first fully automatic micro-expressions spotting and recognition system, Micro-expressions Spotting and Recognition (MESR). Unlike software applications built by Shreve and Polikovsky, it can autonomously spot and interpret micro-expressions in long, spontaneous videos.
The system starts by detecting three points on a subject’s face (the inner corners of the eyes and the spine of the nose) and places a 6×6 grid around them. The algorithm takes the first and the last frame of the footage, derives feature vectors from both, and averages them to create a point of reference. It then performs feature difference analysis, comparing each frame of the video to this reference frame. When the difference exceeds a certain threshold for a time between 1/25 and 1/5 of a second, MESR flags a micro-expression.
The interpretation module works with the flagged clips, applying a technique called Eulerian motion magnification, a technology invented at MIT that registers even the tiniest traces of movement. MESR can classify a micro-expression using either SMIC or CASME labels.
So far, results are promising. In the spotting task, MESR achieved 84-percent accuracy on SMIC and nearly 93 percent on the CASME database (most false alarms were due to eye blinks). For interpretation, the algorithm scored about 80 percent on both databases. Li and Pfister hope MESR, after some fine-tuning, will eventually find its way to border control, where officers will use it during screening interviews. Another possible application lies in forensic sciences.
New Kids on the Block
Despite improving test results, there are some who say none of these tools should ever be used for judging people. Part of this is because all of them unquestioningly follow Ekman’s FACS guidelines. This means all of them, just like the TSA officer who pointed a finger at Nicholas Gorge back in 2009 at Philadelphia International Airport, have two underlying assumptions: that current technology is good enough to understand our emotions and that Paul Ekman was right.
Both ideas have been challenged. While microexpressions can potentially identify concealed emotions, they also appear in people who are telling the truth. And these fleeting expressions can be masked entirely by things happening elsewhere on the face. Thus, it’s an open question as to whether they can be used to consistently identify deception. And that’s before we get to whether we have the technology to do so.
”Although MESR is the most recent development in the field, the technology behind it is relatively old,” says Polikovsky. “To recognize facial expressions, you need to look for some changes in pixels, in the texture of the face, and find a way to quantify these changes. So, people were coming up with new descriptors, ways of describing handcrafted features. That’s how computer vision was done five, maybe 10 years ago,” he adds. Thus, MESR, just like the Battleship Yamato, may be the last and finest specimen of a dying breed.
“The field is dwindling down, and it’s partially because existing psychological models behind its core ideas need some improvement. Right now, they are still very basic. People simply realized this whole thing was way more complicated than they’d thought,” says Polikovsky.
In other words, microexpressions may not be all they have been hyped to be. Polikovsky is not the only one to think so. While Ekman got lots of love from the computer vision community, his fellow psychologists were less impressed. Charles Honts, a psychologist at Boise State University and a polygraph expert, said he had been trained on FACS back in the 1980s, but he was unable to replicate Ekman’s results on facial expressions. David Raskin, a professor of psychology at the University of Utah, told Nature he had “yet to see a comprehensive evaluation” of Ekman’s work.
Hatice Gunes and Hayley Hung, computer scientists specializing in affect recognition, took Ekman to task in a 2016 opinion paper about the state of their field in Image and Vision Computing. His six basic emotions idea, Gunes and Hung argued, was completely outdated—the field fell for it because it framed the problem of detecting emotions in a very AI-friendly way. But facial expressions can communicate different things in different contexts, and the wide variety of human emotional states didn’t map neatly into Ekman’s six categories. “This simplification of the task, while serving us well in the early days, needs to change significantly,” they wrote, arguing that it created an illusion that affect recognition was a solved problem, while it absolutely wasn’t.
“I 100 percent agree with them,” says Polikovsky, but he points at a viable way forward. “The current notion in computer vision is that we just need to have enough data for neural networks to learn all the necessary stuff on their own, rather than manually program them to look for predefined features,” he told Ars. But all existing micro-expression databases contain a few hundreds of samples, tops. Deep learning needs hundreds of thousands, if not millions of them.
Read more at Ars Technica
Trackback from your site.