Think & Built Bigger Faster Better

Assume you’re scrolling through your phone’s images and come upon an image you don’t recognize at first. On the couch, it appears to be something fluffy; might it be a cushion or a coat? It clicks after a few seconds — of course! Mocha, that fluffy ball is your friend’s cat. While some of your photographs were easy to understand, why was one cat photo so difficult?

Despite the critical importance of understanding visual data in critical areas ranging from health care to transportation to household devices, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers discovered that the notion of an image’s recognition difficulty for humans has been almost entirely ignored. Datasets have been a primary driver of success in deep learning-based AI, but we know little about how data drives progress in large-scale deep learning beyond the fact that bigger is better.

Humans beat object identification algorithms in real-world applications that require comprehending visual input, despite the fact that models perform well on existing datasets, particularly those expressly meant to test machines with debiased images or distribution changes.  This issue persists in part because we lack information on the absolute difficulty of an image or dataset. It is difficult to objectively assess progress toward human-level performance, cover the spectrum of human abilities, and raise the challenge offered by a dataset without controlling for the difficulty of images used for evaluation.

David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, dug into the deep world of image databases to bridge this knowledge gap, investigating why certain photos are more difficult for people and robots to recognize than others. “Some images take longer to recognize inherently, and it’s critical to understand the brain’s activity during this process and how it relates to machine learning models.”  Perhaps our present models are missing complicated brain networks or unique mechanisms that become obvious only when tested with difficult visual inputs. “This research is critical for understanding and improving machine vision models,” says Mayo, lead author of a recent paper on the study.

This resulted in the creation of a new metric, the “minimum viewing time” (MVT), which assesses the difficulty of recognizing an image based on how long it takes a person to examine it before correctly identifying it.  The team showed images to participants for varying durations ranging from 17 milliseconds to 10 seconds, and asked them to choose the correct object from a set of 50 options, using a subset of ImageNet, a popular dataset in machine learning, and ObjectNet, a dataset designed to test object recognition robustness. The team discovered that previous test sets, like ObjectNet, appeared to be slanted toward easier, shorter MVT images, with the great majority of benchmark performance deriving from images that are easy for humans.

The experiment discovered intriguing tendencies in model performance, notably in terms of scalability. Larger models improved significantly on basic images but made less progress on more difficult images.  The CLIP models, which combine language and visual, stood noteworthy as they advanced toward more human-like recognition.

“Traditionally, object recognition datasets have been skewed toward less-complex images, resulting in an inflation of model performance metrics that are not truly reflective of a model’s robustness or ability to tackle complex visual tasks.” “Our research shows that harder images pose a more acute challenge, resulting in a distribution shift that is frequently overlooked in standard evaluations,” Mayo adds. “We released image sets labeled by difficulty, as well as tools to compute MVT automatically, allowing MVT to be added to existing benchmarks and extended to a variety of applications.”  Measurement of test set difficulty prior to deploying real-world systems, discovery of neural correlates of image difficulty, and advancement of object recognition techniques to close the gap between benchmark and real-world performance are among these.”

“One of the most important takeaways for me is that we now have another dimension to evaluate models on.” We want models that can recognize any image, even if it is difficult for a human to recognize it. We are the first to quantify the implications of this.  “Our findings show that not only is this not the case with today’s state of the art, but that our current evaluation methods are unable to tell us when it is because standard datasets are so skewed toward easy images,” says Jesse Cummings, an MIT graduate student in electrical engineering and computer science and co-first author on the paper with Mayo.