Jump To Top


This AI system learned to understand videos by watching YouTube

Elevate your enterprise data technology and strategy at Transform 2021.

Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and an image that seem innocuous when considered apart — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.

Even the best AI systems struggle in this area. But there’s been progress, most recently from a team at the Allen Institute for Artificial Intelligence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering. In a preprint paper published this month, the researchers detail Multimodal Neural Script Knowledge Models (Merlot), a system that learns to match images in videos with words and even follow events globally over time by watching millions of YouTube videos with transcribed speech. It does all this in an unsupervised manner, meaning that the videos haven’t been labeled or categorized — forcing the system to learn from the videos’ inherent structures.

Learning from videos

Our capacity for commonsense reasoning is shaped by how we experience causes and effects. Teaching machines this type of “script knowledge” is a significant challenge, in part because of the amount of data it requires. For example, even a single photo of people dining at a restaurant can imply a wealth of information, like the fact that the people had to meet up, agree where to go, and enter the restaurant before sitting down.

Merlot attempts to internalize these concepts by watching YouTube videos. Lots of YouTube videos. Drawing on a dataset of 6 million videos, the researchers trained the model to match individual frames with a contextualized representation of the video transcripts, divided into segments. The dataset contained instructional videos, lifestyle vlogs of everyday events, and YouTube’s auto-suggested videos for popular topics like “science” and “home improvement,” each selected explicitly to encourage the model to learn about a broad range of objects, actions, and scenes.

The goal was to teach Merlot to contextualize the frame-level representations over time and over spoken words, so that it could reorder scrambled video frames and make sense of “noisy” transcripts — including those with erroneously lowercase text, missing punctuation, and filler words like “umm,” “hmm,” and “yeah.” The researchers largely accomplished this. They that in a series of qualitative and quantitative tests, Merlot had a strong “out-of-the-box” understanding of everyday events and situations, enabling it to take a scrambled sequence of events from a video and order the frames to match the captions in a coherent narrative, like people riding a carousel.

Future work

Merlot is only the latest work on video understanding in the AI research community. In 2019, researchers at Georgia Institute of Technology and the University of Alberta created a system that could automatically generate commentary for “let’s play” videos of video games. More recently, researchers at Microsoft published a preprint paper describing a system that could determine whether statements about video clips were true, by learning from visual and textual clues. And Facebook has trained a computer vision system that can automatically learn audio, textual, and visual representations from publicly available Facebook videos.

Above: Merlot can understand the sequence of events in videos, as demonstrated here.

The Allen Institute and University of Washington researchers note that, like previous work, Merlot has limitations, some owing to the data selected to train the model. For example, Merlot could exhibit undesirable biases because it was only trained on English data and largely local news segments, which can spend a lot of time covering crime stories in a sensationalized way. It’s “very likely” that training models like Merlot on mostly news content could cause them to learn racist patterns as well as sexist patterns, the researchers concede, given that the most popular YouTubers in most countries are men. Studies have demonstrated a correlation between watching local news and having more explicit, racialized beliefs about crime.

For these reasons, the team advises against deploying Merlot into a production environment. But they say that Merlot is still a promising step for future work in multimodal understanding. “We hope that Merlot can inspire future work for learning vision+language representations in a more human-like fashion compared to learning from literal captions and their corresponding images,” the coauthors wrote. “The model achieves strong performance on tasks requiring event-level reasoning over videos and static images.”


  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Source: Read Full Article