TwelveLabsDocumentation • Guide

Automated Video Data Labeler

Replace hours of manual video annotation with AI-powered labeling. Upload raw footage, let TwelveLabs' multimodal models generate structured training data, and export production-ready datasets — all from a single dashboard.

Architecture Diagram

The Data Labeling Bottleneck

In computer vision, video data is abundant — thousands of terabytes flow through IP-based camera networks daily. But labeled video data? That's the bottleneck. Manually scrubbing through hours of footage to find a specific event — a forklift violation, a safety breach, a product defect — is prohibitively expensive. Companies spend an estimated $25–$50 per hour on human annotators, leading to the rise of expensive services like AWS SageMaker Ground Truth and other data labeling vendors.

Breakthroughs in semantic video understanding allow us to invert this workflow entirely. Instead of manually hunting for events, we treat video as data that can be queried, clustered, and auto-labeled. This isn't just a time-saver — it's a force multiplier for deploying Vision Language Models (VLLMs) in production environments.

97%

Faster than manual

90%+

Cost reduction

3 formats

JSON, CSV, COCO

512-dim

Marengo embeddings

Core Features

Everything you need to go from raw footage to training-ready datasets.

Video Index Management

Upload videos into named indexes. Organize datasets by project, domain, or experiment. Track video count, duration, and status at a glance.

AI-Powered Annotation

Define custom label taxonomies, then let TwelveLabs generate frame-accurate annotations with timestamps, descriptions, and confidence scores.

Semantic Video Search

Search for specific moments across your entire video library using natural language queries. No keywords needed — search by meaning.

Embedding Visualization

Visualize 512-dimensional Marengo embeddings projected into 2D space via PCA. See how your videos cluster by semantic similarity.

Multi-Format Export

Download annotations as JSON for raw access, CSV for spreadsheets, or COCO format for direct use in object detection pipelines.

ROI Calculator

See real-time cost and time comparisons between manual annotation and TwelveLabs-powered labeling for your selected videos.

How It Works

A three-step pipeline from raw video to training-ready dataset.

1

Upload & Index Videos

Videos are uploaded to TwelveLabs via the tasks.create() API. Each video is processed through the Marengo 3.0 engine, which generates multimodal embeddings encoding visual, audio, and textual content into a 512-dimensional vector space.

route.js — Video ingestionjavascript
const task = await tl_client.tasks.create({
    indexId: indexId,
    videoUrl: videoURL,
    userMetadata: JSON.stringify({
        indexName: "Autonomous Driving",
        description: "Dashcam footage for perception model training"
    })
});

// Wait for TwelveLabs to finish processing
const completed = await tl_client.tasks.waitForDone(task.id, {
    sleepInterval: 5
});

console.log(`Video ${completed.videoId} indexed successfully`);
2

Auto-Annotate with Custom Labels

Define your domain-specific label taxonomy (e.g., "car_turning_left", "pedestrian_crossing"), then trigger automated annotation. The system uses TwelveLabs' generative video understanding to produce frame-accurate labels with precise start and end timestamps.

Annotation prompt constructionjavascript
const prompt = `Analyze this video and generate annotations.
For each distinct event, provide:
- label: one of [${domainLabels.join(', ')}]
- start_timestamp: exact seconds when the event begins
- end_timestamp: exact seconds when the event ends
- description: brief description of what's happening

Return as JSON array.`;

const response = await fetch('/api/annotate', {
    method: 'POST',
    body: JSON.stringify({ videoId, prompt })
});
3

Export & Train

Download your annotations in the format your ML pipeline expects. The COCO export includes category mappings and bounding box placeholders, making it ready for fine-tuning object detection or action recognition models.

COCO format exportjson
{
  "info": {
    "description": "Autonomous Driving Dataset",
    "date_created": "2026-02-15T00:00:00Z"
  },
  "videos": [
    { "id": 1, "file_name": "dashcam_001.mp4", "duration": 124.5 }
  ],
  "annotations": [
    {
      "id": 1,
      "video_id": 1,
      "category_id": 3,
      "start": 12.4,
      "end": 15.8,
      "description": "Vehicle executing left turn at intersection"
    }
  ],
  "categories": [
    { "id": 1, "name": "pedestrian_crossing" },
    { "id": 2, "name": "lane_change" },
    { "id": 3, "name": "car_turning_left" }
  ]
}

Understanding Your Data Through Embeddings

Every video indexed by TwelveLabs is represented as a 512-dimensional embedding vector generated by the Marengo 3.0 model. These vectors capture the semantic meaning of video content across visual, audio, and textual modalities.

The Embeddings tab in each index uses Principal Component Analysis (PCA) to project these high-dimensional vectors into 2D space. Videos that are semantically similar to each other appear as clusters — giving you an intuitive way to audit data quality, identify duplicates, and discover patterns before training.

Custom power-iteration PCA (runs in-browser)javascript
// Build N×N gram matrix instead of dim×dim covariance
// This keeps computation O(N³) instead of O(dim³ = 512³)
const gram = Array.from({ length: N }, (_, i) =>
    Array.from({ length: N }, (_, j) =>
        centered[i].reduce((s, v, k) => s + v * centered[j][k], 0)
    )
);

// Power iteration to find top-2 eigenvectors
let v = Array.from({ length: N }, () => Math.random() - 0.5);
for (let iter = 0; iter < 100; iter++) {
    const next = gram.map(row =>
        row.reduce((s, g, j) => s + g * v[j], 0)
    );
    const norm = Math.sqrt(next.reduce((s, x) => s + x * x, 0));
    v = next.map(x => x / norm);
}

Why TwelveLabs?

TwelveLabs provides the foundational video understanding models that power every feature in this application.

Marengo 3.0 — Multimodal Embeddings

State-of-the-art video representation model that encodes visual, audio, and textual content into a unified 512-dimensional vector space. Powers semantic search, clustering, and similarity detection.

Pegasus 1.2 — Generative Video Understanding

Generates structured, human-readable descriptions and labels from video content. Understands temporal relationships, object interactions, and scene transitions with frame-level accuracy.

Enterprise-Grade Infrastructure

SOC 2 compliant, built for scale. Process thousands of hours of video through a simple REST API with consistent, predictable pricing and 99.9% uptime.

Research-Backed Innovation

TwelveLabs' research team publishes cutting-edge work on video understanding, continuously improving model accuracy and expanding capabilities into new domains.

From Curated Data to Business Impact

The output of this tool — structured, labeled datasets — is not just a file; it's an actionable asset that drives business intelligence and model performance.

Accelerating VLLM Fine-Tuning

  • Skip Feature Extraction — Pre-computed embeddings mean you can train a classifier in seconds, not hours.
  • Reduce Hallucinations — Curated data ensures your model learns from distinct, well-separated examples.

Operational Intelligence

  • Heatmap of Hazards — Clustering reveals systemic operational failures, not just one-off events.
  • Trend Analysis — Track if specific violation clusters are growing or shrinking over time.

Technology Stack

LayerTechnologyPurpose
FrontendNext.js 16 + React 19Server-side rendering, routing, and UI
Video AITwelveLabs APIEmbeddings (Marengo), annotations (Pegasus)
StylingTailwind CSSUtility-first responsive design
VisualizationCanvas 2D + Custom PCAEmbedding scatter plots with power iteration
StorageVercel BlobVideo file hosting before indexing
ExportClient-side generationJSON, CSV, and COCO format downloads

Ready to automate your video annotation?

Get started with the API documentation, explore the source code, or talk to our team about enterprise deployment options.