Documentation • GuideAutomated Video Data Labeler
Replace hours of manual video annotation with AI-powered labeling. Upload raw footage, let TwelveLabs' multimodal models generate structured training data, and export production-ready datasets — all from a single dashboard.


The Data Labeling Bottleneck
In computer vision, video data is abundant — thousands of terabytes flow through IP-based camera networks daily. But labeled video data? That's the bottleneck. Manually scrubbing through hours of footage to find a specific event — a forklift violation, a safety breach, a product defect — is prohibitively expensive. Companies spend an estimated $25–$50 per hour on human annotators, leading to the rise of expensive services like AWS SageMaker Ground Truth and other data labeling vendors.
Breakthroughs in semantic video understanding allow us to invert this workflow entirely. Instead of manually hunting for events, we treat video as data that can be queried, clustered, and auto-labeled. This isn't just a time-saver — it's a force multiplier for deploying Vision Language Models (VLLMs) in production environments.
97%
Faster than manual
90%+
Cost reduction
3 formats
JSON, CSV, COCO
512-dim
Marengo embeddings

Core Features
Everything you need to go from raw footage to training-ready datasets.
Video Index Management
Upload videos into named indexes. Organize datasets by project, domain, or experiment. Track video count, duration, and status at a glance.
AI-Powered Annotation
Define custom label taxonomies, then let TwelveLabs generate frame-accurate annotations with timestamps, descriptions, and confidence scores.
Semantic Video Search
Search for specific moments across your entire video library using natural language queries. No keywords needed — search by meaning.
Embedding Visualization
Visualize 512-dimensional Marengo embeddings projected into 2D space via PCA. See how your videos cluster by semantic similarity.
Multi-Format Export
Download annotations as JSON for raw access, CSV for spreadsheets, or COCO format for direct use in object detection pipelines.
ROI Calculator
See real-time cost and time comparisons between manual annotation and TwelveLabs-powered labeling for your selected videos.

How It Works
A three-step pipeline from raw video to training-ready dataset.
Upload & Index Videos
Videos are uploaded to TwelveLabs via the tasks.create() API. Each video is processed through the Marengo 3.0 engine, which generates multimodal embeddings encoding visual, audio, and textual content into a 512-dimensional vector space.
const task = await tl_client.tasks.create({
indexId: indexId,
videoUrl: videoURL,
userMetadata: JSON.stringify({
indexName: "Autonomous Driving",
description: "Dashcam footage for perception model training"
})
});
// Wait for TwelveLabs to finish processing
const completed = await tl_client.tasks.waitForDone(task.id, {
sleepInterval: 5
});
console.log(`Video ${completed.videoId} indexed successfully`);Auto-Annotate with Custom Labels
Define your domain-specific label taxonomy (e.g., "car_turning_left", "pedestrian_crossing"), then trigger automated annotation. The system uses TwelveLabs' generative video understanding to produce frame-accurate labels with precise start and end timestamps.
const prompt = `Analyze this video and generate annotations.
For each distinct event, provide:
- label: one of [${domainLabels.join(', ')}]
- start_timestamp: exact seconds when the event begins
- end_timestamp: exact seconds when the event ends
- description: brief description of what's happening
Return as JSON array.`;
const response = await fetch('/api/annotate', {
method: 'POST',
body: JSON.stringify({ videoId, prompt })
});Export & Train
Download your annotations in the format your ML pipeline expects. The COCO export includes category mappings and bounding box placeholders, making it ready for fine-tuning object detection or action recognition models.
{
"info": {
"description": "Autonomous Driving Dataset",
"date_created": "2026-02-15T00:00:00Z"
},
"videos": [
{ "id": 1, "file_name": "dashcam_001.mp4", "duration": 124.5 }
],
"annotations": [
{
"id": 1,
"video_id": 1,
"category_id": 3,
"start": 12.4,
"end": 15.8,
"description": "Vehicle executing left turn at intersection"
}
],
"categories": [
{ "id": 1, "name": "pedestrian_crossing" },
{ "id": 2, "name": "lane_change" },
{ "id": 3, "name": "car_turning_left" }
]
}
Understanding Your Data Through Embeddings
Every video indexed by TwelveLabs is represented as a 512-dimensional embedding vector generated by the Marengo 3.0 model. These vectors capture the semantic meaning of video content across visual, audio, and textual modalities.
The Embeddings tab in each index uses Principal Component Analysis (PCA) to project these high-dimensional vectors into 2D space. Videos that are semantically similar to each other appear as clusters — giving you an intuitive way to audit data quality, identify duplicates, and discover patterns before training.
// Build N×N gram matrix instead of dim×dim covariance
// This keeps computation O(N³) instead of O(dim³ = 512³)
const gram = Array.from({ length: N }, (_, i) =>
Array.from({ length: N }, (_, j) =>
centered[i].reduce((s, v, k) => s + v * centered[j][k], 0)
)
);
// Power iteration to find top-2 eigenvectors
let v = Array.from({ length: N }, () => Math.random() - 0.5);
for (let iter = 0; iter < 100; iter++) {
const next = gram.map(row =>
row.reduce((s, g, j) => s + g * v[j], 0)
);
const norm = Math.sqrt(next.reduce((s, x) => s + x * x, 0));
v = next.map(x => x / norm);
}
Why TwelveLabs?
TwelveLabs provides the foundational video understanding models that power every feature in this application.
Marengo 3.0 — Multimodal Embeddings
State-of-the-art video representation model that encodes visual, audio, and textual content into a unified 512-dimensional vector space. Powers semantic search, clustering, and similarity detection.
Pegasus 1.2 — Generative Video Understanding
Generates structured, human-readable descriptions and labels from video content. Understands temporal relationships, object interactions, and scene transitions with frame-level accuracy.
Enterprise-Grade Infrastructure
SOC 2 compliant, built for scale. Process thousands of hours of video through a simple REST API with consistent, predictable pricing and 99.9% uptime.
Research-Backed Innovation
TwelveLabs' research team publishes cutting-edge work on video understanding, continuously improving model accuracy and expanding capabilities into new domains.

From Curated Data to Business Impact
The output of this tool — structured, labeled datasets — is not just a file; it's an actionable asset that drives business intelligence and model performance.
Accelerating VLLM Fine-Tuning
- Skip Feature Extraction — Pre-computed embeddings mean you can train a classifier in seconds, not hours.
- Reduce Hallucinations — Curated data ensures your model learns from distinct, well-separated examples.
Operational Intelligence
- Heatmap of Hazards — Clustering reveals systemic operational failures, not just one-off events.
- Trend Analysis — Track if specific violation clusters are growing or shrinking over time.

Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Next.js 16 + React 19 | Server-side rendering, routing, and UI |
| Video AI | TwelveLabs API | Embeddings (Marengo), annotations (Pegasus) |
| Styling | Tailwind CSS | Utility-first responsive design |
| Visualization | Canvas 2D + Custom PCA | Embedding scatter plots with power iteration |
| Storage | Vercel Blob | Video file hosting before indexing |
| Export | Client-side generation | JSON, CSV, and COCO format downloads |

Ready to automate your video annotation?
Get started with the API documentation, explore the source code, or talk to our team about enterprise deployment options.