How to Automate Video Captions with an API and n8n Workflows

The Bottom Line

Manual captioning costs $1-3 per video minute and takes 5-10x the video length in human labor. An automated caption pipeline using the AutoCaptions API and n8n processes a 60-second video in under 90 seconds, costs a fraction of manual work, and scales to hundreds of videos per day without additional headcount.

85% of Facebook videos are watched without sound. On Instagram, that number is 70%. TikTok's algorithm actively favors videos with captions because they increase watch time. Captions aren't optional anymore - they're a ranking factor.

This guide covers the complete pipeline: speech-to-text transcription, subtitle generation, burn-in rendering with custom styling, and batch processing at scale.

Manual vs. Automated Captioning

Before diving into the technical setup, here's why automation wins:

Factor	Manual Captioning	Automated (API + n8n)
Time per minute of video	5-10 minutes	15-30 seconds
Cost per video (1 min)	$1-3 (freelancer)	$0.05-0.15 (API)
Accuracy	98-99% (human)	95-97% (AI, improving)
Styling consistency	Varies by editor	Pixel-perfect every time
Scale capacity	Limited by team size	1,000+ videos/day
Turnaround for 100 videos	3-5 business days	2-4 hours
Language support	Need bilingual staff	50+ languages via API

The accuracy gap is closing fast. Modern speech-to-text models handle accents, technical jargon, and overlapping speech better than they did even a year ago. For most content types - talking head videos, product demos, presentations - automated accuracy is indistinguishable from human work.

Where human captioning still wins: heavy background noise, multiple speakers talking simultaneously, and domain-specific terminology (medical, legal). For these edge cases, use the automated pipeline for the first pass and add a human review step.

How the AutoCaptions API Works

The AutoCaptions API handles three distinct operations that can be used independently or chained together.

1. Speech-to-Text Transcription

Upload a video (or provide a URL), and the API returns a time-stamped transcription. Each word gets a start time, end time, and confidence score.

// POST /api/autocaptions/transcribe
{
  "video_url": "https://your-cdn.com/video.mp4",
  "language": "en"
}

Response:

{
  "transcription": {
    "text": "Today we're looking at three ways to automate your video workflow.",
    "words": [
      { "word": "Today", "start": 0.0, "end": 0.35, "confidence": 0.98 },
      { "word": "we're", "start": 0.38, "end": 0.52, "confidence": 0.97 },
      { "word": "looking", "start": 0.55, "end": 0.82, "confidence": 0.99 },
      { "word": "at", "start": 0.84, "end": 0.92, "confidence": 0.99 },
      { "word": "three", "start": 0.95, "end": 1.15, "confidence": 0.96 },
      { "word": "ways", "start": 1.18, "end": 1.42, "confidence": 0.98 }
    ],
    "language_detected": "en",
    "duration": 4.8
  }
}

2. Subtitle File Generation

From the transcription, generate industry-standard subtitle files. Supported formats:

SRT - Most widely supported, works everywhere
VTT (WebVTT) - HTML5 video standard, supports styling
ASS - Advanced SubStation Alpha, full styling control
JSON - Raw timed data for custom rendering

// POST /api/autocaptions/subtitles
{
  "video_url": "https://your-cdn.com/video.mp4",
  "format": "srt",
  "max_chars_per_line": 42,
  "max_lines": 2
}

The API handles line breaking intelligently - it splits on natural phrase boundaries, not in the middle of words or clauses. Max characters per line and max lines per subtitle block are configurable.

3. Burn-In Rendering

This is where it gets powerful. Instead of generating a subtitle file that viewers can toggle on/off, burn-in rendering hardcodes the captions directly into the video pixels. The output is a new video file with permanent, styled captions.

// POST /api/autocaptions/render
{
  "video_url": "https://your-cdn.com/video.mp4",
  "style": {
    "font": "Montserrat",
    "font_size": 42,
    "color": "#FFFFFF",
    "background": "rgba(0,0,0,0.7)",
    "position": "bottom-center",
    "padding": 8,
    "border_radius": 4,
    "highlight_color": "#FFD700",
    "highlight_style": "word"
  },
  "callback_url": "https://your-n8n.com/webhook/caption-done"
}

Burn-in is essential for social media platforms. Instagram Reels, TikTok, and YouTube Shorts don't support external subtitle files. The captions must be part of the video itself.

Caption Styling Options

Generic white-text-on-black captions look dated. Modern caption styles are a design element that matches your brand. Here are the main styles the API supports.

Word-by-Word Highlight (TikTok Style)

The most engaging caption style in 2026. Each word highlights as it's spoken, creating a karaoke-like effect that keeps viewers reading.

{
  "highlight_style": "word",
  "highlight_color": "#FFD700",
  "color": "#FFFFFF",
  "font": "Montserrat Bold",
  "font_size": 48,
  "background": "none",
  "text_shadow": "2px 2px 4px rgba(0,0,0,0.8)",
  "position": "center"
}

This style works best for short-form content (under 60 seconds) where the text is large and centered. It's the default on TikTok and Instagram Reels for good reason - it increases average watch time by 12-15% compared to no captions.

Full Sentence with Background Box

The classic approach. A colored box behind the text ensures readability over any video content.

{
  "highlight_style": "none",
  "color": "#FFFFFF",
  "font": "Inter",
  "font_size": 36,
  "background": "rgba(0,0,0,0.75)",
  "padding": 12,
  "border_radius": 8,
  "position": "bottom-center",
  "margin_bottom": 80
}

Best for longer content, tutorials, and presentations where readability matters more than visual flair. The background box guarantees contrast regardless of what's happening in the video behind it.

Animated Text Reveal

Words or phrases animate onto the screen - fade in, slide up, pop in. Each animation style changes the feel of the captions.

{
  "animation": "slideUp",
  "animation_duration": 0.3,
  "color": "#FFFFFF",
  "font": "Poppins SemiBold",
  "font_size": 44,
  "background": "none",
  "text_shadow": "1px 1px 3px rgba(0,0,0,0.9)",
  "position": "bottom-center"
}

Available animations: fadeIn, slideUp, slideDown, popIn, typewriter, bounceIn. The typewriter effect works particularly well for dramatic or narrative content.

Custom Fonts and Brand Colors

Upload your brand font (TTF/OTF) and the API uses it for rendering. Combined with your brand colors, the captions become part of your visual identity rather than a generic overlay.

{
  "font": "custom",
  "font_url": "https://your-cdn.com/fonts/YourBrandFont-Bold.ttf",
  "color": "#E63946",
  "highlight_color": "#F1FAEE",
  "font_size": 40
}

Step-by-Step n8n Workflow

Here's the complete n8n workflow for automated captioning. This handles single videos and batch processing.

Workflow Structure

[Webhook: Receive video URL]
    → [HTTP Request: Send to AutoCaptions API]
    → [Wait: For render callback]
    → [Function: Validate output]
    → [IF: Quality check passed?]
        → Yes: [HTTP Request: Deliver to destination]
        → No: [HTTP Request: Flag for human review]
    → [HTTP Request: Log to analytics]

Node 1: Webhook Trigger

Set up a Webhook node that accepts POST requests with a video URL and desired caption style.

// Expected incoming payload
{
  "video_url": "https://your-cdn.com/raw-video.mp4",
  "style_preset": "tiktok",
  "language": "en",
  "destination": "s3://your-bucket/captioned/",
  "callback_meta": {
    "project_id": "proj_123",
    "video_id": "vid_456"
  }
}

Node 2: Function Node - Build API Request

Map the style preset to actual styling parameters:

const input = $input.first().json;

const stylePresets = {
  tiktok: {
    font: "Montserrat Bold",
    font_size: 48,
    color: "#FFFFFF",
    highlight_style: "word",
    highlight_color: "#FFD700",
    background: "none",
    text_shadow: "2px 2px 4px rgba(0,0,0,0.8)",
    position: "center"
  },
  youtube: {
    font: "Inter",
    font_size: 36,
    color: "#FFFFFF",
    highlight_style: "none",
    background: "rgba(0,0,0,0.75)",
    padding: 12,
    border_radius: 8,
    position: "bottom-center",
    margin_bottom: 60
  },
  instagram: {
    font: "Poppins SemiBold",
    font_size: 44,
    color: "#FFFFFF",
    highlight_style: "word",
    highlight_color: "#FF6B6B",
    background: "none",
    text_shadow: "1px 1px 3px rgba(0,0,0,0.9)",
    position: "center",
    animation: "popIn"
  },
  professional: {
    font: "Inter",
    font_size: 32,
    color: "#FFFFFF",
    highlight_style: "none",
    background: "rgba(0,0,0,0.6)",
    padding: 8,
    position: "bottom-center",
    margin_bottom: 40
  }
};

const style = stylePresets[input.style_preset] || stylePresets.tiktok;

return [{
  json: {
    video_url: input.video_url,
    style: style,
    language: input.language || "en",
    callback_url: `${$env.WEBHOOK_URL}/webhook/caption-callback`,
    callback_meta: input.callback_meta
  }
}];

Node 3: HTTP Request - Submit to AutoCaptions API

POST the request to the AutoCaptions API render endpoint. Include your API key in the header.

Method: POST
URL: https://api.samautomation.com/v1/autocaptions/render
Headers: Authorization: Bearer {{$credentials.samautomation_api_key}}
Body: JSON from the previous Function node

Node 4: Wait Node

Configure the Wait node to resume when a webhook callback is received. Set the webhook path to /webhook/caption-callback and a timeout of 10 minutes (captioning a 5-minute video takes roughly 2-4 minutes).

Node 5: Function Node - Validate Output

Check the callback response for errors and validate the output URL:

const result = $input.first().json;

if (result.status !== "completed") {
  return [{
    json: {
      success: false,
      error: result.error || "Unknown rendering error",
      video_id: result.callback_meta?.video_id
    }
  }];
}

return [{
  json: {
    success: true,
    captioned_video_url: result.output_url,
    duration: result.duration,
    word_count: result.word_count,
    video_id: result.callback_meta?.video_id,
    destination: result.callback_meta?.destination
  }
}];

Node 6: IF Node - Quality Gate

Route based on the success field. Successful renders continue to delivery. Failed renders go to a notification node that alerts your team via Slack, email, or however you prefer.

Multi-Language Caption Support

The AutoCaptions API supports 50+ languages for transcription and can generate captions in a different language than the spoken audio. This opens up two use cases.

Same-Language Captions

The straightforward case. English video gets English captions, Spanish video gets Spanish captions. Just set the language parameter to match the spoken language.

{ "language": "es" }

For auto-detection, omit the language parameter. The API detects the spoken language from the first 30 seconds of audio and uses it for the full transcription.

Cross-Language Translation

Render captions in a different language than what's spoken. The API transcribes the original audio, translates the transcription, and generates timed captions in the target language.

{
  "video_url": "https://cdn.example.com/english-video.mp4",
  "language": "en",
  "translate_to": "es",
  "style": { "font_size": 36, "color": "#FFFFFF" }
}

This is particularly useful for reaching international audiences without re-recording content. A single English video can be captioned in Spanish, Portuguese, French, German, and Japanese in one batch run.

Batch Multi-Language Pipeline

// n8n Function node: Generate render requests for multiple languages
const videoUrl = $input.first().json.video_url;
const languages = ["es", "pt", "fr", "de", "ja"];

const requests = languages.map(lang => ({
  json: {
    video_url: videoUrl,
    language: "en",
    translate_to: lang,
    style: {
      font: "Noto Sans",  // Supports all language character sets
      font_size: 38,
      color: "#FFFFFF",
      background: "rgba(0,0,0,0.7)"
    },
    output_filename: `video_captioned_${lang}.mp4`
  }
}));

return requests;

Use n8n's SplitInBatches node after this to process languages in parallel (3-5 concurrent renders is safe) without overwhelming the API.

Batch Captioning: Processing 100+ Videos

Single-video workflows are the starting point. The real value is batch processing. Here's how to caption 100+ videos in a single workflow run.

Input: Spreadsheet or Database Query

Pull video URLs from Google Sheets, Airtable, a database, or a CSV file. Each row contains:

Video URL or file path
Desired caption style
Target language(s)
Output destination

Throttling and Concurrency

Don't submit 100 render requests simultaneously. The API has rate limits, and your n8n instance has memory limits for tracking 100 concurrent Wait nodes.

The pattern:

SplitInBatches node: Process 5-10 videos at a time
Submit batch: Send all 5-10 to the API
Wait for all callbacks: Each video gets its own Wait node within the batch
Process results: Download, validate, deliver
Next batch: Move to the next 5-10 videos

// n8n Function node: Batch progress tracker
const batchSize = 10;
const allVideos = $input.all();
const totalBatches = Math.ceil(allVideos.length / batchSize);

// Add batch metadata
return allVideos.map((item, index) => ({
  json: {
    ...item.json,
    batch_number: Math.floor(index / batchSize) + 1,
    total_batches: totalBatches,
    index_in_batch: index % batchSize
  }
}));

For 100 videos at 90 seconds average processing time per video, with 10 concurrent renders, the full batch completes in approximately 15 minutes. That's 100 captioned videos ready for distribution.

Error Recovery

In a batch of 100 videos, 2-3 might fail (network timeout, corrupt source file, unusual audio). Build retry logic into the workflow:

Collect failed video IDs after each batch
After all batches complete, retry failed videos once
If still failing, route to a manual review queue

Don't let one failure block the entire batch. Process what succeeds, handle errors separately.

Integrating Captions into Existing Video Pipelines

If you're already using the JSON-to-Video API to render videos, adding captions is a single additional step in your pipeline.

Pipeline Without Captions

[Template JSON] → [Render Video] → [Deliver]

Pipeline With Captions

[Template JSON] → [Render Video] → [AutoCaption] → [Deliver]

The AutoCaptions step takes the rendered video URL from the JSON-to-Video callback and submits it for captioning. The output is a new video with burned-in captions.

For workflows where you need both a captioned and uncaptioned version (e.g., YouTube gets the uncaptioned version with a separate SRT file, while Instagram gets the burned-in version), fork the pipeline after the initial render:

[Render Video] → Fork:
    → Path A: [Generate SRT] → [Upload to YouTube with SRT]
    → Path B: [Burn-in Captions] → [Upload to Instagram]

This is a common pattern for content repurposing workflows. One render, multiple outputs optimized for each platform. Check our automation guides for platform-specific workflow templates.

Platform-Specific Caption Requirements

Each social platform has different expectations for captioned video. Getting these wrong means your captions get cut off, overlap with UI elements, or look unprofessional.

Platform	Safe Zone (caption area)	Max Video Length	Recommended Font Size	Notes
TikTok	Bottom 15% is UI overlay	10 min	44-52px	Keep captions in center-bottom, above the description text
Instagram Reels	Bottom 20% is UI overlay	90 sec	40-48px	Username and caption text overlap bottom
YouTube Shorts	Bottom 15%	60 sec	36-44px	Subscribe button and title overlap
Instagram Feed	Full frame available	60 sec	32-40px	Smaller font OK, less UI competition
YouTube (landscape)	Full frame, standard	No limit	28-36px	Use standard subtitle positioning
LinkedIn	Full frame	10 min	32-40px	Professional styling, background box recommended

The margin_bottom and position parameters in the AutoCaptions API let you adjust for these safe zones. For TikTok, set margin_bottom: 250 to keep captions above the UI overlay. For Instagram Reels, use margin_bottom: 300.

{
  "style": {
    "position": "bottom-center",
    "margin_bottom": 250,
    "font_size": 46
  }
}

Quality Control: Reviewing Auto-Generated Captions

Automated captions are 95-97% accurate. That remaining 3-5% can include embarrassing errors - misheard brand names, incorrect numbers, or garbled technical terms.

Automated Quality Checks

Before burn-in rendering, you can retrieve the transcription text and run automated checks:

Spell check: Flag words not in a dictionary or your custom word list
Brand name verification: Ensure your brand and product names are transcribed correctly (add them to a custom vocabulary list)
Number validation: Cross-reference transcribed numbers against expected values
Profanity filter: Catch misheard words that accidentally become inappropriate

// n8n Function node: Basic caption quality check
const transcription = $input.first().json.transcription;
const brandTerms = ["SamAutomation", "AutoCaptions", "JSON-to-Video"];
const flagged = [];

// Check for common misheard brand terms
brandTerms.forEach(term => {
  const regex = new RegExp(term, 'i');
  if (transcription.text.includes(term.toLowerCase().replace(/[^a-z]/g, ''))
      && !regex.test(transcription.text)) {
    flagged.push(`Possible misheard brand term: "${term}"`);
  }
});

// Check confidence scores
const lowConfidence = transcription.words.filter(w => w.confidence < 0.85);
if (lowConfidence.length > 0) {
  flagged.push(`${lowConfidence.length} words with low confidence: ${
    lowConfidence.map(w => `"${w.word}" (${(w.confidence * 100).toFixed(0)}%)`).join(', ')
  }`);
}

return [{
  json: {
    passed: flagged.length === 0,
    flags: flagged,
    transcription: transcription.text
  }
}];

Human Review Queue

For high-stakes content (paid ads, client deliverables, corporate communications), add a human review step between transcription and burn-in:

API generates the transcription
n8n sends the transcription text to a review interface (Google Sheet, Slack message, custom dashboard)
Reviewer approves or edits the text
Edited transcription is sent back to the API for burn-in rendering

This hybrid approach gives you automation speed with human accuracy. The reviewer only needs 30-60 seconds per video (reading text is much faster than captioning from scratch), so you maintain most of the throughput benefit.

CapCut API vs. AutoCaptions API for Subtitles

Many people search for CapCut API subtitles when they actually need a dedicated captioning API. Here's how they compare:

CapCut is primarily a video editing tool. Its auto-caption feature is designed for interactive use within the CapCut editor. API access to CapCut's captioning is limited, undocumented, and subject to change without notice.

The AutoCaptions API is built specifically for programmatic caption generation. It's designed for automation workflows: accept a video URL, return captioned video. No editor UI, no manual steps, no undocumented endpoints.

For automated workflows in n8n or Make.com, a purpose-built captioning API saves you the headaches of reverse-engineering consumer editing tools. Check the full API documentation for endpoint specifications, rate limits, and authentication details.

Getting Started

The fastest path to automated captions:

Get an API key from the AutoCaptions page
Set up n8n using our n8n setup guide
Test with a single video: submit it to the transcribe endpoint, review the output, then submit for burn-in rendering
Build the full workflow (webhook trigger, API call, wait for callback, deliver)
Process your first batch of 10 videos to validate the pipeline
Scale to your full video library

The workflow template is available in our templates marketplace - import it into n8n and configure your API key. Total setup time from zero to first captioned video: about 30 minutes.