How to Automate Video Captions with an API and n8n Workflows

Mar 10, 2026 By smrht@icloud.com

The Bottom Line

Manual captioning costs $1-3 per video minute and takes 5-10x the video length in human labor. An automated caption pipeline using the AutoCaptions API and n8n processes a 60-second video in under 90 seconds, costs a fraction of manual work, and scales to hundreds of videos per day without additional headcount.

85% of Facebook videos are watched without sound. On Instagram, that number is 70%. TikTok's algorithm actively favors videos with captions because they increase watch time. Captions aren't optional anymore - they're a ranking factor.

This guide covers the complete pipeline: speech-to-text transcription, subtitle generation, burn-in rendering with custom styling, and batch processing at scale.

Manual vs. Automated Captioning

Before diving into the technical setup, here's why automation wins:

Factor Manual Captioning Automated (API + n8n)
Time per minute of video 5-10 minutes 15-30 seconds
Cost per video (1 min) $1-3 (freelancer) $0.05-0.15 (API)
Accuracy 98-99% (human) 95-97% (AI, improving)
Styling consistency Varies by editor Pixel-perfect every time
Scale capacity Limited by team size 1,000+ videos/day
Turnaround for 100 videos 3-5 business days 2-4 hours
Language support Need bilingual staff 50+ languages via API

The accuracy gap is closing fast. Modern speech-to-text models handle accents, technical jargon, and overlapping speech better than they did even a year ago. For most content types - talking head videos, product demos, presentations - automated accuracy is indistinguishable from human work.

Where human captioning still wins: heavy background noise, multiple speakers talking simultaneously, and domain-specific terminology (medical, legal). For these edge cases, use the automated pipeline for the first pass and add a human review step.

How the AutoCaptions API Works

The AutoCaptions API handles three distinct operations that can be used independently or chained together.

1. Speech-to-Text Transcription

Upload a video (or provide a URL), and the API returns a time-stamped transcription. Each word gets a start time, end time, and confidence score.

// POST /api/autocaptions/transcribe
{
  "video_url": "https://your-cdn.com/video.mp4",
  "language": "en"
}

Response:

{
  "transcription": {
    "text": "Today we're looking at three ways to automate your video workflow.",
    "words": [
      { "word": "Today", "start": 0.0, "end": 0.35, "confidence": 0.98 },
      { "word": "we're", "start": 0.38, "end": 0.52, "confidence": 0.97 },
      { "word": "looking", "start": 0.55, "end": 0.82, "confidence": 0.99 },
      { "word": "at", "start": 0.84, "end": 0.92, "confidence": 0.99 },
      { "word": "three", "start": 0.95, "end": 1.15, "confidence": 0.96 },
      { "word": "ways", "start": 1.18, "end": 1.42, "confidence": 0.98 }
    ],
    "language_detected": "en",
    "duration": 4.8
  }
}

2. Subtitle File Generation

From the transcription, generate industry-standard subtitle files. Supported formats:

  • SRT - Most widely supported, works everywhere
  • VTT (WebVTT) - HTML5 video standard, supports styling
  • ASS - Advanced SubStation Alpha, full styling control
  • JSON - Raw timed data for custom rendering
// POST /api/autocaptions/subtitles
{
  "video_url": "https://your-cdn.com/video.mp4",
  "format": "srt",
  "max_chars_per_line": 42,
  "max_lines": 2
}

The API handles line breaking intelligently - it splits on natural phrase boundaries, not in the middle of words or clauses. Max characters per line and max lines per subtitle block are configurable.

3. Burn-In Rendering

This is where it gets powerful. Instead of generating a subtitle file that viewers can toggle on/off, burn-in rendering hardcodes the captions directly into the video pixels. The output is a new video file with permanent, styled captions.

// POST /api/autocaptions/render
{
  "video_url": "https://your-cdn.com/video.mp4",
  "style": {
    "font": "Montserrat",
    "font_size": 42,
    "color": "#FFFFFF",
    "background": "rgba(0,0,0,0.7)",
    "position": "bottom-center",
    "padding": 8,
    "border_radius": 4,
    "highlight_color": "#FFD700",
    "highlight_style": "word"
  },
  "callback_url": "https://your-n8n.com/webhook/caption-done"
}

Burn-in is essential for social media platforms. Instagram Reels, TikTok, and YouTube Shorts don't support external subtitle files. The captions must be part of the video itself.

Caption Styling Options

Generic white-text-on-black captions look dated. Modern caption styles are a design element that matches your brand. Here are the main styles the API supports.

Word-by-Word Highlight (TikTok Style)

The most engaging caption style in 2026. Each word highlights as it's spoken, creating a karaoke-like effect that keeps viewers reading.

{
  "highlight_style": "word",
  "highlight_color": "#FFD700",
  "color": "#FFFFFF",
  "font": "Montserrat Bold",
  "font_size": 48,
  "background": "none",
  "text_shadow": "2px 2px 4px rgba(0,0,0,0.8)",
  "position": "center"
}

This style works best for short-form content (under 60 seconds) where the text is large and centered. It's the default on TikTok and Instagram Reels for good reason - it increases average watch time by 12-15% compared to no captions.

Full Sentence with Background Box

The classic approach. A colored box behind the text ensures readability over any video content.

{
  "highlight_style": "none",
  "color": "#FFFFFF",
  "font": "Inter",
  "font_size": 36,
  "background": "rgba(0,0,0,0.75)",
  "padding": 12,
  "border_radius": 8,
  "position": "bottom-center",
  "margin_bottom": 80
}

Best for longer content, tutorials, and presentations where readability matters more than visual flair. The background box guarantees contrast regardless of what's happening in the video behind it.

Animated Text Reveal

Words or phrases animate onto the screen - fade in, slide up, pop in. Each animation style changes the feel of the captions.

{
  "animation": "slideUp",
  "animation_duration": 0.3,
  "color": "#FFFFFF",
  "font": "Poppins SemiBold",
  "font_size": 44,
  "background": "none",
  "text_shadow": "1px 1px 3px rgba(0,0,0,0.9)",
  "position": "bottom-center"
}

Available animations: fadeIn, slideUp, slideDown, popIn, typewriter, bounceIn. The typewriter effect works particularly well for dramatic or narrative content.

Custom Fonts and Brand Colors

Upload your brand font (TTF/OTF) and the API uses it for rendering. Combined with your brand colors, the captions become part of your visual identity rather than a generic overlay.

{
  "font": "custom",
  "font_url": "https://your-cdn.com/fonts/YourBrandFont-Bold.ttf",
  "color": "#E63946",
  "highlight_color": "#F1FAEE",
  "font_size": 40
}

Step-by-Step n8n Workflow

Here's the complete n8n workflow for automated captioning. This handles single videos and batch processing.

Workflow Structure

[Webhook: Receive video URL]
    → [HTTP Request: Send to AutoCaptions API]
    → [Wait: For render callback]
    → [Function: Validate output]
    → [IF: Quality check passed?]
        → Yes: [HTTP Request: Deliver to destination]
        → No: [HTTP Request: Flag for human review]
    → [HTTP Request: Log to analytics]

Node 1: Webhook Trigger

Set up a Webhook node that accepts POST requests with a video URL and desired caption style.

// Expected incoming payload
{
  "video_url": "https://your-cdn.com/raw-video.mp4",
  "style_preset": "tiktok",
  "language": "en",
  "destination": "s3://your-bucket/captioned/",
  "callback_meta": {
    "project_id": "proj_123",
    "video_id": "vid_456"
  }
}

Node 2: Function Node - Build API Request

Map the style preset to actual styling parameters:

const input = $input.first().json;

const stylePresets = {
  tiktok: {
    font: "Montserrat Bold",
    font_size: 48,
    color: "#FFFFFF",
    highlight_style: "word",
    highlight_color: "#FFD700",
    background: "none",
    text_shadow: "2px 2px 4px rgba(0,0,0,0.8)",
    position: "center"
  },
  youtube: {
    font: "Inter",
    font_size: 36,
    color: "#FFFFFF",
    highlight_style: "none",
    background: "rgba(0,0,0,0.75)",
    padding: 12,
    border_radius: 8,
    position: "bottom-center",
    margin_bottom: 60
  },
  instagram: {
    font: "Poppins SemiBold",
    font_size: 44,
    color: "#FFFFFF",
    highlight_style: "word",
    highlight_color: "#FF6B6B",
    background: "none",
    text_shadow: "1px 1px 3px rgba(0,0,0,0.9)",
    position: "center",
    animation: "popIn"
  },
  professional: {
    font: "Inter",
    font_size: 32,
    color: "#FFFFFF",
    highlight_style: "none",
    background: "rgba(0,0,0,0.6)",
    padding: 8,
    position: "bottom-center",
    margin_bottom: 40
  }
};

const style = stylePresets[input.style_preset] || stylePresets.tiktok;

return [{
  json: {
    video_url: input.video_url,
    style: style,
    language: input.language || "en",
    callback_url: `${$env.WEBHOOK_URL}/webhook/caption-callback`,
    callback_meta: input.callback_meta
  }
}];

Node 3: HTTP Request - Submit to AutoCaptions API

POST the request to the AutoCaptions API render endpoint. Include your API key in the header.

  • Method: POST
  • URL: https://api.samautomation.com/v1/autocaptions/render
  • Headers: Authorization: Bearer {{$credentials.samautomation_api_key}}
  • Body: JSON from the previous Function node

Node 4: Wait Node

Configure the Wait node to resume when a webhook callback is received. Set the webhook path to /webhook/caption-callback and a timeout of 10 minutes (captioning a 5-minute video takes roughly 2-4 minutes).

Node 5: Function Node - Validate Output

Check the callback response for errors and validate the output URL:

const result = $input.first().json;

if (result.status !== "completed") {
  return [{
    json: {
      success: false,
      error: result.error || "Unknown rendering error",
      video_id: result.callback_meta?.video_id
    }
  }];
}

return [{
  json: {
    success: true,
    captioned_video_url: result.output_url,
    duration: result.duration,
    word_count: result.word_count,
    video_id: result.callback_meta?.video_id,
    destination: result.callback_meta?.destination
  }
}];

Node 6: IF Node - Quality Gate

Route based on the success field. Successful renders continue to delivery. Failed renders go to a notification node that alerts your team via Slack, email, or however you prefer.

Multi-Language Caption Support

The AutoCaptions API supports 50+ languages for transcription and can generate captions in a different language than the spoken audio. This opens up two use cases.

Same-Language Captions

The straightforward case. English video gets English captions, Spanish video gets Spanish captions. Just set the language parameter to match the spoken language.

{ "language": "es" }

For auto-detection, omit the language parameter. The API detects the spoken language from the first 30 seconds of audio and uses it for the full transcription.

Cross-Language Translation

Render captions in a different language than what's spoken. The API transcribes the original audio, translates the transcription, and generates timed captions in the target language.

{
  "video_url": "https://cdn.example.com/english-video.mp4",
  "language": "en",
  "translate_to": "es",
  "style": { "font_size": 36, "color": "#FFFFFF" }
}

This is particularly useful for reaching international audiences without re-recording content. A single English video can be captioned in Spanish, Portuguese, French, German, and Japanese in one batch run.

Batch Multi-Language Pipeline

// n8n Function node: Generate render requests for multiple languages
const videoUrl = $input.first().json.video_url;
const languages = ["es", "pt", "fr", "de", "ja"];

const requests = languages.map(lang => ({
  json: {
    video_url: videoUrl,
    language: "en",
    translate_to: lang,
    style: {
      font: "Noto Sans",  // Supports all language character sets
      font_size: 38,
      color: "#FFFFFF",
      background: "rgba(0,0,0,0.7)"
    },
    output_filename: `video_captioned_${lang}.mp4`
  }
}));

return requests;

Use n8n's SplitInBatches node after this to process languages in parallel (3-5 concurrent renders is safe) without overwhelming the API.

Batch Captioning: Processing 100+ Videos

Single-video workflows are the starting point. The real value is batch processing. Here's how to caption 100+ videos in a single workflow run.

Input: Spreadsheet or Database Query

Pull video URLs from Google Sheets, Airtable, a database, or a CSV file. Each row contains:

  • Video URL or file path
  • Desired caption style
  • Target language(s)
  • Output destination

Throttling and Concurrency

Don't submit 100 render requests simultaneously. The API has rate limits, and your n8n instance has memory limits for tracking 100 concurrent Wait nodes.

The pattern:

  1. SplitInBatches node: Process 5-10 videos at a time
  2. Submit batch: Send all 5-10 to the API
  3. Wait for all callbacks: Each video gets its own Wait node within the batch
  4. Process results: Download, validate, deliver
  5. Next batch: Move to the next 5-10 videos
// n8n Function node: Batch progress tracker
const batchSize = 10;
const allVideos = $input.all();
const totalBatches = Math.ceil(allVideos.length / batchSize);

// Add batch metadata
return allVideos.map((item, index) => ({
  json: {
    ...item.json,
    batch_number: Math.floor(index / batchSize) + 1,
    total_batches: totalBatches,
    index_in_batch: index % batchSize
  }
}));

For 100 videos at 90 seconds average processing time per video, with 10 concurrent renders, the full batch completes in approximately 15 minutes. That's 100 captioned videos ready for distribution.

Error Recovery

In a batch of 100 videos, 2-3 might fail (network timeout, corrupt source file, unusual audio). Build retry logic into the workflow:

  1. Collect failed video IDs after each batch
  2. After all batches complete, retry failed videos once
  3. If still failing, route to a manual review queue

Don't let one failure block the entire batch. Process what succeeds, handle errors separately.

Integrating Captions into Existing Video Pipelines

If you're already using the JSON-to-Video API to render videos, adding captions is a single additional step in your pipeline.

Pipeline Without Captions

[Template JSON] → [Render Video] → [Deliver]

Pipeline With Captions

[Template JSON] → [Render Video] → [AutoCaption] → [Deliver]

The AutoCaptions step takes the rendered video URL from the JSON-to-Video callback and submits it for captioning. The output is a new video with burned-in captions.

For workflows where you need both a captioned and uncaptioned version (e.g., YouTube gets the uncaptioned version with a separate SRT file, while Instagram gets the burned-in version), fork the pipeline after the initial render:

[Render Video] → Fork:
    → Path A: [Generate SRT] → [Upload to YouTube with SRT]
    → Path B: [Burn-in Captions] → [Upload to Instagram]

This is a common pattern for content repurposing workflows. One render, multiple outputs optimized for each platform. Check our automation guides for platform-specific workflow templates.

Platform-Specific Caption Requirements

Each social platform has different expectations for captioned video. Getting these wrong means your captions get cut off, overlap with UI elements, or look unprofessional.

Platform Safe Zone (caption area) Max Video Length Recommended Font Size Notes
TikTok Bottom 15% is UI overlay 10 min 44-52px Keep captions in center-bottom, above the description text
Instagram Reels Bottom 20% is UI overlay 90 sec 40-48px Username and caption text overlap bottom
YouTube Shorts Bottom 15% 60 sec 36-44px Subscribe button and title overlap
Instagram Feed Full frame available 60 sec 32-40px Smaller font OK, less UI competition
YouTube (landscape) Full frame, standard No limit 28-36px Use standard subtitle positioning
LinkedIn Full frame 10 min 32-40px Professional styling, background box recommended

The margin_bottom and position parameters in the AutoCaptions API let you adjust for these safe zones. For TikTok, set margin_bottom: 250 to keep captions above the UI overlay. For Instagram Reels, use margin_bottom: 300.

{
  "style": {
    "position": "bottom-center",
    "margin_bottom": 250,
    "font_size": 46
  }
}

Quality Control: Reviewing Auto-Generated Captions

Automated captions are 95-97% accurate. That remaining 3-5% can include embarrassing errors - misheard brand names, incorrect numbers, or garbled technical terms.

Automated Quality Checks

Before burn-in rendering, you can retrieve the transcription text and run automated checks:

  1. Spell check: Flag words not in a dictionary or your custom word list
  2. Brand name verification: Ensure your brand and product names are transcribed correctly (add them to a custom vocabulary list)
  3. Number validation: Cross-reference transcribed numbers against expected values
  4. Profanity filter: Catch misheard words that accidentally become inappropriate
// n8n Function node: Basic caption quality check
const transcription = $input.first().json.transcription;
const brandTerms = ["SamAutomation", "AutoCaptions", "JSON-to-Video"];
const flagged = [];

// Check for common misheard brand terms
brandTerms.forEach(term => {
  const regex = new RegExp(term, 'i');
  if (transcription.text.includes(term.toLowerCase().replace(/[^a-z]/g, ''))
      && !regex.test(transcription.text)) {
    flagged.push(`Possible misheard brand term: "${term}"`);
  }
});

// Check confidence scores
const lowConfidence = transcription.words.filter(w => w.confidence < 0.85);
if (lowConfidence.length > 0) {
  flagged.push(`${lowConfidence.length} words with low confidence: ${
    lowConfidence.map(w => `"${w.word}" (${(w.confidence * 100).toFixed(0)}%)`).join(', ')
  }`);
}

return [{
  json: {
    passed: flagged.length === 0,
    flags: flagged,
    transcription: transcription.text
  }
}];

Human Review Queue

For high-stakes content (paid ads, client deliverables, corporate communications), add a human review step between transcription and burn-in:

  1. API generates the transcription
  2. n8n sends the transcription text to a review interface (Google Sheet, Slack message, custom dashboard)
  3. Reviewer approves or edits the text
  4. Edited transcription is sent back to the API for burn-in rendering

This hybrid approach gives you automation speed with human accuracy. The reviewer only needs 30-60 seconds per video (reading text is much faster than captioning from scratch), so you maintain most of the throughput benefit.

CapCut API vs. AutoCaptions API for Subtitles

Many people search for CapCut API subtitles when they actually need a dedicated captioning API. Here's how they compare:

CapCut is primarily a video editing tool. Its auto-caption feature is designed for interactive use within the CapCut editor. API access to CapCut's captioning is limited, undocumented, and subject to change without notice.

The AutoCaptions API is built specifically for programmatic caption generation. It's designed for automation workflows: accept a video URL, return captioned video. No editor UI, no manual steps, no undocumented endpoints.

For automated workflows in n8n or Make.com, a purpose-built captioning API saves you the headaches of reverse-engineering consumer editing tools. Check the full API documentation for endpoint specifications, rate limits, and authentication details.

Getting Started

The fastest path to automated captions:

  1. Get an API key from the AutoCaptions page
  2. Set up n8n using our n8n setup guide
  3. Test with a single video: submit it to the transcribe endpoint, review the output, then submit for burn-in rendering
  4. Build the full workflow (webhook trigger, API call, wait for callback, deliver)
  5. Process your first batch of 10 videos to validate the pipeline
  6. Scale to your full video library

The workflow template is available in our templates marketplace - import it into n8n and configure your API key. Total setup time from zero to first captioned video: about 30 minutes.

Related Articles