How It All Works

Transcription to Metadata to Thumbnails Pipeline

← Back to Documentation

The Pipeline Overview

1. Transcribe
Multi-track audio
2. Metadata
70/30 weighting
3. Thumbnail
Content-aware
4. Upload
Batch to YouTube

Step 1: Transcription

  • Multi-Track Detection: Loki Studio detects up to 32 audio tracks in your video
  • Track Naming: You assign meaningful names to each track (System Audio, Microphone 1, etc.)
  • Primary Track Selection: Mark which track contains YOUR voice (radio button)

Output Files:

  • video_transcript.txt - All tracks with timestamps
  • video_captions.srt - PRIMARY tracks only (for YouTube)
  • video_metadata.json - Full JSON data

Step 2: Metadata Generation

  • Weighted Content Loading: AI receives 70% PRIMARY track, 30% context tracks
  • Timestamps Stripped: Clean text without [HH:MM:SS] clutter
  • Character Budget: Respects header/footer templates (max 5000 chars)
  • Focus: Description focuses on what YOU said, not just game sounds

Step 3: Thumbnail Generation

  • Same Weighting: Uses 70% YOUR words, 30% context
  • Real Content: Title/subtitle reflect actual topics discussed
  • Personality System: Applies Dad Joke, Brain Rot, or other styles

Track Priority System

Track Type Weight Examples In Captions?
PRIMARY 70% Microphone, Host, Commentary, My Voice Yes
SECONDARY 20% (shared) Guest 1, Guest 2, Speaker, Co-Host Yes
CONTEXT 10% System Audio, Game Audio, Music, SFX No

Track Naming Best Practices

Track Type Good Names Bad Names Why?
Your Mic Microphone 1, Host, Commentary Mic, Track 1, Audio Auto-detection works best with full words
Game Audio System Audio, Game Audio, Desktop Game, Sounds, Track 0 Clear context prevents mixing with voices
Guests Guest 1, Speaker, Interview Person, Other, Track 2 Identifies supporting speakers

Avoid Human Names for Tracks

Problem: What if you name tracks "Joe" and "Sally"?

  • Auto-detection won't recognize "Joe" or "Sally" as primary
  • May treat them as context (bad!)
  • Solution: Manually select Primary radio button, or rename to "Joe (Host)" / "Sally (Guest)"

Real-World Examples

Gaming Commentary (Good)

Track 0: "System Audio"     → Context (game sounds)  ○
Track 1: "Microphone 1"     → Primary (your voice)   ●

Result: Metadata focuses on YOUR commentary, game audio adds context

Podcast with Guests (Good)

Track 0: "System Audio"     → Context (intro music)  ○
Track 1: "Host"             → Primary (you)          ●
Track 2: "Guest 1"          → Secondary              ○
Track 3: "Guest 2"          → Secondary              ○

Result: Weighting = 50% host, 40% guests, 10% system

Poor Naming (Bad)

Track 0: "Track 1"          → Unclear                ○
Track 1: "Audio"            → Unclear                ○
Track 2: "Mic"              → Too vague              ○

Problem: Auto-detection fails, unclear weighting, poor results

Captions (SRT) Special Handling

What Goes in Captions?

  • INCLUDED: All PRIMARY and SECONDARY tracks (human voices)
  • EXCLUDED: CONTEXT tracks (game audio, music, SFX)

Note: Speaker labels are not yet supported. All human voices merge chronologically by timestamp.

Pro Tip

Good track naming is the foundation of great metadata! Take 10 seconds to name tracks properly, and the AI will generate descriptions and thumbnails that truly reflect your content.

← YouTube Upload LUFS Audio Guide →
Buy me a coffee