Loki Studio | Technical Specifications

System Requirements

Operating System	Windows 10/11 (64-bit)
Processor	Intel Core i5 or AMD Ryzen 5 (minimum) Intel Core i7 / AMD Ryzen 7 (recommended)
RAM	8 GB (minimum) · 16 GB (recommended for large models)
Storage	2 GB for application · 10-15 GB for AI models
GPU (for fast transcription)	NVIDIA GTX 1060 6GB or better CUDA 11.8+ supported
Display	1920×1080 minimum · 4K supported
Internet	Required for: LLM API calls (Claude/GPT), YouTube upload, model downloads

Note: GPU is optional but strongly recommended. CPU transcription works but is 10-20x slower.

GPU Compatibility

GPU	VRAM	Max Model	Performance
RTX 4090	24 GB	large-v3	Excellent
RTX 4080 / 3090	16-24 GB	large-v3	Excellent
RTX 4070 / 3080	10-12 GB	large-v3-turbo	Excellent
RTX 3070 / 4060	8 GB	medium / turbo	Very Good
RTX 3060 / 2070	6-8 GB	medium	Good
GTX 1660 / 1060	6 GB	small / base	Usable
CPU Only	—	Any (slow)	Works

Transcription Models (Whisper)

Loki Studio uses OpenAI's Whisper model via the faster-whisper implementation (CTranslate2). Models are downloaded once and run entirely on your machine.

Model	Size	VRAM	Speed vs Quality
large-v3-turbo Recommended	1.6 GB	~6 GB	Best balance — near large-v3 quality at 4x speed
large-v3	3.1 GB	~10 GB	Maximum accuracy, slower
medium	1.5 GB	~5 GB	Good accuracy, fast
small	488 MB	~3 GB	Faster, reduced accuracy
base	147 MB	~2 GB	Fastest, basic accuracy

Translation Engine

Component	Technology	Details
NLLB-200	Meta's Neural Machine Translation	34+ languages, word-level timing preserved
Model Size	~3 GB (downloaded on first use)	Runs locally on GPU/CPU
Languages	European, Asian, Middle Eastern, Cyrillic	Any-to-any translation supported

CJK Caption System

Feature	Description
Character-Level Timing	Distributes characters evenly across segment duration (no word boundaries in CJK)
Auto-Detection	Recognizes CJK content via language setting OR Unicode analysis
All Caption Styles	OneWordPop (2 chars), FullSegment (character highlight), WindowSlide (character scroll)
ASS Export	Character-level timing preserved in subtitle export
Languages	Chinese (Simplified/Traditional), Japanese, Korean

AI Text Generation (LLM Providers)

Loki Studio generates titles, descriptions, tags, and chapters using your choice of LLM. You provide your own API key — no middleman markup.

Provider	Models	Notes
Anthropic Claude Recommended	Claude 3.5 Sonnet, Claude 3 Opus	Best creative writing quality
OpenAI	GPT-4o, GPT-4o-mini, GPT-4-turbo	Reliable, fast
Ollama	Llama 3, Mistral, custom models	Free, runs locally, requires setup
LM Studio	Any GGUF model	Free, runs locally, requires setup
Grimnir (Built-in)	Bundled llama.cpp models	No setup required, moderate quality

Supported Formats

Video Input

MP4, MKV, MOV, AVI, WebM, WMV, FLV, M4V, TS, MTS

Any resolution up to 8K supported

Audio Input

MP3, WAV, FLAC, AAC, OGG, M4A, WMA

Multi-track audio from OBS/recording software

Video Output

MP4 (H.264/H.265), WebM (VP9)

Stream copy when possible for fast export

Caption Export

SRT, ASS/SSA, VTT, JSON

Word-level timestamps, CJK character-level timing

Thumbnail Output

PNG, JPG, WebP

1280×720 default, customizable

Project Files

JSON-based project format, EDL import/export

Human-readable, version control friendly

Audio Processing

Multi-Track Support	Separate audio tracks from OBS, Streamlabs, etc. processed independently
VAD (Voice Activity Detection)	Silero VAD v5.1 (ONNX) — filters silence before transcription
LUFS Analysis	ITU-R BS.1770-4 compliant loudness measurement
Waveform Generation	SIMD-optimized (AVX2) waveform visualization
Sample Rates	8kHz to 192kHz (internally resampled to 16kHz for Whisper)

YouTube Integration

API Version	YouTube Data API v3
Authentication	OAuth 2.0 (you authorize Loki Studio to upload on your behalf)
Upload Features	Video, thumbnail, title, description, tags, category, playlists, scheduling
Privacy Settings	Public, Unlisted, Private, Scheduled
Multiple Channels	Switch between authenticated channels
Upload Queue	Batch uploads with progress tracking and retry on failure

Performance Benchmarks

Tested on RTX 4070 with large-v3-turbo model. Your results may vary based on hardware.

Operation	30-min Video	1-hour Video
Transcription (GPU)	~2-3 minutes	~5-6 minutes
Transcription (CPU)	~30-45 minutes	~60-90 minutes
Metadata Generation	~15-30 seconds	~20-40 seconds
Thumbnail Frame Extraction	~5 seconds	~8 seconds
Video Metadata Read	<5 ms	<5 ms
Waveform Generation	~2-3 seconds	~4-5 seconds

Technical Architecture

Application Framework	Qt 6.10 (C++17, QML)
Transcription Engine	CTranslate2 (faster-whisper), CUDA 11.8+
Video Playback	GStreamer 1.0 MSVC
Video Processing	FFmpeg (statically linked in isolated DLLs)
Image Compositing	Custom AVX2-optimized engine
Local LLM	llama.cpp (optional Skuld module)
Model Inference	ONNX Runtime 1.19+
Networking	Qt Network, OpenSSL 3.x

Privacy & Data

Telemetry	None. Zero data collection.
Local Processing	Transcription, thumbnail creation, video editing — all local
Cloud Connections	Only when you request: LLM API calls, YouTube uploads, model downloads
API Keys	Stored locally, encrypted, never transmitted except to the provider you chose
Videos	Never uploaded anywhere except YouTube when you click Upload

Technical Specifications