Video Generation & Image-to-Video: Sora 2, Veo 3.1, Runway, Open-Source
A practical guide to AI video generation in 2026: closed APIs (Sora 2/Pro, Veo 3.1, Runway Gen-4, Kling 2.5, Hailuo 02, Pika 2.0), open-source models (Wan 2.5, HunyuanVideo, LTX-Video, Mochi 1, CogVideoX-5B), image-to-video with motion brushes and character consistency, self-hosting on RTX 5090/4090, pricing comparisons, and code examples with the Runway SDK and ComfyUI.
By Jose Nobile | Updated 2026-04-27 | 26 min read
1. Sora 2 / Sora 2 Pro (OpenAI)
Sora 2 (released October 2025) is OpenAI's flagship text-to-video and image-to-video model. The headline upgrade over Sora 1 is synchronized audio — dialog, sound effects, ambience, and music are generated together with the video, eliminating the need for a separate audio pass. Sora 2 produces up to 20-second clips at 1080p in the standard tier; Sora 2 Pro extends this to higher fidelity and longer durations with cinematic motion and improved physics.
Access is via the Sora app (sora.com), the OpenAI API (/v1/videos endpoint), and the ChatGPT Pro/Plus tiers. Pricing is per second of generated video: roughly $0.10/sec for Sora 2 and $0.30–$0.50/sec for Sora 2 Pro at 1080p. Generation time is 30–120 seconds for typical prompts. The model accepts text prompts, reference images, and reference videos for style transfer and continuation.
Sora 2's advantage is world simulation: it understands rigid-body physics, fluids, fabric, and human motion better than prior generations. Failure modes have shifted from "mangled hands" to subtler issues like inconsistent shadows or character drift across longer clips. Watermarking via C2PA provenance metadata is enforced on all outputs.
Native Synchronized Audio
First major model with integrated dialog, foley, ambient sound, and music in a single pass. Lip-sync to generated speech, footsteps that match terrain, and Doppler-correct passing vehicles. No separate TTS or sound-design step.
Up to 1080p, 20s
Sora 2 standard outputs 720p/1080p at up to 20 seconds. Sora 2 Pro pushes higher fidelity, sharper detail and 30-60s segments suitable for short-form ads, film previs, and stylized music video sequences.
~$0.10–$0.50 / second
API pricing: Sora 2 ~$0.10/s at 720p, ~$0.15/s at 1080p. Sora 2 Pro ~$0.30–$0.50/s at 1080p. ChatGPT Pro includes a monthly generation quota; pay-per-use beyond that. Caching does not apply to video generation.
Improved World Simulation
Better handling of rigid bodies, gravity, collisions, and fluid dynamics. Multi-shot continuity (cuts and camera changes) is more reliable. Still struggles with very fine-grained hand-object interaction and consistent text rendering.
OpenAI Videos API
REST endpoint POST /v1/videos accepts prompt, reference image (input_image), duration, resolution, aspect ratio. Returns a job ID; poll /v1/videos/{id} until status=completed, then download the MP4 + audio. Async only.
C2PA Watermarking
All outputs carry C2PA cryptographic provenance metadata identifying them as AI-generated. Visible watermark on free tier; metadata-only on paid tiers. Likeness protection blocks named public figures unless explicitly licensed.
2. Google Veo 3.1 (Vertex AI)
Veo 3.1 is Google DeepMind's flagship video model, accessed via Vertex AI, the Gemini API, the consumer Gemini app, and embedded in Google Vids and YouTube Shorts. Like Sora 2, Veo 3.1 generates native audio — speech, music, and ambient sound — in a single pass. Veo 3.1 produces 1080p clips up to 8 seconds (extendable via "Extend video" iterations to 30s+).
Vertex AI pricing: $0.40/sec for Veo 3.1 with audio, $0.20/sec for Veo 3.1 Fast. Generation latency is 60–180 seconds. Veo's standout feature is "Ingredients" mode: provide multiple reference images (character, object, scene) and Veo composes them into a coherent shot — ideal for advertising and product visualization where character/product consistency matters.
Veo also exposes camera controls (pan, tilt, dolly, orbit) as explicit prompt parameters, and "Frames to Video" which interpolates between two keyframes to produce a smooth shot — powerful for storyboard-driven workflows. SynthID watermarks are embedded invisibly in every frame and audio track.
Native Audio Generation
Synchronized dialog, foley, music, and ambience generated jointly with the visuals. Dialog can be specified verbatim in the prompt with quoted speaker tags. Multi-language speech supported across 30+ languages.
Multi-Reference "Ingredients"
Pass up to 3 reference images (character, product, environment) and Veo blends them into a single coherent shot. Critical for brand-consistent advertising where the actor, product, and location must all match supplied references.
Camera Controls
Explicit camera directives in the prompt: pan-left, dolly-in, orbit-right, crane-up, handheld. Interprets cinematic language ("Dutch angle", "rack focus") more reliably than competitors. 16:9, 9:16, and 1:1 aspect ratios.
Frames-to-Video Interpolation
Provide a start frame and an end frame; Veo generates the motion between them. Enables storyboard-driven production: artists draw key beats, Veo fills the in-betweens. Reduces the unpredictability of pure text-to-video.
Vertex AI API
Long-running operation pattern: POST to publishers/google/models/veo-3.1:predictLongRunning, poll the operation, fetch GCS-hosted MP4. IAM-controlled, VPC-SC compatible, and auditable — suitable for regulated enterprises.
SynthID Watermarking
Invisible watermark embedded in every video frame and audio sample. Survives compression, cropping, and re-encoding. Detectable via Google's SynthID Detector. Mandatory on all Veo outputs across consumer and API surfaces.
3. Runway, Kling, Hailuo & Pika
Beyond Sora and Veo, four commercial APIs dominate professional creative workflows in 2026. Each has a distinct strength: Runway Gen-4 for character/world consistency, Kling 2.5 for cinematic motion, Hailuo 02 for prompt adherence at low cost, and Pika 2.0 for stylized social-format video with effects.
All four offer 1080p output, image-to-video, and 5–10 second clips. Pricing varies from $0.03/sec (Hailuo 02 Standard) at the low end to $0.50/sec (Runway Gen-4 Turbo Pro) at the high end. Most production teams maintain accounts on at least two providers and route prompts based on the shot type — consistent character work to Runway, motion-heavy action to Kling, bulk b-roll to Hailuo.
Runway Gen-4 + Gen-4 Turbo
Best-in-class for character and world consistency across shots. "References" feature locks an actor or location to a single image set across an entire production. 1080p, up to 10s. ~$0.05/credit, 5s clip ~$0.25-0.50. Native API + SDK.
Kling 2.5 (Kuaishou)
Strongest cinematic motion and physics among Chinese-origin models. 1080p up to 10s, 4K upscale available. Excellent dance/sports/action footage. Pricing ~$0.10/s standard, ~$0.20/s Master mode. API via Kling Cloud and fal.ai.
Hailuo 02 (MiniMax)
Aggressive pricing: ~$0.03–$0.05/s at 720p, ~$0.08/s at 1080p. Strong prompt adherence and stable motion for the price. 6–10s clips. Ideal for high-volume b-roll generation, A/B testing, and stock-style content.
Pika 2.0 + Pikaffects
Stylized output with signature "Pikaffects" (squish, crush, melt, inflate, explode). Strong for social-first, meme-style, and music-video creative. Up to 5s, 1080p. Subscription tiers from $10/mo with credit-based usage.
Aggregators: fal.ai, Replicate
fal.ai and Replicate expose Runway, Kling, Hailuo, Pika, plus open-source models behind a single API and key. Useful for multi-model A/B routing without separate billing. Slight markup vs. direct provider pricing.
When to Pick Which
Sora 2 / Veo 3.1: best overall + audio. Runway Gen-4: character continuity across shots. Kling 2.5: action and cinematic motion. Hailuo 02: cheap bulk generation. Pika 2.0: social/stylized creative with effects.
4. Open-Source Video Models
Open-source video generation closed most of the gap with closed APIs in 2025-2026. Five models matter: Wan 2.5 (Alibaba), HunyuanVideo (Tencent), LTX-Video (Lightricks), Mochi 1 (Genmo), and CogVideoX-5B (Zhipu/THUDM). All publish weights on Hugging Face under permissive or research licenses, and all have first-class ComfyUI nodes.
For local self-hosting on a single RTX 4090 (24GB) or RTX 5090 (32GB), LTX-Video is by far the fastest — real-time generation of 5s clips at 768x512. Wan 2.5 and HunyuanVideo deliver the highest quality and now match Kling 2.0 / Runway Gen-3 on most benchmarks. Mochi 1 excels at motion fidelity. CogVideoX-5B is the easiest entry point for fine-tuning and LoRA training.
Wan 2.5 (Alibaba)
14B-parameter DiT (Diffusion Transformer). 1080p up to 10s. Wan 2.5 adds native audio generation, narrowing the gap with Sora 2 and Veo 3.1. Apache 2.0. Runs on 24GB VRAM with offloading; 48GB recommended for 1080p. ComfyUI + Diffusers support.
HunyuanVideo (Tencent)
13B params, dual text encoders (LLM + CLIP). 720p, 5–15s clips. Best-in-class open-source quality on VBench. Strong I2V variant. Apache 2.0. Needs 60GB VRAM at full precision; FP8 + offloading runs on 24GB at 720p with longer gen times.
LTX-Video (Lightricks)
2B-param DiT optimized for speed. Real-time 5s @ 768x512 on a single RTX 4090 (~5s gen for 5s output). Open RAIL-M license. Best choice for interactive UIs, batch pipelines, and rapid prototyping. T2V + I2V + V2V.
Mochi 1 (Genmo)
10B-param AsymmDiT. Strong motion fidelity and prompt adherence. 480p base, with HD upsampler. Apache 2.0. ~24GB VRAM with sequential offloading. Active community fine-tunes for film/anime/photorealism on Hugging Face.
CogVideoX-5B (Zhipu)
5B params, 720x480, 6s clips. Most mature ecosystem for LoRA fine-tuning — trainable on 1x A100 40GB. Apache 2.0. CogVideoX-5B-I2V variant for image-to-video. Excellent baseline for character/style customization.
ComfyUI & Diffusers
All five models ship with ComfyUI custom nodes (visual graph editor, fastest iteration) and Hugging Face Diffusers pipelines (Pythonic API, batch workflows). xformers, SageAttention, and TeaCache accelerate inference 2–3x.
5. Image-to-Video Techniques
Image-to-video (I2V) is often more useful than pure text-to-video for production. You start from a known still — an illustration, a product photo, a generated image — and the model only has to invent motion, not the scene. This collapses prompt-engineering complexity and gives you tight control over composition, branding, and identity.
Three I2V control mechanisms have converged across providers in 2026: motion brushes (paint motion vectors directly on the source image), character/identity references (lock a face or product across multiple shots), and start+end keyframes (interpolate between two images for shot-perfect timing). Runway, Kling, Pika, Veo, and HunyuanVideo I2V all expose at least two of the three.
Motion Brush
Paint a mask on the source image and assign a direction vector + intensity. The brushed region animates along that vector while the rest stays static. Pioneered by Runway Gen-2/3, now in Kling, Pika, and ComfyUI nodes for HunyuanVideo I2V.
Character Consistency
Runway "References", Kling "Custom Character", Veo "Ingredients" all let you upload 1-5 images of a face/actor/product and lock identity across an entire production. For open-source, train a CogVideoX or HunyuanVideo LoRA (~30 images, 1-2 hours on A100).
Start + End Keyframes
Provide a first frame and a last frame; the model interpolates motion between them. Veo "Frames-to-Video", Kling "Start & End Frame", LTX-Video "Image Conditioning". Eliminates motion ambiguity — the model only invents the in-betweens.
Camera Path Control
Specify camera motion (pan, dolly, zoom, orbit) independently of subject motion. Runway "Camera Controls" and Kling "Cinematic Lens Movement" expose explicit sliders. ComfyUI exposes per-frame camera trajectory tensors for fine control.
Video Extension & Looping
Feed the last frame of one clip as the start frame of the next to chain shots into 30-60s sequences. "Loop" mode forces the last frame to match the first — ideal for reels, ads, and animated backgrounds.
Upscaling & Frame Interpolation
Generate at 720p/24fps, then upscale with Topaz Video AI / SeedVR2 to 4K, and interpolate to 60fps with RIFE / Practical-RIFE. Common pipeline: cheap base generation → high-quality post-processing for delivery.
6. Self-Hosting (RTX 5090 / 4090)
Self-hosting open-source video models has crossed the practicality threshold for solo creators and small studios. A single RTX 5090 (32GB GDDR7) can run Wan 2.5 and HunyuanVideo at 720p with reasonable generation times; an RTX 4090 (24GB) handles LTX-Video, Mochi 1, and CogVideoX-5B comfortably. Two-card setups (2x 4090 or 1x 5090 + 1x 4090) unlock 1080p and longer clips via tensor parallelism.
VRAM is the binding constraint. Memory optimization techniques — FP8 quantization, sequential CPU offloading, VAE tiling, attention slicing, SageAttention — let larger models fit on 24GB, at the cost of 1.5–3x longer generation. For production pipelines, batch generations overnight; for interactive iteration, stick to LTX-Video.
RTX 5090 (32GB) — Sweet Spot
32GB GDDR7 + Blackwell tensor cores. Runs HunyuanVideo and Wan 2.5 at 720p natively without offloading. Wan 2.5 5s @ 720p in ~90s. LTX-Video real-time. ~$2,000 retail when in stock. Best single-card option for serious creators in 2026.
RTX 4090 (24GB) — Mainstream
24GB GDDR6X. Mochi 1 and CogVideoX-5B run natively. HunyuanVideo and Wan 2.5 require FP8 + offloading. Wan 2.5 5s @ 720p in ~3-5 min with offloading. LTX-Video 5s in ~5-10s. Excellent value at ~$1,600 used.
RTX 3090 (24GB) — Budget
Same 24GB VRAM as 4090 at half the price (~$700 used). 30-50% slower than 4090. Still capable of running every open-source model with offloading. The most cost-effective entry point for VRAM-bound workflows.
RTX 6000 Ada / Pro 6000 (48GB-96GB)
48GB (6000 Ada) or 96GB (Pro 6000 Blackwell) VRAM eliminates all offloading. HunyuanVideo and Wan 2.5 at 1080p without compromise. 4-7x more expensive than 4090/5090 but the only single-card path to high-end production at 1080p+.
Cloud Alternative: Runpod / Vast.ai
A100 80GB ~$1.20-1.80/hr, H100 80GB ~$2-3/hr. Spin up a ComfyUI instance on demand, generate, shut down. Cheaper than a 5090 for sub-50 hours/month of usage. Pre-built ComfyUI templates available.
VRAM Optimization Stack
FP8 quantization (Q8_0 GGUF for video), SageAttention, TeaCache, VAE tiling, sequential CPU offload, torch.compile. Combined: HunyuanVideo on 12GB VRAM (RTX 4070), Wan 2.5 on 16GB. 2-3x slower but accessible.
7. Comparison Table & Pricing
Side-by-side comparison of the major models on the four dimensions that decide a production purchase: price per second, maximum clip length, native audio support, and maximum resolution. Closed APIs lead on audio and consistency; open-source leads on cost-at-scale and customization (LoRA fine-tunes).
| Model | Price / sec | Max length | Audio | Max res | License |
|---|---|---|---|---|---|
| Sora 2 | $0.10–$0.15 | 20s | Yes (native) | 1080p | API |
| Sora 2 Pro | $0.30–$0.50 | 30–60s | Yes (native) | 1080p+ | API |
| Veo 3.1 | $0.40 | 8s (extend) | Yes (native) | 1080p | API |
| Veo 3.1 Fast | $0.20 | 8s | Yes (native) | 1080p | API |
| Runway Gen-4 | $0.05–$0.10 | 10s | No | 1080p | API |
| Kling 2.5 | $0.10–$0.20 | 10s | No | 1080p (4K up) | API |
| Hailuo 02 | $0.03–$0.08 | 10s | No | 1080p | API |
| Pika 2.0 | ~$0.05 | 5s | No | 1080p | API |
| Wan 2.5 | Self-hosted | 10s | Yes | 1080p | Apache 2.0 |
| HunyuanVideo | Self-hosted | 15s | No | 720p | Apache 2.0 |
| LTX-Video | Self-hosted | 5s | No | 768x512 | Open RAIL-M |
| Mochi 1 | Self-hosted | 5s | No | 480p (HD up) | Apache 2.0 |
| CogVideoX-5B | Self-hosted | 6s | No | 720x480 | Apache 2.0 |
Cost reality check: at $0.10/sec, a 30-second TikTok ad costs $3 to generate — cheaper than a single stock-footage license. At Sora 2 Pro pricing ($0.50/sec), the same 30s costs $15. Self-hosting on a 5090 amortizes after ~200-400 minutes of generation depending on tier; cloud A100 rental ($1.50/hr) breaks even with Hailuo 02 around 30s of output per hour of compute.
Audio Capability Matters
Only Sora 2, Veo 3.1, and Wan 2.5 generate native audio. Everyone else needs a separate TTS + foley + music pass. Native audio cuts post-production from hours to minutes for short-form content.
Top-Tier Quality (VBench)
As of early 2026: Sora 2 Pro and Veo 3.1 lead, followed closely by Kling 2.5 and Runway Gen-4. Wan 2.5 and HunyuanVideo are the best open-source models, within 5-10% of Kling on most VBench metrics.
High-Volume Pipelines
Generating 1000+ clips/day? Hailuo 02 ($0.03-0.08/s) or self-hosted LTX-Video on 4-8 GPUs. Closed APIs hit rate limits and budgets fast. Open-source wins on $/clip at scale.
Enterprise Compliance
Vertex AI Veo 3.1 is the only option with VPC-SC, IAM, audit logs, and data residency guarantees out of the box. Open-source on private VPC closes the rest. Sora 2 has enterprise tier with stricter data handling.
Watermarking & Provenance
Sora (C2PA), Veo (SynthID) embed cryptographic provenance. Most open-source models don't — you must add C2PA tags downstream if your platform requires "AI-generated" disclosure (EU AI Act, YouTube, Meta).
Quick Decision Tree
Need audio + best quality? Sora 2 / Veo 3.1. Need character consistency? Runway Gen-4. Cinematic action? Kling 2.5. Bulk b-roll cheap? Hailuo 02. Privacy + customization? Self-host Wan 2.5 or HunyuanVideo on 5090.
8. Code: Runway SDK + ComfyUI
Two integration paths cover 90% of production work: the Runway SDK (Python/JavaScript) for closed-API workflows, and ComfyUI (visual graph) for self-hosted open-source models. The Runway SDK is the most polished commercial video API; ComfyUI is the de-facto standard for open-source video generation, with custom nodes for every model in this guide.
For the OpenAI Sora 2 API and Google Veo 3.1, the OpenAI Python SDK and the Google google-genai SDK both expose long-running operation patterns nearly identical to the Runway example below: submit a job, poll the status, download the output. ComfyUI workflows can also be invoked headlessly via its REST API for fully programmatic pipelines.
Runway Gen-4 Image-to-Video (Python)
Submit an image + prompt, poll for completion, download the MP4. Async-friendly with SSE events. Install: pip install runwayml.
from runwayml import RunwayML
import time, requests, base64
client = RunwayML() # uses RUNWAYML_API_SECRET
# Encode source image as data URI
with open("hero.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode()
prompt_image = f"data:image/jpeg;base64,{img_b64}"
# Submit Gen-4 image-to-video job
task = client.image_to_video.create(
model="gen4_turbo",
prompt_image=prompt_image,
prompt_text="Slow camera dolly-in, warm sunset light, "
"subject smiles, hair moves in the wind",
duration=5, # 5 or 10 seconds
ratio="1280:720",
)
# Poll until complete
while True:
task = client.tasks.retrieve(task.id)
if task.status in ("SUCCEEDED", "FAILED"):
break
time.sleep(5)
if task.status == "SUCCEEDED":
url = task.output[0]
open("out.mp4", "wb").write(requests.get(url).content)
print("Saved out.mp4")
OpenAI Sora 2 Generate (Python)
OpenAI Videos API. Submit prompt, poll job, fetch video. Native audio is on by default in Sora 2.
from openai import OpenAI
import time, requests
client = OpenAI()
job = client.videos.create(
model="sora-2",
prompt="A golden retriever surfs a small wave at sunset, "
"GoPro POV, water splashes, dog barks happily",
size="1280x720",
duration_seconds=8,
# input_image="data:image/png;base64,..." # optional I2V
)
while True:
job = client.videos.retrieve(job.id)
if job.status in ("completed", "failed"):
break
time.sleep(5)
if job.status == "completed":
url = job.output[0].url
open("sora.mp4", "wb").write(requests.get(url).content)
Google Veo 3.1 (Vertex AI)
Long-running operation pattern. Install: pip install google-genai. Auth via gcloud auth application-default login.
from google import genai
from google.genai import types
import time
client = genai.Client(
vertexai=True, project="my-proj", location="us-central1"
)
op = client.models.generate_videos(
model="veo-3.1-generate-preview",
prompt='A barista pulls an espresso shot, steam rises, '
'narrator says "Single origin, slow extracted."',
config=types.GenerateVideosConfig(
aspect_ratio="16:9",
duration_seconds=8,
number_of_videos=1,
generate_audio=True, # native audio
),
)
while not op.done:
time.sleep(10)
op = client.operations.get(op)
video = op.response.generated_videos[0].video
client.files.download(file=video)
video.save("veo.mp4")
ComfyUI: Wan 2.5 / HunyuanVideo
ComfyUI is the standard graph-based UI for open-source video. Install custom nodes for the model, drop the workflow JSON, hit Queue.
# Install ComfyUI on a Linux box with RTX 4090/5090
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt
# Install model-specific nodes via ComfyUI Manager
# - ComfyUI-WanVideoWrapper (Wan 2.5)
# - ComfyUI-HunyuanVideoWrapper (HunyuanVideo)
# - ComfyUI-LTXVideo (LTX-Video)
# - ComfyUI-MochiWrapper (Mochi 1)
# - ComfyUI-CogVideoXWrapper (CogVideoX-5B)
# Download weights to models/diffusion_models/
huggingface-cli download Wan-AI/Wan2.5-T2V-14B \
--local-dir models/diffusion_models/Wan2.5
# Launch
python main.py --listen 0.0.0.0 --port 8188
# Drag a workflow .json onto the UI:
# https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/main/example_workflows
Diffusers: HunyuanVideo (Python)
Pure-Python pipeline for batch jobs and CI/CD. Memory optimizations let HunyuanVideo run on a single 24GB card.
import torch
from diffusers import HunyuanVideoPipeline
from diffusers.utils import export_to_video
pipe = HunyuanVideoPipeline.from_pretrained(
"tencent/HunyuanVideo",
torch_dtype=torch.bfloat16,
)
pipe.vae.enable_tiling()
pipe.enable_sequential_cpu_offload() # fits on 24GB VRAM
video = pipe(
prompt="A cat dressed as a wizard casts a sparkling spell, "
"soft volumetric lighting, cinematic 35mm",
height=720, width=1280,
num_frames=85, # ~3.5s at 24fps
num_inference_steps=30,
guidance_scale=6.0,
).frames[0]
export_to_video(video, "wizard_cat.mp4", fps=24)
ComfyUI Headless API (Python)
Drive ComfyUI from production code via its websocket+HTTP API. Submit a workflow JSON, get the output filename, fetch the MP4.
import json, requests, websocket, uuid
CUI = "http://localhost:8188"
client_id = str(uuid.uuid4())
# Load a workflow exported via "Save (API Format)" in ComfyUI
workflow = json.load(open("ltx_video_i2v.json"))
# Override the prompt and source image
workflow["6"]["inputs"]["text"] = "Subject smiles, sunset light"
workflow["10"]["inputs"]["image"] = "hero.jpg"
# Queue the prompt
r = requests.post(f"{CUI}/prompt",
json={"prompt": workflow, "client_id": client_id})
prompt_id = r.json()["prompt_id"]
# Listen for completion via websocket
ws = websocket.create_connection(
f"ws://localhost:8188/ws?clientId={client_id}")
while True:
msg = json.loads(ws.recv())
if msg.get("type") == "executed" and \
msg["data"].get("prompt_id") == prompt_id:
break
# Fetch the resulting video from /history
hist = requests.get(f"{CUI}/history/{prompt_id}").json()
video = hist[prompt_id]["outputs"]["12"]["gifs"][0]
url = f"{CUI}/view?filename={video['filename']}" \
f"&subfolder={video['subfolder']}&type=output"
open("out.mp4", "wb").write(requests.get(url).content)