AI / GENERATIVE VIDEO

Video Generation & Image-to-Video: Sora 2, Veo 3.1, Runway, Open-Source

A practical guide to AI video generation in 2026: closed APIs (Sora 2/Pro, Veo 3.1, Runway Gen-4, Kling 2.5, Hailuo 02, Pika 2.0), open-source models (Wan 2.5, HunyuanVideo, LTX-Video, Mochi 1, CogVideoX-5B), image-to-video with motion brushes and character consistency, self-hosting on RTX 5090/4090, pricing comparisons, and code examples with the Runway SDK and ComfyUI.

By Jose Nobile | Updated 2026-04-27 | 26 min read

1. Sora 2 / Sora 2 Pro (OpenAI)

Sora 2 (released October 2025) is OpenAI's flagship text-to-video and image-to-video model. The headline upgrade over Sora 1 is synchronized audio — dialog, sound effects, ambience, and music are generated together with the video, eliminating the need for a separate audio pass. Sora 2 produces up to 20-second clips at 1080p in the standard tier; Sora 2 Pro extends this to higher fidelity and longer durations with cinematic motion and improved physics.

Access is via the Sora app (sora.com), the OpenAI API (/v1/videos endpoint), and the ChatGPT Pro/Plus tiers. Pricing is per second of generated video: roughly $0.10/sec for Sora 2 and $0.30–$0.50/sec for Sora 2 Pro at 1080p. Generation time is 30–120 seconds for typical prompts. The model accepts text prompts, reference images, and reference videos for style transfer and continuation.

Sora 2's advantage is world simulation: it understands rigid-body physics, fluids, fabric, and human motion better than prior generations. Failure modes have shifted from "mangled hands" to subtler issues like inconsistent shadows or character drift across longer clips. Watermarking via C2PA provenance metadata is enforced on all outputs.

AUDIO

Native Synchronized Audio

First major model with integrated dialog, foley, ambient sound, and music in a single pass. Lip-sync to generated speech, footsteps that match terrain, and Doppler-correct passing vehicles. No separate TTS or sound-design step.

RESOLUTION

Up to 1080p, 20s

Sora 2 standard outputs 720p/1080p at up to 20 seconds. Sora 2 Pro pushes higher fidelity, sharper detail and 30-60s segments suitable for short-form ads, film previs, and stylized music video sequences.

PRICING

~$0.10–$0.50 / second

API pricing: Sora 2 ~$0.10/s at 720p, ~$0.15/s at 1080p. Sora 2 Pro ~$0.30–$0.50/s at 1080p. ChatGPT Pro includes a monthly generation quota; pay-per-use beyond that. Caching does not apply to video generation.

PHYSICS

Improved World Simulation

Better handling of rigid bodies, gravity, collisions, and fluid dynamics. Multi-shot continuity (cuts and camera changes) is more reliable. Still struggles with very fine-grained hand-object interaction and consistent text rendering.

API

OpenAI Videos API

REST endpoint POST /v1/videos accepts prompt, reference image (input_image), duration, resolution, aspect ratio. Returns a job ID; poll /v1/videos/{id} until status=completed, then download the MP4 + audio. Async only.

SAFETY

C2PA Watermarking

All outputs carry C2PA cryptographic provenance metadata identifying them as AI-generated. Visible watermark on free tier; metadata-only on paid tiers. Likeness protection blocks named public figures unless explicitly licensed.

2. Google Veo 3.1 (Vertex AI)

Veo 3.1 is Google DeepMind's flagship video model, accessed via Vertex AI, the Gemini API, the consumer Gemini app, and embedded in Google Vids and YouTube Shorts. Like Sora 2, Veo 3.1 generates native audio — speech, music, and ambient sound — in a single pass. Veo 3.1 produces 1080p clips up to 8 seconds (extendable via "Extend video" iterations to 30s+).

Vertex AI pricing: $0.40/sec for Veo 3.1 with audio, $0.20/sec for Veo 3.1 Fast. Generation latency is 60–180 seconds. Veo's standout feature is "Ingredients" mode: provide multiple reference images (character, object, scene) and Veo composes them into a coherent shot — ideal for advertising and product visualization where character/product consistency matters.

Veo also exposes camera controls (pan, tilt, dolly, orbit) as explicit prompt parameters, and "Frames to Video" which interpolates between two keyframes to produce a smooth shot — powerful for storyboard-driven workflows. SynthID watermarks are embedded invisibly in every frame and audio track.

AUDIO

Native Audio Generation

Synchronized dialog, foley, music, and ambience generated jointly with the visuals. Dialog can be specified verbatim in the prompt with quoted speaker tags. Multi-language speech supported across 30+ languages.

INGREDIENTS

Multi-Reference "Ingredients"

Pass up to 3 reference images (character, product, environment) and Veo blends them into a single coherent shot. Critical for brand-consistent advertising where the actor, product, and location must all match supplied references.

CAMERA

Camera Controls

Explicit camera directives in the prompt: pan-left, dolly-in, orbit-right, crane-up, handheld. Interprets cinematic language ("Dutch angle", "rack focus") more reliably than competitors. 16:9, 9:16, and 1:1 aspect ratios.

FRAMES

Frames-to-Video Interpolation

Provide a start frame and an end frame; Veo generates the motion between them. Enables storyboard-driven production: artists draw key beats, Veo fills the in-betweens. Reduces the unpredictability of pure text-to-video.

VERTEX

Vertex AI API

Long-running operation pattern: POST to publishers/google/models/veo-3.1:predictLongRunning, poll the operation, fetch GCS-hosted MP4. IAM-controlled, VPC-SC compatible, and auditable — suitable for regulated enterprises.

SYNTHID

SynthID Watermarking

Invisible watermark embedded in every video frame and audio sample. Survives compression, cropping, and re-encoding. Detectable via Google's SynthID Detector. Mandatory on all Veo outputs across consumer and API surfaces.

3. Runway, Kling, Hailuo & Pika

Beyond Sora and Veo, four commercial APIs dominate professional creative workflows in 2026. Each has a distinct strength: Runway Gen-4 for character/world consistency, Kling 2.5 for cinematic motion, Hailuo 02 for prompt adherence at low cost, and Pika 2.0 for stylized social-format video with effects.

All four offer 1080p output, image-to-video, and 5–10 second clips. Pricing varies from $0.03/sec (Hailuo 02 Standard) at the low end to $0.50/sec (Runway Gen-4 Turbo Pro) at the high end. Most production teams maintain accounts on at least two providers and route prompts based on the shot type — consistent character work to Runway, motion-heavy action to Kling, bulk b-roll to Hailuo.

CONSISTENCY

Runway Gen-4 + Gen-4 Turbo

Best-in-class for character and world consistency across shots. "References" feature locks an actor or location to a single image set across an entire production. 1080p, up to 10s. ~$0.05/credit, 5s clip ~$0.25-0.50. Native API + SDK.

CINEMATIC

Kling 2.5 (Kuaishou)

Strongest cinematic motion and physics among Chinese-origin models. 1080p up to 10s, 4K upscale available. Excellent dance/sports/action footage. Pricing ~$0.10/s standard, ~$0.20/s Master mode. API via Kling Cloud and fal.ai.

VALUE

Hailuo 02 (MiniMax)

Aggressive pricing: ~$0.03–$0.05/s at 720p, ~$0.08/s at 1080p. Strong prompt adherence and stable motion for the price. 6–10s clips. Ideal for high-volume b-roll generation, A/B testing, and stock-style content.

EFFECTS

Pika 2.0 + Pikaffects

Stylized output with signature "Pikaffects" (squish, crush, melt, inflate, explode). Strong for social-first, meme-style, and music-video creative. Up to 5s, 1080p. Subscription tiers from $10/mo with credit-based usage.

PROVIDERS

Aggregators: fal.ai, Replicate

fal.ai and Replicate expose Runway, Kling, Hailuo, Pika, plus open-source models behind a single API and key. Useful for multi-model A/B routing without separate billing. Slight markup vs. direct provider pricing.

SELECTION

When to Pick Which

Sora 2 / Veo 3.1: best overall + audio. Runway Gen-4: character continuity across shots. Kling 2.5: action and cinematic motion. Hailuo 02: cheap bulk generation. Pika 2.0: social/stylized creative with effects.

4. Open-Source Video Models

Open-source video generation closed most of the gap with closed APIs in 2025-2026. Five models matter: Wan 2.5 (Alibaba), HunyuanVideo (Tencent), LTX-Video (Lightricks), Mochi 1 (Genmo), and CogVideoX-5B (Zhipu/THUDM). All publish weights on Hugging Face under permissive or research licenses, and all have first-class ComfyUI nodes.

For local self-hosting on a single RTX 4090 (24GB) or RTX 5090 (32GB), LTX-Video is by far the fastest — real-time generation of 5s clips at 768x512. Wan 2.5 and HunyuanVideo deliver the highest quality and now match Kling 2.0 / Runway Gen-3 on most benchmarks. Mochi 1 excels at motion fidelity. CogVideoX-5B is the easiest entry point for fine-tuning and LoRA training.

FLAGSHIP

Wan 2.5 (Alibaba)

14B-parameter DiT (Diffusion Transformer). 1080p up to 10s. Wan 2.5 adds native audio generation, narrowing the gap with Sora 2 and Veo 3.1. Apache 2.0. Runs on 24GB VRAM with offloading; 48GB recommended for 1080p. ComfyUI + Diffusers support.

QUALITY

HunyuanVideo (Tencent)

13B params, dual text encoders (LLM + CLIP). 720p, 5–15s clips. Best-in-class open-source quality on VBench. Strong I2V variant. Apache 2.0. Needs 60GB VRAM at full precision; FP8 + offloading runs on 24GB at 720p with longer gen times.

SPEED

LTX-Video (Lightricks)

2B-param DiT optimized for speed. Real-time 5s @ 768x512 on a single RTX 4090 (~5s gen for 5s output). Open RAIL-M license. Best choice for interactive UIs, batch pipelines, and rapid prototyping. T2V + I2V + V2V.

MOTION

Mochi 1 (Genmo)

10B-param AsymmDiT. Strong motion fidelity and prompt adherence. 480p base, with HD upsampler. Apache 2.0. ~24GB VRAM with sequential offloading. Active community fine-tunes for film/anime/photorealism on Hugging Face.

FINETUNE

CogVideoX-5B (Zhipu)

5B params, 720x480, 6s clips. Most mature ecosystem for LoRA fine-tuning — trainable on 1x A100 40GB. Apache 2.0. CogVideoX-5B-I2V variant for image-to-video. Excellent baseline for character/style customization.

TOOLING

ComfyUI & Diffusers

All five models ship with ComfyUI custom nodes (visual graph editor, fastest iteration) and Hugging Face Diffusers pipelines (Pythonic API, batch workflows). xformers, SageAttention, and TeaCache accelerate inference 2–3x.

5. Image-to-Video Techniques

Image-to-video (I2V) is often more useful than pure text-to-video for production. You start from a known still — an illustration, a product photo, a generated image — and the model only has to invent motion, not the scene. This collapses prompt-engineering complexity and gives you tight control over composition, branding, and identity.

Three I2V control mechanisms have converged across providers in 2026: motion brushes (paint motion vectors directly on the source image), character/identity references (lock a face or product across multiple shots), and start+end keyframes (interpolate between two images for shot-perfect timing). Runway, Kling, Pika, Veo, and HunyuanVideo I2V all expose at least two of the three.

CONTROL

Motion Brush

Paint a mask on the source image and assign a direction vector + intensity. The brushed region animates along that vector while the rest stays static. Pioneered by Runway Gen-2/3, now in Kling, Pika, and ComfyUI nodes for HunyuanVideo I2V.

IDENTITY

Character Consistency

Runway "References", Kling "Custom Character", Veo "Ingredients" all let you upload 1-5 images of a face/actor/product and lock identity across an entire production. For open-source, train a CogVideoX or HunyuanVideo LoRA (~30 images, 1-2 hours on A100).

KEYFRAMES

Start + End Keyframes

Provide a first frame and a last frame; the model interpolates motion between them. Veo "Frames-to-Video", Kling "Start & End Frame", LTX-Video "Image Conditioning". Eliminates motion ambiguity — the model only invents the in-betweens.

CAMERA

Camera Path Control

Specify camera motion (pan, dolly, zoom, orbit) independently of subject motion. Runway "Camera Controls" and Kling "Cinematic Lens Movement" expose explicit sliders. ComfyUI exposes per-frame camera trajectory tensors for fine control.

EXTEND

Video Extension & Looping

Feed the last frame of one clip as the start frame of the next to chain shots into 30-60s sequences. "Loop" mode forces the last frame to match the first — ideal for reels, ads, and animated backgrounds.

UPSCALE

Upscaling & Frame Interpolation

Generate at 720p/24fps, then upscale with Topaz Video AI / SeedVR2 to 4K, and interpolate to 60fps with RIFE / Practical-RIFE. Common pipeline: cheap base generation → high-quality post-processing for delivery.

6. Self-Hosting (RTX 5090 / 4090)

Self-hosting open-source video models has crossed the practicality threshold for solo creators and small studios. A single RTX 5090 (32GB GDDR7) can run Wan 2.5 and HunyuanVideo at 720p with reasonable generation times; an RTX 4090 (24GB) handles LTX-Video, Mochi 1, and CogVideoX-5B comfortably. Two-card setups (2x 4090 or 1x 5090 + 1x 4090) unlock 1080p and longer clips via tensor parallelism.

VRAM is the binding constraint. Memory optimization techniques — FP8 quantization, sequential CPU offloading, VAE tiling, attention slicing, SageAttention — let larger models fit on 24GB, at the cost of 1.5–3x longer generation. For production pipelines, batch generations overnight; for interactive iteration, stick to LTX-Video.

5090

RTX 5090 (32GB) — Sweet Spot

32GB GDDR7 + Blackwell tensor cores. Runs HunyuanVideo and Wan 2.5 at 720p natively without offloading. Wan 2.5 5s @ 720p in ~90s. LTX-Video real-time. ~$2,000 retail when in stock. Best single-card option for serious creators in 2026.

4090

RTX 4090 (24GB) — Mainstream

24GB GDDR6X. Mochi 1 and CogVideoX-5B run natively. HunyuanVideo and Wan 2.5 require FP8 + offloading. Wan 2.5 5s @ 720p in ~3-5 min with offloading. LTX-Video 5s in ~5-10s. Excellent value at ~$1,600 used.

3090

RTX 3090 (24GB) — Budget

Same 24GB VRAM as 4090 at half the price (~$700 used). 30-50% slower than 4090. Still capable of running every open-source model with offloading. The most cost-effective entry point for VRAM-bound workflows.

PRO

RTX 6000 Ada / Pro 6000 (48GB-96GB)

48GB (6000 Ada) or 96GB (Pro 6000 Blackwell) VRAM eliminates all offloading. HunyuanVideo and Wan 2.5 at 1080p without compromise. 4-7x more expensive than 4090/5090 but the only single-card path to high-end production at 1080p+.

CLOUD

Cloud Alternative: Runpod / Vast.ai

A100 80GB ~$1.20-1.80/hr, H100 80GB ~$2-3/hr. Spin up a ComfyUI instance on demand, generate, shut down. Cheaper than a 5090 for sub-50 hours/month of usage. Pre-built ComfyUI templates available.

OPTIMIZE

VRAM Optimization Stack

FP8 quantization (Q8_0 GGUF for video), SageAttention, TeaCache, VAE tiling, sequential CPU offload, torch.compile. Combined: HunyuanVideo on 12GB VRAM (RTX 4070), Wan 2.5 on 16GB. 2-3x slower but accessible.

7. Comparison Table & Pricing

Side-by-side comparison of the major models on the four dimensions that decide a production purchase: price per second, maximum clip length, native audio support, and maximum resolution. Closed APIs lead on audio and consistency; open-source leads on cost-at-scale and customization (LoRA fine-tunes).

Model Price / sec Max length Audio Max res License
Sora 2$0.10–$0.1520sYes (native)1080pAPI
Sora 2 Pro$0.30–$0.5030–60sYes (native)1080p+API
Veo 3.1$0.408s (extend)Yes (native)1080pAPI
Veo 3.1 Fast$0.208sYes (native)1080pAPI
Runway Gen-4$0.05–$0.1010sNo1080pAPI
Kling 2.5$0.10–$0.2010sNo1080p (4K up)API
Hailuo 02$0.03–$0.0810sNo1080pAPI
Pika 2.0~$0.055sNo1080pAPI
Wan 2.5Self-hosted10sYes1080pApache 2.0
HunyuanVideoSelf-hosted15sNo720pApache 2.0
LTX-VideoSelf-hosted5sNo768x512Open RAIL-M
Mochi 1Self-hosted5sNo480p (HD up)Apache 2.0
CogVideoX-5BSelf-hosted6sNo720x480Apache 2.0

Cost reality check: at $0.10/sec, a 30-second TikTok ad costs $3 to generate — cheaper than a single stock-footage license. At Sora 2 Pro pricing ($0.50/sec), the same 30s costs $15. Self-hosting on a 5090 amortizes after ~200-400 minutes of generation depending on tier; cloud A100 rental ($1.50/hr) breaks even with Hailuo 02 around 30s of output per hour of compute.

AUDIO

Audio Capability Matters

Only Sora 2, Veo 3.1, and Wan 2.5 generate native audio. Everyone else needs a separate TTS + foley + music pass. Native audio cuts post-production from hours to minutes for short-form content.

QUALITY

Top-Tier Quality (VBench)

As of early 2026: Sora 2 Pro and Veo 3.1 lead, followed closely by Kling 2.5 and Runway Gen-4. Wan 2.5 and HunyuanVideo are the best open-source models, within 5-10% of Kling on most VBench metrics.

VOLUME

High-Volume Pipelines

Generating 1000+ clips/day? Hailuo 02 ($0.03-0.08/s) or self-hosted LTX-Video on 4-8 GPUs. Closed APIs hit rate limits and budgets fast. Open-source wins on $/clip at scale.

ENTERPRISE

Enterprise Compliance

Vertex AI Veo 3.1 is the only option with VPC-SC, IAM, audit logs, and data residency guarantees out of the box. Open-source on private VPC closes the rest. Sora 2 has enterprise tier with stricter data handling.

PROVENANCE

Watermarking & Provenance

Sora (C2PA), Veo (SynthID) embed cryptographic provenance. Most open-source models don't — you must add C2PA tags downstream if your platform requires "AI-generated" disclosure (EU AI Act, YouTube, Meta).

DECISION

Quick Decision Tree

Need audio + best quality? Sora 2 / Veo 3.1. Need character consistency? Runway Gen-4. Cinematic action? Kling 2.5. Bulk b-roll cheap? Hailuo 02. Privacy + customization? Self-host Wan 2.5 or HunyuanVideo on 5090.

8. Code: Runway SDK + ComfyUI

Two integration paths cover 90% of production work: the Runway SDK (Python/JavaScript) for closed-API workflows, and ComfyUI (visual graph) for self-hosted open-source models. The Runway SDK is the most polished commercial video API; ComfyUI is the de-facto standard for open-source video generation, with custom nodes for every model in this guide.

For the OpenAI Sora 2 API and Google Veo 3.1, the OpenAI Python SDK and the Google google-genai SDK both expose long-running operation patterns nearly identical to the Runway example below: submit a job, poll the status, download the output. ComfyUI workflows can also be invoked headlessly via its REST API for fully programmatic pipelines.

RUNWAY SDK

Runway Gen-4 Image-to-Video (Python)

Submit an image + prompt, poll for completion, download the MP4. Async-friendly with SSE events. Install: pip install runwayml.

from runwayml import RunwayML
import time, requests, base64

client = RunwayML()  # uses RUNWAYML_API_SECRET

# Encode source image as data URI
with open("hero.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()
prompt_image = f"data:image/jpeg;base64,{img_b64}"

# Submit Gen-4 image-to-video job
task = client.image_to_video.create(
    model="gen4_turbo",
    prompt_image=prompt_image,
    prompt_text="Slow camera dolly-in, warm sunset light, "
                "subject smiles, hair moves in the wind",
    duration=5,             # 5 or 10 seconds
    ratio="1280:720",
)

# Poll until complete
while True:
    task = client.tasks.retrieve(task.id)
    if task.status in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(5)

if task.status == "SUCCEEDED":
    url = task.output[0]
    open("out.mp4", "wb").write(requests.get(url).content)
    print("Saved out.mp4")
SORA 2 API

OpenAI Sora 2 Generate (Python)

OpenAI Videos API. Submit prompt, poll job, fetch video. Native audio is on by default in Sora 2.

from openai import OpenAI
import time, requests

client = OpenAI()

job = client.videos.create(
    model="sora-2",
    prompt="A golden retriever surfs a small wave at sunset, "
           "GoPro POV, water splashes, dog barks happily",
    size="1280x720",
    duration_seconds=8,
    # input_image="data:image/png;base64,..."  # optional I2V
)

while True:
    job = client.videos.retrieve(job.id)
    if job.status in ("completed", "failed"):
        break
    time.sleep(5)

if job.status == "completed":
    url = job.output[0].url
    open("sora.mp4", "wb").write(requests.get(url).content)
VEO 3.1

Google Veo 3.1 (Vertex AI)

Long-running operation pattern. Install: pip install google-genai. Auth via gcloud auth application-default login.

from google import genai
from google.genai import types
import time

client = genai.Client(
    vertexai=True, project="my-proj", location="us-central1"
)

op = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    prompt='A barista pulls an espresso shot, steam rises, '
           'narrator says "Single origin, slow extracted."',
    config=types.GenerateVideosConfig(
        aspect_ratio="16:9",
        duration_seconds=8,
        number_of_videos=1,
        generate_audio=True,        # native audio
    ),
)

while not op.done:
    time.sleep(10)
    op = client.operations.get(op)

video = op.response.generated_videos[0].video
client.files.download(file=video)
video.save("veo.mp4")
COMFYUI

ComfyUI: Wan 2.5 / HunyuanVideo

ComfyUI is the standard graph-based UI for open-source video. Install custom nodes for the model, drop the workflow JSON, hit Queue.

# Install ComfyUI on a Linux box with RTX 4090/5090
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI && pip install -r requirements.txt

# Install model-specific nodes via ComfyUI Manager
# - ComfyUI-WanVideoWrapper      (Wan 2.5)
# - ComfyUI-HunyuanVideoWrapper  (HunyuanVideo)
# - ComfyUI-LTXVideo             (LTX-Video)
# - ComfyUI-MochiWrapper         (Mochi 1)
# - ComfyUI-CogVideoXWrapper     (CogVideoX-5B)

# Download weights to models/diffusion_models/
huggingface-cli download Wan-AI/Wan2.5-T2V-14B \
  --local-dir models/diffusion_models/Wan2.5

# Launch
python main.py --listen 0.0.0.0 --port 8188

# Drag a workflow .json onto the UI:
# https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/main/example_workflows
DIFFUSERS

Diffusers: HunyuanVideo (Python)

Pure-Python pipeline for batch jobs and CI/CD. Memory optimizations let HunyuanVideo run on a single 24GB card.

import torch
from diffusers import HunyuanVideoPipeline
from diffusers.utils import export_to_video

pipe = HunyuanVideoPipeline.from_pretrained(
    "tencent/HunyuanVideo",
    torch_dtype=torch.bfloat16,
)
pipe.vae.enable_tiling()
pipe.enable_sequential_cpu_offload()  # fits on 24GB VRAM

video = pipe(
    prompt="A cat dressed as a wizard casts a sparkling spell, "
           "soft volumetric lighting, cinematic 35mm",
    height=720, width=1280,
    num_frames=85,        # ~3.5s at 24fps
    num_inference_steps=30,
    guidance_scale=6.0,
).frames[0]

export_to_video(video, "wizard_cat.mp4", fps=24)
COMFYUI API

ComfyUI Headless API (Python)

Drive ComfyUI from production code via its websocket+HTTP API. Submit a workflow JSON, get the output filename, fetch the MP4.

import json, requests, websocket, uuid

CUI = "http://localhost:8188"
client_id = str(uuid.uuid4())

# Load a workflow exported via "Save (API Format)" in ComfyUI
workflow = json.load(open("ltx_video_i2v.json"))

# Override the prompt and source image
workflow["6"]["inputs"]["text"] = "Subject smiles, sunset light"
workflow["10"]["inputs"]["image"] = "hero.jpg"

# Queue the prompt
r = requests.post(f"{CUI}/prompt",
    json={"prompt": workflow, "client_id": client_id})
prompt_id = r.json()["prompt_id"]

# Listen for completion via websocket
ws = websocket.create_connection(
    f"ws://localhost:8188/ws?clientId={client_id}")
while True:
    msg = json.loads(ws.recv())
    if msg.get("type") == "executed" and \
       msg["data"].get("prompt_id") == prompt_id:
        break

# Fetch the resulting video from /history
hist = requests.get(f"{CUI}/history/{prompt_id}").json()
video = hist[prompt_id]["outputs"]["12"]["gifs"][0]
url = f"{CUI}/view?filename={video['filename']}" \
      f"&subfolder={video['subfolder']}&type=output"
open("out.mp4", "wb").write(requests.get(url).content)

Related Technologies