CraftStory and the Race to Realistic Long‑Form AI Video
- Aykut Onat
- 6 days ago
- 2 min read
For all the hype around AI video, most models still struggle with the thing businesses actually need: believable people talking, teaching, and presenting on camera for more than a few seconds. Into that gap steps CraftStory, a new AI video startup founded by OpenCV co‑creator Victor Erukhimov, which has just emerged from stealth with Model 2.0—a system that generates realistic, human‑centric videos up to five minutes long. Where rivals like
OpenAI’s Sora and Google’s Veo rely on sequential diffusion (generating short clips frame chunk by frame chunk), CraftStory uses a parallelized diffusion architecture that processes the entire video duration at once with bidirectional constraints. Multiple smaller diffusion processes run in parallel across the full timeline, so later frames can influence earlier ones.
This avoids the artifact accumulation and jitter that plague long clips and enables temporal coherence far beyond the ~10–25 seconds typical today. Instead of scraping the internet, the team trains on proprietary, high‑frame‑rate studio footage of actors. Professional studios shoot with specialized cameras to avoid motion blur in fingers, facial micro‑expressions, and fast gestures, producing crisp motion and detailed gesture reproduction while keeping data and compute requirements relatively modest. Actors whose “driving” performances power the model receive revenue shares when their motion data is used—an unusually creator‑friendly touch.
Model 2.0 is currently video‑to‑video. Users provide a still image to animate plus a “driving video” of a person. CraftStory offers preset driving clips from its actor pool, or customers can upload their own footage. Features include advanced lip‑sync to a script or audio and gesture alignment that matches speech rhythm and emotional tone. Today, the system produces ~30‑second low‑resolution clips in roughly 15 minutes of processing, with support for videos up to five minutes—far beyond Sora 2’s ~25‑second limit.
Backed by a relatively small $2 million (primarily from Andrew Filev, who sold Wrike to Citrix and now runs AI coding company Zencoder), CraftStory is deliberately narrow in its focus. Target customers include software and enterprise companies producing training, product, and launch videos, as well as creative agencies seeking faster, cheaper corporate content. The promise: a small business can create in minutes what might previously cost ~$20,000 and take two months. Filev expects big labs to become “API providers of powerful, general‑purpose generation models,” while companies like CraftStory build specialized “production studio and assembly line” layers on top. In that spirit, the roadmap includes text‑to‑video generation from scripts and support for moving‑camera, “walk‑and‑talk” style scenarios, ultimately positioning CraftStory as a focused layer atop foundation video APIs rather than as yet another generalist model. Erukhimov’s deep computer vision background in OpenCV and automotive perception shapes CraftStory’s emphasis on motion, facial dynamics, and temporal coherence—areas that sit at the intersection of generative modeling and classic vision.
The result is a bet that parallel diffusion, high‑quality motion data, and a tight enterprise niche will let a small, expert team compete credibly with AI video giants. Model 2.0 is available with early access for users and enterprises at app.craftstory.com/model-2.0. If CraftStory’s thesis is right, the next wave of AI video won’t just be about more spectacular clips—it will be about reliable, long‑form, human‑centric content that quietly replaces the studio for everyday business storytelling.




Comments