By attribute · 9 models · 105 documented failure modes
Which AI video model keeps a consistent face across cuts?
Runway Gen-4 Scenes mode is the only documented model built specifically to hold a character across multiple cuts; others drift after a few cuts.
Short answer
Runway Gen-4 (Scenes mode) is the only documented AI video model built to hold the same character across multiple cuts; other models visibly drift after a few cuts. For multi-shot work with a recurring face it has the strongest documented identity track record — though no model is reliable on extreme close-ups.
Identity holding is the consistency most people actually mean when they ask which model is “consistent.” Runway’s Scenes mode references a locked character, which is why its documented identity failures are narrower than its peers. The others can match a face within a single take but drift across cuts because each clip is a fresh sample. If your project is one continuous shot, the gap narrows; if it spans cuts with the same person, Runway is the documented pick.
See the documented evidence: Runway Gen-4 profile, the full failure catalogue, or the overall consistency ranking.
Full context
Documented failure profile, every model
| Model | Documented modes | Holds best on | Documented weak spot |
|---|---|---|---|
| VeoGoogle Veo 3 | 13 | native audio, single-shot photoreal, lighting | long-prompt instruction drop, camera-motion-ignored on locked-off shots |
| RunwayRunway Gen-4 | 13 | character identity across cuts (Scenes mode) | hand anatomy on close-ups, prompt-ignored on dense prompts |
| SoraOpenAI Sora 2 | 12 | stylized motion (historically) | camera-control failures, multi-character interaction |
| SeedanceByteDance Seedance | 12 | short stylized clips | style-preset drift, motion drift over long clips |
| LumaLuma Dream Machine Ray-2 | 12 | lighting realism, atmospheric single takes | identity drift past ~3 cuts, camera-path drift |
| ViduVidu | 11 | reference-to-video character carry | motion plausibility, color drift |
| PikaPika 2.0 | 11 | stylized short-form, the closest Sora-style substitute | face distortion on long clips, motion failures |
| KlingKling 1.6 | 11 | human motion on simple single-subject shots | motion-blur overload, prompt adherence on complex scenes |
| HailuoHailuo MiniMax | 10 | expressive faces on close-ups | camera-shake artifacts, physics collapse |
Which model holds…
Pick by the thing that has to stay consistent
Holds readable on-screen text
No model reliably does
Text rendering is a documented failure mode for every covered model — all garble past roughly six characters. Add text in post instead of relying on the model.
Holds correct hands in close-up
No model reliably does
Hand-anatomy failure is documented across every model. Frame hands away from camera or expect to re-roll; no model has solved close-up finger topology.
Holds cinematic lighting in a single take
Luma Ray-2
Luma Ray-2 documents the fewest lighting-related failures and leads on photoreal cinematic light for mood-led single-shot work.
Holds native audio with the video
Veo (Google Veo 3)
Veo is the only covered model with native audio generation; the rest produce silent video that needs separate audio.
Holds long, multi-instruction prompts
No model reliably does
Every model documents an instruction-drop / prompt-adherence failure that worsens as prompt length grows. Front-load must-haves and keep prompts short.