From a catalogue of 105 documented, observable failure modes across 9 AI video models — not an invented pass-rate score. It shows where each model is known to break so you can pick by track record.

By attribute · 9 models · 105 documented failure modes

Which AI video model generates native audio with the video?

Veo is the only covered model with native audio generation; the rest produce silent video that needs separate audio.

Last updated June 16, 2026 · Methodology: documented-failure-mode catalogue, not invented scores

Short answer

Veo (Google Veo 3) is the only covered AI video model with native audio generation — it produces synchronized sound with the video. Every other model outputs silent video that needs a separate audio pass. If native audio matters to your workflow, Veo is currently the only documented option.

Native audio is a clear-cut capability question rather than a consistency one: Veo generates sound with the clip, the rest do not. That makes Veo the default for talking-head or ambient-sound work where syncing audio separately is friction. Its documented weak spots are elsewhere — long multi-instruction prompts, where camera directions get dropped — so pick Veo for audio, then keep prompts short to stay inside its reliable zone.

See the documented evidence: Veo (Google Veo 3) profile, the full failure catalogue, or the overall consistency ranking.

Full context

Documented failure profile, every model

Model	Documented modes	Holds best on	Documented weak spot
VeoGoogle Veo 3	13	native audio, single-shot photoreal, lighting	long-prompt instruction drop, camera-motion-ignored on locked-off shots
RunwayRunway Gen-4	13	character identity across cuts (Scenes mode)	hand anatomy on close-ups, prompt-ignored on dense prompts
SoraOpenAI Sora 2	12	stylized motion (historically)	camera-control failures, multi-character interaction
SeedanceByteDance Seedance	12	short stylized clips	style-preset drift, motion drift over long clips
LumaLuma Dream Machine Ray-2	12	lighting realism, atmospheric single takes	identity drift past ~3 cuts, camera-path drift
ViduVidu	11	reference-to-video character carry	motion plausibility, color drift
PikaPika 2.0	11	stylized short-form, the closest Sora-style substitute	face distortion on long clips, motion failures
KlingKling 1.6	11	human motion on simple single-subject shots	motion-blur overload, prompt adherence on complex scenes
HailuoHailuo MiniMax	10	expressive faces on close-ups	camera-shake artifacts, physics collapse

Which model holds…

Pick by the thing that has to stay consistent

Holds a consistent face across cuts

Runway Gen-4

Runway Gen-4 Scenes mode is the only documented model built specifically to hold a character across multiple cuts; others drift after a few cuts.

Holds readable on-screen text

No model reliably does

Text rendering is a documented failure mode for every covered model — all garble past roughly six characters. Add text in post instead of relying on the model.

Holds correct hands in close-up

No model reliably does

Hand-anatomy failure is documented across every model. Frame hands away from camera or expect to re-roll; no model has solved close-up finger topology.

Holds cinematic lighting in a single take

Luma Ray-2

Luma Ray-2 documents the fewest lighting-related failures and leads on photoreal cinematic light for mood-led single-shot work.

Holds long, multi-instruction prompts

No model reliably does

Every model documents an instruction-drop / prompt-adherence failure that worsens as prompt length grows. Front-load must-haves and keep prompts short.

Documented failure profile, every model

Pick by the thing that has to stay consistent

Score your prompt against each model’s documented weak spots.