Enabling Stable Long Human Animation with Any Video DiTs

Recent DiT-based human animation models deliver impressive results—but only for a few seconds.

Our Lookahead Anchoring is a simple-yet-effective method designed to stay stable against error accumulation and identity drift, even within straightforward temporal autoregressive approaches.

HunyuanAvatar* / HunyuanAvatar* + Ours
(* HunyuanAvatar is modified to take 5 starting frames for autoregressive generation scheme.)

OmniAvatar / OmniAvatar + Ours


Keyframe ahead of  Generation Window

Traditional keyframe-based approaches follow a two-step pipeline: keyframe generation → video interpolation, where keyframes typically mark the start and end of each segment.

When audio drives the generation, keyframes must be audio-synced. This rigid constraint forces the video to hit exact facial poses at fixed timestamps, limiting motion expressiveness and dynamics.

Distant keyframes eliminate audio-sync stage.

But what if keyframes don't have to be in the generated video? What if we place them beyond the segment endpoints—further ahead in time?

We explore this idea. By positioning keyframes ahead of the generation window, they transform from rigid constraints into soft directional guidance. This dramatically simplifies the framework while delivering superior performance.

KeyFace (keyframe-based approach) / Ours


Temporal Distance as a Conditioning Strength

Models naturally learn that longer temporal distances allow for greater scene variation. We exploit this prior strategically: distant keyframes provide high-level guidance without imposing strict physical constraints, enabling diverse yet coherent generation.

Temporal distance 𝐷 reflects inter-frame relevance.


Abstract

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures.

Citation


Acknowledgements

TBD. The website template was borrowed from Michaël Gharbi.