📄 DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
👥 Authors: Xu Guo, Fulong Ye, Qichao Sun, Liyang Chen, Bingchuan Li, Pengze Zhang, Jiawei Liu, Songtao Zhao, Qian He, Xiangwang Hou
🏢 Organization: ByteDance
📅 Published: February 12, 2026
🔥 Upvotes: 20
⭐ GitHub Stars: 26
🎯 What This Research Is About
DreamID-Omni is a groundbreaking unified framework that tackles three critical human-centric AI tasks in one system:
- Reference-Based Audio-Video Generation (R2AV): Create videos with synchronized audio from reference materials
- Video Editing (RV2AV): Edit existing videos while maintaining character identity and audio consistency
- Audio-Driven Video Animation (RA2V): Generate animated videos driven by audio input with lip-sync precision
What makes this especially impressive is its ability to handle multiple characters with distinct identities and voice timbres in a single scene—a challenge that has plagued previous approaches.
💡 Why This Matters
- Commercial-Grade Quality: Achieves state-of-the-art performance that outperforms leading proprietary commercial models—a rare feat for academic research.
- Unified Framework: Previous solutions treated these tasks separately. DreamID-Omni handles all three in one coherent system, dramatically simplifying workflows.
- Multi-Character Control: Solves the notorious "identity-timbre binding" problem where voices get mixed up between characters in multi-person scenes.
- Precise Control: Offers disentangled control over visual identity and voice characteristics independently—crucial for creative applications.
- Open Source Promise: The team plans to release the code, democratizing access to commercial-grade audio-video generation technology.
🔬 Technical Innovations
1. Symmetric Conditional Diffusion Transformer
Integrates heterogeneous conditioning signals (video, audio, text) through a symmetric conditional injection scheme, ensuring balanced influence from all modalities.
2. Dual-Level Disentanglement Strategy
- Synchronized RoPE: At the signal level, ensures rigid attention-space binding to prevent speaker confusion
- Structured Captions: At the semantic level, establishes explicit attribute-subject mappings ("Person A has deep voice, Person B has high-pitched voice")
3. Multi-Task Progressive Training
Cleverly uses weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting while harmonizing different objectives.
🎮 Potential Applications
- Content Creation: Automated video production with synthetic actors and voices
- Film & Animation: Rapid prototyping of scenes with voice-synced characters
- Gaming: Dynamic NPC animations with real-time voice synchronization
- Education: Creating engaging educational videos with virtual instructors
- Marketing: Personalized video ads with controllable characters and voices
- Accessibility: Generating sign language videos or audio descriptions
📊 Performance Highlights
DreamID-Omni achieves comprehensive state-of-the-art performance across:
- ✅ Video quality and consistency
- ✅ Audio generation quality
- ✅ Audio-visual synchronization
- ✅ Multi-character identity preservation
📖 Read Full Paper → 🌐 Project Page → 💻 GitHub (26⭐) →
Curated from Hugging Face daily papers by AMS IT Services AI Research Curator