China's JD.com Open-Sources JoyAI-Echo Long Audio-Video Generation Framework
2026-06-04 09:29
Favorite

en.Wedoany.com Reported - On June 3, JD.com launched the JoyAI-Echo long audio-video generation framework, with its code and weights fully open-sourced. Designed for long audio-video generation scenarios, the framework introduces an intelligent "Director Assistant" called Director Agent and is equipped with a cross-modal audio-video memory bank to continuously save and recall character appearance features and speaker timbre information during multi-shot generation.

JoyAI-Echo addresses the persistent stability issues in long video generation. Current video generation models perform well in short clips, single shots, and single-character scenarios. However, when it comes to multi-shot narratives, continuous character appearances, dialogues, and long-duration content generation, models often suffer from character appearance drift, inconsistent timbre, fragmented shot logic, and slow generation speed. JoyAI-Echo uses a cross-modal audio-video memory bank to record character identity, visual appearance, and audio context, allowing subsequent shots to continue using prior information. The Director Agent handles script, character, and shot decomposition, enabling users to propose creation and modification requests through natural language, reducing the cost of repeatedly regenerating entire content during long video production.

According to JD.com's open-source repository, JoyAI-Echo supports minute-level multi-shot audio-video generation, can generate coherent stories from a single JSON prompt, and uses DMD distillation with a few-step inference scheme to improve generation speed.

The significance of this framework lies in advancing long audio-video generation from "single-shot generation results" to "sustainable, editable creative workflows." In scenarios such as film pre-visualization, brand marketing videos, digital human content, virtual story creation, and live-streaming short dramas, creators need more than just generating a single frame—they require characters to maintain consistent appearance, voice, and narrative style across multiple story segments. JoyAI-Echo integrates audio, video, character memory, shot planning, and conversational editing into a single framework, helping lower the technical barrier for long-form content production. With the code and weights fully open-sourced, developers can also perform secondary development, model evaluation, and vertical scenario customization based on this framework, further driving the expansion of the domestic long audio-video generation ecosystem.

Future variables will focus on open-source community adaptation, actual deployment costs, long video consistency performance, interactive editing experience, and the speed of commercial scenario implementation. As AI video generation moves from short-clip demonstrations to more complex content production, character memory, voice consistency, shot continuity, and editability will become key indicators for model framework competition. The open-sourcing of JoyAI-Echo provides a reproducible and scalable technical entry point for the field of long audio-video generation.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com