China's Catnip Releases Streaming Audio-Video Model MaineCoon

2026-06-21 09:49

Favorite

en.Wedoany.com Reported - Chinese startup Catnip recently launched the streaming audio-video model MaineCoon, which can generate real-time synchronized audio and video for up to 30 minutes or more, achieving an inference speed of 47.5 FPS on a single H100 GPU, with a cost controllable to under $0.001 per second.

MaineCoon was developed by Catnip, a startup team of just 10 people, headquartered in China. The project officially started in March of this year, with three core researchers completing the full-stack delivery of model training, architecture design, data infrastructure, and inference system within two months.

Unlike traditional audio-video generation models, MaineCoon is the first to focus its application scenario on social interaction. The model supports simultaneous generation and playback, with audio and video output together, and the first frame can appear within one second after the instruction is issued. Under full GPU load, the inference cost per second can be reduced to $0.00025, which is 1/2000 of Veo 3 and 1/560 of Seedance. The model has 22B parameters, runs stably on a single H100, and can even maintain a real-time speed of over 30 FPS on the lower-cost RTX Pro 6000 inference card.

The Catnip team detailed the training and inference architecture of MaineCoon in a technical report. The training framework is divided into three stages: Self-Resampling addresses the gap between training and inference; Representation Alignment accelerates the convergence of joint audio-video training by freezing the pre-trained V-JEPA 2 visual encoder; Domain-Aware Preference Optimization (DPO), combined with Reinforced Online Policy Distillation (ROPD), trains specialized preference expert models for different social scenarios. The entire model was trained on 64 H100 GPUs using fewer than 1 million data points, consuming 10k GPU hours.

On the inference side, an Agentic inference framework composed of three independent intelligent controllers is adopted: the Director handles narrative and error correction, generating structured prompts beat-by-beat through a planner and monitoring generation quality through an observer; the Cache Manager manages the retention and eviction strategies of KV caches, treating character appearances and scene-establishing frames as long-term memory anchors; the Buffer Controller manages the look-ahead buffer, balancing real-time performance with interactive responsiveness.

The Catnip team also built the first dedicated benchmark for social short videos, SocialVideo Bench, covering seven major scenarios: dense speech, two-person interaction, music singing, emotional performance, dance, creative challenges, and social memes. Evaluation shows that MaineCoon achieved a comprehensive score of 0.934, surpassing seven mainstream audio-video generation models including SoulX-FlashTalk (0.895).

The Catnip team first proposed the concept of a "Social World Model," which they believe comprises three levels: the perception layer (understanding user emotions), the simulation layer (predicting social behavior), and the rendering layer (real-time audio-video generation). MaineCoon is seen as a breakthrough at the rendering level. The team plans to move beyond the traditional half-duplex interaction mode of AI dialogue to achieve human-like continuous, interleaved, multimodal real-time bidirectional interaction, and to promote the model's deployment as an interactive content platform.

Team founder Yang Shurui previously worked at TikTok and PixVerse, responsible for the product launch of viral template effects, and has a background in serial entrepreneurship. Chief Scientist Xie Zeke is an Assistant Professor at the Hong Kong University of Science and Technology (Guangzhou), with a bachelor's degree from the University of Science and Technology of China and a Ph.D. from the University of Tokyo. He previously participated in cutting-edge large model research at Baidu Research and has long served as an Area Chair for top AI conferences such as NeurIPS, ICLR, and ICML. Other team members are primarily recent graduates.

The Catnip team previously released the technical report on the social platform X, which quickly garnered widespread attention, with LTX officials proactively seeking collaboration. The team revealed that they had secured angel round financing from investment institutions such as Sequoia and Mingshi Capital earlier this year.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

China