China's Soul App Open-Sources SoulX-Transcriber, a Multi-Person Conversation Transcription Model Integrating Speaker, Timestamp, and Text Recognition
2026-06-03 16:22
Favorite

en.Wedoany.com Reported - On June 3rd, the Soul App AI team (Soul AI Lab), in collaboration with the ASLP@NPU team from Northwestern Polytechnical University and Moonstep AI, officially open-sourced the end-to-end multi-person conversation transcription model SoulX-Transcriber. Designed for long-audio, multi-speaker dialogue scenarios, this model can directly generate structured results from multi-person conversation audio, including timestamps, speaker identities, and transcribed text.

SoulX-Transcriber targets the complex issues of speech recognition in real-world dialogue scenarios. In meetings, podcasts, group chats, customer service quality checks, interviews, and multi-person voice social scenarios, audio does not consist of a single speaker speaking in sequence. Instead, it frequently involves rapid speaker turns, interruptions, overlapping speech, confusion from similar voices, background noise, and inaccurate boundary segmentation. Traditional approaches typically decompose the process into multiple serial modules for voice activity detection, speaker diarization, speaker clustering, and automatic speech recognition. An error in any single module can be amplified in subsequent transcription. SoulX-Transcriber adopts an end-to-end framework, processing "who is speaking, when they are speaking, and what they are saying" within a unified model, aiming to reduce error propagation in cascaded systems and enhance structured understanding in multi-speaker scenarios.

Information from the open-source repository indicates that SoulX-Transcriber supports downloading model weights for both Chinese and English, and is licensed under Apache 2.0.

From a technical perspective, the model is based on a large audio language model framework and employs a speaker-aware multi-stage training strategy to enhance speaker representation, boundary perception, and overlapping speech recognition capabilities. According to its technical report, the model's training combines pseudo-labeled real conversation data with simulated multi-person conversation data. This approach preserves the acoustic environment and interaction characteristics of real audio while enhancing speaker differentiation, dialogue structure, and cross-domain generalization capabilities through controllable simulated data. On multi-speaker meeting datasets such as AISHELL-4, AliMeeting, and AMI, SoulX-Transcriber demonstrates performance in multi-person speech transcription. In internal general scenario evaluations, it also covers more complex multi-domain data, including daily conversations, film and television audio, and podcasts. For developers, the model not only outputs standard transcribed text but also synchronously generates speaker labels and time boundaries, making audio content more accessible for processes like meeting minutes, content moderation, knowledge base organization, customer service analysis, and multimedia retrieval.

Models of this type hold direct value for voice interaction products and enterprise audio data processing. Many enterprises have accumulated meeting recordings, phone call recordings, training audio, interview materials, podcast content, and customer service dialogues. However, if these audio files cannot accurately distinguish speakers, time segments, and text content, they are difficult to transform into searchable, analyzable, and reusable data assets. By converting raw audio into structured results, multi-person conversation transcription models can further connect to downstream applications such as summary generation, topic extraction, sentiment analysis, knowledge consolidation, and business quality inspection. Soul App itself features multi-person voice interaction and social scenarios. The Soul AI Lab's continued open-sourcing of models for voice, digital humans, and podcast generation also indicates that its AI technology roadmap is forming a continuous layout centered around real-time interaction, multimodal expression, and dialogue understanding.

From the perspective of the language processing industry, speech recognition is transitioning from single-sentence transcription to the stage of "real multi-person dialogue understanding." In the future, enterprises and platforms will need not just to convert sound into text, but to restore complex audio into structured content that is traceable, attributable, editable, and searchable. With the open-sourcing of SoulX-Transcriber, researchers and developers can conduct secondary development around meeting transcription, long-audio processing, multi-person speaker recognition, podcast content structuring, and voice social data analysis. Future variables will focus on the stability of real long audio, cross-language expansion, adaptability to noisy environments, the upper limit on the number of speakers, inference costs, and the effectiveness of integration with enterprise workflows and content platform systems.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com