DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Jiaqi Li, Xiaolong Lin, Zhekai Li, Shixi Huang, Yuancheng Wang, Chaoren Wang, Zhenpeng Zhan, Zhizheng Wu
Paper accepted to Interspeech 2025

Abstract

Neural audio codecs serve as the fundamental building blocks for speech language model-based speech generation. To improve speech generation performance, recent neural audio codec SpeechTokenizer proposed to distill the first-layer codec tokens from semantic-rich self-supervised (SSL) representations. Our work improves on this idea of semantically-enhanced audio codec, making it more usable for speech generation. 1) We increase the semantic information accuracy in first-layer codec tokens by proposing a dual encoding method. Dual encoding replaces semantic distillation with a two-stream encoding of SSL and waveform in an end-to-end codec framework. 2) We achieve outstanding low bitrate audio reconstruction quality with DAC-based waveform encoding and adopting larger codebooks. 3) We adopt low frame rates of 25Hz and 12.5Hz to improve speech generation efficiency. The resulting codec model, DualCodec, outperforms existing audio codecs including SpeechTokenizer and Mimi in both audio reconstruction and text-to-speech performance, making it ideal for efficient speech synthesis. We open-source our models and codes.

Architecture of DualCodec

There are two encoding streams in DualCodec: an SSL encoding stream and a waveform encoding stream. The SSL encoding stream captures semantic-rich information to the first-layer codec tokens by directly encoding from SSL feature. The waveform encoding stream encodes and decodes high-quality audio with the proven DAC framework. We apply downsampling to both streams to achieve a low frame rate of 12.5hz or 25hz.

For speech generation model training, both encoding streams are required to obtain training tokens. During SLM inference, only the codec decoder is used to produce audio.

A high-level comparison between codec systems is as follows.

System Semantic Enhancement Audio Quality Frame Rate
Encodec Good 75Hz
SpeechTokenizer ✔ (distill) Good 50Hz
DAC Great 75Hz/50Hz
Mimi ✔ (distill) Good 12.5Hz
DualCodec (our work) ✔ (dual encoding) Great 12.5Hz/25Hz

Low-Bitrate Audio Reconstruction Quality

Neural audio codecs struggle to get high quality audio at low bitrates. In this section, we demonstrate DualCodec's superior audio reconstruction quality compared to other audio codecs.

We select some challenging cases as well as multilingual cases. Note that we only use a small number of quantization levels to compare at low bitrates; in practice, more quantization levels can be used in TTS systems to achieve better audio quality.

Ground-Truth DualCodec 12.5Hz (0.93kbps) DualCodec 25Hz (0.85kbps) Mimi 12.5Hz (0.83kbps) WavTokenizer-large 75Hz (0.9kbps) DAC-official 75Hz (0.75kbps) Encodec 75Hz (1.5kbps) SpeechTokenizer 50Hz (1.0kbps) DAC-reproduced 12.5Hz (0.75kbps) DAC-reproduced 25Hz (0.75kbps)
Lang: EN
Lang: EN
Lang: EN
Lang: EN
Lang: JA
Lang: JA
Lang: ZH
Lang: ZH

Decoding DualCodec's RVQ-1 token (Discretized SSL token)

DualCodec's RVQ-1 tokens are directly extracted from self-supervised (SSL) features. In this section, we demonstrate that DualCodec's RVQ-1 tokens (akin to semantic tokens) has higher intelligibility than semantic-distillation based codec SpeechTokenizer, as well as vanilla audio codec like DAC. Increasing the codebook size can help capture more semantic information.

Ground-Truth DualCodec 12.5Hz 1VQ
Codebook Size=16384
DualCodec 25Hz 1VQ
Codebook Size=16384
SpeechTokenizer 50Hz 1VQ
Codebook Size=1024
DAC-reproduced 12.5Hz 1VQ
Codebook Size=1024
DAC-reproduced 25Hz 1VQ
Codebook Size=1024
DAC-reproduced 25Hz 1VQ
Codebook Size=16384
Lang: EN
Lang: EN
Lang: EN
Lang: EN
Lang: JA
Lang: JA
Lang: ZH
Lang: ZH

Text-to-Speech Performance

Text Reference Audio Ground Truth DualCodec-VALLE 12.5Hz DualCodec-VALLE 25Hz Mimi-VALLE SpeechTokenizer-VALLE DualCodec-AR+SoundStorm 25Hz SpeechTokenizer-AR+SoundStorm Mimi-AR+SoundStorm
Take these capsules over to Mrs. David's house.
He has the ability to change his head by up to eight different types.
They can also hike to the panorama restaurant at the top of the mountain.
My dog is an alsatian, we call her 'Poppy'.
眨眼工夫,几个混混亮明警察身份,一瞬间将其将张某制服。
该系列首秀于品牌在伦敦举办的,二零一七二月时装秀上。
可怕的力量,如惊涛骇浪一般汹涌澎湃而出。
什么打仗啦当海盗呀他全无了兴趣。

Open-Source

We open-source DualCodec 12.5Hz and 25Hz, including training codes. It will be public after the anonymous review period. We are working on open-sourcing DualCodec-based SLMs—DualCodec-VALLE, DualCodec-MaskGCT, and DualCodec-FlattenedAR, with more models in planning. These models offer a range of configurations, including autoregressive and non-autoregressive approaches and different frame-rates, supporting broader exploration of speech language modeling.

TTS Result Visualization
A summary of speech language model performance on Seed-TTS-Eval. We add ``DualCodec-'' prefix for DualCodec-based models. We are open-sourcing DualCodec-VALLE, DualCodec-MaskGCT, and DualCodec-FlattenedAR.

Citation

@inproceedings{dualcodec,
  title     = {DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation},
  author    = {Li, Jiaqi and Lin, Xiaolong and Li, Zhekai and Huang, Shixi and Wang, Yuancheng and 
    Wang, Chaoren and Zhan, Zhenpeng and Wu, Zhizheng},
  booktitle = {Proceedings of Interspeech 2025},
  year      = {2025}
}