Designing Neural Synthesizers for Low Latency Interaction
Franco Caspe - Jordie Shier - Mark Sandler - Charalampos Saitis - Andrew McPherson*
Centre for Digital Music - Queen Mary University of London
*Dyson School of Design Engineering - Imperial College
Paper - Code - Plugin

Abstract

Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.

Introducing BRAVE: A re-design of RAVE for low-latency interaction

BRAVE achieves adequate latency (< 10 ms) and jitter ( 3 ms) by addressing several sources of latency in the model, namely buffering, representation, and cumulative delays (see the paper). This is achieved by removing RAVE’s noise generator and using a smaller encoder compression ratio, PQMF attenuation, and causal training. The number of parameters is also reduced to improve its Real Time Factor. Numbers below the blocks denote the compression ratio of intermediate results.

Model Summary

Model Hidden Sizes S: Strides D: Dilations PQMF Att. (dB) Cr: C. Ratios Rf: Rec. Field (ms) # Parameters (M)
RAVE v1 (non causal) [64, 128, 256, 512] [4, 4, 4, 2] [1, 3, 5] 100 2048 1047 17.6
c2048_r10 [64, 128, 256, 512] [4, 4, 4, 2] [1, 3, 5] 100 2048 1047 17.5
c1024_r10 [64, 128, 256, 512] [4, 4, 2, 2] [3, 9, 27] 100 1024 1070 16.9
c512_r10 [64, 128, 256, 512] [4, 2, 2, 2] [3, 9, 18, 36] 100 512 960 18.4
c256_r10 [64, 128, 256, 512] [2, 2, 2, 2] [3, 9, 27, 36] 100 256 973 16.2
c128_r10 [64, 128, 256, 512] [2, 2, 2, 1] [3, 9, 27, 45, 63] 100 128 955 17.3
c128_r10_p70 [64, 128, 256, 512] [2, 2, 2, 1] [3, 9, 27, 45, 63] 70 128 947 17.3
c128_r10_p40 [64, 128, 256, 512] [2, 2, 2, 1] [3, 9, 27, 45, 63] 40 128 941 17.3
c128_r05_p40 [64, 128, 256, 512] [2, 2, 2, 1] [3, 9, 27, 36] 40 128 517 15.2
BRAVE [32, 64, 128, 256] [2, 2, 2, 1] [3, 9, 27, 36] 40 128 517 4.9
Models implemented in the paper. All models have a latent vector size of 128. All of them are causal and do not have a noise generator, with the exception of RAVE. The receptive field assumes a sample rate of 44.1 kHz.

Audio Examples

Filosax Models with Audio Files Here we present audio examples showing the synthesis capabilities of the different models we trained for the paper. We train two families of models on both Drumset and Filosax datasets, and perform forward passess over different test sets, not seen during training.

Models trained on Filosax dataset.

Reconstructions with varying compression ratio.
Instrument Original RAVE c2048_r10 c1024_r10 c512_r10 c256_r10 c128_r10
Filosax
Svoice
Viola
Reconstructions with same compression ratio (128) and varying receptive field and PQMF attenuation.
Instrument Original RAVE c128_r10 c128_r10_p70 c128_r10_p40 c128_r05_p40 BRAVE
Filosax
Svoice
Viola
BRAVE is a lightweight version of c128_r05_p40, with only half of the hidden channels in both encoder and decoder.

Models trained on Drumset dataset.

Reconstructions with varying compression ratio.
Instrument Original RAVE c2048_r10 c1024_r10 c512_r10 c256_r10 c128_r10
Drumset
Beatbox
Candombe
Reconstructions with same compression ratio (128) and varying receptive field and PQMF attenuation.
Instrument Original RAVE c128_r10 c128_r10_p70 c128_r10_p40 c128_r05_p40 BRAVE
Drumset
Beatbox
Candombe

Adversarial Training

We illustrate how adversarial training affects melody rendering, probably due to a relatively small dataset size. However, the models with small compression ratio, including BRAVE, do not seem to suffer this problem.
We show reconstructions of the test set of the Filosax dataset, done by models trained for 1M steps, denoted (mss only), and then the same models trained for an additional 500M steps with adversarial (total of 1.5M steps), denoted (adversarial).
Instrument Original RAVE c2048_r10 c1024_r10 c512_r10 c256_r10 c128_r10
Filosax (adversarial)
Filosax (mss only)
Instrument c128_r10_p70 c128_r10_p40 c128_r05_p40 BRAVE
Filosax (adversarial)
Filosax (mss only)