Designing Neural Synthesizers for Low-Latency Interaction

Designing Neural Synthesizers for Low-Latency Interaction
Franco Caspe - Jordie Shier - Mark Sandler - Charalampos Saitis - Andrew McPherson*
Centre for Digital Music - Queen Mary University of London
*Dyson School of Design Engineering - Imperial College
Paper - Code - Plugin

Abstract

Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.

Introducing BRAVE: A re-design of RAVE for low-latency interaction

BRAVE achieves adequate latency (< 10 ms) and jitter ( 3 ms) by addressing several sources of latency in the model, namely buffering, representation, and cumulative delays (see the paper). This is achieved by removing RAVE’s noise generator and using a smaller encoder compression ratio, PQMF attenuation, and causal training. The number of parameters is also reduced to improve its Real Time Factor. Numbers below the blocks denote the compression ratio of intermediate results.

Model Summary

Model	Hidden Sizes	S: Strides	D: Dilations	PQMF Att. (dB)	Cr: C. Ratios	Rf: Rec. Field (ms)	# Parameters (M)
RAVE v1 (non causal)	[64, 128, 256, 512]	[4, 4, 4, 2]	[1, 3, 5]	100	2048	1047	17.6
`c2048_r10`	[64, 128, 256, 512]	[4, 4, 4, 2]	[1, 3, 5]	100	2048	1047	17.5
`c1024_r10`	[64, 128, 256, 512]	[4, 4, 2, 2]	[3, 9, 27]	100	1024	1070	16.9
`c512_r10`	[64, 128, 256, 512]	[4, 2, 2, 2]	[3, 9, 18, 36]	100	512	960	18.4
`c256_r10`	[64, 128, 256, 512]	[2, 2, 2, 2]	[3, 9, 27, 36]	100	256	973	16.2
`c128_r10`	[64, 128, 256, 512]	[2, 2, 2, 1]	[3, 9, 27, 45, 63]	100	128	955	17.3
`c128_r10_p70`	[64, 128, 256, 512]	[2, 2, 2, 1]	[3, 9, 27, 45, 63]	70	128	947	17.3
`c128_r10_p40`	[64, 128, 256, 512]	[2, 2, 2, 1]	[3, 9, 27, 45, 63]	40	128	941	17.3
`c128_r05_p40`	[64, 128, 256, 512]	[2, 2, 2, 1]	[3, 9, 27, 36]	40	128	517	15.2
BRAVE	[32, 64, 128, 256]	[2, 2, 2, 1]	[3, 9, 27, 36]	40	128	517	4.9

Models implemented in the paper. All models have a latent vector size of 128. All of them are causal and do not have a noise generator, with the exception of RAVE. The receptive field assumes a sample rate of 44.1 kHz.

Audio Examples

Filosax Models with Audio Files Here we present audio examples showing the synthesis capabilities of the different models we trained for the paper. We train two families of models on both Drumset and Filosax datasets, and perform forward passess over different test sets, not seen during training.

Models trained on Filosax dataset.

Reconstructions with varying compression ratio.

Instrument	Original	RAVE	c2048_r10	c1024_r10	c512_r10	c256_r10	c128_r10
Filosax
Svoice
Viola

Reconstructions with same compression ratio (128) and varying receptive field and PQMF attenuation.

Instrument	Original	RAVE	c128_r10	c128_r10_p70	c128_r10_p40	c128_r05_p40	BRAVE
Filosax
Svoice
Viola

BRAVE is a lightweight version of c128_r05_p40, with only half of the hidden channels in both encoder and decoder.

Models trained on Drumset dataset.

Reconstructions with varying compression ratio.

Instrument	Original	RAVE	c2048_r10	c1024_r10	c512_r10	c256_r10	c128_r10
Drumset
Beatbox
Candombe

Reconstructions with same compression ratio (128) and varying receptive field and PQMF attenuation.

Instrument	Original	RAVE	c128_r10	c128_r10_p70	c128_r10_p40	c128_r05_p40	BRAVE
Drumset
Beatbox
Candombe

Adversarial Training

We illustrate how adversarial training affects melody rendering, probably due to a relatively small dataset size. However, the models with small compression ratio, including BRAVE, do not seem to suffer this problem.
We show reconstructions of the test set of the Filosax dataset, done by models trained for 1M steps, denoted (mss only), and then the same models trained for an additional 500M steps with adversarial (total of 1.5M steps), denoted (adversarial).

Instrument	Original	RAVE	c2048_r10	c1024_r10	c512_r10	c256_r10	c128_r10
Filosax (adversarial)
Filosax (mss only)

Instrument	c128_r10_p70	c128_r10_p40	c128_r05_p40	BRAVE
Filosax (adversarial)
Filosax (mss only)