Franco Caspe - Jordie Shier - Mark Sandler - Charalampos Saitis - Andrew McPherson*
Centre for Digital Music - Queen Mary University of London
*Dyson School of Design Engineering - Imperial College
Paper - Code - Plugin
Abstract
Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.
Introducing BRAVE: A re-design of RAVE for low-latency interaction

Model Summary
Model | Hidden Sizes | S: Strides | D: Dilations | PQMF Att. (dB) | Cr: C. Ratios | Rf: Rec. Field (ms) | # Parameters (M) |
---|---|---|---|---|---|---|---|
RAVE v1 (non causal) | [64, 128, 256, 512] | [4, 4, 4, 2] | [1, 3, 5] | 100 | 2048 | 1047 | 17.6 |
c2048_r10 |
[64, 128, 256, 512] | [4, 4, 4, 2] | [1, 3, 5] | 100 | 2048 | 1047 | 17.5 |
c1024_r10 |
[64, 128, 256, 512] | [4, 4, 2, 2] | [3, 9, 27] | 100 | 1024 | 1070 | 16.9 |
c512_r10 |
[64, 128, 256, 512] | [4, 2, 2, 2] | [3, 9, 18, 36] | 100 | 512 | 960 | 18.4 |
c256_r10 |
[64, 128, 256, 512] | [2, 2, 2, 2] | [3, 9, 27, 36] | 100 | 256 | 973 | 16.2 |
c128_r10 |
[64, 128, 256, 512] | [2, 2, 2, 1] | [3, 9, 27, 45, 63] | 100 | 128 | 955 | 17.3 |
c128_r10_p70 |
[64, 128, 256, 512] | [2, 2, 2, 1] | [3, 9, 27, 45, 63] | 70 | 128 | 947 | 17.3 |
c128_r10_p40 |
[64, 128, 256, 512] | [2, 2, 2, 1] | [3, 9, 27, 45, 63] | 40 | 128 | 941 | 17.3 |
c128_r05_p40 |
[64, 128, 256, 512] | [2, 2, 2, 1] | [3, 9, 27, 36] | 40 | 128 | 517 | 15.2 |
BRAVE | [32, 64, 128, 256] | [2, 2, 2, 1] | [3, 9, 27, 36] | 40 | 128 | 517 | 4.9 |
Audio Examples
Models trained on Filosax dataset.
Reconstructions with varying compression ratio.Instrument | Original | RAVE | c2048_r10 | c1024_r10 | c512_r10 | c256_r10 | c128_r10 |
---|---|---|---|---|---|---|---|
Filosax | |||||||
Svoice | |||||||
Viola |
Instrument | Original | RAVE | c128_r10 | c128_r10_p70 | c128_r10_p40 | c128_r05_p40 | BRAVE |
---|---|---|---|---|---|---|---|
Filosax | |||||||
Svoice | |||||||
Viola |
Models trained on Drumset dataset.
Reconstructions with varying compression ratio.Instrument | Original | RAVE | c2048_r10 | c1024_r10 | c512_r10 | c256_r10 | c128_r10 |
---|---|---|---|---|---|---|---|
Drumset | |||||||
Beatbox | |||||||
Candombe |
Instrument | Original | RAVE | c128_r10 | c128_r10_p70 | c128_r10_p40 | c128_r05_p40 | BRAVE |
---|---|---|---|---|---|---|---|
Drumset | |||||||
Beatbox | |||||||
Candombe |
Adversarial Training
We show reconstructions of the test set of the Filosax dataset, done by models trained for 1M steps, denoted (mss only), and then the same models trained for an additional 500M steps with adversarial (total of 1.5M steps), denoted (adversarial).
Instrument | Original | RAVE | c2048_r10 | c1024_r10 | c512_r10 | c256_r10 | c128_r10 |
---|---|---|---|---|---|---|---|
Filosax (adversarial) | |||||||
Filosax (mss only) |
Instrument | c128_r10_p70 | c128_r10_p40 | c128_r05_p40 | BRAVE |
---|---|---|---|---|
Filosax (adversarial) | ||||
Filosax (mss only) |