I have been training models with RAVE and it takes days even in a high end GPU (depends on the size of your database and the capacity / resolution). Been using a V100 with 8GB sponsored by Pawsey Supercomputing as part of my research. The technique is quite different to timbral mapping with LetItBee and NMF-based timbral morphing. I’d say the process is like automating the concatenation of audio chunks in the sample space using learned filters and inverse convolutions instead of hancrafted features and similarity metrics. RAVE affords end-to-end resynthesis in real time. The main drawback is that finding the sweet spot of the GAN is a pain in the neck and there is not much guidance regarding the tuning of hyperparameters besides trial and error, which feels like magic.
4 Likes