drumGAN - how to make boring drum sounds in more steps

Came across this today:

I like the idea here a lot, but as is so painfully typically the case, the implemented use case is the most boring shit imaginable.

Even the examples in the video are painfully generic and seem like they could be replaced with a corpus-browsing approach of the same 900k entries.

The code is on github:

But warns that it’s not up to date etc…

It’s interesting too that since drum sounds are typically short, that that makes it easier to do this kind of stuff on. I imagine it still means running for weeks on a rack of GPUs or something anyways.

I do look forward to when it would be possible to use/implementing something like this in a CCE.

1 Like

You can now with RAVE:

1 Like

Isn’t this more like the LetItBee-type thing where it producers a timbre map of the input? (as opposed to generating synthetic/static-ish variations ala GAN)

I’m quite over my head in terms of the ML lingo/field though.

They’re not so different in effect (but they are in architecture). Both give you a generative model, meaning they should, in principle, be able to produce new data based on having learned the statistical distributions of features in the training sets (kinda sorta), so one can make draws on those distributions. Let It Bee, by contrast, is trying to find the best way to reconstruct a patch of input using mixtures stuff from a database that’s basically fixed once it’s made.

That’s right, I have been investigating RAVE, made a Windows build of their nn~ Max object and it works great. But yeah, making your own .ts files takes loads of GPU power and computing time. That whole experience got me interested in GANs, so thanks for posting!


That’s where I got blocked too, though I reckon with a bit of deep massage a collab notebook could be made to do the training. Something I plan to investigate in the near future for sure.

I’ve been running the RAVE training on the Collab that’s been set up. A few hours a day is what I get of Runtime. I would be willing to pay some but I don’t want Google to have my credit card number. GPU with CUDNN won’t work on my system, neither Linux nor Windows (sorry I know I’ve got off topic!)

How long does something like this typically take to train up? Like ballpark. 20h+ (on a GPU farm)? 2 weeks straight on a laptop?

I wouldn’t even bother on a laptop, unless you had an M1 or more recent and could leverage the neural chippy stuff inside through tensorflow.

Other than that I think the training at best might just be an overnight job.

1 Like

I have been training models with RAVE and it takes days even in a high end GPU (depends on the size of your database and the capacity / resolution). Been using a V100 with 8GB sponsored by Pawsey Supercomputing as part of my research. The technique is quite different to timbral mapping with LetItBee and NMF-based timbral morphing. I’d say the process is like automating the concatenation of audio chunks in the sample space using learned filters and inverse convolutions instead of hancrafted features and similarity metrics. RAVE affords end-to-end resynthesis in real time. The main drawback is that finding the sweet spot of the GAN is a pain in the neck and there is not much guidance regarding the tuning of hyperparameters besides trial and error, which feels like magic.


That’s good to know.

The real-time resynthesis I’m less interested in as (by the sounds of it on the video) there’s a fair amount of latency to the process. I’m more interested in synthesizing hybrid/variations from unusual input sounds.

Yeah - ca. 21 Collab hours and I am about a third of the way done, my estimate.


You’ve inspired me to try training. They’ve done a great job creating that notebook template to work from. I’ll report back my model here once it finishes training :slight_smile:


Out of curiosity, how many epochs far have you gone? I’m on 1300 but it has no indication of ever stopping till I tell it to.

I am almost at 9,000 epochs (1,439,000 STEPS). The code is set to stop at 3 million STEPS. How many steps are in an epoch is still a mystery variable to me, I am not sure it is always the same. There was some discussion about this on one of the discussion threads of Rave’s github, but no clear answer.
As you probably know, you have to go at least a million STEPS for the “warmup” session to end, that for me was about 6,000 epochs. Then the distance and validation numbers soar up should go down and level out again. And I am hoping that happens before 3 million steps :slight_smile:
As a noob I have a lot more to say and ask about this process, but this is probably not the right forum. However, their proper discussion on their Github is a bit quiet now.

1 Like

No, go for it! I’m going to leave mine as a long form process to explore but it’s not that interesting yet. I wonder if a TPU is much faster for this.

1 Like

I went ahead and got colab pro, so would like to try using TPU. I haven’t worked out how to implement it yet into RAVE’s code, because I think you have to dock or install Tensorflow - and add some code somewhere.

Hello @jamesbradbury and @bledsoeflute

I’m very curious to hear the sound your respective models produce - I presume you didn’t use these drum sounds :slight_smile:

I will have a guess: is your input dataset made of 160 examples? Before I guess an explanation, knowing what hyperparameters you’ve used can help. I am far from an expert like @groma but I have a feeling it is either a number of batches or a number of items compared…

Also, your 21h for a third on non-pro collab tempts me but if that is for a small-ish model then I’m afraid… especially in the light of @renatrigiorese comments about the tuning of hyperparameters…

I will have a guess: is your input dataset made of 160 examples? Before I guess an explanation, knowing what hyperparameters you’ve used can help. I am far from an expert like @groma but I have a feeling it is either a number of batches or a number of items compared…

I haven’t done the math - but my dataset consisted of ca. 52 minutes of lupophone playing at 48000 (.wav) Batch size 8. I experimented with other sizes but nothing really seemed to make much of a difference speed-wise, which surprised me. There is a “Fidelity” parameter that affects the speed. I set it to 90%.
Also I haven’t extrapolated from the code yet how the set is divided for the testing/training process.
To be honest the results are those only a mother could love :slight_smile: . I am making a second attempt as a Pro+ user with an even smaller dataset at 41000 and a higher fidelity. Still I never get more than 24 hours at a time on Collab. I can share the .ts file with you if it turns out well.

1 Like

I gave up. The training constantly stopped (they note that this might happen in the notebook) and honestly, I just lost interest.