I am trying to render the creepiest voice-over ever with
fluid.bufnmfcross~ in Max. My (mono) target is the voice-over which is 5m27s long. My source is the concatenated buffer of my (mono) short sound gestures, around 5000 of them, in total 1h30m long.
fftsettings 1024 -1 -1,
More or less defaults, unless I’ve missed something.
Now, rendering this is just simply impossible, after around 12 hours I was still at around 0.2 progress. And that is in case it didn’t crash which it eventually did. The longer target I have the more likely it seems to get a random crash during the process. So I sliced my voice-over into short 5-15s slices, and trying to render the slices one by one. Baking a 15s segment takes around 5-6 hours on my Intel Core i7-10750H with 16GB RAM on Windows. The bottleneck seems to be the RAM, not the CPU, the latter is ramping up to several cores fully loaded sometimes, and then back to just one for most of the time.
I was thinking, since it fills up my 16GB RAM in an instant, maybe I lose a lot of time redoing things a million times (that couldn’t fit in the RAM)? Is there any way I could speed things up by maybe pre-rendering NMFs and referencing them somehow? Or by any other trick? Let me know, thanks.
The complexity of NMF grows more than exponentially, so maybe you can try to cut the 5minute file in smaller chunks and see? @weefuzzy would be able to tell you the formula in O but I know it is bad and is a ‘feature’ of the algorythm.
OK, thanks, @tremblap, that’s what I thought. I wanted to keep my whole dataset as my source, so I guess I have to accept the consequences…
it is worth testing with a subset - a 10th will likely take wayyyyy less time! And with your sexy HD space with valid nearest neighbour you could decimate the redundant sounds
I think it’s the size of your source which is really hurting you here.
IIRC NMFCross basically uses the STFT magnitude spectrum of the source as its basis matrix: IOW, it’s then trying to do NMF with your target based on having as many components as there are frames in your source: (90 * 60 * 44100) / 512 => ~465k components.
if you condense the pool of gestures quite dramatically, it should get quicker (quadratically, quite possibly). Which might make it more practical to experiment with different combinations of your target and subsets of the gesture pool…
Thanks for the help, guys. 465k components feels sick, all those years of study and practice now seem to be worth it!
Now I am thinking how I could condense my source without losing “fidelity”. First I was thinking that if I have the full dataset there then it should be the “most accurate”, regardless that it takes much longer to render. But now I think there are probably diminishing returns beyond a certain resolution/length, since the dataset is around 5k gestures, but made with just around 50 objects. So I guess there could be a way to either make or choose just 50 that basically covers all the spectrum I need.
Does time in the source (transients, envelope) make any difference for the process? I am thinking, probably not so much? Then maybe if I overlay each item per class and sum the fft magnitudes and then normalize, I could drastically simplify the problem. There is no drastic timbral variation within a class, most of them are actually quite homogeneous in the spectral domain, the variations are mostly on the meso-envelope level (“gesture”).
I was thinking about my idea of decimating your dataset intelligently. @weefuzzy and @groma will probably find my proposal dangerous, but you could try to use fluid.grid to fit your UMAP’d data from the great Balint descriptor space, and then just pick every 3 square in both dimensions, which should remove 8/9th of the data according to its proximity… does that sound crazy?
Thanks, I am looking at this now, never tried fluid.grid~ before. Previously I was thinking to
- get bounding box over 3D dataset
- define grid of xyz
- pick points closest to cell centers
How much do you think the version with fluid.grid~ would differ from this (beyond the difference between 2D and 3D)?
fluidgrid does the spacial optimisation for you - check the cool demo @jamesbradbury did:
you could run a pca before to lower the redundancy, aiming at 0.95 - I’m deep in other things now but I wanted to code it for you
Aaarggh. Pure porn. @jamesbradbury is a wizard.
I am almost done mocking up my DYI version and now I realize that the problem is a bit trickier than I thought, haha. Off to fluid.grid~ now, thanks.
Oh I just finished decimating with fluid.grid~, it is awesome! I sampled 100 elements from my original 5k dataset this way. Just by listening to the concatenated list (that is around 2m30s) it feels quite comprehensive, almost embarrassingly, actually.
Of course, with that the bufnmfcross~ renders incomparably faster. In 1 minute instead of 8 hours (in case of a 15s target). The result is slightly different, perhaps a little more repetitive in the decimated version, but (surprisingly) very close. And this is working from a 100 files instead of 5000, so with 98% of the data discarded.
What a great lesson I’ve just learned, thanks, guys!
This grid is great. When was it added? It solves so many problems.
And here is another version using a decimated version of the dataset (400 sounds, took around 10 minues):
The most popular theory about the origin of the Hum is that it was a kind of autonomous weapon, since it can, albeit in rare cases, affect people and systematically destroy their memories.
I’ll give it a spin later but the results are great, not perfect but you can play with the various parameters of voiced and things with a rendering time that is a lot more realistic!
This sounds awesome!
I’ll have a gander at the patch when I’m done teaching.
(as an aside, I don’t know if this is a discourse thing, but I can’t get the videos to play on Safari. They play on Chrome but only the audio, the video says this:
I have the same error and results on Firefox
This is because safari and firefox only support specific web codecs which are not the format of the video that Balint uploaded. Unfortunately not solvable discourse side I wouldn’t think, unless it duplicates the file on the server and converts it.
The video thing is just me being a bad person. There is no video track, just the audio wrapped into a webm. Feature request for this Discourse: could you support just audio upload? :))
@jamesbradbury will think of a modern codec that is light and fun I’m sure