This isn’t sort of a usage question and sort of not.
My understanding is that the nmf process takes a single matrix and decomposes it into n number of matrices without any negative elements. What place does FFT have in this process and how does the configuration of window and hop size have to do with a perceived ‘accuracy’?
Do larger window sizes produce ‘better’ results or does the process just keep working till it reaches that non-negative set of matrices at the end.
I’m curious what kind of decisions I could/should be making around these settings depending on…
a) type of sound
b) what kind of output I am looking for
This is a very, very interesting (and geek) question @weefuzzy and I have in the pipeline a sort of KE part for the FFT in general. Now I reckon you have read this page on the KE website
DISCLAIMER: creative coder understanding follows
The key misunderstanding in your explanation is that nmf takes the magnitude part of the spectrogram to process, henceforth not negative. Then it “use(s) the discovered components to generate masks for the original, complex-valued, spectrum.” So you have masks defining the components from the process, then applying them (multiplying) on the complex frame (on both components) changing the amplitude (masking) without changing the phase part.
Does it help?
So all the usual Gabor uncertainties apply, with the great addition of decoupling FFT size and frame size, which allows to oversample the FFT, where I’m still not superclear in relation to spectral precision. @a.harker@weefuzzy and @groma have explained to me many times, but I’m yet to say I understand for real. Hence the need for a FFT primer for geeks
Practically: I usually play with settings with each algo. For nmf~ I found that the algo was quite sturdy to imprecisions, so I was having fun going low even on mixed signal and hear it break magnificently. Otherwise my rule of thumb is to go large fft when significant low-end is present, and make the overlap of at least 4 if not 8 if there is also transients. But I get pre-ring, obviously, because my mask opens for too long. Does it make sense?
The thing about zero padding is very simple. It does not increase resolution (which in this context has the specific meaning of the ability to resolve sinusoidal components in close proximity). What is does do is produce ideal spectral interpolation and hence finer smoothed output.
The density of frequency bins has a more profound effect on how the algorithm decomposes than the density of temporal frames. This stands to reason – for the ‘plain’ NMF algorithm here – because it gives the process more details on which to discriminate bewteen different things. That is, plain NMF only works on a frame-by-frame basis (no dependencies over time), and only uses STFT magnitudes.
It follows that there will be a sweet spot for particular material: too coarse a frequency grid, and you’ll find things that should be percpetually separate get grouped. Too fine, and (again) the decomposition will make less perceptual sense.
The main effect of increasing temporal density (besides increasing processing time) is on the way that the decomposed components are amplitude modulated.
The role of the STFT in NMF is just to provide a representation of the signal that is nonnegative (i.e. the magnitudes), can be inverted (we make masks over the original spectrogram) and has some bearing on our perception (although we know that the STFT is pretty rough in that regard). Other things can be used, like constant-q transforms, and you’d expect these to have more radical effects on the outcome. Invertibility and / or perceptual correlation are more poorly understood for many other representations.
There are extenstions to NMF for modelling temporal dependencies as well. With these, you’d expect things like hop-size settings to have greater consequences.
Some of the bleeding-edge research into NMF-like techniques try and learn a signal representation / transform directly from the training data, which is quite cool.
The help patch certainly puts into perspective how the decomposition works under different circumstances. Definitely something to consider when using the tools in the future.
Indeed a super useful patch! Once we start sorting the ranks it is even easier to find sound comparisons (although they don’t strictly align it is even more didactic to hear similar-ish components compared)
Thanks Owen for the help path. Though, for me only the first component makes sound.
Both source buffers are loaded and the max window reports that the calculation finished, but components 2-10 are silent for all examples.
Super helpful comparison patch @weefuzzy (changed polybuffer~ to buffer~ for 2a).
Actually, sorry, just getting back into looking at this. Why did you use polybuffer~ in the first place in this specific example?
Because @weefuzzy loves them: they can grow programmatically. I agree with him, but we had to drop support for the time being because of code efficiency: the benefits were not worth the code mess. One can still use them by naming instance (mybuf.1 etc)…