Melbands vs MFCCs and "noise"

rodrigo.constanzo · March 14, 2021, 12:51am

In many discussions @weefuzzy (and others) have brought up how fragile MFCCs can be in the face of noise, with suggestions of using higher order melbands in a similar manner. My tests with this were never super promising, as I always seemed to get better results out of 20MFCCs vs even 40 melbands (for vanilla classification/matching).

The reason I’m making this thread now is that I’m finding that MFCCs are struggling a bit with regards to finding the exact match from a bunch of examples, particularly when being followed by UMAP-ing.

I’m wondering if this is an issue of file analysis vs real-time onset detection sometimes having +/- a few samples on the same (or very similar) sounds. I’m trying to mitigate this by running my offline analysis through the same onset detection algorithm (with a bit of preroll on al the files), but the results still aren’t great.

I’m not getting great results from melbands either (a separate discussion/thread perhaps), but my question here is with regards to what happens to MFCCs, and specifically dimensionally reduced MFCCs if there are small differences in phase? Since it’s an “FFT of an FFT” which is then getting shoved through some crazy manifold thing, that seems to me that having even the same audio file +/- a few samples could create radically different values.

AND / OR

Is there a way to make MFCCs (potentially when being used with subsequent dimensionality reduction) more robust with regards to minor changes in attack/onset?

tremblap · March 14, 2021, 1:32pm

A few things to try:

gating the signal by amplitude to remove the quiet noisy frames and keep just those with valid signal in. That applies to MFCC but you could also do that on mel bands and normalise the result after to dismiss the amplitude and keep just the contour.
denoising the signal by splitting it: HPSS or sines than analyse the pitched component.
autoencoder as denoiser, as proposed before. The bottleneck allows to remove the details to keep the gist, which is great if you think about it in this case as you train on a complete dataset.

I know you want stuff to be fast, but trying it first to see a ballpark of potential results will give you ideas of what to expect and what to optimse.

rodrigo.constanzo · March 14, 2021, 2:48pm

Interesting. On the short time scale there’s only 7 frames, and they all tend to have something. I even spent a bit over hour yesterday massaging the JIT analysis frame recording to make sure I capture as much of the attack (and tail) as possible. In the end I went with 16 samples before the onset was detected, and that seemed to work the best. Visualizing and browsing the clustering post UMAP is what I used to here to assess the effectiveness. (You had suggested something similar ages ago, but other than comparing raw matching %, I had no meaningful way of knowing whether it was working).

This too is interesting. I was exploring some pre-decomposed analysis stuff last year and got really good and promising results. I just got slowed down on that by the amount of HD (and RAM) required. That could connect nicely with a decomposed input analysis as that could fork off there.

I hadn’t really considered doing this as a denoise-ing thing though.

I guess this will be an interesting thing to try, vs the UMAP I’m using for timbre at the moment. We’ll geek later this week and see how this goes.

tedmoore · March 17, 2021, 8:22pm

Have you compared both of these with using the output of FluidSpectralShape? For noise, things like centroid, skewness, flatness, crest might actually be more indicative. I use these all the time (often in combination with some number of MFCCs).

tremblap · March 17, 2021, 8:33pm

Along those lines, it is interesting that Diemo has opted for zero-crossing and centroid as his 2 spectral descriptors for the SKataRT…

rodrigo.constanzo · March 17, 2021, 8:40pm

Are MFCCs now out of style?! I only just started using them…

I’m presently experimenting to see if I can get an autoencoder to make better sense of the MFCCs but it may be useful to throw in some “oldschool” spectral descriptors.

Where have you found info for that stuff? On the webpage I’ve only seen a tiny screenshot and that’s it.

tremblap · March 17, 2021, 8:43pm

I’ll answer in the right thread

rodrigo.constanzo · March 17, 2021, 10:03pm

When your “crossing the streams” as it were, what’s your data processing workflow here? Do you standardize(/norm/scale/whatever) each descriptor and associated stats separately, then put them together, then run them through TSNE (or whatever)? Do you try to have somewhat equal proportionality in the aggregate “timbre” pool (e.g. roughly as many perceptual descriptors as you have MFCCs)?

tedmoore · March 18, 2021, 9:55am

I have been using an analysis vector of SpectralShape (7 params), pitch, pitch conf., zero crossing, sensory dissonance, and then 11 or so MFCCs. So yeah, that was my thinking about half “old school” (as you call them) analysis params and then an equal number of MFCCs, so as to not over weigh the MFCCs (with 40 for example).

I put them all in one dataset and normalize them all together. Normalize because I’ve been using neural nets with sigmoid activation (0 to 1 squasher) in the hidden layer.

tremblap · March 18, 2021, 9:57am

do you mean you normalise the whole matrix to keep the relative values between descriptor vectors, or you do normalise per vector (like our object does)

tedmoore · March 18, 2021, 10:22am

Do you mean per column?

//=====//

I normalize using FluidNormalize, so that the centroids all get normalized with the centroids (you know, values 20-20k ish) and the pitch conf gets normalized with the pitch conf (0 to 1 ish), etc.

I just meant all together as in with one FluidNormalize in one FluidDataSet.

Per some other thread (somewhere) I’ve also been considering not normalizing some values, MFCCs for example (was it @tremblap who suggested this?) and seeing how that compares, however Wikipedia seems to suggest that MFCC normalizing is a good idea (at least for speech).

Wikipedia:
MFCC values are not very robust in the presence of additive noise, and so it is common to normalise their values in speech recognition systems to lessen the influence of noise. Some researchers propose modifications to the basic MFCC algorithm to improve robustness, such as by raising the log-mel-amplitudes to a suitable power (around 2 or 3) before taking the DCT (Discrete Cosine Transform), which reduces the influence of low-energy components.[8]

tremblap · March 18, 2021, 12:50pm

yes. sorry. indeed we do that, but what I mostly meant is that you lose the relative scale that you would keep if you were doing a matrix-wide normalisation… each have their consequences but I am still exploring those consequences…

rodrigo.constanzo · March 18, 2021, 1:28pm

I’ll give something like this a spin.

Yeah that’s it. With vanilla MFCCs my results look better without standardizing (in my case), so just thinking how to best fold all those things together since it’s a mix of whatever the hell MFCCs are, MIDI pitch, dB, and a ratio too? A whole bunch of linear/nonlinear stuff mixed together.

@tremblap mentioned, in passing, yesterday about having something in between these options of “doing nothing” or “making all the values gigantic”, as it seems like there may be situations where you want to normalize (or whatever) an entire dataset, and not per-column. My intuition, based on everyones comments here, would be to somehow scale all the spectral stuff to similar ranges, then standardize that chunk, then standardize the MFCCs down to be in the same ballpark (as a whole dataset, and not per column), and then put all of that together through UMAP (or whatever) next. So things are relatively the same size, but not completely spread out in the available range.

Is that a thing for this kind of stuff?

rodrigo.constanzo · March 18, 2021, 1:38pm

Oh, I wanted to ask your experience with this. Back at the penultimate plenary @groma mentioned the usefulness of higher order MFCCs nearing “pitch” resolution, so I did some testing and found that at 20MFCCs I got improved results, enough so over the classic 13 to be worthwhile. Beyond 20 seemed to not to much (with the material I was using anyways).

So wondering if you’ve arrived at 11 from testing. As well as knowing your thoughts on the 0th coefficient, which is not very popular 'round these parts.

tedmoore · March 18, 2021, 1:42pm

I do always take off the zeroth.

The 11 comes from pairing it with the 11 other descriptors and finding that it works well for my uses. At one point I was using 40 MFCCs for some testing, then 20, then 13… I don’t remember exact values to report back (I should probably keep better record of my testings…), but I found that for the tasks I was trying to perform, the low teens # of MFCCs was working pretty well.

Also, when I do my analysis, I generally just analyze for 40. That way I can toss a few more in the dataset as needed if it helps my accuracy…

tremblap · March 18, 2021, 1:44pm

Let’s talk about that tonight. I have heard new information that helped me understand a little more how wrong I was about it all… not that what @tedmoore is not interesting (from the wikipedia article) although I don’t know what the consequences would be. @groma is daboss on this so he might have a clever pointer here for helping me (and you, I presume) understand a bit more…

rodrigo.constanzo · March 18, 2021, 1:46pm

Cool, I’ll give that a test. It may be more robust to the minor variances in timing which prompted this thread in the first place.

Oh, do MFCCs work like PCAs in that way, in that the first 11 out of 40 are the same as 11 out of 11?

Cool, looking forward to what new info you have.

tedmoore · March 18, 2021, 1:59pm

I believe that they’ll be different if you change the number of bands used.

But if you specify 40 bands and ask it to return 13 MFCCs, yes those 13 will be the same as if you specified 40 bands, asked it to return 40 MFCCs, but then only used the first 13.

In SuperCollider it’s two arguments: numBands (default=40) and numCoefs (default=13).

I’d be curious to hear from @groma about the relationship between bands and coefs in this case. Why are 40 and 13 the defaults here?