Mfcc comparison

rodrigo.constanzo · September 15, 2020, 10:23am

In the “Hans method” video, I’m searching the whole lot. So choosing a random sample, and then finding the 4 nearest matches to it.

In context, I was using a smaller and more controlled set, and I train individual samples from the training set, and then play back a single sample from the matching set and see if the numbers are correct (so quantitative vs qualitative).

tremblap · September 15, 2020, 10:37am

ok here I have 4 comparisons going on, all of which return very similar friends, all mostly convincing (more than your video but I was saying to @weefuzzy that it is hard for me to assess since I don’t know the data nor the search, let alone the subjective ‘more similar’ - in my messy dirty synth dataset I definitely get 4 different outputs, all of which are convincing to a certain extent… maybe I should send my sounds to @tutschku and you so you can try it in your respective patches)

(one question I had: you seem to use 20 mfccs, but is that 21 of which you remove 0, or 20 and you keep the very strongly loudness correlated #0?)

rodrigo.constanzo · September 15, 2020, 10:55am

With the comparison with the Orchidea ones, I did 20 MFCCs for both (including the 0th). That was to keep parity between the analysis since there was no control over the Orchidea stuff (as far as I could see).

So with this are you just doing spectrally-weighted MFCCs? Or is there some more of the Orchidea “secret sauce” in there too? (I remember @weefuzzy mentioning some stuff about normalizing across the whole dataset or something like that).

weefuzzy · September 15, 2020, 11:11am

I was never able to work out what Orchidea was doing w/r/t normalizing or dropping MFCC 0. IIRC, when I looked at the output from @tutschku’s sounds, the first two components didn’t look like they’d been standardized / normalized at all, but the later ones look like they might have been standardized. However, I was never able to massage a librosa analysis with weightings to produce comparable distributions.

tremblap · September 15, 2020, 11:42am

I’ll try with 20 here. at the moment I did 13 in 4 flavours

all 13 with 14 stats, unweighted
all 13 with 14 stats, weighted
dismissing 0, with 8 stats (the fluidcorpusmap ones:mean/std/min/max and the same on derivative1), uw
d 0 w 8 s w

I did not normalise (add 4 more comparisons here that I won’t do yet as i need to work on the other secret-ish addition) and I am not clear if you were using derivatives either (add another batch of all)

then we can run all of it in 20 mfccs

rodrigo.constanzo · September 15, 2020, 2:35pm

Curious on the results. Also curious on the impact of weighting on “perceptual” descriptors (e.g. centroid etc…) vs MFCCs too.

For my 20FluCoMa vs 20Orchidea I did no stats (only mean), no derivs.

tremblap · September 15, 2020, 3:12pm

that is probably why I get so much better results than you

thanks for the info. soon, a world of experimentations more for you.

rodrigo.constanzo · September 15, 2020, 3:23pm

With the comparisons in the other thread, I did loads of diff stats and amounts of stats etc…, but since I had no idea what Orchidea was doing, I wanted to compare like-to-like, and take the MFCCs as they came.

rodrigo.constanzo · September 17, 2020, 8:57pm

I guess this would primarily apply to dimensionality reduction stuff, but it was interesting what @b.hackbarth mentioned during the last geek chat with regards to how Norbert Schnell would apply MFCC clipping (was that the term?) where the higher dimensions would be phased out first as part of the process.

So I wonder if there’s some stuff like that going on where MFCCs are kind of rescaled or exaggerated along those lines?

weefuzzy · September 18, 2020, 9:49am

I’ve found that bit in yesterday’s video now. Unless I misunderstand @b.hackbarth’s description, this is liftering (filtering in the cepstral domain) but with a non-rectangular window, which certainly isn’t unheard of. I can see how it would possibly tame matching behaviour in some circumstances by emphasising the gross portions of the spectral envelope relative to the finer detail.

b.hackbarth · September 18, 2020, 10:17am

@weefuzzy’s description of what is happening is correct. when you type d(“mfccs”) into audioguide, it is expanded internally into a longer list of individual mfcc coeffs with liftering weights.

probably TMI, but…

d(“mfccs”) = d("mfcc1, weight=0.121), d("mfcc2, weight=0.116), d(“mfcc3”, weight=0.110), d(“mfcc4”, weight=0.105), d(“mfcc5”, weight=0.098), d(“mfcc6”, weight=0.092), d(“mfcc7”, weight=0.084), d(“mfcc8”, weight=0.076), d(“mfcc9”, weight=0.068), d(“mfcc10”, weight=0.057), d(“mfcc11”, weight=0.045), d(“mfcc12”, weight=0.027)

Note that mfcc0 is omitted, and the weights are (linspace(1, 0) ** 0.5) / sumOfWeights. Dividing each weight by the sum of weights ensures that, when searching for multiple descriptors, mfccs have equal import compared to a single dimension descriptor, like centroid.

Pierre’s comment about normalization was spot on here. Audioguide normalizes each mfcc independently, so the weightings mimic the smaller values usually present in higher mfcc coeffs.

tremblap · September 18, 2020, 1:15pm

Please, call me p.a.

jamesbradbury · September 18, 2020, 10:06pm

Is this somewhere in a paper? Id be curious if librosa had it in somewhere in their weighting stuff so that I could as natively as possible hook into this.

tremblap · September 19, 2020, 9:32am

beware of the rabbit hole
Librosa vs Essentia MFCC comparisons can offer a lot of long evening readings… for instance:

weefuzzy · September 19, 2020, 9:49am

I figure it’ll be in lots, but here’s one that looks at its usefulness in (old) speech recognition schemes
Juang, B. H., Rabiner, L., & Wilpon, J. G. (1987). On the use of bandpass liftering in speech recognition. IEEE Transactions on acoustics, speech, and signal processing , 35 (7), 947-954.

librosa.feature.mfcc seems to have offered raised-sine window liftering since 0.7.1

jamesbradbury · September 19, 2020, 11:02am

Cool! I’ll have to have a punt at putting this in. I think for the next foreseeable months I will be writing PhD stuff so I won’t have anything concrete to apply this too but I’m watching this space