Mfcc comparison

I asked those questions to Carmine a while ago. The only detail he gave in an email:

There are several ways to compute the MFCC that depends on which version of the dct you use. I DCT-II and, as told you, I perform an energy based weighted average of the coefficients.
PS: my computation is compatible with Dan Ellis’ version and I checked this with him several years ago.

He takes thus the amplitudes into account, but I don’t know how.

I would suggest that you reach out to him directly as those details (Dan Ellis’ implementation etc.) are outside of my expertise.

I understand that part. For now I’m just comparing his blackboxes results with a median average over the fluid.mfcc

Indeed but that is a shot in the dark, as what he does inside that blackbox is definitely where the difference is - and might have very little to do with MFCCs computation… hence my questions. Do you know if he is doing median or which other time-series statistical analysis they do?

I think what @tutschku quoted from Carmine gives us something to go on

  • They use DCT II. I can’t remember offhand which one we use, but IIRC we followed Essentia. Sounds like they followed librosa if they’re talking about Dan Ellis (or at least its default: librosa provides different options
  • I can sort of guess what an energy weighted average is, and I think it tells us pretty much how they’re getting their summary statistic.

IOW, I think we’ve got something we can test empirically in all that (and I’d be very surprised if they weren’t normalizing the coefficients at some point, but it should be pretty easy to determine).


another thing we don’t know from their blackbox is now many melbands they use at first… so many variables!

We can use librosa’s defaults as a starting hypothesis. @tutschku has already opened a dialogue between Carmine and I, so if we get stuck I can trouble him for further details.

very much looking forward to your findings :smiley:

me too to be honest. There are so many variables I am slightly overwhelmed. I know that the last time I tried orchidea I did not get good result on anything that was not as pitch-driven as both your examples, but I’ll give it a try once I can compare alikes.

the voice examples from yesterday’s video are not so much pitch driven

Hi @tutschku,
Turns out I needed the Jasch objects as well. There’s still one object I can’t find though, called [recursivefolder]. Is this one of yours?

Edit: Ah, ok. It’s one of Alex’s. Sorry for the noise

actually the distances are using our KDtree :wink:

Hi @tremblap, sorry for the delay,
I haven’t coded the descriptors in orchidea myself, but I’m sure they do not rely on ircamdescriptors, the code is very lightweighted and I think Carmine rewrote them from scratch. I think Carmine is going to release the sources on github at some point, so you will also be able to doublecheck directly.

As for the distances in dada, if you refer to dada.distances, indeed that is multidimensional scaling using the simpmatrix library, as @jamesbradbury said. If you mean instead how do we measure the distances for knn, for the time being it’s just euclidean distance, but you can assign weights to each dimension to compensate for differences of magnitude (e.g. knn 4 [400 6000] [1. 0.1]) That’s all manual, and I really wouldn’t take my knn handling as a reference, it’s very naively coded.

1 Like

@danieleghisi thanks for the update! Github is a good news for us (@weefuzzy and @groma). And distance @tutschku used were in effect the flucoma ones on all settings of mfccs so that simplified comparisons.

The idea of weighting by scaling is a good one indeed, I will get thinking about those indeed too.

1 Like


1 Like

Not to send you down a rabbit hole, but Dan Ellis’s stuff can also be found here:

Written in Java in 2007-8. All the DSP was Dan’s.

So to kind of bump this here as it’s starting to overlap with some of the explorations I’ve been doing in the time travel thread.

Based on some super helpful input from @jamesbradbury I was starting to do some more qualitative testing where I would play the target, and listen back to the nearest matches, very similar to what @tutschku was doing in this thread.

I went through and pulled out the bit from @tutschku’s patch above that does the Orchidea MFCCs and painfully parsed it into a fluid.dataset~ (I’m not a big bach user, so this involved pasting things into a spreadsheet app, and dumping things through two colls etc…) and made a short video comparing the results.

So this is using my tiny analysis windows where for FluCoMa stuff the fft setting are 256 64 512 and for Orchidea it’s just 256 64 as I don’t know how to do oversampling there. Also don’t know how it handles hops or anything like that either, so that might be something to explore/unpack, but even with that, the match accuracy is on par with what @tutschku was getting. So I’m inclined to believe most of that magic is happening in how the MFCCs are being processed, rather than which frames I’m grabbing.

So here’s the video:

I mention it in the video, but this is showing kind of ‘best case’ performance from the FluCoMa stuff. I’ve done loads of different permutations (as can be seen in the other thread), but all of them are roughly in this ballpark. For this video I’m using 20MFCCs, ignoring the first, then adding loudness and pitch. I then take mean, std, min, max for everything, including one derivative. And for the Orchidea I’m taking only the 20MFCCs, nothing else.

So yeah, the results are really striking. Particularly in the overall clustering. This is a testing set of around 800 hits, so big, but not gigantic, and you can really hear how well it handles all 4 neighbors. The FluCoMa stuff does pretty well (though not great) for the nearest match, and maybe two, but the overall “space” that it’s finding isn’t as well defined as with the Orchidea MFCCs.

I’m wondering now if part of the “energy weighted average” that @weefuzzy was wondering about is like what @b.hackbarth has been doing in AudioGuide, where the MFCCs are just weighted by loudness in each frame (perhaps also combined with some standardization across the whole set, rather than per-sample), which is producing better overall results here.

Either way, wanted to bump this as I’m now seeing how accurate/useful this is here, and how to best try to leverage whatever “secret sauce” is going on underneath to get similar results in the fluid.verse~.

To play devil’s advocate I think they just fail differently and the consistency is about the same. I think it would be worth knowing what the implementation fo MFCC is (dct type, liftering, weighting). Did we ever find that out @tutschku @tremblap

To my ear the Orchidea ones were way more consistent in terms of overall timbre and loudness. The ‘worst’ example from that one was way better than the worst from the native flucoma ones, and same goes for the other end of the spectrum.

More so, the ballpark was a bit more consistent. Some of the flucoma ones sounded like a collection of 5 random samples.

It’s also worth noting that this was just MFCCs, no (additional) stats, proper loudness measurement, pitch, derivatives, etc… So presumably, things could improve from there.

I think @weefuzzy was doing some casual spelunking (as per his info above), but from memory there was nothing conclusive. Or do you mean on the native flucoma versions?

Oh, as a more concrete comparison, I ran the flucoma version through the same exact process as I did the Orchidea ones. In the video example above the flucoma tools were going through the real-time onset detection process, which I’ve tried to get exactly the same for offline and real-time, but there is a biquad~ step in the real-time version which isn’t present in the offline, hence sometimes finding a nearest match which isn’t itself (due to it being offset a sample or two).

So if I do what I’ve done with the Orchidea example and load numbers, and then query based on only numbers, I always get the single point returned back (as one would expect), but the overall matching and clustering isn’t much improved. By the time we get to the 3rd and 4th nearest, it gets very outlier-y.

(I can post a video, but it’s basically the same overall matching as the first video, except the nearest match is always the same now)