Fancy descriptors (Essentia)

In having a geek out session yesterday with @jamesbradbury (and then later with @tremblap), we discussed some of the fancy/funky descriptors in Essentia, many of which I wasn’t even aware of.

In discussions @tremblap mentioned that many of these aren’t great, so wanted to create a thread on here to potentially discuss what, if any, would be useful from there and what @groma made of the idea.

A bit difficult to reply without reference to specific descriptors. There are indeed many, and their usefulness depends a lot on the application. They have also been adding more recently. Many of the descriptors are designed thinking about music in a traditional sense, and often to be used in machine learning, so they don’t always have a clear intuition. Others are designed for specific MIR tasks such as tempo tracking or predominant melody extraction. There is also significant overlap, also in our own for example mfccs and spectral shape descriptors are measures of the same thing, except that spectral shae ones are more intuitive.

1 Like

@jamesbradbury has worked with them more, but he spoke fondly of PitchSalience for example.

I guess at some point the legibility of the information to a human becomes minimal, with MFCCs a good example of one that’s useful but largely illegible.

1 Like

Here we have to be careful of not doing the same thing as essentia, i.e. an unmaintainable list of implemented descriptors (for a team of our size).

There are a few on the waiting list, @spluta and @tutschku have asked for chromagram for instance, which is not covered by the current set. But maybe we should find a way for people to suggest specific ones, with a rationale, so we can think of what is really missing in the current set without becoming yet another descriptor library provider (we neither have the means nor the needs for the main project’s aims)

Another option available eventually will be for people to code them themselves and submit a pull request on github… but that is advance C++, beyond my capacity (and current interest) for sure!

1 Like

About PitchSalience it would be interestign to know how it compares to confidence of YinFFT (available in our object) using the same frequency range.

1 Like

In my anecdotal and feelings based approach salience and confidence don’t feel interchangeable but latch on to something else that is audible. Scrubbing through a database of sounds based on confidence feels like (normalised) 0 -> 0.5 is just mush and then around 0.67 -> 1.0 you start to hear considerably more pitched sounds. The distinction between something at 0.67 and 0.8 is something, but not as stark as 0.8 -> 1.0 where you might get completely sinusoidal samples at the upper end of the scale. Salience seems to be more smooth linear, at least on my set of sounds. I also feel like it gives me a more representative and hierarchically interesting spectrum across the salience range especially if a pitched sound is masked by noise.

Again, anecdotal/my own sounds that I am familiar with etc to account for.


what is important here is to find out if it is only a mapping issue, easy for a user to decide a range and scale it in a normalisation pass, or something uncorrelated… did you get, subjectively, moments where the confidence was high but the salience was low, for instance, or was it always in the same direction?

That’s hard to know as I was always using one or the other without inspecting both. What I can tell you is that the numbers can be de-correlated from each other in a single sample analysis as well as between what you could expect from samples.

Some examples from the data are here, if it interests you. These analysis are taken over longer samples (rather than segments) so in some ways the source and the analysis are not that focused.

OK, it is worth pursuing. Both are based on spectrum autocorrelation, so it should not be very difficult to get Salience as wel if thre is a significant difference. Sorry to insist, but I assume you are using the same frequency boundary, which is 100-5k by default in PitchSalience.

In my own extractor here is my algorithm made in the factory:

Algorithm* pitchSalience    = factory.create("PitchSalience", "highBoundary", 22050);

So I change the highBoundary threshold as lots of my sounds contain lots of content from 5k up.