Musical use of descriptors discussion

Came across this via the cycling74 instagram.

Looks like it’s getting to something similar, but packaged in a slicker M4L-y way. It doesn’t look like it’s available yet, but from what I can piece together from the videos, it’s running audio-rate descriptor analysis, and primarily only taking the centroid (“timbre”). It’s also running off analysis windows of 256samples (so smaller than the 512 I was doing).

Also curious what the onset detection algorithm being used is, as his control interface (p.s. something like this would be handy for the fluid. onset detectors!) looks quite similar to the Sensory Percussion one:

If/when it comes out I’ll get it (unless it’s stupid expensive) and poke around the code to see what he’s doing.

Also makes me wonder of what would be a better way to leverage all the :chocolate_bar: :cake: :ribbon:𝒮𝓅𝑒𝒸𝓉𝓇𝒶𝓁 𝑀𝓸𝓂𝑒𝓃𝓉𝓈:ribbon: :cake: :chocolate_bar: into something more meaningful.

/////////////////////////////////////////////////////////////////////////////////////////////////////////

On that note, is it possible to do small-scale dimensionality reduction that potentially retains “weights” or something similar? Like taking the centroid, spread, skewness, kurtosis (then perhaps flatness + crest as a separate “combined” one) and fusing them into a single “timbre” descriptor which still carries, um, some kind of directional meaning (?).

Thinking out loud here, so not exactly sure what I mean (surprise!), but picturing something that takes various spectral moments into account, but still produces a value that is correlated to perception (i.e. “that sounds brighter”). (somewhat related to what was being discussed in this thread)

my hopes are with a log/log centroid approach, which is on the mid-term radar of @groma and myself. For the second toolbox we are currently working on various normalisation ideas of the descriptor space. I talked about that in the Sandbox#3 paper with Diemo a decade ago (how time flies!) and @a.harker did talk about it in his talk on descriptors too, with a very elegantly put question: what timbral variation ‘value’ is equivalent to a semi-tone, or a dB.

1 Like

This just means linear-izing it? (so it transposes and acts like mtof/ftom would?)

And, yeah, I remember you mentioning in the last geek out session with @jamesbradbury that you were building an ATP (TAP, PAT?!) multidimensional space that did something-ish like this.

not really - check the tutorial of the spectralshape, when I explain the filter being log and the value being pulled up because the calculation of centroid is linear, that should be clear.

I was talking about PAT for my initials for the last 18 months, but if I’m being honest (and modest) APT is more accurate: Amplitude Pitch Timbre, since I believe it is for me the order of importance of perceptual features… and also the pun is better (an APT space) :smiley:

1 Like

I guess on a conceptual level, is this dimensionality reduction is primarily useful for human-legible “mapping” type stuff?

Like, any ML algorithm would prefer (?) just to have all the individual data points, numbers, and statistics, rather than having an aggregate “timbre” descriptor yes?

Oh, I forgot to include this in my rebump, but I would have to imagine that in the order of 512/256 samples, that statistical derivatives are probably not very meaningful, since not too much can happen in that small a window (even with fast/transient sounds)?

yes. For ML the weighting is still a problem, but different. @groma and I are trying stuff there too, but you can already play with his NIME paper (flucoma.org/publications) and the SC code we showed at the last plenary.

it depends on how many frames of analysis you have. if you do 128/64 then you will still have 5 windows so all of it might help to find what you want (mostly going downwards for instance, or upwards, might help assess the rapidity of the attack…)

in Python land for the sklearn package the dimensionality reduction process is two phases which are often smashed into one line of code.

reduction = umap.UMAP(n_components=2, n_neighbors=umap_neighbours, min_dist=umap_mindist)
data = reduction.fit_transform(data)

reduction.fit_transform(data) is a kind of sugar for doing

reduction.fit()
reduction.transform(data)

So in reality, you could actually not transform the data and just keep the fit that is applied and re-use this in the future on whatever data you want - it just happens that the data I process is also the data I initially use to make a fit() and so I smoosh it all together. So if your question/curiosity at this point is about storing scaling values and transformations to be applied later then the answer is yes.

As @tremblap has alluded to weighting is an issue, but for me I only use one kind of analysis with lots of values and so its less of an issue scaling multi-modal data sets.

1 Like