Musical use of descriptors discussion

(evil grind full on)

This is it. The fastest, nervous, crazy, musical relative attack detector. It takes the slow fast one (my original) and the tweaked median filter one I optimised, and take the fastest of the 2 :wink:

Enjoy! It even rocks on brushes :wink:


----------begin_max5_patcher----------
3853.3oc6bs2aaajD+uc9TrUnnvtQVl6x20w8hatl1bHIMHNAEA0EFTjqjXL
EoNRpXqqH4y9M6CRQJQZSYQJETTCaJ580ryuY1Y2c1cze8nC5ML5VZROzOf9
CzAG7WO5fC3IwR3.4+ePuoN25F3jvKVuP5MQC+Xu9hrRo2lxSdd3LG2qQDcE
zwlJHLFg0P1H7.UDYfdVw883EFZfi00xRLZdZ.MMcwLpnezyOLsW+69iQAQN
EdA8mx1Jb9T+Pn038UrLwjzEA71tWukESPUd4LkoNyI0che33qhotoh9hA1X
fRejlN6I1j+ODx.EzexpxmeziXO5ucPGvAdCcBGWEHgqEj30X63a78x2XNea
nrUrsazzoTlXaE9NWAXTTXZh++i2Uw.UVGGT0KV1PmoB157XemfrblESS.x3
j5GEVfSz0sGn2GoaqxXBE4isC3TpG3zHpCHL5IfNrMG5TZYnScCfN6s.5rXP
mAQa2BcZFcHzQZNzog2JniMxQ2X2BclVcHzg2.nir0Pmk8NE5Lv3ND5T1.na
KFvZvM5XRT1sPmFoEftD2nYzuT0jfJ2GaCyKy5FJ36msIxDigoOkSkxqfVgJ
rMPhlhXfHgOcCVUI+i1ETL50Pt6XKdW3XMk1gCUIFCLANTkX1JbXciXf9cyG
yn2.yMst1uNlCDZhEGRr5DCGfbaCfA88GLnhM6RXPeSfAi8HLnY2kvfwl.CV
6QXvB2kvf4F.CpZ6QaCXsNDFlR8Pi7CRownzIvTiS1.P4guzVCUMwVh02IKx
nL4v5JaETNJHBnbk3jVk3Do1Mfy79P4MgeGPoSLjNHnthF5LT.EJMClMMaNL
OJJdpCu9FMD3I2OvK5.RbWaWg6p6Mb2VgsXx8HtWpCrk3dMtbRXt3KHbcNmy
tV3OwebH.vUCHpaummzsUD7MemyxoQZYOtcbUqnd4Bs1LdlzZ7La+T5FDXkT
vd35DFeLM7KavjDMGQZ4oVuKOSJ1rops7ikvTw5PikfiDcfdne.8Sz3DXzdA
JbPOmYyJj7AEpBCR+XTbAODyRxOTjjZdRwzO4mUe87TchgddJzsmGK3zaMxV
HBqYh7nwgy84cEQhfv8QYMXFyh04boImK0Eaq2z1dI2B5AiciBDcI1VYs6W3
AtXAod9oLnrwUX33fH2qodELZBZByng9gEscVJaO5Hm4AoWUR4BOnx7G43Rq
sxUpLcPuww9dQgrNQIgHK4LxAnlXKoJEYFdIBclUQkAUc.vqIyDfImmLzIlI
ikyfPxxLMJJnbV40KfNJUl8L+vvUPwznY0mYr+3I2QcGFAYN8tZadNIWMOTj
6UfUgzqRb9TYzN0IHPZlnbyeqSnOLcFM0WHBHJ4YJlEcRhabTPPI9UjympHG
OX3gK8FeuzIbBUTY.Jt+rLknd4RYO+wzjzxok5LNobJqYDARZ9P4v+qRoSmE
.bQ4BT53tJNVunA0Rou1Zw8nrR1eY90nqu5TpEqx5KpIWScMqoO4UfZYxLXv
h2OVrMpxvZM3RsFXyo1piEqahJ9HKUUwrU7YnMrsJLNiUOFDwI+IeO5x3KCu
LE1nhuSnbuJ8kINzIA1+RTHJY9LZLXSJvGrIhJUVzvEnON.M000INMjtPV0i
Y+Hdk82ErF3YYMPL0I.wzcQNy87iPIKBSmPS7Sf2R.shKCQvOOKZ1B9.Mzgt
GwTwIn+Cf3InW49LAsFfNOH.wKSBhYuK9STuAL5AiNm8CmbxM2by.NI3TX.r
KMHSdi+tI.0lEGMFV8IBdcTLkhRhFkdiSL8Tzhn4HWfGiAVMIM1e37TJxOE4
D5cRDf.Qd9iVHZIH04gd7M3QQ.fLMAEMh+O+xqeO5WngzXfaey7gA9tnWBiy
BS.9FHNKkjI..OT1Rr57bV+3BY+.87Hno4lvOEQ8g7iQxYBQjLpHax9nnXQy
bnSJq+Gih3iaOB5zKPrwY4UdPcnvRl0C4Gxa9If8V3EnMAF8Fe.uGRQySnil
GzWzHPwQ+9Kd2u9au+cnye8GP+94u8sm+528gSghmNAznQvr5hFyGFv6CsMv
bwNgoK.dPzFu5me6y9UnRm+Su3ku3ce.3Ezyew6d8OewEnm+auEcN5Mm+128
hm89Wd9aQu48u8M+1E+7.D5BJqqQEMwcf1i3hL.O8noN9AI47+G.wbBzEC7P
S.iuf31k5C5PHGjKn8c+BRQy3.acZLmaghuDQOE4OBEFk1GcSLrNCTZz5hXQ
CrTN2G8hP2ArCmDJmS30AfX3hTnFPq7b+Q.EddPTDLB8mhRRYE+UmiTHXrxw
XUELB89KN+xvu+D9HO3u+sSpC5SNAGpqbTejyXJ6kSY47qfnNJdAZ5gxDdCa
iXHlUxCgkY.FtNi.e3b6YY0fMGE5LdIPGiv7zlEk.Icr3efEcLmUB+Pw+m82
ImfbClOMYAjiOvGL8wgfsUPg7FmE+KYQ.rh2ztSXdFOAcCUHS.TKlxpHq6yI
ynC4E7aNCM8HzeILx.h3Cu9LkSQW+DO+oGBE8H38GeFNuDWlNK5ZJKm9nqge
EL0koel8A+wTI2k24YMpOqQ8eBOcjewFjIpOeLEpzLJ8ZQC6m0pPmLO+yPLn
qTGggZ9E6.WlRC.M07hjU4GelDoKy.xrKPvkrA6ual.qk9PDPnu4LEzxtLkK
B48WPqnOq.GiEMAh2oExvmfXkbUniWCVF75cTEcqk.gngWqfIniWxPeFI34g
v7AWeZUL.qBeyJvWU7viKwCHAS7isAS73JYhG2LlXIA48nBMzRpoTt4gmfES
bIFjqteB4niJMppW1z478cJ2dxCbISk1KZ48iB1eva9RovOzkRUZuDqsFJks
XMT0tVIqB9+EmsY9tGXYSL9.PVsdOLnaSW9YiVloo3h.npemf2xs3AylU2Z6
4zfke0nZBrfF2L8GoFFp7hZg8p.6dNeCu+Q9h5QMUlto8A8F1GzpsOHSTjxl
5.o4SGRiy7Nymbhy0DJmSAkGh0F3y1p7LK9tbYDz67mNeJ2IKJ05FInhYkhr
wNdUbiSTUrYdgSSwJ26qEcuDaezdW4jJWJ6x2RV0iSf404znQq6Vok7Naocq
.r+AV8OW5mnkEEVIYbZ0BgRkKC0UpHuo.7vAgAUmqysB3sxrYqRBrArt7ppR
IPT6kiYkpkOHcQXaYIvzYq4MS9B09VbU9x8NN9fNyKtY5O.9YZgIXXgtZ16L
2WSL2atulnXJNzWCt2JI7KbYayzNoQdCqjwMZLiWyQ9PrDmVKoMuxY2EdkSO
9Q7nY0EJIR6kegwIvOUMJgnusHGVWbLEZ1M+lWsE.WF4DdEGi2tqicclU3.l
JNWshMmsKrG5zhbQQXTaOXrQSWq.TH8Kmd6pB4MLMxoB8Fr8lxv31fgE1REW
yHsN43wp0FC15gYbE2dFW22FKvp6u4WDm0WKMruNY+vjpj76OAOFFNytJoXw
HcUyt5.gQO0adL2xNz7JUcc.r26Z+x0VIuso+s87dME22giwJhXtvVb+wz0T
953jbI65SxUOKJj9mSx8eNI2c6I4RpyIYFci6G2QGgqLHFwhn0vvfeVtDU85
ND2rixQL74PPnW7Dcxl43PMMrB21L6rcvKclrenaLkc6UQmwsIcRdUVVjQnC
EsN5ImgTPe22IOllC8CwGwNgC7QR2rKKVAGjy9f6i7BsxOhTxcLedMjubL5v
7tzQkOijBtGOqk3z9z8jyvefd7tK8oclxi3tWsRXgtG7pcctttKcNsDCzvVU
srrul7NsQG5cZi8p2oq6FwGSCjWEdTyuli38PvCQDtJSyRqMtU60AGvdZ13H
CXeDBQ1hflCapzknwlDIUZ6iXmgGEiDaira7cWD33aBFXtOBblAKisvNBCHa
BFrWBkLgy7r6PL.uIXf9dCCjN9QauGS16gXHS3sKMYfDauSC6IrVmG9Mcefz
HmiUFII63.XBq+2.DTEy+B.ocPvZCk8k1iaVj52RAxNryW1TMJZsRXreeAZk
B76FXvAqX21wdiZ64MVcKo2I6BWRmvtF2bDqn2nueHy7qQDCqaIVZqMe2hV1
cHhwbAC+QywLK0uF0xrkechH9trQgrUX1FG4ol+Mvvc1WCIxcT0py8Uzxs5V
diu1gSkIryW73xa8KNvnX5+cNMzcA5aw8Q07U5FAuGts.RtWDloXrZWXERDw
NdTlgCzSihYQshJ5ov7eyhXwIwD+wSlAsA5ooQyhBhFufEbGPUtIJNcRUm3b
8iD87cY9YxIdQ2cziYqLPbgCTM5DPy0Iw0wq5u1i1eG3NNygLhcdYrcG8rfl
UvgpVcl60jKTwTbyqEKtSs3BGy1TXwl2Ozida9IztgbIu+WkVL9gIFUZiaPf
MWLJt4HD8rmaNHvqQYWXKvjUcarDYV+xLy7CrQNgq1WwruzGP0oisQTR+9nD
osnj18PI1khtcnj58PIrdaQIx8QIs1hR36SNY1VTR49jSshtGtATRuMPOlur
ueJUtP7ojq27VSIMwnAj1nZJqrcT1rIBxNgoyLMc23ssXvy1QIRSoDda0Vsa
.kraAkUrUCHjV2H2TZJow6jQj1cw3hFgupJcAo0T2.RucC9ajRTaXEuQlY3N
naOIJWoClSZR2aWmSZbaOJE2DsHhQaHa0ahrsMVyFtIVW4cGbaKIsZBbhaig
JlMlI2MvoV0voR2SZNb15CLTajonVYNZkc0pAHMYQx31Xg+3lfd5psvj+XbS
ozVuEilL1tMre0jIhVwPSKMdqIT1ztKVsXSFBn1JSNn03kXr0pl5MlTa6VWH
MZZb7Z3mv6RqDXFLhrR.YrRvXrdfXTePXrZ.XvC9Bgu0VwiV4GmyTmaMxue3
YQzPYDf4FWdHXbkeHquW7XpW9nPnXT9aqOrLW4CkAKcakiqKMLsT3cXwJgkp
MAavdyTw1RQS15OZY7jvNmEnpq51vdSgdue1fyhR2RLbAuXVMKWpKoQT0I7H
NQ0jnqweiXpoia6tUoSbo8jEi7CBxYnhAOTlSZ6MN1wye4kGRx+3bYhA7iY+
JeqTD3vqFIqZlVV.x0ux2VuZY0hXqfs4xdUorm8FjDdkZ4DNVFdKlEhQiYwQ
yhhyiymAp14kedZTNilY9N+h4Vs3q7nilIEKdP.MP6hKyz.0JEB+MfmUsaSM
qTmY2Umo3g6BFudoSZD5krH5o2Cg9Pc97i9+.DQqJ6.
-----------end_max5_patcher-----------

Thanks for the explanation. For some reason I thought that everything was in the signal world until descriptors~ for previous examples.

Cool - new test for Rod. Play a snare by hand as consistency as possible - compare the two patches (i.e. what happens in the acoustic world, where exact repeats don’t exist.

All of my sound cards don’t really work at the moment (!!), so hard for me to test it with live audio, but I ran some other training data I had with more similar hits. They aren’t the same same, but are fairly consistent.

It’s a loop of 7 hits.

And with the old version

Whoops, here is the old (Max slop) version for comparison:

The centroid looks alright here, and sfm is a mess again though. The loudness does look better, particularly since the snare hits are all quite similar in dynamics.

Hmm, definitely need to improve things there.

What’s surprising is that the window is big enough for the centroid to (appear to) be meaningful, but the loudness is way off (or rather, way inconsistent).

Here is the audio file I used as a point of reference.

snare hits.wav.zip (520.3 KB)

Ok played with it some more. It seems like that the settings for thresh~ have a massive impact on how the loudness is reported. Using a thresh of 15 20 produces more perceptually relevant results, where as my “standard” settings of 10 15 produces more erratic ones.

The first half here is 15 20 and the second (right side) is 10 15:

Where it gets weirder is that if I start and stop the audio playback, loudness reports a really loud hit on the first return. In the second half of this you can see me turning playback off after two hits, then starting it again. So it goes LOUD soft LOUD soft LOUD soft, even though the hits are fairly similar in dynamics.

One thing that sticks out to me is that you (@a.harker) specifically mentioned analyzing a window that is a multiple of the FFT size, so I was going for a window size of 512 samples (with FFT settings of 256/64), which turns into 11.609977ms. If I try to analyze a window that big, Max averages it down to 11.61 (as in, I can’t make a - 11.609977 object), so maybe that tiny bit of a semi empty frame would have something to do with this?

edit:
One more point of reference. The first half of this is a quiet noise~ burst (0.01 in amplitude), and the second half is a loud noise~ burst (1.0 in amplitude):

The idea here is to try on the exact same signal. If that is not consistent, then you cannot trust the patch since it should be. You have proven it is better with the new one, so stick to it.

As for thresh and varying signal, it makes sense. What you should do is to take the new fastest version, and listen to the sampled grain by ear - that is always a good start for me (I trust my ears) to identify if I get what I expect - if erratic, then I can troubleshoot before descriptors.

Also, be careful with @a.harker descriptor object. We have in other thread talked about errors. In my performance patch, with the current version, I sometimes have to re-send it the fft parameters to reset it. @weefuzzy had similar issues in his. It is a hard bug to reproduce, but again, it was affecting energy readings quite a lot so it might be that.

With the same exact signal I get consistent (but not the same) results from the Max and MSP versions.

I’ll try listening to the grains and play with the window/delay a bit more. I figured that even if the thresh was reporting a diff slice of time (all inside the MSP version) that grabbing a bigger window and/or waiting longer (or less) to analyze it would make for more consistent results. What I tried so far on that front didn’t have much of an impact but I’ll do some more testing.

I do remember there being some talk of problems with how loudness was computed, but it was odd that the Max version’s results looked better.

what I saw in the graphs were that you had a lot of variation in the Max one, and much more focused values in the MSP one.

the bigger the window, with consistent distance from the trigger, the more consistent I would expect it too. but again you might reach a size that makes it too big and starts to grab stuff elsewhere (other sounds)… a typical loudness measure, from the ITU paper, is around 400ms to be perceptually accurate… I recommend this kind of geek paper if you want to see what the cutting edge commercial algorithms are doing:

“LOUDNESS METERING: ‘EBU MODE’ METERING TO SUPPLEMENT EBU R 128 LOUDNESS NORMALIZATION,” n.d. (https://tech.ebu.ch/docs/tech/tech3341.pdf)

Don’t forget that I highpass the signal too, so getting energy might be problematic if your signal is low-end heavy…

if you run it again, you might get different results. That is what killed me in my piece - i got the bug, then ran it for hours without issues, then got it again. The fft trick saved the gig, but it is a strange one…

Yeah super sloppy, but the loudness even in the Max version was consistent. (surprisingly perfectly consistent)

I didn’t do that at all, so I’ll try that. And actually, I can probably pull up the bottom end of what I’m analyzing for in descriptors~ land as I’m querying for 10 20000 in terms of frequency. If my math is right, an FFT size of 256 would correspond with nothing lower than 120Hz anyways, so I can chuck a hipass and then query for less inside descriptors~ too.

1 Like

be careful

it will still represent the full signal, just not with much precision in the low end for analysis. I think loudness will take the energy in all bins, including DC, so it should not change much. you can send test signal to it at different fft and it should not change amplitude - but if you only get a part of a wave, yes, it won’t represent the full energy. that is why they use 400ms for loudness in EBU (but then they have filters to represent roughly equal loudness contour and such)

So I’m working on a version of the “onset descriptors” using the new fluid.buf...descriptors~ objects which I’ll post as soon as it’s done, but there was a lot of chatter on an older thread on the Max forum dealing with this same problem that threw up some interesting links I hadn’t come across before.

The HandSolo: A Hand Drum Controller for Natural Rhythm Entry and Production

Real-Time Hit Classification in a Smart Cajón

Hybrid Percussion: Extending Physical Instruments Using Sampled Acoustics

2 Likes

Came across this via the cycling74 instagram.

Looks like it’s getting to something similar, but packaged in a slicker M4L-y way. It doesn’t look like it’s available yet, but from what I can piece together from the videos, it’s running audio-rate descriptor analysis, and primarily only taking the centroid (“timbre”). It’s also running off analysis windows of 256samples (so smaller than the 512 I was doing).

Also curious what the onset detection algorithm being used is, as his control interface (p.s. something like this would be handy for the fluid. onset detectors!) looks quite similar to the Sensory Percussion one:

If/when it comes out I’ll get it (unless it’s stupid expensive) and poke around the code to see what he’s doing.

Also makes me wonder of what would be a better way to leverage all the :chocolate_bar: :cake: :ribbon:𝒮𝓅𝑒𝒸𝓉𝓇𝒶𝓁 𝑀𝓸𝓂𝑒𝓃𝓉𝓈:ribbon: :cake: :chocolate_bar: into something more meaningful.

/////////////////////////////////////////////////////////////////////////////////////////////////////////

On that note, is it possible to do small-scale dimensionality reduction that potentially retains “weights” or something similar? Like taking the centroid, spread, skewness, kurtosis (then perhaps flatness + crest as a separate “combined” one) and fusing them into a single “timbre” descriptor which still carries, um, some kind of directional meaning (?).

Thinking out loud here, so not exactly sure what I mean (surprise!), but picturing something that takes various spectral moments into account, but still produces a value that is correlated to perception (i.e. “that sounds brighter”). (somewhat related to what was being discussed in this thread)

my hopes are with a log/log centroid approach, which is on the mid-term radar of @groma and myself. For the second toolbox we are currently working on various normalisation ideas of the descriptor space. I talked about that in the Sandbox#3 paper with Diemo a decade ago (how time flies!) and @a.harker did talk about it in his talk on descriptors too, with a very elegantly put question: what timbral variation ‘value’ is equivalent to a semi-tone, or a dB.

1 Like

This just means linear-izing it? (so it transposes and acts like mtof/ftom would?)

And, yeah, I remember you mentioning in the last geek out session with @jamesbradbury that you were building an ATP (TAP, PAT?!) multidimensional space that did something-ish like this.

not really - check the tutorial of the spectralshape, when I explain the filter being log and the value being pulled up because the calculation of centroid is linear, that should be clear.

I was talking about PAT for my initials for the last 18 months, but if I’m being honest (and modest) APT is more accurate: Amplitude Pitch Timbre, since I believe it is for me the order of importance of perceptual features… and also the pun is better (an APT space) :smiley:

1 Like

I guess on a conceptual level, is this dimensionality reduction is primarily useful for human-legible “mapping” type stuff?

Like, any ML algorithm would prefer (?) just to have all the individual data points, numbers, and statistics, rather than having an aggregate “timbre” descriptor yes?

Oh, I forgot to include this in my rebump, but I would have to imagine that in the order of 512/256 samples, that statistical derivatives are probably not very meaningful, since not too much can happen in that small a window (even with fast/transient sounds)?

yes. For ML the weighting is still a problem, but different. @groma and I are trying stuff there too, but you can already play with his NIME paper (flucoma.org/publications) and the SC code we showed at the last plenary.

it depends on how many frames of analysis you have. if you do 128/64 then you will still have 5 windows so all of it might help to find what you want (mostly going downwards for instance, or upwards, might help assess the rapidity of the attack…)

in Python land for the sklearn package the dimensionality reduction process is two phases which are often smashed into one line of code.

reduction = umap.UMAP(n_components=2, n_neighbors=umap_neighbours, min_dist=umap_mindist)
data = reduction.fit_transform(data)

reduction.fit_transform(data) is a kind of sugar for doing

reduction.fit()
reduction.transform(data)

So in reality, you could actually not transform the data and just keep the fit that is applied and re-use this in the future on whatever data you want - it just happens that the data I process is also the data I initially use to make a fit() and so I smoosh it all together. So if your question/curiosity at this point is about storing scaling values and transformations to be applied later then the answer is yes.

As @tremblap has alluded to weighting is an issue, but for me I only use one kind of analysis with lots of values and so its less of an issue scaling multi-modal data sets.

1 Like