All of my sound cards don’t really work at the moment (!!), so hard for me to test it with live audio, but I ran some other training data I had with more similar hits. They aren’t the same same, but are fairly consistent.
Ok played with it some more. It seems like that the settings for thresh~ have a massive impact on how the loudness is reported. Using a thresh of 15 20 produces more perceptually relevant results, where as my “standard” settings of 10 15 produces more erratic ones.
The first half here is 15 20 and the second (right side) is 10 15:
Where it gets weirder is that if I start and stop the audio playback, loudness reports a really loud hit on the first return. In the second half of this you can see me turning playback off after two hits, then starting it again. So it goes LOUD soft LOUD soft LOUD soft, even though the hits are fairly similar in dynamics.
One thing that sticks out to me is that you (@a.harker) specifically mentioned analyzing a window that is a multiple of the FFT size, so I was going for a window size of 512 samples (with FFT settings of 256/64), which turns into 11.609977ms. If I try to analyze a window that big, Max averages it down to 11.61 (as in, I can’t make a - 11.609977 object), so maybe that tiny bit of a semi empty frame would have something to do with this?
One more point of reference. The first half of this is a quiet noise~ burst (0.01 in amplitude), and the second half is a loud noise~ burst (1.0 in amplitude):
The idea here is to try on the exact same signal. If that is not consistent, then you cannot trust the patch since it should be. You have proven it is better with the new one, so stick to it.
As for thresh and varying signal, it makes sense. What you should do is to take the new fastest version, and listen to the sampled grain by ear - that is always a good start for me (I trust my ears) to identify if I get what I expect - if erratic, then I can troubleshoot before descriptors.
Also, be careful with @a.harker descriptor object. We have in other thread talked about errors. In my performance patch, with the current version, I sometimes have to re-send it the fft parameters to reset it. @weefuzzy had similar issues in his. It is a hard bug to reproduce, but again, it was affecting energy readings quite a lot so it might be that.
With the same exact signal I get consistent (but not the same) results from the Max and MSP versions.
I’ll try listening to the grains and play with the window/delay a bit more. I figured that even if the thresh was reporting a diff slice of time (all inside the MSP version) that grabbing a bigger window and/or waiting longer (or less) to analyze it would make for more consistent results. What I tried so far on that front didn’t have much of an impact but I’ll do some more testing.
I do remember there being some talk of problems with how loudness was computed, but it was odd that the Max version’s results looked better.
what I saw in the graphs were that you had a lot of variation in the Max one, and much more focused values in the MSP one.
the bigger the window, with consistent distance from the trigger, the more consistent I would expect it too. but again you might reach a size that makes it too big and starts to grab stuff elsewhere (other sounds)… a typical loudness measure, from the ITU paper, is around 400ms to be perceptually accurate… I recommend this kind of geek paper if you want to see what the cutting edge commercial algorithms are doing:
Don’t forget that I highpass the signal too, so getting energy might be problematic if your signal is low-end heavy…
if you run it again, you might get different results. That is what killed me in my piece - i got the bug, then ran it for hours without issues, then got it again. The fft trick saved the gig, but it is a strange one…
Yeah super sloppy, but the loudness even in the Max version was consistent. (surprisingly perfectly consistent)
I didn’t do that at all, so I’ll try that. And actually, I can probably pull up the bottom end of what I’m analyzing for in descriptors~ land as I’m querying for 10 20000 in terms of frequency. If my math is right, an FFT size of 256 would correspond with nothing lower than 120Hz anyways, so I can chuck a hipass and then query for less inside descriptors~ too.
it will still represent the full signal, just not with much precision in the low end for analysis. I think loudness will take the energy in all bins, including DC, so it should not change much. you can send test signal to it at different fft and it should not change amplitude - but if you only get a part of a wave, yes, it won’t represent the full energy. that is why they use 400ms for loudness in EBU (but then they have filters to represent roughly equal loudness contour and such)
So I’m working on a version of the “onset descriptors” using the new fluid.buf...descriptors~ objects which I’ll post as soon as it’s done, but there was a lot of chatter on an older thread on the Max forum dealing with this same problem that threw up some interesting links I hadn’t come across before.
Looks like it’s getting to something similar, but packaged in a slicker M4L-y way. It doesn’t look like it’s available yet, but from what I can piece together from the videos, it’s running audio-rate descriptor analysis, and primarily only taking the centroid (“timbre”). It’s also running off analysis windows of 256samples (so smaller than the 512 I was doing).
Also curious what the onset detection algorithm being used is, as his control interface (p.s. something like this would be handy for the fluid. onset detectors!) looks quite similar to the Sensory Percussion one:
On that note, is it possible to do small-scale dimensionality reduction that potentially retains “weights” or something similar? Like taking the centroid, spread, skewness, kurtosis (then perhaps flatness + crest as a separate “combined” one) and fusing them into a single “timbre” descriptor which still carries, um, some kind of directional meaning (?).
Thinking out loud here, so not exactly sure what I mean (surprise!), but picturing something that takes various spectral moments into account, but still produces a value that is correlated to perception (i.e. “that sounds brighter”). (somewhat related to what was being discussed in this thread)
my hopes are with a log/log centroid approach, which is on the mid-term radar of @groma and myself. For the second toolbox we are currently working on various normalisation ideas of the descriptor space. I talked about that in the Sandbox#3 paper with Diemo a decade ago (how time flies!) and @a.harker did talk about it in his talk on descriptors too, with a very elegantly put question: what timbral variation ‘value’ is equivalent to a semi-tone, or a dB.
not really - check the tutorial of the spectralshape, when I explain the filter being log and the value being pulled up because the calculation of centroid is linear, that should be clear.
I was talking about PAT for my initials for the last 18 months, but if I’m being honest (and modest) APT is more accurate: Amplitude Pitch Timbre, since I believe it is for me the order of importance of perceptual features… and also the pun is better (an APT space)
I guess on a conceptual level, is this dimensionality reduction is primarily useful for human-legible “mapping” type stuff?
Like, any ML algorithm would prefer (?) just to have all the individual data points, numbers, and statistics, rather than having an aggregate “timbre” descriptor yes?
Oh, I forgot to include this in my rebump, but I would have to imagine that in the order of 512/256 samples, that statistical derivatives are probably not very meaningful, since not too much can happen in that small a window (even with fast/transient sounds)?
yes. For ML the weighting is still a problem, but different. @groma and I are trying stuff there too, but you can already play with his NIME paper (flucoma.org/publications) and the SC code we showed at the last plenary.
it depends on how many frames of analysis you have. if you do 128/64 then you will still have 5 windows so all of it might help to find what you want (mostly going downwards for instance, or upwards, might help assess the rapidity of the attack…)
in Python land for the sklearn package the dimensionality reduction process is two phases which are often smashed into one line of code.
reduction = umap.UMAP(n_components=2, n_neighbors=umap_neighbours, min_dist=umap_mindist)
data = reduction.fit_transform(data)
reduction.fit_transform(data) is a kind of sugar for doing
So in reality, you could actually not transform the data and just keep the fit that is applied and re-use this in the future on whatever data you want - it just happens that the data I process is also the data I initially use to make a fit() and so I smoosh it all together. So if your question/curiosity at this point is about storing scaling values and transformations to be applied later then the answer is yes.
As @tremblap has alluded to weighting is an issue, but for me I only use one kind of analysis with lots of values and so its less of an issue scaling multi-modal data sets.