Onset-based regression (JIT-MFCC example)

rodrigo.constanzo · April 28, 2020, 6:08pm

Ok, did some more testing today and the best results I get, by far, are using the Sensory Percussion for onset detection, and the DPA to feed the MFCC machine. I tried all kinds of combinations of highpass, 5k spike, and mic correction on the Sensory Percussion pickup and they were all very erratic (with the same settings and training points across the board).

The best I was able to get out of it was going through the mic correction convolution followed by a light highpass (110Hz), and this wasn’t great.

For my use case I can just do the double mic setup, but I’d like to generalize the approach, towards the end of building a set of tools/abstractions for using this pickup natively in Max-land.

(more on this below)

https://twitter.com/r_constanzo/status/1086560091954917377

Indeed…

I’m obviously not thinking this through correctly, but if in this example @tremblap has created a 96 dimensional space, are you saying I should have 2^96 (79,228,162,514,264,337,593,543,950,336(!!!)) amount of training points? I guess if I start now, my great-great-great-grandkids can test it out to see if it works…

In my testing today, I didn’t really notice a difference when querying for anything between 3 and 10 nearest points. Most of the tests were around 100-150 total points, with a 96 dimensional space.

Granted, my testing method was “hmm, this isn’t matching too well…try higher numbers…still not matching well…try higher numbers…fuck it, clear and try again”.

Now given that the sounds of a snare are pretty damn similar, particularly between center and end of drum, I wanted to try to tweak what is being analyzed, to see if I can highlight and/or focus in on what would be different about the sounds.

So I figured I’d play with this part of the patch. And man, the way the descriptors->stats->compose workflow is sooo unpleasant and daunting. It would take me like 15min to make sense of your p extract-FCM-data because I have no idea what each sample…of each channel… is. Every time I’ve had to do this in one my patches it’s a “full fat” thing where I need to sit down, manually peek~ a buffer to find the sample and channel I want, and then hope I got it right.

I know I’ve banged the buffer-as-data drum to death, but out of curiosity, when you were putting this together, you presumably thought, “Ok, I’ll take 12 of the MFCCs and their min/max/mean of main and deriv” and were you able to intuit (or know) what samples and channels all of that data corresponds to? When creating that patch and copying over, did you manually check that you were taking the right data?

I guess when you’re packing “the kitchen sink” into something, it doesn’t matter, but picking and choosing what descriptors, and what stats, is really difficult, confusing (and error prone) at the moment…

Ok, ra(/ve)nt over…

So, I will try to mess with this part of the patch some to see if some better differentiation can be made.

rodrigo.constanzo · April 28, 2020, 6:12pm

Ok, so on a hunch I went back and tried a larger sample window (@numframes 512 instead of @numframes 256) thinking that there would be more difference in that initial decay time between the two kinds of hits on the snare. The initial transient (or at least the first 256 samples worth) probably sound very similar.

The results were… very promising!

I was able to get ‘almost as good as the double mic’ setup results. And this was more-or-less across the board, with unprocessed pickup audio, or filtered/convolved, etc…

This amount of latency isn’t too perceivable either way (ca. 5ms more waiting), and I’m quite used to the 512sample delay time now anyways.

I didn’t notice too much a difference between @fftsettings of 128 or 256 though.

So this could be a useful vector of exploration.

tremblap · April 28, 2020, 7:21pm

another vector is the time difference of the ampslice - that will decide what you actually analyse, which 256 samples. a slightly later sample could yield more difference after the (shared) (quite similar) transient.

A quick way to see what you are doing is to record the clicks and the audio in the object. that way you can see in a daw what is being taken (like we did in Montréal with my prototype of attack detector, you remember?) Another option is to take the message you get out of the snapshot and send that to a bufcompose and look at the copied buffer in a waveform~…

rodrigo.constanzo · April 28, 2020, 8:05pm

I also delayed this by 512 samples, so it would be “in time”. Are you saying to massage those numbers further? (so the analysis window would be x number of samples after the onset is detected?)

Do you mean for the descriptor/stats? Aren’t these numbers > -1/1, so they would just show up as black blocks in any DAW? Or do you just mean to see when things temporally line up?

I did this for the onset descriptors stuff, though not exactly sure what that would do in terms of picking what kinds of descriptors/mfccs/stats to chose.

tremblap · April 28, 2020, 8:18pm

yes. I found that for my bd/sn/hh pedagogical example, it made a huge difference on the false trigger of the bd as a snare.

I meant for the audio you send to the analysis chain (the 256 or 512 samples of the ring buffer)

rodrigo.constanzo · April 29, 2020, 12:43pm

Results were not good with this. I tried further delaying the analysis by 64/128 samps, at both 256 and 512 numframes (so things like delay 320 320) and in none of the tests did this work any better, and generally performed worse.

In testing some more permutations, however, I did get the (second) best results using a 512 analysis window, and a 5k bump in the preprocessing. I guess the pickup is most sensitive in that region, so it would follow that it would also see the most amount of differentiation (like how we can see more shades of green or whatever).

I also did some casual testing of lower amounts of training hits (ca. 5) and that still seemed to work ok. I didn’t try to see how generalisable it was though. I was just doing this while testing all the variations/permutations.

So I’ve done a manual version of this. I made a simple fluid.bufcompose~ patch that writes the contents of the analysis buffer to another buffer, with a bit of gap after each hit, and saved the files. (man, the “newer” fluid.bufcompose~ syntax is 10000x better than the original flag-based version, I was able to whip that patch up in a couple of minutes)

The results are surprising.

blipAudioComparison.zip (83.2 KB)

For each example I hit 10 hits in the center, and 10 hits towards the rim, with a bit of variation in dynamic each time. It is also a new take each time (though I could/should probably do the same by feeding in a prerecorded loop to hear the difference in signal chain only).

What is super surprising to me is that the version that I get the best results with (DPA audio) is the one I hear the least amount of difference between the hits. Granted this is for the 512 window, whereas I had good results with 256 window for the DPA, so it’s possible that that first half is more differentiated.

The raw sensor.aif one has a clear, audible, differentiation between the two types of sounds. Like, night and day. Same goes for the 5k bump.aif one, with obviously a hyped top end.

I also included the convolved version as a point of reference, but you can hear that it kind of flattens the difference some (i.e. the raw sensor sounds more differentiated).

Also surprisingly is that even a mild highpass (80Hz) before the signal really smushed the differentiation.

So all of that is to say, that these tiny fragments do sound very different. So it’s a matter of picking the descriptors and statistics that best highlight this difference. (oh, as a quick mid-post test, I tried @numframes 512 @fftsettings 256 instead of @numframes 512 @fftsettings 128 and that seemed pretty good, which would make sense with the amount of low frequency content coming through on the snippets.

//////////////////////////////////////////////////////////////////////////////////////////

SO, with that bit of comparison audio, do you think the 12 MFCCs + min/mean/max/std would be a good way to represent it?

tremblap · April 29, 2020, 1:15pm

I mean, this is why you guys are hired - to contribute to interface research in creative coding - so I’m happy we got there in the end

Seriously, I agree, especially for readability of old patches, that verbose attribute names, if painful to enter, are much easier to read…

It depends™ what i tend to do now is trial and error. What is it you cannot do now? If you ask a specific optimisation question, then maybe @weefuzzy @groma and/or me will have specific ideas. For me, if you get good results with similar training sizes (40-50 hits per class) then stop there and make music until something clear pops out… this is your usual modus operendi but now I’ve lost the plot in this thread on what is actually needed to improve…

rodrigo.constanzo · May 1, 2020, 4:27pm

Ok I managed some more testing with this today.

The first thing I did was create audio recordings for both training and matching the data, just to rule out other variables while testing all the many permutations.

This let me hone in on some settings that worked better across the board.

That being said, the best results I got, overall, was using a larger analysis window (@numframes 512 @fftsettings 256) and using only the audio from the pickup, with the a 5k boost. This actually works better (on my pre-recorded audio) for matching everything, as compared to the DPA/SP combo that worked the best before.

Granted this is with a fixed set of data, so it may not be the most generalizable, so I may very well try doing the same thing but with larger/longer training and matching data to further refine things. (mid post edit: it seems that the DPA/SP combo at 256/128 generalizes better still (tested with a much longer training and performance dataset), but that SP/SP at 512/256 is the best single mic solution)

As per @tremblap’s suggestion I tried shorter delay times (5-10 samples) but this seemed to not have a positive result at all. I don’t think things got worse, but it didn’t seem to make a big impact.

I also tried @numframes 256 @fftsettings 64 to see if having more information in terms of time series was useful, but it was not.

Lastly, I tried plugging the same data (I think) into fluid.kdtree~ to see what I get in terms of the `kNearest’ and more importantly ‘kNearestDist’.

On this front I’m not exactly what units the kNearestDist is in, but I guess it’s related to the amount of entries in the dataset. With a dataset of 46 points (spread across two labels, with about equal amounts of hits each) I get this kind of spread:

(multislider range is 0. to 30.)

(the hits that are “off the chart” were it hearing an onset while I wasn’t at the drum, and those returned values in the 60-70 range)

I tried playing from the center to the edge of the drum (the two labels I trained) and I didn’t really notice a meaningful crossfade in kNearestDist, or if it was it was subtle. It largely stayed in the in the 12-20 range for the most part, and honestly seemed random.

Here is me playing 10 hits in the center of the drum, 10 hits moving towards the edge, and 10 hits at the edge:
Screenshot 2020-05-01 at 5.14.22 pm

While doing this I did notice that the “center” label is the dominant one in terms of the matching. By that I mean that I get it matched until I move almost completely towards the edge. If you take the radius of the drum as a unit, the changeover happens around 80% of the way through.

So yeah, not really sure what I’m looking at with the distances, or if I’ve setup that part of the patch correctly, but I was able to at least land on some settings that work a bit better.

weefuzzy · May 1, 2020, 8:25pm

Are you still using MFCCs in all this? I wonder if they will be the most effective feature for tracking the change in tone across a drum. Thinking out loud, the spectral envelope of a drum strike might be dominated by some resonance that shift in frequency, but don’t change shape or distribution very much.

What happens if you useda small number of mel bands as your feature instead? Does the discrimination get better or worse?

We should try and set up some clustering examples with kmeans…

rodrigo.constanzo · May 1, 2020, 8:56pm

All of this has been using a slightly tweaked version of @tremblap’s patch from the help file.

The variance between the center/edge hits (I’m using the most similar sounds to test with as these would probably be the most difficult ones to differentiate between) is fairly pronounced (perceptually).

They do have pretty pronounced resonant peaks which don’t vary too much between the two, as those are the harmonics of the head.

I’ve not actually tried different descriptor types here. I do find that part of the patch pretty heavy/confusing in terms of picking out what descriptors/stats/etc… Definitely open to picking different analysis that is more suited to the drum stuff.

@tremblap also mentioned that MFCCs don’t respond nicely to noise, and the SP pickup has a shit signal-to-noise ratio, so this could perhaps mitigate that.

As in for visualization or for improved matching? Curious and open for all of it!

weefuzzy · May 1, 2020, 11:28pm

Just saw this, sorry. kNearestDist will be reporting the distances between the data point supplied and the k nearest in the tree. It’s diagnostically useful, both for getting an impression of how well described your data is by looking at the overall spread of distances of points to each other, but also practically useful, e.g for determining if the data point you have is a complete outlier relative to the reest of the data .

Indeed – it’s possible that MFCCs won’t be the most sensitive feature for capturing those distinctions. MFCCs (kind of) tell you about periodicities in the spectral shape of a signal frame, i.e how much wiggle there is at different scales in the spectrum. What I hear in those sounds though is a shift in resonances that might be better differentiated with something closer to the spectrum itself.

Yeah, the buffer flattening in particular is a bit gnarly to do in Max. Remedial work is underway! I might be tempted to start simpler for these sounds, and build it up. Replace the bufmfcc~ with a bufmelbands~ (maybe 40 bands to start) and use a smaller range of statistics, maybe just mean and standard deviation with no derivatives.

With pretty minimal testing, here’s that part of the patch, changed thusly, and a couple of comments added to the buffer flattening. You will also need to change the size of the buffer~ entry to 80 samples


----------begin_max5_patcher----------
3358.3oc0bs0iaaiE94Y9UPXjGZ2NYJI0EJt.K1zl1jEXS5VrIKxVDrXfrLs
G0QVRPhdtzhle6KuHIKIqqdjlwMO3vvKRG9wuygmC4Q42O+rEKitmkt.7WAe
Fb1Y+94mclpJYEmk8uOawV268BbSUcaQH6tnk+5hKzMwY2yUUGCDkRb83u7M
u98ubkK2MuK9qTcPLnWRn4UFtaazNd.iqdnnrZ0UweHloknEKcC2r.7+xZN1
k6cse3lqRXdbcOvWBu.PLj+hPlx+BKppXDhWieX82h5wvRxliYSxyVr1OfcK
KI0OJrTuOagabbopOqzPjHyuFodPNWTTkentJTQUIra8yGuQQstIhICWLS1k
nltKt21bw9GSzJVR3Ne0SRW4ebdtHoVMBc2xRic8zCVtnk27d3whpPFnBYrM
gx+xzAtGgDq1aBh7tgoVjf4UFEyB8CiSXorPtKOS1KZdEas6t.9UqiB4o9+l
RBPxkhFZeclD1Xix4fR7+tDe2fhIvlD+UQgRgnxJgr57W2mAHK0byp7jQ0iP
23FFrf+IfkVZLULI2ktzMQtPsLP8Fv4MxihBp1Tw3BXq4YMG6GFVCE4Qws2X
h+lq6XrKiDMtsqmspkzq1Epa8JAmfeUp6sUQataPPlJa0G+8tg9ac4LtudI.
CKZjE5JlnWm5kDEDTY9pa41FZYkfi6wtyeE+Z0KpLYPzc+3bRzhhU4U9aXo7
p0wc2jVslT9CZPuTU6VloCeEmsMNPLKp1gJF0JqvV13Vk5qYjyKZ6VAyuPcr
rkt6hRtAHrTAtSH.L.+ZF3+xuFj5JjDVJHZMv6ZWwJR.3W.gL1J.OBrIB3GJ
FTBXsPZARyif3H+pug.+PlWztPdY1WMKnVkGPUinvRszjkRaxkVByiVHkURG
k1igQIsmCMXlW+eb944EtXFPzOcsum..kFZR49do.WAtdGCD66ciXR72WzLV
X9XwBjsSIr.COEvh2H3Gbow2M.WvxcqWyRjTJI3jJ3PBtjacNzkfOIfK+f.v
0Bse.Q24K.tgq.h4vKWybk6wjlyKSAeE6xMWBPF.eQkBgIJTLdAG98u40uVL
vnDfAXs32XeAz8sAQ6VExRS+VOgPmD4u5xKu7qak3Zz7hkwQuXgv1pkGplvp
nuljSg0peZ21k5kG+PQ+jqYEPL3uIw9rlyWAZgHiOdrw.q2fG9rPjq3KXM+A
c8tA3C7aaJ2wbFUpkC7HbQkYVCPhi1KPRluN0cI7P.AOg.hvirT2MrFQDgVY
BWxO.u.0lEMq4BUxs9i0NLSdZgkN3IbvxNnIlNsCHFcAHp.Gt.rPsEq9uFFw
AgU+k1U4tQnmHMIgCdcqKgHsCR3t.oQCNXjRexFcx.NABOFVuS4sErMND8HU
pjO6gxZn5HrbdZAltL2.AsYkwFMu1dycqhps3bBXj4aZEKnNGIVLDklb2Lo1
Jj3zAOX2Gm.dgOB7Mfu5E9XveAf+51.n4diZScX7PxI1N0qSb2x.u.eAPD.K
ubUnK.5sWZFvfyLfQ0ZW5i+3jfNsNXmHn.QvJBmfiiRYeA7pTQrtdrrvVdkD
A8CUmiDPFCwCfWIjSEblBPp+gzwHU4kxiiR5JMtMyWzGoOAqChbUa7MPD2P6
zjE14zwmfeyW3hRqa44X9HgnxtO0GHYAKEBhE5okUpdpsfBDyiTSrxoN28zV
6mnQwusOqKGIWdPkkOzJkvGthc+TSizSu1vHxQG6oMQC.lCEAPOVDX+47JOw
gVNfOkjJauYnQaXJagN26YgcmxRcIyUk5nLjUYGGzRyXkB0CGN.oPt8hriyh
THi+bPRgzF77IENCUJPysTfGxJhybJEzAhEJZ7bIE4.cuRg4bxKxo9OurS5P
Ym4JSyhTH2gePVsHj4VJFhNh4bpiH2mePqHNlyIuXnVvyM0OeVsFBufNm6lk
aLZXRQyXQVk4W58B4Mat5J8s3dkKmm3ubGW6DP4aweT213lfnktAY2kXwcQ2
vkQd9dgS8qFjdDoKQQHSpHj1Gvz10ddh.lzwMo9sgzn.6zbZTXzYZTzXjOsl
SEl5fMMn8jSEy.jrkEHD3UkPE+v3cbvqJtHqLTZeLjXKaQyq4ohosXxHBjD6
nZW8fDNol+1Z5NVq.sj4GZU9MaouZBCqgfrpdT0oWMPWWwLCtOz8xJJjsYXp
jQ5CVRG7axY.uIB8f2jdJVKWajujZ4XSs7q4vbqo87podN0nLsnyufZv59zR
Y2J+nOnRNjqdOKbmlbjmBKUghkaV6GD3EEDcPpEkyWVnasHMdx66mAvKwTSD
xQ.KWZfLHXaUIQAKqxFz0iAkOHSKSJDK6poMlXZoJ4fMrLjkf0FFd+6Bhn52
.j5.M0kDUgDuqxCSPtyRFFRoL5HNIJNJoHqftzfVz+c7nMItq7yhsCVwv4EY
LoDQq0SqiEaE.pe9XJy7JVLd2NO2tV.XALYLkUPUhisMQAKNPjiitjoEDiqL
SWjJFrmbFUczXnMUslfrIXMl5.gHypCVZBSMtq7CkzWVANaPwHENiKVnvDGS
ip3rf5T40RglHBUIpTKSBQKzXpdhHL0kOtJixjXgLT8.SMDCT+ZQhYP0WV47
tRnO9uhYgfOHO7sOv15uLJX0hRxU+T5hk6VX0CjoUkW6TvlcHPKGGcoC.t57
ZaKnd4xV9Ghtj9IMKz5LXx0yqNuyPtfqlrl4xjTW1LesXRTHduvMntTHJmdd
eFff0HA4Yl2my8C7QKP+72wuRML3HDqx9V3EvDlu2CjpSSTnPHTBrzKrB9.w
Xfpepwl+Sm5aZBYSkJOrpCxRZwVqPZRfJNtjhazs5lNQG+SqF1fG17qh0mx9
jvn+zGeyX2xwzFgf5choXaS0RiEEIPoCIFkTCbZmehrMDTL01PDSaJMqDj1A
8DIchPuegjSp15gj6NQoAU3WWURkzGD09cHLxxVaIStdqdNz1H2+iGVkDsgE
9QE08Ymi2mQ3SDmiZmhaAI1p2Oz1x.oY.hED7jtKhJvLeugawlNk1dcLnDsU
DCZl8TYUVFcvsMyLy1To941sJY0HyukExt0cbz3F7uGMtY5QyFqipYNddXoi
hC2KUs20jojq9863bgr0Ai8PeHbJ.RXVoIWr9ghuggSHg5st9geYfRkXktXy
hCKM8hV5cxDLtKg6.lFJSqc+OyhLgOoDp+YZf+JVxn8DYf1IaN32gN5ZFza0
75SHd8N1pwhU1CUrq0qoTrEwyk3euGOYjVQlOAhwYIeYtnc0HNOold9onU4o
i+zOwZ2cowsI8jNg62DROpwSpzn9zH5RXZxlzgaZNCh1O66wmIysSpXJ+1L4
mbda7AunX1W59XxOJcpZgXmeJDTC4elKtvGcWNVRv.87dwyGy4ihP1VNZGnq
sac6ABLuFN9XzFY.KibQwPDFMAo2Pwz.oBhxrxMj7L5E7+YaOWrzSniSex8V
15njseYrll6lCCuz54DnW4lbyKCkeu0uTcj.cM4JNPiQdhmsbXClEFAHEW.T
wUA4nfmI3bxP8Pzm7yWXxOirdOegdOt8Igo7qY4BBDl8AiLEOJ7XNZMqVuKj
NOOSXuLw8GW9AEZ+xJVFvD6SnudvuOJ5lo6zvF1b4zlsNQTN+vaFczpBgG5T
rSqgMTXQpkaEaDi4PJ29qQP0cixk5+HDFvqr9wAOPqsi80zlNwEU9spf8nRm
i9komPa2OAWFWGS2IQM4FcLYeRlZOO2NKEHckneWnaTwsXyeqBqc4K7cc1HP
pYVFcPoFNjrRDmIc2ust2aW7+iP83dzwceO0VRJ6WXCjJmpgQQxH4S6DtzWP
z.NVDSrgElVc2JoEKzTKVU99Dmt0hGsGlNEW19gk5vTEwQlUZWzXomvMteL4
oVUsigsJV9auZnG5Vob5ax8uUm5seevN1KQi5FNjYkHIKID1aCiV1x0TJfIQ
2ENZIrWaqSnD95GbGu.BwTKc9fPHFFXstCFI7wZ5Ev2lvXGgDpPuhyTCYK8G
a5Et2KrrExcGs34XBss0m1mMTHbJnzlhnVSuL9uYqFs7QJbbCUDUCLaGioV9
9EVPPzccKhGlocVTHzoy8bwV5DMg3fbTdiJFiko0rMAv+YbBnt9h2FEbDTjh
SuvR3WikNmdJuk1jHfw6RhCNNmGHEo.oIxRn+qLRIMCX0iGplPsiQXKnMUoD
fcrrwNS47Jg4w7us6atqINQQrj6mSvIOWASzgnHoEi3BP5kPzXDulB6fzr8O
HYYHtfjSmzcRRYgqRO01AYOH2stWsL7kffD8EFYZCoYYTOAY068Y7znvxci6
yNXoyk6ct7Hv6jQgtXxd+ob15cAA7dufk5ImpikERev11FDC8W6fsP6pqK7R
9stPTc0tvEe4IHPblVPUOo3G0kFcjgpicHNZeTLJzBLvTCpyyOdH+.HOk9dA
tSdZNOLbQxvt2ibDJARU3kFY3mPEN6tCFzGMvv9FRZ4.EQTJT6VpPAvL6t1s
gPCTm4M8O3mve.7iahNJ0Y8mNWK+eB84+w4+elQI9bB
-----------end_max5_patcher-----------

tremblap · May 2, 2020, 1:42pm

to make it even quicker replacement, you could use 13 melbands between 200 and 4k and keep everything else the same in the patch. you get the same number of values so it is a one object replacement (a quick experiment!)

rodrigo.constanzo · May 2, 2020, 1:59pm

Interestingly, with this melbands version of the patch, the numbers I get from kNearestDist are more in the order of 0.001-0.002 range (rather than 13.-20. from the prior version). I guess this has to do with the types of units in the dataset(?).

Ok, so I gave this a spin and it’s not massively different. I seem to get marginally better results with the smaller sample/fft size (256/128) and about identical for the larger one (512/128).

Maybe it generalizes better, but it’s kind of hard to tell. Basically for these tests I run it with small and large pre-recorded files, and then playing. I can get exactly the same results on the fixes files, but doesn’t necessarily translate to better matching on the drum.

I just tested both on more real-world-ly different sounds and both perform really well. I think for the sake of honing in on things, the center to edge difference is probably the smallest difference I’d be trying to train.

Based on this I tried doing the 40bands, but in a smaller frequency range (@minfreq 100 @maxfreq 6000 and @minfreq 200 @maxfreq 4000) and both did much worse in terms of matching. Don’t know if that’s too many bands for that frequency range, or if this isn’t how it’s supposed to work at all.

Perhaps something like this would be useful in terms figuring out what returns the most differentiated clusters?

And as a spitball here, as I don’t think this is how ML shit works, but it would be handy to have a couple of sounds and be able to have a meta-algorithm which you can specify that these two sounds are different, so it would then iterate over descriptors and statistics to find what most accurately captures that difference between the sounds. Rather than manually testing/tweaking (in the dark).

tremblap · May 2, 2020, 2:22pm

yes. distance mean nothing in themselves, they are dependant on units and normalisation, so you cannot compare between descriptor spaces… hence the importance of LPT’s sanitisation of descriptor space before the distance are calculated!

melbands are a (rough) perceptual model of critical bands. you could go online, find how many critical bands there are in the range you care about and choose that. one way to decide the range you care about could be that you try (in reaper) high and low pass filters on it to see what happens.

something in between is already possible but you will never get a free ride. as ML people say: ‘just add data’

with immense amount of data, and immense amount of descriptors, and immense cpu time, you could do what you want. this is what amazon does. but an in-between is possible with a compromise on you curating and training and tweaking, and many descriptors, and data reduction algo that will remove redundancy in the latter… but no free ride for small batch + time series + low latency + low cpu…

rodrigo.constanzo · May 2, 2020, 2:36pm

Is there a way to normalize the output relative to the space?

As in, getting a number between 0. and 1. as the output? Just to see what’s matching and why.

I meant more in terms of requesting a high number of melbands for a frequency range that is (potentially) too small (e.g. @numbands 40 @minfreq 1000 @maxfreq 1111). Does that “break” the melband computation or are they just somewhat proportionally distributed in that space?

Obviously to do it well would require lots of time/cpu/data/etc…, but isn’t the principle the same? I meant more in terms of a workflow that rather than sifting through algorithms and parameters, you pick what you want to be classed as different, and it does the best it can with what it has (similar to your autothreshold picker thing, rather than running it over and over hoping to get the right amount, you specific what you want it let the algorithm iterate). At the moment each permutation of this is very time consuming (and again, error prone) to setup, with very little to tell me how effective it has been.

Even given the two versions that work “well” now. As far as I can tell, both work equally well in terms of results (MFCCs with 512/256 and MEL with 256/128), but without being able to see clustering, or have a more objective way of measuring success other than feeding it test audio and saying it found all the notes or not, I don’t know what to do as a next step, or what/how to approach it differently.

Obviously there is no free ride, but at the moment it’s shooting in the dark.

tremblap · May 2, 2020, 2:55pm

Try fluid.normalize and fluid.standardize on your data. There is an example in LPT and also in the simple learning examples to see their interface. Let us know how you understand them or not.

no, but you might get too much detail to make generalisation. imagine having too much precision on pitch, it won’t help you find trends…

play with it, make music, and that will give you a feel of what works well with each. Then you might have clearer edge cases that help you train and test the mechanism you implement for your specific task.

Maybe @weefuzzy will have more information here, but shooting in the dark is what it’s all about - otherwise you can use tools where someone else have done that experimental training for you, curating the experiments and the results. Either you try to get a feel, or you look at numbers, and you have both possibilities now - you can even look at it with the same paradigm we have in the learning example (json or graphics)

Your next step is to try data sanitisation and normalisation/standardisation and the interaction with the choice of descriptors. There is very little more than learning how they interact since they all fail in various ways.

tremblap · May 2, 2020, 2:56pm

you could also use the MFCC clustering app we talked about on another thread and see how it classifies your entries chopped at the length you will feed…

rodrigo.constanzo · May 2, 2020, 3:28pm

Would that be useful here? At the moment, anyways, I’m feeding in the same kind of data for training that I am matching, so presumably they would be scaled “the same”? Or do you mean getting things like MFCCs to behave more reasonably?

Shooting in the dark means different things to you and me though. You (and the rest of the team) know what’s going on with the algorithms, their implementations, intended use cases, etc… Whereas I’m clueless to (most(/all) of) that.

Do you remember who posted it? Or how long ago? There was that language one this week, but I don’t think that’s the one.

Having some kind of visualization of clustering would definitely be useful though, as with that I could more easily figure out if something is working better or not.

rodrigo.constanzo · May 2, 2020, 6:03pm

I went back and tested permutations of this (as in, having melbands be the main descriptors but with more stats) and I want to say I got some great results at 256/128 with one of the permutations, but in going back to try to figure out which one, I wasn’t able to reproduce it. Perhaps I had a nice sweet spot of amount of training data or something, but as it stands I’m still getting the best results from MFCCs with 512/256 (and more stats).

tremblap · May 3, 2020, 9:39am

This is a vector of truth. My understanding is building candidly on this website and with the PA learning experiments folder. The other two actually know what they are doing. The dialogue between my ignorance (and also those of you who share questions like you) and their knowledge is the point of this project: trying to find a point where we can discuss and share knowledge and exploration and vision without one person doing all the tasks (coding, exploring, poking, implementing, bending) so we all win in this dialogue. Does it make sense? So the extent of my knowledge is shared here 100%, and it isn’t much, but it keeps our questions focused on what we want. So the playing/poking part is quite important, because it allows you to build a different bias in your knowledge and quest. Otherwise I can send you SciKit Learn tutorials (which I have not done myself for the reason I give you) and you can come at it from a data scientist perspective… an in-between could be Rebecca Fiebrink’s class on Kadenze, which I’ve a part of, but again she offers ways to poke at data.

Now if you want to see your data, you have a problem that we all have: visualising 96D in 2D. see below for some examples on how imperfect that is. Another approach is the clustering you get in our objects for now: it will tell you how many items per cluster and you can try to see which descriptor gives you 2 strong classes.

This thread is about the app I was talking about. This and this threads are about other visualisation possibilities.

Again, as we discussed, if you try more segregated values (further in the space, very different) as your training data, or another approach is to train the ambiguous space. I’ve never tried either but you can now.