Onset-based regression (JIT-MFCC example)

I also delayed this by 512 samples, so it would be “in time”. Are you saying to massage those numbers further? (so the analysis window would be x number of samples after the onset is detected?)

Do you mean for the descriptor/stats? Aren’t these numbers > -1/1, so they would just show up as black blocks in any DAW? Or do you just mean to see when things temporally line up?

I did this for the onset descriptors stuff, though not exactly sure what that would do in terms of picking what kinds of descriptors/mfccs/stats to chose.

yes. I found that for my bd/sn/hh pedagogical example, it made a huge difference on the false trigger of the bd as a snare.

I meant for the audio you send to the analysis chain (the 256 or 512 samples of the ring buffer)

1 Like

Results were not good with this. I tried further delaying the analysis by 64/128 samps, at both 256 and 512 numframes (so things like delay 320 320) and in none of the tests did this work any better, and generally performed worse.

In testing some more permutations, however, I did get the (second) best results using a 512 analysis window, and a 5k bump in the preprocessing. I guess the pickup is most sensitive in that region, so it would follow that it would also see the most amount of differentiation (like how we can see more shades of green or whatever).

I also did some casual testing of lower amounts of training hits (ca. 5) and that still seemed to work ok. I didn’t try to see how generalisable it was though. I was just doing this while testing all the variations/permutations.

So I’ve done a manual version of this. I made a simple fluid.bufcompose~ patch that writes the contents of the analysis buffer to another buffer, with a bit of gap after each hit, and saved the files. (man, the “newer” fluid.bufcompose~ syntax is 10000x better than the original flag-based version, I was able to whip that patch up in a couple of minutes)

The results are surprising.

blipAudioComparison.zip (83.2 KB)

For each example I hit 10 hits in the center, and 10 hits towards the rim, with a bit of variation in dynamic each time. It is also a new take each time (though I could/should probably do the same by feeding in a prerecorded loop to hear the difference in signal chain only).

What is super surprising to me is that the version that I get the best results with (DPA audio) is the one I hear the least amount of difference between the hits. Granted this is for the 512 window, whereas I had good results with 256 window for the DPA, so it’s possible that that first half is more differentiated.

The raw sensor.aif one has a clear, audible, differentiation between the two types of sounds. Like, night and day. Same goes for the 5k bump.aif one, with obviously a hyped top end.

I also included the convolved version as a point of reference, but you can hear that it kind of flattens the difference some (i.e. the raw sensor sounds more differentiated).

Also surprisingly is that even a mild highpass (80Hz) before the signal really smushed the differentiation.

So all of that is to say, that these tiny fragments do sound very different. So it’s a matter of picking the descriptors and statistics that best highlight this difference. (oh, as a quick mid-post test, I tried @numframes 512 @fftsettings 256 instead of @numframes 512 @fftsettings 128 and that seemed pretty good, which would make sense with the amount of low frequency content coming through on the snippets.


SO, with that bit of comparison audio, do you think the 12 MFCCs + min/mean/max/std would be a good way to represent it?

I mean, this is why you guys are hired - to contribute to interface research in creative coding - so I’m happy we got there in the end :slight_smile:

Seriously, I agree, especially for readability of old patches, that verbose attribute names, if painful to enter, are much easier to read…

It depends™ :slight_smile: what i tend to do now is trial and error. What is it you cannot do now? If you ask a specific optimisation question, then maybe @weefuzzy @groma and/or me will have specific ideas. For me, if you get good results with similar training sizes (40-50 hits per class) then stop there and make music until something clear pops out… this is your usual modus operendi but now I’ve lost the plot in this thread on what is actually needed to improve…

Ok I managed some more testing with this today.

The first thing I did was create audio recordings for both training and matching the data, just to rule out other variables while testing all the many permutations.

This let me hone in on some settings that worked better across the board.

That being said, the best results I got, overall, was using a larger analysis window (@numframes 512 @fftsettings 256) and using only the audio from the pickup, with the a 5k boost. This actually works better (on my pre-recorded audio) for matching everything, as compared to the DPA/SP combo that worked the best before.

Granted this is with a fixed set of data, so it may not be the most generalizable, so I may very well try doing the same thing but with larger/longer training and matching data to further refine things. (mid post edit: it seems that the DPA/SP combo at 256/128 generalizes better still (tested with a much longer training and performance dataset), but that SP/SP at 512/256 is the best single mic solution)

As per @tremblap’s suggestion I tried shorter delay times (5-10 samples) but this seemed to not have a positive result at all. I don’t think things got worse, but it didn’t seem to make a big impact.

I also tried @numframes 256 @fftsettings 64 to see if having more information in terms of time series was useful, but it was not.

Lastly, I tried plugging the same data (I think) into fluid.kdtree~ to see what I get in terms of the `kNearest’ and more importantly ‘kNearestDist’.

On this front I’m not exactly what units the kNearestDist is in, but I guess it’s related to the amount of entries in the dataset. With a dataset of 46 points (spread across two labels, with about equal amounts of hits each) I get this kind of spread:

(multislider range is 0. to 30.)

(the hits that are “off the chart” were it hearing an onset while I wasn’t at the drum, and those returned values in the 60-70 range)

I tried playing from the center to the edge of the drum (the two labels I trained) and I didn’t really notice a meaningful crossfade in kNearestDist, or if it was it was subtle. It largely stayed in the in the 12-20 range for the most part, and honestly seemed random.

Here is me playing 10 hits in the center of the drum, 10 hits moving towards the edge, and 10 hits at the edge:
Screenshot 2020-05-01 at 5.14.22 pm

While doing this I did notice that the “center” label is the dominant one in terms of the matching. By that I mean that I get it matched until I move almost completely towards the edge. If you take the radius of the drum as a unit, the changeover happens around 80% of the way through.

So yeah, not really sure what I’m looking at with the distances, or if I’ve setup that part of the patch correctly, but I was able to at least land on some settings that work a bit better.

Are you still using MFCCs in all this? I wonder if they will be the most effective feature for tracking the change in tone across a drum. Thinking out loud, the spectral envelope of a drum strike might be dominated by some resonance that shift in frequency, but don’t change shape or distribution very much.

What happens if you useda small number of mel bands as your feature instead? Does the discrimination get better or worse?

We should try and set up some clustering examples with kmeans…

1 Like

All of this has been using a slightly tweaked version of @tremblap’s patch from the help file.

The variance between the center/edge hits (I’m using the most similar sounds to test with as these would probably be the most difficult ones to differentiate between) is fairly pronounced (perceptually).

They do have pretty pronounced resonant peaks which don’t vary too much between the two, as those are the harmonics of the head.

I’ve not actually tried different descriptor types here. I do find that part of the patch pretty heavy/confusing in terms of picking out what descriptors/stats/etc… Definitely open to picking different analysis that is more suited to the drum stuff.

@tremblap also mentioned that MFCCs don’t respond nicely to noise, and the SP pickup has a shit signal-to-noise ratio, so this could perhaps mitigate that.

As in for visualization or for improved matching? Curious and open for all of it!

Just saw this, sorry. kNearestDist will be reporting the distances between the data point supplied and the k nearest in the tree. It’s diagnostically useful, both for getting an impression of how well described your data is by looking at the overall spread of distances of points to each other, but also practically useful, e.g for determining if the data point you have is a complete outlier relative to the reest of the data .

Indeed – it’s possible that MFCCs won’t be the most sensitive feature for capturing those distinctions. MFCCs (kind of) tell you about periodicities in the spectral shape of a signal frame, i.e how much wiggle there is at different scales in the spectrum. What I hear in those sounds though is a shift in resonances that might be better differentiated with something closer to the spectrum itself.

Yeah, the buffer flattening in particular is a bit gnarly to do in Max. Remedial work is underway! I might be tempted to start simpler for these sounds, and build it up. Replace the bufmfcc~ with a bufmelbands~ (maybe 40 bands to start) and use a smaller range of statistics, maybe just mean and standard deviation with no derivatives.

With pretty minimal testing, here’s that part of the patch, changed thusly, and a couple of comments added to the buffer flattening. You will also need to change the size of the buffer~ entry to 80 samples

1 Like

to make it even quicker replacement, you could use 13 melbands between 200 and 4k and keep everything else the same in the patch. you get the same number of values so it is a one object replacement (a quick experiment!)

1 Like

Interestingly, with this melbands version of the patch, the numbers I get from kNearestDist are more in the order of 0.001-0.002 range (rather than 13.-20. from the prior version). I guess this has to do with the types of units in the dataset(?).

Ok, so I gave this a spin and it’s not massively different. I seem to get marginally better results with the smaller sample/fft size (256/128) and about identical for the larger one (512/128).

Maybe it generalizes better, but it’s kind of hard to tell. Basically for these tests I run it with small and large pre-recorded files, and then playing. I can get exactly the same results on the fixes files, but doesn’t necessarily translate to better matching on the drum.

I just tested both on more real-world-ly different sounds and both perform really well. I think for the sake of honing in on things, the center to edge difference is probably the smallest difference I’d be trying to train.

Based on this I tried doing the 40bands, but in a smaller frequency range (@minfreq 100 @maxfreq 6000 and @minfreq 200 @maxfreq 4000) and both did much worse in terms of matching. Don’t know if that’s too many bands for that frequency range, or if this isn’t how it’s supposed to work at all.

Perhaps something like this would be useful in terms figuring out what returns the most differentiated clusters?

And as a spitball here, as I don’t think this is how ML shit works, but it would be handy to have a couple of sounds and be able to have a meta-algorithm which you can specify that these two sounds are different, so it would then iterate over descriptors and statistics to find what most accurately captures that difference between the sounds. Rather than manually testing/tweaking (in the dark).

yes. distance mean nothing in themselves, they are dependant on units and normalisation, so you cannot compare between descriptor spaces… hence the importance of LPT’s sanitisation of descriptor space before the distance are calculated!

melbands are a (rough) perceptual model of critical bands. you could go online, find how many critical bands there are in the range you care about and choose that. one way to decide the range you care about could be that you try (in reaper) high and low pass filters on it to see what happens.

something in between is already possible but you will never get a free ride. as ML people say: ‘just add data’

with immense amount of data, and immense amount of descriptors, and immense cpu time, you could do what you want. this is what amazon does. but an in-between is possible with a compromise on you curating and training and tweaking, and many descriptors, and data reduction algo that will remove redundancy in the latter… but no free ride for small batch + time series + low latency + low cpu…

Is there a way to normalize the output relative to the space?

As in, getting a number between 0. and 1. as the output? Just to see what’s matching and why.

I meant more in terms of requesting a high number of melbands for a frequency range that is (potentially) too small (e.g. @numbands 40 @minfreq 1000 @maxfreq 1111). Does that “break” the melband computation or are they just somewhat proportionally distributed in that space?

Obviously to do it well would require lots of time/cpu/data/etc…, but isn’t the principle the same? I meant more in terms of a workflow that rather than sifting through algorithms and parameters, you pick what you want to be classed as different, and it does the best it can with what it has (similar to your autothreshold picker thing, rather than running it over and over hoping to get the right amount, you specific what you want it let the algorithm iterate). At the moment each permutation of this is very time consuming (and again, error prone) to setup, with very little to tell me how effective it has been.

Even given the two versions that work “well” now. As far as I can tell, both work equally well in terms of results (MFCCs with 512/256 and MEL with 256/128), but without being able to see clustering, or have a more objective way of measuring success other than feeding it test audio and saying it found all the notes or not, I don’t know what to do as a next step, or what/how to approach it differently.

Obviously there is no free ride, but at the moment it’s shooting in the dark.

Try fluid.normalize and fluid.standardize on your data. There is an example in LPT and also in the simple learning examples to see their interface. Let us know how you understand them or not.

no, but you might get too much detail to make generalisation. imagine having too much precision on pitch, it won’t help you find trends…

play with it, make music, and that will give you a feel of what works well with each. Then you might have clearer edge cases that help you train and test the mechanism you implement for your specific task.

Maybe @weefuzzy will have more information here, but shooting in the dark is what it’s all about - otherwise you can use tools where someone else have done that experimental training for you, curating the experiments and the results. Either you try to get a feel, or you look at numbers, and you have both possibilities now - you can even look at it with the same paradigm we have in the learning example (json or graphics)

Your next step is to try data sanitisation and normalisation/standardisation and the interaction with the choice of descriptors. There is very little more than learning how they interact since they all fail in various ways.

you could also use the MFCC clustering app we talked about on another thread and see how it classifies your entries chopped at the length you will feed…

Would that be useful here? At the moment, anyways, I’m feeding in the same kind of data for training that I am matching, so presumably they would be scaled “the same”? Or do you mean getting things like MFCCs to behave more reasonably?

Shooting in the dark means different things to you and me though. You (and the rest of the team) know what’s going on with the algorithms, their implementations, intended use cases, etc… Whereas I’m clueless to (most(/all) of) that.

Do you remember who posted it? Or how long ago? There was that language one this week, but I don’t think that’s the one.

Having some kind of visualization of clustering would definitely be useful though, as with that I could more easily figure out if something is working better or not.

I went back and tested permutations of this (as in, having melbands be the main descriptors but with more stats) and I want to say I got some great results at 256/128 with one of the permutations, but in going back to try to figure out which one, I wasn’t able to reproduce it. Perhaps I had a nice sweet spot of amount of training data or something, but as it stands I’m still getting the best results from MFCCs with 512/256 (and more stats).

1 Like

This is a vector of truth. My understanding is building candidly on this website and with the PA learning experiments folder. The other two actually know what they are doing. The dialogue between my ignorance (and also those of you who share questions like you) and their knowledge is the point of this project: trying to find a point where we can discuss and share knowledge and exploration and vision without one person doing all the tasks (coding, exploring, poking, implementing, bending) so we all win in this dialogue. Does it make sense? So the extent of my knowledge is shared here 100%, and it isn’t much, but it keeps our questions focused on what we want. So the playing/poking part is quite important, because it allows you to build a different bias in your knowledge and quest. Otherwise I can send you SciKit Learn tutorials (which I have not done myself for the reason I give you) and you can come at it from a data scientist perspective… an in-between could be Rebecca Fiebrink’s class on Kadenze, which I’ve a part of, but again she offers ways to poke at data.

Now if you want to see your data, you have a problem that we all have: visualising 96D in 2D. see below for some examples on how imperfect that is. Another approach is the clustering you get in our objects for now: it will tell you how many items per cluster and you can try to see which descriptor gives you 2 strong classes.

This thread is about the app I was talking about. This and this threads are about other visualisation possibilities.

Again, as we discussed, if you try more segregated values (further in the space, very different) as your training data, or another approach is to train the ambiguous space. I’ve never tried either but you can now.

1 Like

and again, XKCD has the answer with Friday’s graph:

Yeah that’s a bummer that. I guess with data reduction you (potentially) lose some of how differentiated things are.

Sadly this appears fucked. I downloaded it (AudioStellar) and set it up, and it just sits on 0% initializing or gives me an “error path not found” whenever I try to load samples.

@jamesbradbury’s example isn’t online anymore either.

I created a mini set of samples so there are 20 of each of center and edge hits, all 512 samples long.
tiny samples.zip (44.1 KB)

Hmmm. At the moment I’m getting pretty solid differentiation when I train sounds that are actually different (e.g. center and rim), but for most of the sounds that are actually different, there’s not really a middle ground or ambiguous space as there’s a different surface involved. I did try training hitting the rim with nearer the tip of the stick vs nearer the shoulder, and that worked without any mistakes. I didn’t test to see where that difference started though.

The center to edge, I guess, is the hardest one to tell, in terms of regular snare hits, since they sound the most similar. Perhaps this is not the case, but my thinking is that if I can get those working smoothly, the rest will be a piece of cake.

I plan to rectify this soon as its going to be part of a potential journal submission. Stay tuned…