Regression + Classification = Regressification?

I think you misunderstood what I said: if you listen to the bit that is analysed only (which you can’t be now since I hear a lot more than 256 samples) that should give you an intuition of how it analyses. For instance, what you play at 1:10 it is quite very similar by ear… so it means your actual data is not discriminated enough if you see what I mean?

In other words, you need to listen to only what you ask the machine listens. If that is not clearly segregated by way of description and/or listening, you won’t get anything better further down the line.

Not sure I follow.

The analysis/matching happens only on the 256 samples, the 100ms version we are hearing is, from the input data, the entire “real” 100ms, and from the matching version, the composited “predicted” version.

I can listen more to the initial snippet, and matched snippet to see what I’m comparing against (I think this is what you’re saying?).

Those sounds are similar, in that particular example, but different enough that I think the descriptors were able to differentiate.

Today I’m going to try the MFCCs+stats route, as that got me like 80% accuracy with hits as similar sounding as the center and edge of the snare (hits 1 and 2 in this video). So hoping that works better.

if you only use 256 samples, this is what you should listen to. If your ears cannot clearly discriminate, and if your descriptors give you false nearest neighbours, then you have a problem you need to solve. nothing will be good out of that if that is wrongly classified.

if mfccs are better than your ear on 256 samples to match, then there is hope there. don’t forget that the classification of the 1st item can be a datapoint in a second dataset…

I’ll set it up so I can listen to the initial clips of both. I’ll still feed 100ms into the real-time analysis, so it’s not missing any frames at the end if the onset detection fires early (or something). But basically have a bit where I can A/B the 256 sample nuggets.

None of these are classes as such, I just have only 10 entries in the KDTree, so it’s finding the nearest one by default. In the end I’m going to have a wide variety of loads of different entries where the exact match isn’t as important as having a rough idea of the kind of morphology of the sound (even if “wrong”). As in, if the bit it matches can plausibly go to another sample, that’s, perceptually, ok.

I then plan on weighting things so the initial bit matters more in the query, and the predicted bit will weigh less (not presently (easily) possible).

Once I get that far I’ll do some further A/B comparison matching only with the 256 samples, and then matching with 4410 samples (most of it being predicted). To see what sounds better (or more nuanced (or more interesting)).

1 Like

After some growing pains in adapting the patch, I got it working with MFCCs. With no other fine tuning and literally dropping in the same mfccs/stats from the JIT-MFCC patch and it’s already worlds more robust.

I get 67.05% accuracy out of the gate (out of a sample of 2000 tests), which is more than twice the improvement.

Speed has gone down some though. It was around 0.58ms or so when running a KDTree with 12 dimensions in it, and up to 1.5ms with the 96dimensional mfcc/stats thing.

I haven’t optimized anything, so it’s possible I’m messy in places with how I’m going from data types. I do remember it being faster than this than in the JIT-MFCC patch in context though, but I’ll worry about that later.

edit:
Went and compared the original JIT-MFCC patch, and it is as “slow” as this, coming in at 1.5ms per query. I guess that seemed fast at the time all things considered.

What I want to try next is taking more stats/descriptors and trying some fluid.mds~ on it, to bring it down to a manageable amount. For the purposes of what I’m trying to do here, I think not including a loudness descriptor is probably good, since I wouldn’t have to worry about the limitations of variations in loudness for my initial training set.

The only thing I’m concerned about (which I’ll do some testing for) is the difference in speed in querying a larger dimensional space, vs fit-ting a dimensionality reduction scheme quickly in real-time. Like is the latter (shrinking dimensions on real-time data from a pre-computed fit) faster than just querying the larger dimensional space in the first place…

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So in effect, my “real world” 0-256 analysis would include:

  • (whatever stats work best for actual corpus querying(?))
  • (loudness-related analysis for loudness compensation)
  • (40melbands for spectral compensation)
  • (buttloads of mfccs/stats for kdtree prediction) <---- focus of this thread

So I would use the mfccs/stats to query a fluid.kdtree~, and then pull up the actual descriptors/stats that are good for querying from the relevant fluid.dataset~ to create a composite search entry that may or may not include the mfcc soup itself…

I returned to this testing today, with the realization that I can’t have > 9 dimensions (with my current training set) due to how PCA works.

I’ve changed up the analysis now too and am taking 25 MFCCs (and leaving out the 0th coefficient), as well as two derivatives. So an initial fluid.dataset~ with 196 dimensions in it. So a nice chunky one…

I tested seeing how small I can get things and have OK matching.

196d → 2d = 28.9%
196d → 5d = 41.9%
196d → 8d = 44.1%
196d → 9d = 54.0%

Not quite the 67.05% I was getting with a 12d reduction (which is surprising as a couple of those dimensions would be dogshit based on the pseudo-bug of requesting more dimensions than points).

I also made another training set with 120 points in it, but I can’t as easily verify the validity of the matching since it’s essentially a whole set of different sounding attacks on the snare. So I’ll go back and create a new set of training and testing data where I have something like 20 discrete hits in it, so I can test PCA going up to that many dimensions.

I’ll also investigate to see what I did to get that 67% accuracy above, to see if I build off that. But having more MFCCs and derivs seems promising at the moment, particularly if I can squeeze the shit out of it with PCA.

edit:
It turns out I got 67.05% when running the matching with no dimensionality reduction at all. If I do the same with the 196d variant (25MFCC + 2derivs) I get 73.9% matching out of the gate. So sans reduction, this is working better.

Well, at least without so much reduction. If you had more fitting data to try, you might find a sweet spot with a less drastic amount of reduction where the PCA is removing redundancy / noise but not stuff that’s useful for discrimination.

1 Like

Yeah that’s what I’m aiming to do at the moment. I made a list of 30 repeatable sounds (as in, mallet in center, needle on rim, etc…), so I can train it on that, then I’ll create a larger testing set with 5 of each hit and run the same kinds of tests, to see if I get better results going up to 30d (if so, I’ll try higher).

Also going to isolate and test the raw matching power (unreduced) of 20mfccs 1deriv, 25mfccs 1 deriv, 20mfccs 2deriv, 25mfcc 2derivs, to see what along the way made it better, as to not fill it with more shit if it’s not helpful.

Ok, did some more testing (and my original dataset, to keep things consistent) and it’s a bit weird.

So it looks like the 67% results I got were from using just 13 MFCCs, which I have since moved on from. And it also turns out I think I was compiling the stats from the 2nd derivs wrong (more on this below).

Results are a bit surprising too.

So this is run with a 10point training set, and then fed 50 random samples where there are 5 of each “type” present in the original training set (i.e. 5x snare center, 5x cross stick, etc…).

I run this 1000 times, and see how often both of the values were the same. I also ran this whole process three separate times, making sure to close/open everything again between each run.

20MFCCs with 1 derivative: 76.1% / 76.7% / 75.8% = 76.20%
25MFCCs with 1 derivative: 71.7% / 75.9% / 72.4% = 73.33%
25MFCCs with 2 derivatives(?): 73.1% / 72.1% / 74.0% = 73.07%

A couple of striking things here. It turns out my results got worse with the higher MFCCs count. I think the matching with the 2nd derivative with 25MFCCs is better, but the numbers got skewed with an outlier for the second test of 25MFCCs with 1deriv.

I haven’t run the 20MFCCs with 2derivs yet, because I think I’m fucking something up in terms of creating the dataset/entry.

Up to this point I’ve been using an adapted version of @tremblap JIT-MFCCs code for unpacking and flattening the fluid.bufstats~ stuff. (I know we have fluid.buf.select and fluid.bufflatten~ but the js inside fluid.buf.select means it won’t fare too well for fast/real-time use).

So to adapt this code to 25MFCCs (sans 0th coefficient) I changed the uzi and that does it. I thought I understood the list bit, where it’s taking the mean, standard deviation, min, and max for the original stats and then +7 everything to do the same for the derivatives.

Screenshot 2020-07-27 at 1.59.48 pm

But something isn’t adding up right.

If I run this process with 20MFCCs and 1deriv I get 152 dimensions. If I adjust things to get 2 derivs, I then get…156 dimensions. Which leads me to believe somethings gone fucked.

Same goes for 25MFCCs. With a single deriv I get 192d, and if I change to 2derivs I get 196d.

Have I misread that bit of patch?

the problem is in the expr (you still have an offset of 8)

what you need to do is to really understand that patch with a dummy buffer in and check what you get at the output. The problem of interface would be a solution you would have even more (imagine bufstats with binary inputs like you laughed at me in the other thread on the exploration patch) but there is very little leeway in designing options of what stats you get out otherwise…

Aaah.

So would I need to offset 8 and 16?

I’d be down with a bufstats where you only get what a you ask for if that’s what you mean! (I don’t doubt that I would have razzed about something but I was surprised that all the spectral stuff and stats returned everything no matter what)

I’ve tried making sense of this patch fragment, it’s just really gnarly and doesn’t adapt well. Adding more MFCCs was easy enough but derivs is a pain.

no. if you have 8 items in the list in the code you plunder and there is an 8 in the expr, I reckon that if you put 12 items, you’ll need to multiply by 12 in the expri

1 Like

Gotcha!

Duh, lol.

I figured out the other numbers 12 was 13-1, etc… but I was looking at the 8 and was thinking “what is this number related to?!”.

1 Like

Ok, thanks to @tremblap’s helpful comments about that patch nugget, I’ve done some proper testing and comparing between all the variables at hand (pre-reduction).

The corrected and updated stats are:

20MFCCs with 1 derivative: 76.1% / 76.7% / 75.8% = 76.20%
20MFCCs with 2 derivatives: 76.6% / 73.6% / 75.0% = 75.07%
25MFCCs with 1 derivative: 71.7% / 75.9% / 72.4% = 73.33%
25MFCCs with 2 derivatives: 69.7% / 69.1% / 67.9% = 68.90%

So, surprisingly the 20MFCCs with no addition derivatives works out the best, with the 2 derivatives being only a touch behind. Surprisingly having 25MFCCs wasn’t better. Perhaps the extra resolution in this context isn’t beneficial and/or it’s capturing more noise or something else.

@weefuzzy’s intuition about the reduced usefulness of 2nd derivatives for such short analysis frames is correct.

I’m now tempted to try min-maxing the raw MFCCs and/or other stats to see if I can get better raw matching here, but it just takes a bit of faffing to set up each permutation to test things out.

In all honesty, 76% is pretty good considering how similar some of these sounds are (snare center vs snare edge), so it will likely do the job I’m wanting it to do. I’ll retest things once I have a larger and more varied training/testing set (with much more different sounds), but my hunch is that the matching will improve there. We’ll see.

I also now need to test to see how small I can get the PCA reduction while still retaining the best matching.

Well, well, well. What’s a data-based post without a little completionism.

So I wanted to go back and test stuff just using melbands (like @weefuzzy had suggested ages ago for the JIT-regressor stuff).

So using the same methodology as above (10 training hits, 1000 tries with testing data) but with some different stats. For all of these I’m using 40 melbands between 100 and 10k, with the only stats being mean and standard deviation.

Here are the results:

40mel with 1 derivative: 70.9% / 73.2% / 76.1% / 73.3% = 73.38%
40mel with 0 derivatives: 68.9% / 67.0% / 66.8% = 67.57%

The results are pretty good, though not as good as the MFCC-based results.

Based on the fact that I got pretty decent results from taking only the mean and standard deviation (rather than also taking min and max), I reran some of the earlier tests with 20MFCCs.

The results are pretty good, though not quite as good as taking more comprehensive stats.

Here are 20MFCCs with only mean and standard deviation as the stats:

20MFCCs with 1 derivative: 73.7% / 71.0% / 73.8% = 72.83%
20MFCCs with 0 derivatives: 71.7% / 71.3% / 73.0% = 72.00%

Where this starts getting interesting is that although the accuracy is lower, I’m getting pretty decent results with much less overall dimensions. For example, the last test there gives me 72% matching, using only 38 dimensions. As a point of reference, the best results I posted in my previous post was 76.2% which took 152 dimensions to achieve.

So it will be interesting to see how these shape up with some PCA applied, as it will be a balance between accuracy and speed, and the initial amount of dimensions for taking only mean/std is already at 25% the overall size, before any compression.

And today’s experiments have been with dimensionality reduction, and using a larger training/testing data set.

Off the bat, I was surprised to discover that my overall matching accuracy went down with the larger training set. I also noticed that some of the hits (soft mallet) failed to trigger the onset detection algorithm for the comparisons, so after 1000 testing hits, I’d often only end up with like 960 tests, so I would just “top it up” until I got the right amount. So it’s possible that skewed the data a little bit, but this was consistent across the board. If nothing else, this should serve as a useful relative measure of accuracy between all the variables below.

I should mention at this point that even though the numerical accuracy has gone down, if I check and listen to the composite sounds it assembles, they are plausible, which is functionally the most important thing here.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So I did just some vanilla matching as above, but with the larger training set, just to get a baseline. That gave me this:

20MFCCs with 1 derivative: 44.1% / 45.5% / 46.5% = 45.37% (152d)
20MFCCs with 2 derivatives: 42.5% / 47.1% / 44.0% = 44.53% (228d)

(also including the amount of dimensions it takes to get this accuracy at the end)

I also re-ran what gave me good (but not the best) results by only taking the mean and standard deviation (whereas the above one also includes min and max).

That gives me this:

20MFCCs with 0 derivatives: 52.2% / 53.0% / 53.5% = 52.90% (38d)
20MFCCs with 1 derivatives: 54.5 / 53.0% / 52.5% = 53.33% (76d)

What’s interesting here is that for the raw matching power (sans dimensionality reduction), I actually get better results with only the mean and std. Before this was close, but the larger amount of statistics and dimensions were better.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

What I did next was test these variations with different amount of PCA-ification. This can go a lot of different ways, so I compared all of them with heavy reduction (8d) and medium reduction (20d) to see how they faired, relatively. (granted there are different amounts of dimensions to start with, but I wanted to get a semi-even comparison, and given my training set, I can only go up to 33d anyways).

As before, here are the versions that take four stats per derivative (mean, std, min, max):

20MFCCs with 1 derivative: 22.5% (8d)
20MFCCs with 1 derivative: 23.1% (20d)
20MFCCs with 2 derivatives: 26.5% (8d)
20MFCCs with 2 derivatives: 27.7 (20d)

I then compared the versions with only mean and std:

20MFCCs with 0 derivatives: 28.5% (8d)
20MFCCs with 0 derivatives: 26.2% (20d)
20MFCCs with 1 derivative: 25.5% (8d)
20MFCCs with 1 derivative: 23.0% (20d)
20MFCCs with 1 derivative: 30.0% (33d)

Even in a best case scenario, of going from 38d down to 20d, I get a pretty significant drop in accuracy (52.9% to 26.2%).

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So with all of this in mind, I get the best overall accuracy while taking speed into consideration is taking only mean and std of 20MFCCs with zero derivatives which gives me 72.0% with the smaller data set and 52.90% with the larger data set, with only 38 dimensions.

I was hoping to see if I could smoosh that a little, but it appears that the accuracy suffers greatly. I wonder if this says more about my tiny analysis windows, the sound sources + descriptors used, or the PCA algorithm in general.

My take away with all of this is to get the best accuracy (if that’s important) with the lowest amount of dimensions possible, on the front end, rather than gobbling up everything and hoping dimensionality reduction (PCA at least) will make sense of it for you.

For a use case where it’s more about refactoring data (i.e. for a 2d plot, or a navigable space ala what @spluta is doing with his joystick), then it doesn’t matter as much, but for straight up point-for-point accuracy, the reduction stuff shits the bed.

(if/when we get other algorithms that let you transformpoint I’ll test those out, and same goes for fluid.mlpregressor~ if it becomes significantly faster, but for now, I will take my data how I take my chicken… raw and uncooked)

have you tried 20 Melbands? just for fun?

I haven’t.

I’ll give a spin. I used 40 for a couple reasons. One was that I figured more bands would be better, but also it’s the amount of bands I’m already getting for the spectral compensation stuff, so that analysis comes “for free”.

Ok, tested it with 20 melbands, only mean and std, no derivatives:

20mel with 0 derivatives: 48.6%

Which puts it smack in the middle of of the 20MFCC tests where 4stats was slightly worse than this, and 2stats was slightly better.

So “ok” results, but not great.

wait, what freq range did you use to spread the 40? I presume you could use the 20 mfccs of the middle of the 40 you have already to test. if you focus on the range in which the spectrum changes, you’ll have better segregations… or maybe using PCA to go from 40 to 20 here makes more sense.