Class "interpolation" (distance to classes)

you know that this is exactly what the mlpclassifier is doing, and that I shared a patch doing this and sharing it here in April 2022 and pointed at it in May last year? So I wonder what is different now?

thanks for sharing - you are in good company with @balintlaczko here. I look forward to see how (short) time series will work (with lstm) but in the meantime I wonder if you could not bake in some sort of spectral profile à la LPT instead of averaging… the sound you compare have clear spectromorphologies after all, and you dismiss the morpho part :slight_smile:

I remember looking at this back then and although I don’t remember the specifics, I remember not finding it useful for what I was doing.

So I loaded it up into the current patch with the current sounds encoded just as it was in your patch, using a dataset/labelset (as opposed to the mlp example above that is dataset/dataset so it has the “one-hot” vectors). And I get this:

(in effect, the mlp-based stuff doesn’t seem to be able to actually communicate confidence (other than flip-flopping between 100% this or 100% that))

When I did a ton of testing with this stuff in the past (LTEp) I found that this worked well for having a visual representation/spread for something where you mouse/browse around ala CataRT, but found that it didn’t work with classification nearly as well. I never tried using just purely spectral moments this, so that’s my next course of action.

My thinking, though, is that the “one-hot” approach is only as good as the core classification that can happen with it. e.g. if I don’t get good solid separation between the classes with the descriptors/recipe, it likely won’t work well for this at all.

So there’s some refinement here still I think(/hope) in terms of optimizing the descriptor recipe, but also curious if there’s anything else that can be done with regards to how interpolation/confidence is computed here. (e.g. mlp not working well and flip-flopping, numneighbours being one of the biggest deciding factors, etc…).

I do look forward to experimenting with this as in my case I just have 7 frames of analysis anyways, so I imagine I could just chuck all 7 frames in, rather than doing statistics on them and this would only slightly increase my dimension count while at the same time better representing morphology.

1 Like

Ok here is some further testing this morning.

I’ve wanted to visualize the numneighbours and see if a radius was useful at all, so I set up a fluid.kdtree~ trained on the same dataset to try and visualize things.

I don’t know if this is a useful/direct analog to fluid.knnregressor~ trained on the same input data (with “one-hot” vectors as output). As in, are the @numneighbours in fluid.knnregressor~ doing (mathematically) the same thing as @numneighbours in fluid.kdtree~?

Either way, here’s the same chunk of audio as in the previous examples with a few different settings:

I talk through it in the video but my first setup is @numneighbours 40 @radius 0, which more closely matches the settings in many of the examples above. Again, don’t know if this is mathematically the same, but kind of interesting to see that it jumps around a bit at the edges rather than being completely blobbed on one side.

Then I tried @numneighbours 0 @radius 50 to rely solely on radius. Not as good as results as I would have thought, particularly since I have to crank the radius up otherwise it misses some hits completely.

Finally I tried a hybrid, @numneighbours 10 @radius 60 which actually looks very promising. The center and edge classes seem decently defined and the interpolation looks a big more legible. Probably some fine tuning of the parameters here may be useful to tighten things up, but it’s definitely seems like an improvement.


And chasing up on a hunch, here is the same test but with the 31d PCA reduction (un-normalized) instead of the full 104d MFCC soup:

I tweaked the numbers slightly so they are comparable to the previous test:
-@numneighbours 30 @radius 0
-@numneighbours 0 @radius 50
-@numneighbours 10 @radius 70

To my eyes this looks quite a bit better actually. This leads me to believe that in this specific context it may be beneficial to do the PCA reduction.


If the maths are the same for @numneighbours in fluid.kdtree~/fluid.knnregressor~.


So with that being said, firstly, are the maths the same in these? And secondly, is there a way to somehow combine these to be able to use radius+numneighbors in a “one-hot” vectors context?

it is exactly the same code - it is calling the same kdtree code as an instance of that object. The beauty of C++ when well done

1 Like


Ok, so is there an elegant way to do what I’m doing above in reality? (as in, the moving slider and sonic results in the videos above are just the fluid.knnregressor~ in the background like in all the previous examples)

What comes to mind seems really clunky:
-getting the knearest numneighbours as a long list
-iterating through that to getpoint a parallel fluid.dataset~ with the “one-hot” vectors for each individual match from the knearest list
-manually do the maths to turn the “one-hot” vectors into an interpolated result (just averaging the values? eucledian funny business?)
-turn the results of that math into a single vector that represents the interpolation

And all of this would need to happen per hit and it feels like iterating/uzi-ing through lists and datasets would be kind of slow in this context.

Is there a shorter and/or more elegant way to do something like this? (is there a technical reason why fluid.knnregressor~ doesn’t have radius in addition to numneighbours if it’s doing the same kind of thing under the hood?)

Ok made some time today to dig into this today and some interesting/promsing results.


First some plots/comparisons.

On the left are 2d UMAP reductions of 104d MFCCs and on the right is 2d UMAP reduction 56d spectral descriptors (all 7 spectral moments, min/mean/max/std, 1deriv).

2 classes:
Screenshot 2024-02-10 at 10.22.03 PM

3 classes:
Screenshot 2024-02-10 at 10.22.18 PM

5 classes:
Screenshot 2024-02-10 at 10.22.29 PM

8 classes:
Screenshot 2024-02-10 at 10.22.38 PM

Overall the spectral moments look a bit more tight/tidy I have to say. It’s pretty close overall though, with the most clear difference being in the 3 class version where it’s closer to 3 clear stripes rather than a fuzzier middle area.


The combined vector confirms this. Here’s a vid showing the 2 class comparison. (we’re listening to the spectral moments being sonified):

The spectral moments seem to use the overall range better (no scaling being used here), which is perhaps indicative of a more clearly defined egde cases across fewer dimensions. It also looks to be slightly less jumpy overally, and specifically in the transitions.

Here’s a multislider showing the plot over time:
Screenshot 2024-02-10 at 10.19.46 PM

The top is my attempt at hand drawing what the output should be:
center / edge / center → edge → center / edge / center / edge / edge → center → edge

The “dynamic range” of the spectral moments stands out here, and you can see it is much less jump in the center → edge transition (less apparent in (edge → center however).

I don’t know if this is a useful metric for superficially checking the salience of data, but when keeping 95% of the variance with PCA, the 104d MFCCs becomes 31d PCA (~30%) and 56d spectral becomes 12d PCA (~21%), which would lead me to believe that the spectral moments are more redundant, but not sure.


Here is the same being plotted with fluid.kdtree~ so we can visualize the neighbors/radius stuff.

As before, I tweaked the numbers slightly to be more in line with previous ones.
-@numneighbours 30 @radius 0
-@numneighbours 0 @radius 24
-@numneighbours 10 @radius 24

The radius is perhaps too tight here as in some of the transition hits you can see the @numneighbours disappear completely.

Again, this looks slightly better than the 104d MFCC and 31d PCA’d MFCCs above. I will next try this with other source material (different drum classes, voice, etc…) to see how generalizable the 56d spectral moments are, but so far this is looking the most effective.

I also plan on experimenting a bit with different spectral statistics (min/mean/max/std at the moment) and/or perhaps leaving out derivatives at this point.


At this point I think I’m also just focussing on making it work solely with 2 classes trained, as even though you do get a bit of improvement in some cases from interim classes, this becomes a lot more practically and conceptually faffier (e.g. not being able to use pre-trained classes as they are, and having to manage tags for “real” classes vs “interpolation” classes etc…).

Aaand a bit of classic number crunching.

Compared the MFCC and Spectral descriptors in terms of raw/straight classification (ala all the experiments from this thread) and got the following results.

The labeled musical example I gave it had only 4 classes and I ran the tests with a classifier trained on just 4 classes, or all 10 classes. Then primarily experimented with including loudness compensation (this helped my MFCC accuracy) and whether or not to include derivatives.

The base recipes are as follows:
MFCC baseline

13 mfccs / startcoeff 1
zero padding (256 64 512)
min 200 / max 12000
mean std low high (1 deriv)

Spectral baseline:

all moments / power 1 / unit 1
zero padding (256 64 512)
max freq 20000
mean std low high (1 deriv)


The results:
4 classes:
mfcc baseline - 95.8333% (my current “gold standard”)
spectral baseline - 87.5%
spectral no loudness - 87.5%
spectral no deriv - 88.88%

10 classes:
mfcc baseline - 86.11%
spectral baseline - 66.66%
spectral no loudness - 66.66%
spectral no deriv - 72.22%


So it looks like I’m probably better off without derivatives for these spectralshape things.

And now putting these things in practice.

Firstly just running the same audio/tests as before (MFCC as the control, and spectralshape with no derivatives now, based on the info from the last test):

(this is both sets of tests combined so you can see the combined vector as well as the nearest neighbor/radius stuff)

Looks like this when plotted in time:
center edge

Looks about on par as it was before, so not a huge difference in terms of practical performance difference. But from the tests in the previous post it does seem like derivatives only muddy the water, so no need to include them if it makes no perceptual difference in this context.


And then I ran the same test/comparison with different sounds. Rather than using center → edge, I used rimtip → rimshoulder.

It just so happens I recorded a bunch of test data at the same time and had loads of examples like this.

Here is how both plot out with 2d UMAP-ing:
Screenshot 2024-02-11 at 2.50.39 PM

So pretty similar in terms of differentiation/separation.

And here’s the video comparison:

With the morphological plot next:

Overall more binary, which makes sense with the separate islands, but you can see the difference here for the spectral shape one. A bit smoother and more transitional stuff between the peaks.

I’m still torn on the effectiveness (or ability) to do the numneighbours/radius stuff to actually turn that into a vector, but it’s getting somewhere.

I guess I should try comparing PCA’d MFCCs (as those were more performant than vanilla MFCCs) against the spectralshape stuff, to see how those stack up. I’m hesitant to combine them though as they are pretty different in ranges and from previous tests MFCCs don’t like being transformed/rescaled very much.

And for the final bit of testing (or things I have to test for now), here is that comparison.


So this is 31d PCA’d MFCC (raw/unnormalized) on the top and 28d spectralshape on the bottom:

Pretty close, but I have to say I think the MFCC is doing a bit better here.

Here’s the time series comparison:
pca vs spectral

The “dynamic range” of the spectralshape is a touch better, but the smoothness of the ramps looks better for the PCA’d MFCCs.

Given the reduced “dynamic range”, I wanted to try and see how these actually look if I scale/normalize the range a bit.

That gives me a time plot that looks like this:

Which I then paired with a more contextual assessment. Which sounds better. Or rather, which has a smoother trajectory when sonified in this way.

Here are the results:

If I close my eyes and just listen, the PCA’d MFCCs sound a lot more smooth in the transitions. The spectralshape one, although looking a bit smoother sometimes, tends to jump/stick to values more it seems.


The effectiveness of these PCA’d MFCCs made me wonder how they stack up in terms of classification accuracy. This is something I had actually tried years ago and got absolutely dogshit results, but as outlined earlier in this thread, I think the normalization post PCA-ing was just breaking the relationship between the MFCC coefficients.

So I plugged this PCA-ing into my accuracy test and got the following results (inserted into the data from earlier today):

4 classes:

pca’d mfcc - 97.22% (my best results so far!)
mfcc baseline - 95.8333% (my previous “gold standard”)
spectral baseline - 87.5%
spectral no loudness - 87.5%
spectral no deriv - 88.88%

10 classes:

pca’d mfcc - 87.5%
mfcc baseline - 86.11%
spectral baseline - 66.66%
spectral no loudness - 66.66%
spectral no deriv - 72.22%

A slight improvement, but an improvement nonetheless. And this is using fluid.knnclassifier~, whereas I got better results using fluid.mlpclassifier~ before.


So it seems that overall, PCA’d MFCCs capture the most variance here and still capture good transitional/interpolation states. The spectralshape stuff was very promising (though I did dread having to add a whole new descriptor “type” to SP -Tools), but not quite as good as just (further) refined MFCCs.

1 Like

catching up with this, and I still wonder if you used the log spectral shapes? or if you compared log and lin - for me the log is much more perceptually correlated but hey, maybe machines dreams of electric sheep.

It was log (@power 1 @unit 1), which is what I use 100% of the time since it was added.

1 Like

if the patch is handy, try with both at 0 ? I’m super curious :smiley:

Not very well as it turns out:

The plain clustering (on the right) is much much worse, and the overall interpolation is mega noisy as a result (on the left).

1 Like

thanks for indulging me :slight_smile: