Ways to test the validity/usefulness/salience of your data

That definitely helps.

I guess with my question I meant more along the lines of “best practice in the field/literature”.

The assumption, it would seem, is that the rotations and relationships that PCA sets up (along with its non-linear caveats) are more valuable than using the algorithm to determine which dimensions in-and-of-themselves demonstrate the most variance.

So this would be useful for a “big descriptor soup” approach, rather than having a smaller amount of perceptual ones.

All of this really does make me miss a use case where you can build a kdtree out of a subset of a dataset to mix brute force and tree-based searches into one. You could then find the nearest match for “loudness”, “pitch”, and then “this huge pile of arbitrary numbers that roughly represent spectra and morphology” all as the same query without needing to either scale the timbre-based descriptor space way way down, or somehow pump up the loudness/pitch-based descriptor spaces.

1 Like

With some doing, it could be possible to ask fluid.kdtree~ to give you the 30 (or whatever) nearest neighbours based on pitch & loudness and then do a manual brute force search on just those 30 for the nearest based on some spectral description, maybe MFCCs. Not sure how fluid.kdtree~ would perform asking for 30 and not sure what this would mean in terms of latency, but could be worth a shot.

1 Like

That’s a bit of the tradeoff/issue with something like fluid.datasetquery~ in that it’s not really built for realtime use.

The opposite of what you’re describing would be ideal (give you the nearest neighbour within those 30 dimensions, and ignore the rest), but that’s not how a kdtree works (in my understanding at least).

That’s right. A kdtree will consider all the dimensions in the dataset it is trained on. When it builds the tree, it does it using all the dimensions.

You could make a kdtree with just the n dimensions you want to search first, get 30 or so neighbours and then brute force that?

I’ll end up doing some testing when the time comes, but I think there will be a point of diminishing returns if I’m cascading kdtree into something like entrymatcher, as opposed to just doing it all in entrymatcher, which can also return values as well as query (as well as vary weights, distances, pruning, etc…).

The main blockage for straight kdtree usage for me would be conditional cases. Like finding the nearest match when loudness is > x or duration is < y, and doing that at a per-query basis (making having a precooked/pre-fit set of datasets at the ready (which I believe is @tremblap’s approach). At the moment that requires a fluid.datasetquery~ step (which then duplicates everything over) and then re-fit-ing each subsequent kdtree. And, again, doing this per-query.

Yes, there’s no point to doing this. Constructing a kd-tree index is only worth it if the same tree is going to be queried many times, otherwise you might as well just do an exhaustive search.

Moreover, beyond a fairly modest number of dimensions (in your terms), kd-trees are going to be no(t much) better than an exhaustive search and, again, not worth it (because of the curse of dimensionality).

(Even if weren’t a concern, you ideally want the number of points (N) to be much greater than the number of dimensions(d) like N >> 2K. So, not practical for high dimension counts).

This isn’t specific to kd-trees, but to exact nearest neighbour searches in high dimensions in general. Very much an open problem. For moderate amounts of data , especially if it’s subset not going to be repeatedly queried, an exhaustive search is often the most pragmatic path. For chunkier things, there exist approximate nearest neighbour algorithms, but the gist I get is that some of these also only really start to pay off when N is (very) large.

In sum: for what you describe, entry matcher is probably optimal.

1 Like

i was on the road for 3 days and this is on fire!

Our various use we do are very badly documented in literature. Just standardization or not is controversial for small dataset, and MFCCs post process too. So that is why I keep suggesting to implement one and make music with it. If you get to a musical problem, we can brainstom a creative coding solution around it to keep on musicking

1 Like

For the most part my corpora haven’t been more than 1-3k samples or so (for most of the use cases it’s been longer “one shot” style samples, rather than individual grains, which can get in the 100k+ territory in C-C-Combine). So I guess with high dimensionality there’d be dubious (in terms of speed) benefit of a pre-fit space.

A native (e.g. fluid.dataset~-based) brute force search object would be handy for use cases like that. It’d probably still harp on about having some kind of interface to bias/fork etc… though…

Just building each analysis/recipe saps most of the energy/life out of me due to the sheer number of objects, error-prone-ness, and then when mirroring offline/realtime analyses, making sure all the fit files are stored/lineup etc…

It’s also kind of hard to gauge how effective something is if it’s returning a match each time, and it’s relatively disparate source/targets. Perhaps just doing assessable matching (e.g. the “time travel” idea I was on about before) as there I can point to recipes and say “yes, this one works better”. That died off for similar reasons as wanting to change any bit of the analysis (lower order MFCCs in that case) meant revamping so so much.

Ok, so I’m trying to build a @tedmoore -esque PCA→UMAP to see how that behaves in a somewhat measurable context.

At the moment I’m taking:

  • all loudness descriptors/stats (1 deriv)
  • 20 loudness-weighted MFCCs with all stats (1 deriv)
  • all loudness-weighted spectralshape descriptors with all stats (1 deriv)
  • loudness-weighted pitch descriptors with all stats (1 deriv)

I’m not 100% confident on some of these things (e.g. loudness-weighted “confidence” in the mix with pitch, there’s a kajillion MFCC dimenions now, etc…), but it’s a jumping off point.

Now looking at my old LTEp approach I applied robust scaling to everything except MFCCs, which I standardized instead. And my workflow was to flatten each brach of the analysis and then post-process (robust scale/standardize) the datasets individually before concatenating them into a single larger dataset (this step was actually really unpleasant to do, so let me know if this is easier to do now than cascading together a bunch of dummy datasets).

Is this, more-or-less, your workflow (@tedmoore)?

From this step forward I plan on doing the PCA→UMAP thing to see what I get from the whole big mess of soup. Firstly just to browse and compare how this fares against the LTE approach with a more hand-picked/conceptual descriptor space, as well as trying to apply the same transformations to tiny sample windows (256samps) to larger ones (4410samps) and see if I can regress between the two.

I do have to say that having native @unit attributes in places makes some of the coding here much easier than before (previously I was unpacking and manually massaging the spectralshape descriptors I wanted to be in the “correct” units), and not having to care about pulling individual columns out also helps, but it’s still not a very pleasant coding experience/workflow to put together an analysis chain like this.

1 Like

Yes, this is more-or-less my workflow. For what it’s worth, in this project, I didn’t do any loudness scaling. I think maybe it wasn’t implemented when I did it, or maybe it just wasn’t yet on my radar. But at this point, I do think it’s a good idea for you to do the scaling.

You might also consider scaling the pitch descriptor by the pitch confidence descriptor. Maybe you’re already doing this?

When you have lots of data like this scaling is definitely a good idea. I’m curious why you are doing robust scaler to some descriptors and standardization to others? Have you done some tests that show one scaler to be better or worse for certain descriptors? If it’s a lot of extra fuss, you could just put it all in one dataset and then scale that one dataset using one scaler.

I’ll be very curious to hear what comes of this!

Good to know.
Yeah, the loudness stuff I think is quite useful to add in, and isn’t too big a faff if I’m doing “all the descriptors”. My original patch got a lot messier as I was peeking/poking out individual stats and scaling them etc… So it’s much easier just to slap a @weights on a fluid.bufstats~ and call it a day.

That’s definitely the medium/longterm plan. In terms of the code I had already in this patch I was looking at some loudness scaling, but I have experimented with confidence scaling as well. I haven’t yet found an ideal implementation of that as I suspect a combination of loudness and confidence in combination will suit more of my use cases.

I literally have no idea, but I remember that being an important distinction at the time. I think I chatted with @tremblap about it in this thread a while ago. I think robust scaling had freshly been implemented so it was all the rage at the time.

That’s part of the question as I’m not entirely sure how to best go about it. If I was just standardizing everything, I could presumably flatten/concatenate everything together and then standardize it all at once?

Me too!

I’m still leaning towards a conceptually-relevant space, or at least something that isn’t bespoke to each corpus. I guess a medium-term solution would be to run the PCA->UMAP on a bunch of different corpora at the same time and take the columns/scalings it gives me as a “standard” I would then apply to everything as a swiss-army-knife of descriptor soups.

Yep!

I hope I’m understanding you right:

Keep in mind that doing PCA on different corpora (i.e., datasets) will give you totally different Principal Components that cannot be compared with each other. One point from one corpus having a “high PC1 and low PC2” is not comparable to a point from a different corpus (and its own PCA analysis) also having a “high PC1 and low PC2”.

(Also not that, PCA is deterministic though so doing PCA on the same dataset will give you the same results)

Even less comparable is UMAP, which is stochastic, so it can have different results each it it runs, even on the same dataset!

This is all to say that one should be careful “comparing” the PCA->UMAP pipeline on multiple corpora. You can certainly see which are perceptually valid or interesting to you, but where certain kinds of points might end up in space is not really predictable or repeatable, so trying to find or create a “standard” from these comparisons will not be successful.

This would be following the workflow of my initial misunderstanding of your PCA->UMAP process. So that I would run (I guess only the) PCA on all the copora to have that tell me that x amount of dimensions are most salient. And those would be, e.g. mean of loudness, deriv of std of MFCC3, mean of confidence of pitch, etc… Then I would take those specific descriptors to make up my “generic descriptor soup”.

From there I’m more blurry about my follow up, but I guess I would find a reduction pipeline that works for one part of the process (perhaps the incoming percussion/snare source) and then apply those ‘fits’ all the corpora.

1 Like

In this case I would just x principal components rather than a selection of x original descriptors. You can always use PCA’s transformPoint to get the x principal components for a new point (after the PCA has been fit to a dataset), such as the real-time snare analysis descriptors.

Once a UMAP has been fit, you can do a transform operation on as many datasets as you want and they will all be projected into the same lower dimensional space that UMAP determined from its initial fitting. I’d be quite curious to hear how this turns out. My instinct says that if the datasets are quite different (noisy snare sounds vs. pure bell sounds) it won’t work very well as UMAP will be trying to spread out points based on a descriptor that isn’t as useful in distinguishing difference.

However, if the datasets are similar enough, it might be a nice way of knowing where in lower dimensional space certain kinds of sounds will end up!

@jamesbradbury pointed me at this. Might be useful!

2 Likes

That was quite useful.

I guess the idea of it itself being a “rotation” of the space, rather than simply using it as identifying which of the original descriptors are most important is a handy idea.

I’ve moved my tests along a bit and have new mappings/projections. The “timbre” (MFCC + stats/derivs) and “spectral shape” (spectral shape + stats/derivs) are quite good. The new “pitch” one is dogshit compared to the old one. I could be doing something weird with the confidence weighting numbers as it’s been ages since I look at those, and my previous approach had quite tuned/selected parameters/stats.

I also just used the fittransform output of fluid.pca~ to get around 90% and took those dimensions before moving onto a 3d UMAP. The difference between using PCA pre-UMAP or not didn’t seem to make a big difference (at least going to 3d), but not tested/prodded exhaustively.

I take it you were going to a higher dimensional space with UMAP to them feed into another algorithm, rather than plotting (2d/3d)?

I’m also doing the separate processing where some are robust scaling and some are standardized, so I will flip the order of things around to just apply one of them. For PCA is standardization typically better because it’s proper mean-centered, or is IQR better for handling outliers (even though it’s median-centered)? As mentioned above, I don’t remember why some were robust scaled and some are standardized. For doing comparisons I want to just do one to reduce the amount of fits I need to keep track of for subsequent steps.

//////////////////////////////////////////////////////////////////

This last bit is me talking outloud to try to figure out what I need to do next.

So beyond just checking the plotting/spread and vibe-ing out the difference using this workflow, part of what I want to assess is regression between a shorter analysis window and a longer analysis window. Towards that end I want to create a dataset that does all of what’s mentioned above, but is applied on a 256 sample window, then apply the same fits (stadardization/pca-wise) to the same audio but with a 4410 sample window.

My hope is to then be able to create a regression between the two datasets and seeing how accurate that is (in terms of an error, but also qualitatively by listening/comparing). Like, if numerically the match isn’t exact (the same audio is going to be used for both steps), but both are of “crotale with sewing needle”, that’s likely good enough.

Does anything from that jump out as being a bad idea or super non-feasable?

If that works, I remember some loftier discussions with @weefuzzy a while back about a more generalizable model being created where rather than comparing the specific hits on this specific drum with these specific implements, I could create more a more archetypical data space which I could then use on other drums/setups/preparations. That’s way more pipe-dream-y.

In the project I reference above, I was starting with ~700 dimensions. I standardized that used PCA on it. I then kept the first 11 principal components. Then I sent those 11 dimensions into UMAP and asked for 2 dimensions as output to get the 2D plot that is seen in some of the documentation.

There isn’t a straight forward answer to this. Both Robust Scaler and Standardisation are going to get you to the roughly centred around 0 with a spread of 1, which is probably sufficient. If you know that you have some crazy outliers, Robust Scaler might be a better choice as it will keep you in the centred at 0 with a spread of 1 vicinity. If not Standardisation is the “standard” choice.

No. It sounds cool. I don’t know what you’ll find so I’m very curious to hear!

1 Like

actually, if you know you have multimode data, and not standard, Robust Scaler should be at least as good if not better. The Khan Academy video I posted long ago showed that very beautifully.

1 Like

Did a bit more testing yesterday and got some interesting results. Still dealing with plumbing and concatenation stuff, but thought it may be useful to share some early results.

Firstly I realized that I was concatenating and displaying the pre-scaled/standardized datasets when plotting them.

This is my initial (non-scaled/standardized) results:

And here’s the scaled/standardized results*:

Not a huge difference (barring pitch) for the individual datasets, but the knock-on effect for the summed one is gigantic (obviously).

Both the spectral and MFCC spaces navigate quite well, though outliers seem to be more present in the MFCC (“Timbre”) one.

*I did revisit what I was doing for “pitch” here though, baed on this:

I basically rolled back to my older approach would only took a few descriptors, no derivs, loudness-weighted things and, seemingly importantly, scaled the pitch output to the same range as confidence before processing them down the line. My original thinking was that for the purposes of the sounds I was using, whether something was “pitchy” or not, was as important as “what pitch” it was. So I wanted confidence and pitch normalized before carrying on.

Even though the results look similar above (in terms of overall shape/spread), the sound is worlds better in the second one. There’s an appreciable order to things in a way that, for those corpus, overlaps a lot with timbre/spectral. This makes sense for percussion as the higher pitched bits of metal are also brighter.

//////////////////////////////////////////////////////////

Lastly I have a question about moving onto the regression step. I had a quick check on the learn.flucoma.org page for regression, as well as the help/reference files, but couldn’t find out what I was looking for.

I remember, after previous failed attempts, there being some “rule of thumb” stuff about amount of entries vs dimensions vs nodes etc… I know that "it depends"™ but, from memory, in order for it to regress at all there needs to be some relationship between these numbers.

So firstly, what is/where can I find that info, and secondly, that kind of info should be in the help/reference/learn somewhere for quick access. Like, if I want to use a regressor, I don’t want to have to watch 10h of Kahn academy videos before knowing what numbers I should put in the boxes.

As a point of reference I will have around 1k entries (for the regression idea). I presently have 800 test samples, but will probably aim to have a few thousand for more rigorous testing. I have around 260 dimensions in my initial “descriptor soup”, and PCA seems to give me around 90-95% coverage with around 110-150 of those dimensions. I can then obviously UMAP the output of PCA.

So I guess I want a small amount of intput/output nodes for the regressor (3? 5? 10?) and then an amount of @hidden nodes relative to the overall amount of points (dimensions x entries) being worked on?

I can obviously just test a bunch of random stuff, but from memory, there was a method/reason for choosing certain values here.

This is looking good. Can you remind me what exactly your inputs and outputs will be for the neural network? You’re hoping to predict the descriptor analysis of a larger window from a smaller window?