Getting the contents of a fluid.labelset~ into a fluid.dataset~

I know I’m quite late to the party, but based on the stuff that @tedmoore was showing in the thread about SVM stuff, I’m building a fluid.kmeans~ thing to automatically cluster a corpus to then further experiment/test on.

Other than manually making labels/classes before I’ve not done much at all with fluid.labelset~.

Now, because the visualizer I built takes a single fluid.dataset~ as its input, I want to make a 3D dataset that contains 2Ds of umap output, with the 3rd D being the normalized clusters created by fluid.kmeans~.

Seems simple enough.

But unless I’m overlooking something, do I need to have fluid.kmeans~ populate a fluid.labelset~ (with integers (as symbols)), which I then output to a dict, then uzi/iterate out the contents via a get message into a list, which I then peek~ into a buffer~, which I then have to manually add back into another fluid.dataset~ via addpoint before finally concatenating the two datasets back together. Is that right?

I understand that, fundamentally, labelsets are capable of holding symbols/names/whatever, but something like fluid.kmeans~ spits out ints anyways.

Perhaps this is just my greenness with pairing datasets/labelsets, but for the purposes of visualize at least, I kind of want all of those things in a single thing (a fluid.dataset~ in my case).

And actually, a related, but not different enough to start another thread, follow up question.

What is the intended (?) way to process/separate datasets based on information in labelsets?

More specifically:

  • I have 5 sounds I want to use to create classes with my drums. I train/create the classes.
  • I want to break a corpus into 5 clusters, such that each cluster corresponds with a class.
  • I want to put each cluster in a different dataset (or more specifically, a different kdtree) so that I can search for the nearest neighbor within each cluster.

If I’m understanding things correctly, I need to use the labelset generated by kmeans to break the corpus/dataset into 5 separate datasets, which will each then be fit to a kdtree. So on input/analysis, I figure out what class the sound is, then once that’s determined I pass the analysis off to the relevant kdtree to find the nearest match.

So that would require me using a labelset to break apart the dataset, which leaves me in a similar predicament as above (having to data munge both the labelset and dataset).

In an ideal world the cluster/label would just be another column in the dataset that would be used for filtering(/biasing) but not distance matching. Or perhaps for distance matching with an absolute condition (cluster==5, radius==0.1, etc...). But that’s not really the paradigm.

For a small/finite amount of clusters it’s possible to create dataset forks, though this obviously gets complicated if you want an arbitrary or dynamic amount of clusters (and corresponding datasets), but that’s putting the cart ahead of the horse.

Am I simply not understanding what/how a labelset is for?

Yes, this should be easier, as it seems like a perfectly reasonable thing to do.

You can munge the dicts together, but it’s still a bit of patching. Here’s something to test against the k means helpfile:


Hot yikes! Yeah that’s “a load of stuff” right there.

I’ll have a test with this and report back

1 Like

Ok, it was not as straightforward as I initially thought either. Initially I just tried doing the example from the first post, where I just add the contents of fluid.labelset~ as an additional column in a fluid.dataset~. I managed to get all of that into a single dict, but couldn’t initially figure out how to munge them together from there. Got it in the end though.

Quite handy for visualizing. This is definitely a situation where you don’t want to use the linear perception maps as the differentiation is far less with those when looking at clusters. I also noticed that if you use HSL, the edges wrap around more obviously for visualizing clusters. (I’ll update fluid.datasetplot~ in the relevant thread).

4 clusters from kmeans (linear display):

4 clusters (previous hsl):

4 clusters (new hsl):

these are sexy clusters - are you clustering after the UMAP? This is on my radar to try as it would allow to shape the space in 2-ish parameters and then get the clustering to be affected by that overall shape…

1 Like

In case I am not clear:

  • doing the clustering on the high dimension and then visualising in 2d with UMAP would give you a good way to see how the dim redux moves stuff together or apart when it was considered by KMeans in its hight dimension… that is interesting in itself.

  • doing the clustering on the reduced dimensions (post UMAP) is interesting because you can distort the reduced space to get cluster shapes and content that you like. This could be done post-autoencoder too, or post-pca or post-mds. Just fun all round.

Both interest me in my next tune, to poke at material.

1 Like

All the clusters above were done on the “raw” dimensions (21d) as I would imagine using it such a way where that is what’s in the fluid.kdtree~ as well. In other words, in this case the UMAP is just for visualization.

I haven’t played with this much at all really, but I anticipate using kmeans and such independently of any reduction, or rather avoiding reduction in general for real-time use since it’d be “slower” (the fitting of umap/etc…, but also all the pruning/peeking/composing around the process).

This sounds really interesting, but I fear by this you just mean just scaling min/max-type things (or std), as opposed to more elastically “distorting” the space, which is of definite interest.

I am talking about distortions indeed. If you check the various shapes you get in UMAP you will see you are distorting the space…

Ah right. Yeah, UMAP itself distorts the space. I thought you meant taking the projection you get from UMAP and then distorting that.

1 Like