Hey,
So I have large datasets (four of different dimensions - 2, 6, 14 and 14) that i merge together in larger fluid.dataset that will be used in umap. I would like to add weight to the different dataset in a way that some would be more significant in the redistribution in the 2D space.
To be more specific, for my project I do corpus-based synthesis made of field recordings and I have a lot of data that are “non audio” (gps, weather, bird names, etc) that i normalized so it’s readable by flucoma objects. I can get maps and play sound but I would like to be able to assign more weight certain dataset, to gps data for example so the map is more representative of the field, and things like that. I have to try stuff to see what is more interesting.
Where and what should I start look at to do this? Anyone has a clue on this?
Thank you
Here is an example of my merged_dataset where all the columns of my normalized datasets converge.
Now, I am wondering if I should weigh the datas directly before creating this merged dataset which is going to be use in fluid.umap OR should I do it before the kdtree~ or both… I am kind of confused on this one.
(first 2 columns geographical long and lat, next 6 columns are weather data, next 14 columns are descriptors and the lasts are 14 mfccs)
…
1000: 0.959008 0.13974 0.720339 0.308571 0.352201 0. 0.72973 0.549784 0.772244 0.075517 0.051641 0.04406 0.149073 0.003717 0.031206 0.054492 0.080045 0.968205 0.545455 0.5125 0.559628 0.05164 0.692513 0.714963 0.460888 0.537223 0.443614 0.523205 0.76176 0.212396 0.478403 0.582105 0.470666 0.365984 0.568989 0.381974
1001: 0.959008 0.13974 0.720339 0.308571 0.352201 0. 0.72973 0.549784 0.754184 0.066674 0.076019 0.07385 0.140507 0.003717 0.023408 0.118929 0.153591 0.996205 0.545455 0.475 0.509599 0.076018 0.6955 0.793702 0.319563 0.541469 0.370154 0.493392 0.63683 0.364303 0.510633 0.422028 0.631149 0.343908 0.444035 0.420601
1002: 0.959008 0.13974 0.720339 0.308571 0.352201 0. 0.72973 0.549784 0.768993 0.079556 0.065911 0.061556 0.139839 0.003717 0.030787 0.101048 0.168084 0.976103 0.454545 0.4625 0.535195 0.065914 0.691033 0.756031 0.389757 0.506661 0.437192 0.490248 0.657868 0.33368 0.595917 0.505023 0.519414 0.355798 0.50891 0.497854
There’s two considerations here.
The first is that because GPS data is 2/36 (or 1/18th) of your dimensions, it will contribute to 1/18 of your distance measures between data points (which is the first step in the UMAP algorithm). That means that the GPS is already quite marginalized as a measure. One approach might be to do some dimensionality reduction on the larger datasets to bring them all down to 2D, then each “category” of data could contribute the same (or a proportion that you desire). You could use PCA on the datasets that have 6 and 14 dimensions, bring them each to 2 dimensions, then have a merged dataset of 2+2+2+2=8 dimensions. Then GPS would contribute 1/4th to your initial distance calculations.
The second consideration is, as you say, weighting the dimensions. If you want GPS to be more important, then you want the distances in those dimensions to be greater than the distances in other dimensions before doing your UMAP. If you standardize your data (perhaps after the first consideration above) to have mean 0 and stddev 1, you could multiply all values in the GPS dimensions by 2 so that dataset would end up with mean 0 stddev of 2. That would weight the GPS data to have more impact on the distance measures.
If I were to attempt this in Max, I would probably do something like dump the dataset to a dict and iterate over the dict, modifying the values into a new dict and then load that dict back into a new dataset.
I would set up a pipeline that allows you to tweak the dim redux of consideration 1 and the weights of consideration 2 and then try some different choices and see what you like the best. Also will be important to personally, probably subjectively, identify your assessment criteria so you have a plan for how to decide which sets of dim redux and weights you prefer!
//========================================
It might be possible to skip consideration 1 and do something more directly, like standardize your 36D dataset and then just scale the 2/36 GPS dimensions up much larger to have a greater impact, but my gut says you’re wading out into the mess of the curse of dimensionality and it’s probably not wise.
1 Like
Thank you so much Ted for the tips!
I will work on that right now and post here the results.
Ok so the easiest and most efficient way I found to put on the second consideration do a second normalization (the first one brings everthing back between 0. and 1.) where I change the @max parameter so it becomes the multiplicator. It apparently does the exact same thing as multiplying by X. This is what I realized after tring to literally iterate over 1000+ data of 1 col of my dict which was, as you can imagine, pretty long and inefficient. Even while imagining an automated and iterative pipeline…
Thanks again for your help Ted
I think this means that all your data is weighted by @max, which means that none of the dimensions will be more influential than others (which is the goal of weighting). So I think it won’t achieve what you’re after. Or am I misunderstanding?
That would be true if I did the normalization on the whole merged datasets but I can do it separately on my GPS dimensions, for example, because they exist in an another dataset (as for the others that are also separated at first) that I ultimately merge with fluid.datasetquery up to the full dataset that regroup all four “sub dataset” which are : gps, meteo, basic audio descriptors and mfccs.
That’s how I can weigh type of data separately.
In short, my pipeline starts with data I collect from a json file created from a python pipeline. I use dict objects to unpack it and create unique dataset for each key which represents a certain type of data. Then I normalize these dataset, reduce dimensions with pca and merge all these single type of dataset into one last dataset with a couple fluid.datasetquery~ operations. This last “whole” dataset is the one used to concatenate. So it’s pretty easy to rescale only one type of dataset with fluid.normalize and it also gives me the oportunity to have have control over it to see different results with different weight applied.
1 Like