Fluid.umap~ transform message

tremblap · May 5, 2023, 5:13pm

As I said, my answers from 8 days ago explain why it is not deterministic. The UMAP people explain it better than me. I even posted a solution proposed by @weefuzzy but it would mean putting an option on the object, and baking in an MLP. See and “enjoy”

The results i have in Python are exactly the same. They have a trick where if a point is near enough it gives the source, but we don’t do that cheating since it only works for points in the training DS, in which case a KDTree is more efficient.

philippe.salembier · May 6, 2023, 12:35pm

Thanks to everyone for the contributions on this discussion. I am grateful for all the advices provided here on alternative approaches to handle Dimensionality Reduction/ Classification / Regression. I am studying them.

I have also looked at the links provided by Pierre Alexandre on UMAP. My current understanding is that I was trying to do something that the original UMAP is not designed for. UMAP is designed to perform dimensionally reduction on a dataset but, in its original version, it is not really appropriate for mapping individual samples based on a pre-learned mapping.

One of the possible solutions discussed in the UMAP web site and already suggested by Pierre Alexandre is what is called “Parametric UMAP”. If my understanding is correct, the idea is to train the dimensionality reduction with UMAP on a database and then learn this mapping with a regressor, for example a neural network. Once the mapping is learned by the neural network, new unseen data can be mapped.

In the context of the Flucoma ecosystem, I guess one possible roadmap would be to use a mlp regressor to perform this task. I wonder whether any one has worked on this solution. If yes, I would be very interested in learning from their experience.

Furthermore, when Pierre Alexandre was writing:

The eventual solution: Parametric (neural network) Embedding — umap 0.5 documentation

Does this suggests that this is something that we may expect to see included in the Flucoma toolkit? (I know the project funding is over, so resources are quite limited)

Finally, I think I was misled by the documentation and its parallelism with, for example, the PCA documentation. Here is the definition of the main UMAP commands we are discussing:

fittransform: Fit the model to a fluid.dataset~ and write the new projected data to a destination FluidDataSet.
transform: Given a trained model, apply the reduction to a source fluid.dataset~ and write to a destination. Can be the same for both input and output (in-place).
transformpoint: Transform a new data point to the reduced number of dimensions using the projection learned from previous fit call to fluid.umap~.

Those definitions are closely related to those of similar messages for the PCA. So, I think that I got the idea that the “transformpoint” applied on the data used for the training was supposed to give me the same point cloud. Unfortunately, this does not seem to be the case. May I suggest to update the documentation to at least warn users that using transform and transformpoint may not produce exactly what you expect (again compared to PCA).

tremblap · May 7, 2023, 10:39am

I hope so. But I don’t have the means to do it full-time because:

indeed all volunteers welcome. There is a list of things to do. I need time to do that. The good news is that our codebase has all the components to do it, so it is a question of macro implementation, and the most important, interface (i.e. how does one provide good ways in without doing scikit-learn-in-an-object)

Same here, and conceptually, they are. It is just distorted a little more. They bring you in the ballpark of the new space, hence on new material I used it like many others with great success. It is just not completely deterministic.

PCA is very efficient and deterministic but has problems linked to these 2 characteristics.

The parametric UMAP is a good compromise but needs careful thinking to keep UMAP sane for simpler (and faster) uses. The parametric umap is complicated to train, as they say in the paper, and if people from ML say so, imagine for musicians. It might take a lot more time to converge than now, for instance, and a lot of other problems we spent a lot of paid, full-time time researching with musicians how to make sure we pitch at the right level… so implementing it is much more trivial than doing it right for the toolset.

I hope you understand the tensions here. In the meantime:

for your problem, I think PCA → mlpclassifier is the way to try first. if your patch uses UMAP it is a 10 minute job to replace it in your code (the dataset is done, and populated, so you replace 2 objects)
if you have cpp and ML friends, the codebase is free so they could hack a version and propose a PR (at least of the implementation) so we can share the load of adding to the ecosystem - we can then think of the interface and the presets and the saving/loading of the states (etc etc etc)

I hope this helps

p