Thanks to everyone for the contributions on this discussion. I am grateful for all the advices provided here on alternative approaches to handle Dimensionality Reduction/ Classification / Regression. I am studying them.
I have also looked at the links provided by Pierre Alexandre on UMAP. My current understanding is that I was trying to do something that the original UMAP is not designed for. UMAP is designed to perform dimensionally reduction on a dataset but, in its original version, it is not really appropriate for mapping individual samples based on a pre-learned mapping.
One of the possible solutions discussed in the UMAP web site and already suggested by Pierre Alexandre is what is called “Parametric UMAP”. If my understanding is correct, the idea is to train the dimensionality reduction with UMAP on a database and then learn this mapping with a regressor, for example a neural network. Once the mapping is learned by the neural network, new unseen data can be mapped.
In the context of the Flucoma ecosystem, I guess one possible roadmap would be to use a mlp regressor to perform this task. I wonder whether any one has worked on this solution. If yes, I would be very interested in learning from their experience.
Furthermore, when Pierre Alexandre was writing:
The eventual solution: Parametric (neural network) Embedding — umap 0.5 documentation
Does this suggests that this is something that we may expect to see included in the Flucoma toolkit? (I know the project funding is over, so resources are quite limited)
Finally, I think I was misled by the documentation and its parallelism with, for example, the PCA documentation. Here is the definition of the main UMAP commands we are discussing:
- fittransform: Fit the model to a fluid.dataset~ and write the new projected data to a destination FluidDataSet.
- transform: Given a trained model, apply the reduction to a source fluid.dataset~ and write to a destination. Can be the same for both input and output (in-place).
- transformpoint: Transform a new data point to the reduced number of dimensions using the projection learned from previous fit call to fluid.umap~.
Those definitions are closely related to those of similar messages for the PCA. So, I think that I got the idea that the “transformpoint” applied on the data used for the training was supposed to give me the same point cloud. Unfortunately, this does not seem to be the case. May I suggest to update the documentation to at least warn users that using transform and transformpoint may not produce exactly what you expect (again compared to PCA).