Reducing Umap processing time on large DataSets

jan · April 3, 2022, 9:04am

Hello all,

supposing one has large DataSets (say 200.000 points) and one wants to run Umap on it with higher than average num neighbours (ca. 100), would there be any options to reduce processing time?
Mine is running for 13 hours on a new machine, and i have no idea how long it might still take.
Would parallel processing of smaller DataSets and merging them later be possible or even an option?
Just wondering about possibilities to increase efficiency with these new processes.
Thanks!
Jan

jamesbradbury · April 4, 2022, 4:12am

Iterations!

If you turn these to a lower number then you may see worse results, but it will give you a better indication of how things will eventually turn out given some more iterations to work with. It won’t be a panacea but it will help cut down on compute time.

jan · April 4, 2022, 6:32am

Of course! Am i right to assume that a lower minimal distance will make the clustering also more obvious quicker, hence reducing the error margin of lower iterations?
Also i wondered if these rather heavy process could take advantage of more than 1 cpu- core to speed up?

jamesbradbury · April 6, 2022, 1:15pm

I think if you change the mindist to 0 it may be faster - I’m sorta shooting from the hip on that one. UMAP uses a KDTree under the hood to check proximity of nodes so it might be that it just doesn’t need to query the tree when it’s figuring out the projection when mindist is effectively null.

jan · April 6, 2022, 5:52pm

Thanks @jamesbradbury, it hadn’t occurred to me to reduce it to 0. Will try this out!