Speed comparison between different dataset processing chains

rodrigo.constanzo · March 5, 2021, 7:19pm

As evident by some recent threads, I’ve been trying to optimize the data processing chains to avoid using dimensionality reduction in general. Among other reasons (I seemed to get worse matching overall), one of the main reasons for this was speed. Although I can’t seem to find examples/discussion on the forum as such, but I remember testing longer processing chains early on and not being very pleased with how long additional processing and dimensionality reduction processes took.

So I finally got around to building a patch that compares the raw speed of three approaches:

fluid.robustscale~ → fluid.pca~ → fluid.kdtree~
fluid.robustscale~ → fluid.kdtree~
fluid.robustscale~ → fluid.normalize~ → fluid.mlpregressor~ → fluid.kdtree~

I have to say that I’m quite pleased with the results!

The patch (attached below) creates 5000 entry datasets with 100 (randomly generated) data points and then does the associated processing/passing around.

Since the data is random the fluid.mlpregressor~ didn’t properly converge (I think I stopped training around 5.4xxxxx or so, but I figure for the purposes of speed, accuracy didn’t matter.

So it seems the PCA-ing step doesn’t add too much time at all, and MLP isn’t far behind either (though I’ve yet to get a dataset to converge using it, but that’s a different story).

Granted, in a realworld scenario there would be 10000 more steps involving individual column/stats scaling and massaging, as well as enough pruning and fluid.bufcompose~-ing to make one’s head spin, but the core speed here is pretty comparable.

I’m still a bit terrified of what *actually* processing some analysis stuff entails (a boy can dream), but for now:

speedtest.zip (144.2 KB)

rodrigo.constanzo · March 8, 2021, 12:06am

Well color me interested.

Based on @weefuzzy’s comments in the LTE thread, I benchmarked UMAP vs PCA and the results are really impressive:

And from my understanding, UMAP is “better” than PCA in just about every way (particularly with lumpy(/musical) data)(?).

It definitely takes a bit longer to do the initial fittransform, but I’m not bothered by that at all since you only really need to do that once (and it’s nowhere near as long as MLP).

Archive.zip (151.0 KB)

tremblap · March 8, 2021, 9:29am

in both example there is a logical issue with a chain, namely:
robustscale normalise mlp

because normalise will take the full range (min/max) and undo the robustscaling.

so you should try
robustscale mlp
and play with the IQR values (25/75 are default but 10/90 could be fun too)
(and you save one (cheap) step)

otherwise this is interesting as an optimiisation thread. Looking forward to see/hear the comparative results

rodrigo.constanzo · March 8, 2021, 10:24am

When I built that I was still under the impression that mlp needed the values to be within the range of whatever @activation you are using, which @weefuzzy let me know is not the case in another thread.

Either way, it was more about building plausible processing chains to compare “real world” usage.

tremblap · March 8, 2021, 12:54pm

it is true that it is optimal but then you can use the neat features of non-linearity of the activation to distort softly these outliers… so most of your data fits within the linear part of the activation, then the distance between the bandits get more ignored as they are getting nearer due to cliping…