The need for speed (fluid.dataset~ + fluid.datasetquery~)

rodrigo.constanzo · January 25, 2021, 3:27pm

Thanks for the detailed consideration and breakdown.

To have a fully dynamic/forking query system, it’s definitely great to be able to radically change around what you are searching for and what order you do it. The overall interface of having those different steps be different objects is a bit faffy, but powerful. I will definitely agree with that. In general I tend to prefer a tidier interface (“blackbox”), but I can see the overall design considerations that going a function-per-object affords. Again, faffy interface, but powerful/functional.

At lot of what is being discussed here (re: copying) centers around this. I’m trying to think of the analogs to a buffer~-based system, where you have dirty flags and whatnot, which is problematic, but I guess the general idea is to know that data will be static while you’re doing stuff with it.

My initial suggestion for this (a fluid.datasetfilter~, or similar object) was that there would be a separate object for “simply” filtering through data without necessarily building new datasets. My initial thoughts on this were to do with interface (overall syntax and “the buffer problem” of having oodles of datasets), but a lot of what I suggest in that thread could apply here in that fluid.datasetfilter~ would specifically create an internal copy of the dataset to sort/query/filter/whatever without having to worry about data elsewhere getting fucked around with.

That does put a lot of eggs in one basket as what happens if you then want to fork, rescale, sort, etc… as @tremblap suggests above.

So a couple more spitballed ideas.

Having a “dirty” dataset type that is happy to be used as a “buffer” of sorts, for in-place/destructive edits where a user has to mind what they do. It can be cordoned off as to not break/crash, but can obviously through errors if your reading/writing at the same time in bad ways.
Having a @dirty (or whatever) flag for fluid.datasetquery~ where on loading/whatever, it creates an internal reference to the fluid.dataset~ that was loaded into it, and everything else is done on that internal version.
In general, I guess copying/sorting/filtering can be done via indices instead of complete(/large) datasets so that each step in the process does what it needs to do, but not by copying every single thing in order to do so.
Having some RT/offline distinction between dataset-based processes, just like there are fluid.bufversion~ and fluid.version~ of most algorithms. The fluid.buf~ version works as it does now, as an “offline” process that reads/writes datasets per step, and everything is safe and sound as you go. And a fluid.~ version which is more bare metal and works destructively, but quickly, and “you get what you get” in the same way the RT versions of objects presently work.

Lastly, I do imagine that things will speed up come proper optimization time (like the fluid.kdtree~ has done so, but I don’t think that at any point, copying massive datasets, multiple times, per query, will ever be remotely “fast”.