The need for speed (fluid.dataset~ + fluid.datasetquery~)

a.harker · October 10, 2020, 12:42am

I guess because threading is a mystery of the universe, the results are actually the same.

So much a mystery to you that it’s not even involved in this case. The results are similar but not the same, but the change of computer makes direct comparison void - I’d expect if you tested on the same machine that the ratio would go from 6.8ish to 5.8ish, which is noticeable

rodrigo.constanzo · October 10, 2020, 12:49am

I’ll test it tomorrow (in bed now, hence the laptop). I just assumed going from 6.8 to 6.3 was just a matter of random spread, though that’s not an insignificant amount of change either.

weefuzzy · October 10, 2020, 1:01am

I’m about to go to bed too, but FYI I’ve tweaked out some unneeded allocations (incl. the Eigen objects) and trimmed the smart pointers back to only where they’re useful, and things are now happier. Not yet tried the sqrt thing.

10k points, level pegging

100k points, starting to pay off

rodrigo.constanzo · October 10, 2020, 9:57am

On a semi interesting note here, both actually speed up with the fully separated timing. So the ratio is slightly better, it’s still only by a factor of 0.3

This is the “bad” timing method for 10k/8d:
Screenshot 2020-10-10 at 10.50.21 am

This is the “good” timing method for 10k/8d:
Screenshot 2020-10-10 at 10.51.10 am

I guess we’ll see in due time, but I’m curious if this holds true as you go smaller. Like does it only start becoming equal around 10k?

Perhaps there’ll be an algorithmic sweet spot that below a certain amount of points brute force is faster, then it gets into KDTree land, with perhaps something above that (that maybe isn’t super useful for corpora-level numbers).

I’m specifically thinking of an immediate/obvious use case for me being the time-travel/prediction stuff where I want to find the nearest distance on a pre-trained set of data asap, before moving on to more wiggly/complex querying. At the moment my test corpus for this has been <1000, but that will likely change when I build the proper version of that. It would still be in the few thousand range though, at most.

tremblap · October 10, 2020, 10:18am

This speedup is quite incredible @weefuzzy!

weefuzzy · October 10, 2020, 11:17am

People shouldn’t get too excited yet – this still has to be doofus-checked by the person who actually maintains this code and knows how the thing is meant to work.

That can often happen, but a quick check suggests that 10k points @ 8 dimensions seems to be a coincidental sweet spot where the KD tree and entry matcher are in a similar ball park, and that for a smaller N (100) the tree is now much faster. But, anyway, your particular bottleneck is less to do with the raw query speed of the tree, so much as the desire to push hard on datasetqeury and make lots of little baby trees.

rodrigo.constanzo · October 10, 2020, 11:40am

This reads a bit weird to me. Are you saying that for small datasets, the kdtree is faster, and it’s also faster at large(r) datasets, and the only place they are similar is around 10k?

For general purpose usage, totally. But there are some places where that’s not as relevant.

weefuzzy · October 10, 2020, 12:09pm

Yes, but take this with a sack of salt because it’s not the result of any really principled measuremen, and as before, quite probably has little to do with what’s inherent to each approach to searching.

tremblap · October 10, 2020, 12:49pm

and we should remember the wisdom further up the thread: optimising now would be counter-productive since the team is in C++ sabbatical to protect time for divergent abuse of higher-level interface…

as the wise one said to me:

It is hard for me to wait, but at the same time, we are still poking at the interface and its design. As I presented 2 days ago, I barely have time to music (nice verb) with the demo ideas I had more than 2 years ago… so @rodrigo.constanzo keep poking and trying, it is good for us to see what people (try to) do with the toolset, but also it is quite good to see what ideas people don’t have. @weefuzzy as some sexy code coming that just opened ideas - ways of thinking about data I did not even consider…

rodrigo.constanzo · October 11, 2020, 12:52pm

Totally. The raw speed via a kdtree approach inquiry is merely there for context. Optimization should definitely come later. As I was quote-responding to @weefuzzy, it’s more the interface stuff of filtering/querying/etc… that is current missing/difficult/slow.

Definitely looking to see more code from The Team™ on the forum!

rodrigo.constanzo · January 21, 2021, 7:09pm

Bumping this based on a…spirited…conversation with @tremblap (and @a.harker to a lesser extent) today at the FluCoMa chat thing.

Basically on the idea of “copying the dataset” each time vs “not copying the dataset each time” (which is apparently a “black box” somehow). From what I remembered, copying the dataset each time added significantly to the querying time, independent of the re-fit-ing process, which is also pretty time consuming.

To clarify the context here, the idea would be to have a dynamic query where every time you query it is a bit different (meaning having a handful of pre-fit fluid.kdtree~s isn’t really an option).

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So looking back at some of the patches from this thread, but tweaking them a bit to compare fit-ing vs non-fit-ing, to mainly focus on how much time it takes to copy.

What I’m going to show here is the time it takes for fluid.datasetquery~ to transform a fluid.dataset~, and then I’ll show the same transform followed by a fit-ing of fluid.kdtree~.

(all of this is with @weefuzzy’s much improved alpha07/08 code)

Here is a dataset with 10k points with 8d each (no fit):
Screenshot 2021-01-21 at 6.45.12 pm

Here is the same dataset but with an addition fit-ing step:
Screenshot 2021-01-21 at 6.46.22 pm

Obviously slower, and the bigger hit comes from the fit in this case but just copying the dataset once adds a massive amount of delay.

This is already pushing at the edges of usability for in a “realtime” context.

Rather than doing every step of the granularity as before, if I jump up to 100k/8d I get the following (no fit):
Screenshot 2021-01-21 at 6.48.03 pm

The same (100k/8d) but with a fit:
Screenshot 2021-01-21 at 6.49.12 pm

At this point, the copying itself takes adds (about) as much latency as the fit-ing process.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So in general the numbers are in line with what I’ve posted above with the main distinction that I’m point out here is that the simple use of fluid.datasetquery~ in the mix slows things down considerably.

Granted, these are fairly large datasets I’m demoing here (10k/100k) (though a really small number of points (8d)), but it can also serve as a decent analog if you have multiple steps of filtering in any given query with a more modest dataset size.

The main point I was trying to make with @tremblap is that I’m not interested in creating a “black box”, though I don’t see anything necessarily wrong with that, but rather that I was pointing out that the idea of duplicating/copying every bit of data, every time you make a query, for every step of the query you make is not only a lot of friction in terms of UI/UX, but also prohibitively slow with anything other than small datasets. And as much as I’m a big stickler for speed, the idea of “querying” = “go play with SQL for a bit and come back to me” seems like a quite non-musicianly way of going about things.

So yeah, giving this thread a nice little bump.

a.harker · January 21, 2021, 7:40pm

Can you provide the patch so others can confirm the results (and particularly that the time measurement is now corrected, as it was not being performed correctly before)?

rodrigo.constanzo · January 21, 2021, 7:53pm

It’s just a slightly tweaked version of the code above by adding fluid.datasetquery~.

Here’s the 10k/8d code (add/delete the fit bit as needed).


----------begin_max5_patcher----------
5439.3oc68r1iiaqced2eEBCZ91FW9lT8CE8ldayEnIsA2jKBJBJFHayYVkU
VxURd1cyE282d4CYYIaZKJYJOS5clEXVMRhhmygm27Px+5aeycKK9jr5tn+o
neI5Mu4u9127Fysz23MM+8ataSxmVkkTYds6xkerX4ud26rOpV9oZysyJRVu
LI+w8OHe2lz7LYsoQvlatMod06Sye79R4pZamBQ33EX.QDyDXHEhHT36hvXw
Bv6hnleiPK.Q+OG9tE6pO9CauU8m2Jse06LfRaiRWafQEb+0n6z25u812p+0
67Dm2HqpRdTdBRWJqj0uK5gzrZYYDH5eNBr.n9QgAIqWWpfAo5tBmzDzknIB
tFwITrA+AwSkL3jDfwSgFblw8Gx1ktdw5j5DEo3KQUoa1lIoU6Vp9yQyKP.D
MlxYrEPP79+AAX96hfLv.TA7EYFdmFVKRp0W3lrPBHYoNZYzxQi+X.TiiLlg
umSF.gQCgvmUJfDTwfGRq8XfGMLhyQFTNVDT9cBeJH6xc00E4iWcFEYDcANX
gsXm92AScFoU4x1jxjMRkdn6k4IKyLsDDNF5JYVTsRiV0CEkal53KCYzqI.W
KisaZQbHYpaQ18r1ZcbWIaNBXMqAQK3wc9gR0Z44AkqmClKs7+u6jke9KiWx
nA4EjELPmePXkxcHZdUtyfAjZ70QvILzSWHN1MGHlr.AEG9GQqkHdAcZ7Ao4
0t84AETKaoSvxlSzOlt.QIG9GOVohfN.1itD1eIadnP51yVo7CeIpVVUusvz
wNnG3wROPJo.XG5Ai.8Qz3r7CVQBmzhP5qyteKcB93pv4So.T1BTbGylPMA.
b0N9s++Oq.BMfji+QMDCVLdMDJOGNIPHHGunqBBDgniKXF3H3AjDnLcttXig
NDHxfKEk7IGVzY4ChCcH.SPUIv5lXSHfnPw6ed8hgzz3ukEUJebrC5Btgitw
KQrXptGdVy+PwycrcBAbAipDeEDHkSYHDyJciQ7C+ifz2jr.FSUgL.YBhhzP
hmon.gSRleUwlMxCl7ZIKIOIK63Hs+76.yfOSHVz0sPHTooyFLEBbVjG3feF
DRrpNUgFONYjBf6gTXsEcBb7HELNjHEdQTQYT01jMQIQK2ku58QEODo8qOUV
s+syRykqJ1kW2kIyWB.CBU91oUbu.fYc9WCCtV0FqCA3gh75GRVI69YcSTzu
YU5uYdSMMzAshdWm2MWEUr4Q+IY1Sx5zUIgjRhVDspTlTKiZr4YbGLRcsgb9
4nOlV+9nz5fRTU9MS5xUAXZEDwyKQkb6HpvVhZRTSLm5brlIWaImMjZEprTV
FF9UkxGMkkXc0zlpwYiVxBOobhYqBpwQkEWFdeloBb9of3PmeJkKrJfZrXZr
RQDt6Obn1jC+fmVS0eiyh5T1EQ8VVfxMIFfjcyHFLL7TWrcZpZtHMzWpjF27
I2PRCl+6KtFAZFEgvv4mXbNiPp6U94MZxfrLJsJBo3AnQJ2AkUQOjTo.kfP.
E1z9Bi4i2oPQHyqW01RkmKODoM1dLt+UODdDmy3NlgDTXmpSVbXSrynypiaT
GcHbWjffYT.Tf.wnPmTGRHCseajdN9mpKFDExhTXLsI4lTi6VvvNQef65BHJ
FUKR1fkJMEoYRUTpUoJWkN70eycIa2141uoSSzjlesv7gDuq8Vo41aAauUo7
oz8s+vKlTpHG0JZwtRqOdehs2MZ8moXsrLeWp4KYuoZPpAjLCGZeCUQoY85z
Lps+wGHvTiDCzNy4LqCrDQG8Hpg6GyJV8A45NpQTjzsx7z7s5RXHuNotA1ae
7Z4CI6xpuuq2rlLQ63468K14Ca8u8OTllj0h.OVlttHWCD8FIz2de2oXdLxI
5ee.YLuQdxVGMVwCnHKm4gUJjbW0xjR8.UiEDz9GVWTj0+QssKS9Pcyi2llm
eDUrtX64eXY5iu+BscYg5gatz217jp62kae58Jdh56qRdpO0tNIKqQls+m+S
I4oJs3RstaazHsOzZE88UqJKxx5gu1m7jimrVwiuR9wz00u2zQcYFTud518L
Q20NJuN8QYUc+6Um7XU+6TU+YKQuys1srQF99Z4lsYJrn+KnjNRqpqdewGqZ
dw8LZcI.GJsotxzcU.169GoHznfqUd8XkdfNOv4bdaDLYbqp98+tkQ9rJ67H
BqShxpKPt2gltjKy6muV9oN5ZZrKzn1Yhjnd1JtXEgcVqFmi.RAVEa.2UBVP
IfT6D.EHhR+RJnOUwnROpHWFU+whn52WJkQOTrqL5gzmjQUoeR4.1Sx7HoV2
wYIcHuHcDqyTj3qk1cV5lfER5lseuN9EjxgKaNw.7XANV4esmhe.2nXa.YSW
35fwKcNpNiJICFoetaBUkhGY09wiFY9n9niRKbcZdqo7eocHx7hdMNMVfPGH
+yOTr+iOHTnGKOCTzby8Nfcm1J656sdTbeRccY5xc01wttdTNJKeJ2vVljcj
4JWFFe6AfKL9tuQljGlTgP5mJDhtzkvwSM5+1h14byHKlFxZ5ZUlLobzSNIC
ziNPDzyjRHLMr0oKKrSL6ikE61NVrOFzDiliZ0gBC9D0hX2treGqBc5DFb0n
auYrRE19rjXb7MptM2FsZ6Nc..ilu2o7OEwV.6V5h5IziEp5Qwcvl87z08bm
zM1x+tJ3ealjf1rvtOIhX9rD8ObAMlBEr1w9WyEvq4B3uWxEv0EaB1X.QWtO
Z4S59eO9nQNaj9JGnjkOkjEklGsoRmz9R41hxZ45HE8P9blHfkQvoR2f1YGg
ai9+joH45yAfWJ3NYIjczG3H0Z2Fppxr9Jsh8qjz1j3XJ6Zos8MkOYhK9EAw
8qiZKf2Qm7kFYchQVGiLky9yOck7hftdEpBHhWdZBn++BMADtlG8klh.1si1
dc442pGkv81z9zSS8YSEotUQUJGPqhrErpoBDsd.nbFnw2.mNAfdoPHsB32D
5nX.5Xhlp0PHuHw6lksWjuY3bFSyJ1SXfLiv.wSX.NmI81SXXNS4MaDiEvYB
F3dBCn4L0+dBCzm0D+6WZTd9ldfesHMeroGDSEZ2FvLwoqYQNcpKgUm4JU.d
AjHbhMATXT7oKHuYHQ37PV4ZaUdBHUNDzTReUiNS4MKLKW3NDD1Y8fGxkl51
jUevh0ieIa2fzDfi5TbvcvfQhy7.u7zxl5HrKoYHdxqI6KteMvC4hxqWIqpb
oaaRYZUQtYMndtsfG3fZ2H5hWr65yCq43iGRbmbN5QVZkcOJ3cGt17+NKmw1
pj2SaRxMKOJswWnro69vMIa21atmll0kyMsq5pc4qhfKh9GflpdwdMpo5Wr+
E1VEL1+fXqFF6ePMUEi8ZVS0wX+KtsJYZ9KwTmZSli0VLkgB6b4JlA04Fd9o
xY6TsFNnHc7KW46Xfca544T9VP9cl7ccwiOlM5E3KUfbrUWPNccgN8YY+bUO
Bmdilkc0mtrHBM5c6.mzFkMXROZCuMsbgbmviGx8CkoTfQtWPQ23BLhIBYAF
8gbYRoJ7uA1YXtzBFA5X6u.yicrg3ABq1ZF+4dORfJfN1ijdd2iDXjaWAI4T
a.DvbrZuCeAIwBttxoQDXb9oB.TD4zxNZNHBnW7kkkSslTkOrmTNh2vxxhge
srrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrdsrrd
srrdsrrdsrrdsrrFWRBWt6gGjkc1l4i9WpR1nLlL5o6z59fIsTclWoXS86L3
TKM7jqb9M5u3Y5blQusfN5JVSOm9wbJEREDDAy4Dalk6cZTfzyCNKnyrBMjS
9eopOkQ6mnoQm8XWYW24ZdVPCdMsQIA+HI4CqqKkxQeXjvPNlkMmECvfGBS3
gpEfKevjPwgTBwrG9M5sbe.+rUDBNrkACgEzxf4bH6EF4Qw18SHWh8ChrNl4
CBI3x1SZPrAubJJyvAWTlfedOOcP.ro5icUQpg+.zA+re.5zfu5SHlSNfTXn
45HyACB9Qli098jNzb1Ol6rFUArveln.QO6mRNByXqyZffL6GKNP3y+whCpI
yCJ0amTjlyw4fi3kwwfydzlBOca2I7m6M2vCHwlHQfyz9MON9FU.GS6XPqYy
ZESVzu.+HlS5m45fOS7ruGZQYmcoTD3MMKdH0YsO.7NFstxPvc5eFjCmuPvg
fYXSBOY8ZK03qVOnE8gkGbxW.4g87vjFXa41BOGLMIAWmfKPDZtMnSl8C94y
lWFO3BbVK9zY9TeF97qajh87z3HzpJ6ELt4M5O2HVRww49sgfb5b.D2MoumK
468lwj6JJWaqHPnyQAe6ZjG8rdygO78Li3QWyAGRy8j6IL1GxKL.8jOTyifl
VpI3p5Yh3VQM8hYkflEjDYm7mK207.fjFtAzP8Tb.5IN3Vwb5EJgCQG4SOgA
gpmFhgfGhgIHdDJJutdh4iRjPvji7pmDgnmDdLNAggpmFBmnyhkTe5YAXNzK
5CuogsBdsVR8QYEFEBa1.ejrCQGA8QvNDrlXe7N.FDhG0Gg.VH5IuTgPCQOw
uY8jWN3DBOo7vLIRDnNZP2lBA6.wG8ChfP67QrkEBK+Dur7SlCSJd00Twr3p
sOB1gPqH8V0OXOj1ngf0jRFQz7WWO4CwiEBMHTluVIuVWLnd4.WHBSg4i8X8
xy7pwIlWxwgv1Eieynd9LNgCgjKyG6whPHO4CwKDRSb3MRoG2qLeEBFOtWYy
KHDOeT5wCAKNmcqTux42J+l3dkmPdn5I3MAmh8Mx6qUQtv2H3fg1UPuF1hmC
OAE93SeHbzV3SXJgPt1q36Bh0XevHVPR4J2mDrEhP7fdwIFBwZrWRah.HVC4
91SWc5I8g2CEjj65CNgHt0UccZLPTeRrQHX6QXOX6QwgpmFjbFBmPQHeG3tM
rHrYIY4TeF3Bg9JjWIJJDjSlONlhfgBmFT.iFpdZP6JmpuxVRDGsQDo6ji1.
hNZyG5zMdnyuoCc7FNjY8tY2sUNpLLNrNq2sNs3GMaUN2+8x7c15MY+F5SeR
wxGeHMKaUQVwI66R6qNj6rOscONZ+69KQfEnXBDJTjkEXHliXlqTWn2S9fGV
Psl1.22HBkDCP5WkvPbB0bk.goX8UfiZF5PeAfw1d.DK.D6UpaoKPttMKI+w
lsFHdm82lskE5UU89cLoE310uWxt5hGKSVm1rZh6uAX9tFNoR0SOdSt4tMJB
Z591zkyqcv361sJ4xC.8Hnw.BTWBufEhXJgyMWQPwbCUhxZ4Q8YfqEoNyXmm
zy9idh1wLAGPEB6UBB1YyZG8XTUmneWl9Gt8J6WZVF71upPkYJgJU65g5bDf
YgFHiirrcB..R5AL8IWDNEhM3MJFqFbLWQgpOT+FkrZkBT50TLxr8CptfrmJ
nkQHG0REnt43lxEJhkkZCfBg8J0WAg52VcElYZ38o4ZkQx1QYbLBxr.6dwNz
oCW81oFTJ79u1Jyi9wj7pneTtIcYQ1561q46JEH99j55KIPzcyH6W5nTt6Zw
UujzV.BD.8C+g56MMCLJ4Tqfhqq5wBoK3uNszTGupWUIZSshCJoHNtei98lr
s2MKvB28GQnZiQVsjDNvf7ZbGOr7oe5lNlCD4i3mYzd+utf7lcuQIPbz+7O8
uOBVYHCyHFjFxIr33lq.wuPXJ8S64yt2BcqeWsKm8Qcs6QF6LPDjxrniVTwn
ON1sEGn1kJqceMar4U46ct5xbzT.mYPY.ihg1AWUOi7vhCgAg.qOYwHFwPBo
wPk77EkFDmm29O840kEOJy+ICaRXXwMkZc5pQvlKvwbqlIbbihB8snSP2qCG
hgiqalL25wnTis8SuZR739vJeVMZ84hZz755pA3c8so8Y.iCo13uUlKeJIv7
peSmkCmaN1S8SPzxW.ZtBFL+dZ.q+X6N55KHf5aSRy+hmPkhwsU83oWEdPq5
ioJ4hKAbmvTCa32N7qYAlPun.p+iprz0629p8RE8YEOcGEmuJJF.mujsvg5g
PRu9N45wIG5hR4BKXm7xgDrUwyUl9oU0kYWAzGT.Ruld+xHX6FT8wKENj+yh
0xJu0J5o2ASCsNuM3fhvCqBY.EHAEZ1sY4kAFW5oN0n4L.Z+P5p5goUO6Lv+
fdequ9Em2F+3phsxwnwXBVd5lr6Xr9m4hW3mRV5s.iuHhKNmA0uDTjR4R+xQ
6.saK8Nh4XdUb7ScNVrBg6rWXPAqh6mCssffgl.IIMSqRHQo+xlAlXoanil+
bxSxGJJ27kwpZ9x7+fET2Bx2lvMVmT9guNWe5S70l.wu1Ywyct4fCv3D7bVD
77x4SNKHsJn3syAR6rgHLi2WR6fuoVtMuCdmX7fvo7qMa8n.PytjVH9TWLVx
imUiKljVP.ytlecyui4TOLUDmbw485ld1YUXYlTYxzNQdeSQwGtKXrbo4eXL
Ie0uYHxAi0gjfy.5Fz8p.j21V5tZbEHZ8B.a5hKOsDymByaxTRvrD01eOb9V
FlFMVZpyIia3V4LIGC2rSTOeo1DDwjOXiI6m0k1yysyRYZWIF1E5iyMBHlzT
qCwwXAu4Jtvi.DXslXos5n2K0DN7ZSxmXsGxZdfSGQf65kmCtQQ+3a3MBtvo
MaCACg6bfM3STdHLEE22XoVebvGG5ssGMatpJZmk9Su5Bpe4BcMl8NmWcC8B
vSk1AhqpuzgeihcOxz7Moacpoufm7H69K12jsSt+XTyyY3PWUh7lZa3fxn3f
qBpA.KK9X9ngvaiRRKD9u94jwCf.TL0VE.bNFirRaHnxurvCfeaoTNAHzP8Z
yoFjo81L7.22qzrkWmLZvSP.LlMaeLfB3LjRVLLlFdX7OKWOZ3i259JrMnJP
iEiPCe+2xrrhONdRHhZKhEtdq80.fTEW4f0mEMF.DyABf98HBXl9husHaBrH
so8fp7DhZqmntFACB.tcW41L4nbNk.r90fn.VrgGFInLjvSOz3s0tIARUpML
51zZOBpnYobkL8oKOyct3YZiT9.vA5VVhgA3rgnnYKFAkePFBmwCRT5Aiare
vapsckPPbPsjTIyWW8RyBxAh7kk8NpFi4P.2NgQDFHtYs.vgCW+e2FA15jsC
UI2cRF02kTWD8c5nPuKX8eUs7gcYY0CNAKGWyqBJEZyCLCyw15imojtXWf+U
uVW3lWk0FTfNCBsUFYPQpQiSHAWX8y.2xIiQw3Xw0fSO6w2W2dHF8BY8B7Qc
1b97k.I2IbCFGCrd+o3yHMSoMC.vv90.r6juAziSlXVwMCOJo7l4k379PfY9
XFbD0+dG44+XZY8mi92drXRhy1kNmYF7zeypsMCglMV4292d6+GPXqG87
-----------end_max5_patcher-----------

edit:
The last posts (before the bump) were correct in terms of timing. The orders of magnitude were such that it didn’t make much of a difference one way or another though.

tremblap · January 24, 2021, 11:16am

There are 2 things in this thread: what you want to reproduce in your context in a given manner, and the design consideration of the toolbox. The former is very clear so no need to re-explain, and seem well catered for by other solutions like entrymatcher. For the latter, I will attempt an explanation of what I meant by black-boxing vs a flexible data mining framework, and the affordances and limitations of what is a considered interface which might not be what you need. I even have many ideas of things you might try should you be interested to experiment with different approaches than what you get with a black-boxed solution, but that might not suit your investigations. More soon, I’m clearing the information with @groma and @weefuzzy to make sure nothing I say is wrong.

rodrigo.constanzo · January 24, 2021, 1:32pm

I still don’t understand the ‘black-box’ term here, particularly since, in this case, “the design consideration of the toolbox” limits what can be done. You can easily compare one fixed list of numbers to another in a (now slightly) faster way than brute force, without being (easily) able to vary, bias, or filter that search. So I would argue that even though it’s spread across multiple objects, the available paradigm itself is a “black box”. It can do this thing well, and beyond that it’s not so flexible.

I wouldn’t describe my intended use case here super niche or peculiar as to be some kind of outlier that falls outside of the super wide scope of what the objects can do (as compared to some of my other investigations). I’m after fairly basic functions for any kind of database/dataset. A way of filtering and navigating a database. I can understand leveraging SQL terminology/queries as they already exist (although the syntax isn’t super “musicianly”) but even with that, SQL isn’t used in a real-time performance-centric context (I could be wrong here). It’s meant to filter and process large chunks of data, offline.

I’m curious as to what you have in mind. There’s a bunch of options we’ve discussed already, but from memory all of them side-step the “filter-per-query” problem.

p.s.
I want to clarify my tone here (since we’re reading text), as although I’m pushing back fairly firmly, it’s not meant to be dicky or sassy. I’m earnestly pushing for things that I think are missing and/or would be useful to (in my estimation) most use cases for a query-based way of navigating a dataset (in realtime).

tremblap · January 25, 2021, 9:53am

As stated above, I understand very well what you are trying to do, it is not far from where I come from… you saw and was inspired by Sandbox#3 after all Take for granted that we hear your bottleneck in your use-case and that the copying of dataset might not even be the bottleneck, since optimisation require thorough profiling which will happen in due time.

So for now, and for everyone’s benefit including yours I hope, as it opens a lot of other potentially exciting options, let me focus on the toolbox design decisions vs “black boxing” the process, and the advantages of both.

First, let me say that blackbox is a wrong metaphor. It is more about how opened or closed an interface is. Both have advantages. For the rest of this message, the tltr is: if entrymatcher works for you, or any other querying object, keep on using it for now if speed is what you need most. If you want to understand its limits and why we do not do it that way and what a more granular interface offers, keep on reading.

The long version:

When we do query on a dataset, we have various consecutive processes to do on the said dataset that require it to be reliable in shape or form for the said sub-processes to be valid. Sorting, adding, pruning, querying are all destructive edits to the dataset integrity in the memory space. There are ways around this. An all-in-one approach can get away with doing them if all of them are within a single ‘blackbox’ object, but if one wants to explore the various subprocess impact, either by changing the type of process or their order (or many other options, see below) one needs to create a granularity of such processes.

To illustrate this, I will provide a clear example not far from your intention: a simple query that has many conditions.

Here is a dataset of 10000 items with loudness, pitch and centroid, give me the nearest item where loudness is within 6dB, pitch within a semitones, and centroid is within an octave.

This in effect needs to have
1 - a dataset at a given moment
2 - sieve it with absolute masks to reject the items not fulfilling the conditions
3 - sort the remaining items (if any) by calculating distance in a certain way (sorting or else)

To allow to change stuff in there (the order, the conditions, the dataset itself, etc), one need to know the state of each element. The all-in-one approach has that advantage. If you add an item to any of those, or renumber, or sort, the object will know and this can be made fast. This might be the desired behaviour when the order of tasks is known and curated: you can trust that the things you point at yours only.

Now, if we separate the tasks, we lose such advantages, but we gain other advantages for musicianly investigations of data mining. For instance, in the same workflow above, here are a few ideas that are possible to try:

step 1 (aka dataset at a given moment)
1a. the same datasets can be used in many other tasks/processes (see the whole list of algorithm available)
1b. subdatasets can be pre-processed independently, removing outliers, scaling, extracting features, clustering, trending, etc

step 2 and step 3 can be swapped (you can do that in entrymatcher by changing the order of the query IIRC)

step 2 and step 3 can do conditional queries in programmatic ways. For instance:
2: Give me all items within a semi-tone. if none, make it a tone (or use centroid).
2: Give me all items with pitch confidence high, within a semitone. Depending on results, give me more or less tolerance
3: give me all items within a given euclidian distance (after you have normalised/standardised/reobustscaled them) then give me the one that has the nearest pitch only
3: give me only the 10 nearest items, then check with binary masks (#2) if any are within the tolerance.

Forking: if my query has high pitch confidence, give me material within a semitone (binary) then the nearest timbre. Then I will pitch- and loudness- correct the entry I’m given

Class: take that timbral space, make 100 classes. Sort them by their centroid. Then give me the nearest match with the timbral class is the same +/- 1, etc etc

MLP: find trends in the timbral space by using an autoencoder and shrinking it to 2d.

step 3: kdtree can be replaced by brute force sorting, or just sorting on one dimension, or whatever else.

I could go on for ages. These are possible in realtime at the moment with our interface, but not just-in-time at press-roll speed yet since we want to sort interface questions, then optimise. We are aware that there are improvements in how to make the workflow of how we move around these queries at the moment too.

Even more fun is possible now: one can experiment with all that data preprocessing and feed it to entrymatcher. Or one could try the order swapping in non-real-time with AudioGuide then code the workflow more dynamically to tweak in FluidLand. The granularity and json/dict interface allows for this. You can even do an entrymatcher sorter for more involved query down the line. The limits are quite endless (apart for my diary)

I hope this opens up possibilities of programmatic corpus manipulation/query for you. After all, I feel you are trying to find musical context for things and maybe there are other ways than single transient analysis.

p

rodrigo.constanzo · January 25, 2021, 3:27pm

Thanks for the detailed consideration and breakdown.

To have a fully dynamic/forking query system, it’s definitely great to be able to radically change around what you are searching for and what order you do it. The overall interface of having those different steps be different objects is a bit faffy, but powerful. I will definitely agree with that. In general I tend to prefer a tidier interface (“blackbox”), but I can see the overall design considerations that going a function-per-object affords. Again, faffy interface, but powerful/functional.

At lot of what is being discussed here (re: copying) centers around this. I’m trying to think of the analogs to a buffer~-based system, where you have dirty flags and whatnot, which is problematic, but I guess the general idea is to know that data will be static while you’re doing stuff with it.

My initial suggestion for this (a fluid.datasetfilter~, or similar object) was that there would be a separate object for “simply” filtering through data without necessarily building new datasets. My initial thoughts on this were to do with interface (overall syntax and “the buffer problem” of having oodles of datasets), but a lot of what I suggest in that thread could apply here in that fluid.datasetfilter~ would specifically create an internal copy of the dataset to sort/query/filter/whatever without having to worry about data elsewhere getting fucked around with.

That does put a lot of eggs in one basket as what happens if you then want to fork, rescale, sort, etc… as @tremblap suggests above.

So a couple more spitballed ideas.

Having a “dirty” dataset type that is happy to be used as a “buffer” of sorts, for in-place/destructive edits where a user has to mind what they do. It can be cordoned off as to not break/crash, but can obviously through errors if your reading/writing at the same time in bad ways.
Having a @dirty (or whatever) flag for fluid.datasetquery~ where on loading/whatever, it creates an internal reference to the fluid.dataset~ that was loaded into it, and everything else is done on that internal version.
In general, I guess copying/sorting/filtering can be done via indices instead of complete(/large) datasets so that each step in the process does what it needs to do, but not by copying every single thing in order to do so.
Having some RT/offline distinction between dataset-based processes, just like there are fluid.bufversion~ and fluid.version~ of most algorithms. The fluid.buf~ version works as it does now, as an “offline” process that reads/writes datasets per step, and everything is safe and sound as you go. And a fluid.~ version which is more bare metal and works destructively, but quickly, and “you get what you get” in the same way the RT versions of objects presently work.

Lastly, I do imagine that things will speed up come proper optimization time (like the fluid.kdtree~ has done so, but I don’t think that at any point, copying massive datasets, multiple times, per query, will ever be remotely “fast”.

rodrigo.constanzo · May 8, 2022, 8:31pm

Kind of random bump here, but I did a quick test/comparison today with the current code and I have to say that fluid.kdtree~ + coll is only a tiny amount slower than entrymatcher + lookup $1 when I want to know the data that corresponds with a given entry.

Screenshot 2022-05-08 at 9.28.54 pm

I still stand by all the stuff about being able to bias/weigh a query or something, but for use cases where I just want the “nearest match” and also need the relevant metadata/descriptors, I can stay in the fluid.verse~ by using fluid.kdtree~ + coll.

I’ll likely end up making some connective tissue for future bits to be able to go from fluid.dataset~s to entrymatcher more directly, but this will work for one of the use cases.

Here’s the test code as a point of reference:


----------begin_max5_patcher----------
3538.3oc6cssbihiF95zOEpRsycY7pyB1q18t8cXpoRgskSXZLvB3zGlZ6m8
UG.GvFhE1RNo2wcp1IVfQ5+S+5+rj+yOc28KK9pr9dv+.7af6t6O+zc2YZR2
vcsu+t62l70UYI0la69b4WJV9G2+f8RMxu1XZ96Yf5rzUR.p6R461llmIaLe
LbaioqM2s5I7q792YwtlCuUaSMeqTZGd2e+Cp+C981KWlzr54z7mdrRtpwdG
QhnEvG.rHwBXu+QYLUiL8kv3EPvuqe.+2O8I8KObYT8lphs0ea6xhrQoZzHT
MbbpFMIU6BECi0+RfBAMtpHKCzHqabkDwwiShz2bh8gYLASiFYBFtf0Q7sOg
5jWjqeTMjTOfGSZZpRWtqwxse2dL3t6kaWJMid3CssTp5xz5zhbcqbSiFrzS
Hp5wWJyWCxJJ97tRmgUpO4bPvX5BRu+QiTvGMht.R36+Ihfd.DGD1pFvRfyq
ZvyUVwxj7mNI6DQsfwPzviI5H9BAy+T8eTjl6p.RTjOmvaIVFmd0H1UE6xaj
UiRurQnWwLkajl2zWzg9stf.XxwH.+ChnyPI4rk14vnSL6+QQz4phsak4MGA
opwUUxSRmAzIT2BeCIiHJznMEOzLBLVnYdL7IvyjOYJppIUQFO4JQghOahhS
wGq5jhBBQQV.Jp.0kIaAIfk6xW8LnXC3+rSVkp3jZu6rzboQNQegfaJxa1jr
R1m9OInvdaPQ+LqS+t4YpI0dMmmr0z78+aY1KxlzUI2OMRJnrEXEnQLlbgDT
8unbOic3EfUUxjFInJIecwVPoRwQCP82F.7afuj17LHsw+vH85BiHLZQOARb
hfpEEEGDTEsGUS.qSZRpkMfMoYYx0V7rEqUT9RYk+YQ4WGnEAQVNTqKOb3Eg
kJ49ME4NKZh75.qRMnUp+eTlmrLS1mRczdFiQbuoQr3HinKB2JVl185YQoax
JTCLWoTFe9T5a5Z6PpcSQ01DCYxeqYZkmPGYDu.yGHiWI6WojWbQF3LSrg89
iMBNeAQgOwbkQNLDlxPSn96pBMDwGU1lO.XC5CJaiPojx2qnlRGkpspusUOv
jUfzZfpmo.k8gxZvlj5obmZDzLBOaqDGk3iniHNAEKtHaFmvgo5xJkYNa.ZE
yGBD+xlyBF3w9zK5wQHHeABF28CBoar0FFe6R4eWwP3Z.DnnYR5pkqIMym9E
5XJbHGBQYEBDGQIbFDEggwX+CFk.0.dsqbBTuF50NaOnP3BrhXYLp4GF9UOp
56QcokS9POnUleJUtz14wLpyO5jxxdMeWuOhFZ9iByCJ5g8MklaahruoJ4K6
cDGtu0jJEgznnhcUVKJ+Juyhe8ioXsrJeWpYnXaTMI0NjLSGZKQUtzYM70Lq
0c4WgFl00HH05wo09yH3qfgZ59orhUe9fXGTTJySyKqj0p09IMGL1uesbSxt
rlG6amLBuXzq2YZ9nWbu0z+qpzjr8DvSUoqKx0ChAyD5l65N0ztwnZ8quRLl
6HOobjOrMBJSbwZEQtqdYRkdhpUKGt6hMEEYCuz9OWlbSS6kKSyyO.EaJJm9
hUoO87a7YWVnt3125YatR8i6xsW8QEOQyi5nEM79RxxZWyN7w+0j7TkLcoVR
t0Om8Wzpo+45UUEYYCnW6UdYjqrVwiuR9kz0MOa5n9LCpaOsriI598yxqSeR
GPtAs0j7T8vVpa9lEz60ztksqgerQtsLSQECuA0piz5l5mK9Rc6M1wn0G.dM
aa8WS2W.3f1OPPnQ.290qGJzC16B8cJ6fOvwx9bv+qIkCZCDB2XO.A185fOV
msN8gKyXLes7q8j0zEqv65zMbNPz.cEC0Wn0VXyLvD.HZb.bu+LAAAYPq3Rn
M4kCsZvSfhRnc8qgMcHpXDoCJxkfluT.ZdtRJAaJ1UA1j9hDTm9Uk4XuHyAR
sriIgN73PWD+LgNWgMp0LKZbPvM6vZt7K6cxaZyueapBSZc3fCEwQj3HnvuK
tdU4kN5VSHRxL7zWebfpVwirpa5patFLbvoDC2jluWW9u8J9.bchZtiB8hUm
FEcC2fLJPW9fnswNCvbMmLyRymxLrkIYGntZLEidOI3akINGUQBalo+05HiC
4DbhPgDMLTHLdTW.+OaWXFJAtWNByjIUt5QGg6SWXhiYNFFHRPpdlumAdppX
hJfXDpGy8cECgfDtgxThcwsNvY+QoI9xRI77ha90Mr4iV6GL7vJCHlRu3PpO
oe6qJ2ocAvYFezIX7G2mvSFgiI8FzcYELNaAd3BF0hH1eo8+GacRNxjUltPE
RDAI..nErXFJhG28uagC3V3.9qR3.NO2SPms2IDpsnGHVklcuNo+Ho55+5kj
LPZNXasNJ9UxxhpF4ZfBOjumwBXI.MWbC6lqrNIdxwvELs7oIURYj1ZKTV7A
IoI7npRu9Jsf8Y6wbng1gJ9uHrsMv1L90Fb+UkDs4F+E5OC.aqXEpQrBAaJj
6qHtdFhBX+DHJfF8ynj.9OCLrTglGMrBBNqP8KBZn9sx+nBG09qenfZkAn0.
aAtZpeQqE.JiAZsMXTi.vu2.YTHAR6Jy4hiIZTqEHeSv65EvWGCxIKfAZU33
X.Gvwfqw8VoLFfBUXucbLDxfdScbLfB3XfLi4hPMFvtlGj20X+6VXTd+xPvb
1qTQv.rWoHbxBTufByo5hchsf89GJbQruCENEGYs9lt.E85OXcTfYAY2Q0sM
HaqwuZWiUtvyaDRaJ.HJ5+HBGACRhOJSV8YKY691kVHBAUSOnzEgDgtdFe22
vmhnPrgO2OWyEGuvFQBxJ6Akwpxptxjpz5hbc8d.gfHmKdQ1L2ZfYo062cjc
+s42uoP.HwriX00un.u+GJwTxuGWTim+1Db5zb0+haSJK2e4yWAyT4dUWyK+
B.s.72PlZew9231Zfw9NhsVXrugZqIF6aXlZiw9271Zjw9NgsVYZeWjyY2MJ
DKwwQwKv8krQ06iw1cLSfDoaX5ck01qEncGGLDOhbMRHnYOs.OhdMVfqABib
eJ6+qVg2T7zSYNuqfELOlt8SUhIBDeQzgkLeaYdEjrqqnmpBvfc97aaYA4dO
WfAiRxwzEz94IGx2G5LruA.2KsHdTnJsnw2.ZWwJK5y4xjJkWelSf.y9G14i
lGuZqofINFHF0vS0qgxxykNyPPOSKOOuUEHAdAA2WRL1tqxQJ2wgDDOhFIXz
3qVkHwYWyRQZbLgfGYOgeYxJmIHfulfPDBe7BDFmbbcE8Qobr3j2+xwZb4qZ
IHGV8h2pFqaUi0spw5V0XcqZrtUMV2pFqaUi0spw5V0XcqZrtUMV2pFqaUi0
spw5V0XcqZrtUMV2pFqSDjvk61rQV8iWSl.3eVmrUoLw4LbxhO2TsbxHnxMY
PgggGuCVQWXNllJ6JaRa.0oaKyjL8QIpqQNk40b7yItlZknfjosJ0vUB5Rzj
yX.02Eu13oXZrCy9nv78.Q1tz0K975lJo7GNudfbcfgwR0u8XK12vvu5d4rQ
maJFOY5ViEGmIoQq8EbbHpyqFPp6IYjN2rN2dT2e57oMJJLxoa.OHE6VoT94
ebh7MSFANh8+AU3X.w308GJHkB0tumBbtxuXST9xDmR57CmYQnfU1Yfi6evU
h1eNTEfysRHD57QWIy+GckQT5wBGGspeoAQGQ6gKtFFbFEvdVH4nXvnBIEgo
lTLmuqNqif62CrS3jUFHA+NqNj36YZrZAOahx4+CfBPBLPJ.aoaLTbrnMNNb
57rtBLKsdHr2kw0MqO1FWAA4u654Pn.qmCEarqAwiNtppneDTrE4+4b6WvCX
kmMGIE+CilLeaseGQS4GYY6Ep4ZlmaWwWyBEiu+qEkPUJXyRHNNTdwnKJRcJ
HUJpNpN.oWlPbubL7IPdMHN1Y0QsKIHqe6BkWOcVyNXdHXfCl2neq7gByWwk
cesCjrdsEM9k0mTg9XAzwqlqZWCLJaARHBkpb6tVwY45z.qO2xKDCOdqCfv3
vEIu1ubn9wIiv6XqL79gTYKmvnaJOF78+TIU3U+z3QQiUWuC22DDsEsSHcz7
vGlXUKTbXhiZAjiSfHxlYv1w3DYtqWJLOBzcsmzeU49ZtolpmPdnmzeuCexd
pa3bQ8Di5BMA8POo2xb.xI5IhOllbgjhn9hjNEqG1G8jKbdTlOHIW37FNZJp
VaqMe3k0yNzwh2sdV69euo588L5h5YgKyqloDzENup2h9mjU0KKJDNMQJ7QO
ItVKzEbWjRF4idh4B54CwjBWDSJ7BMQbA87gzKAxkkx9nibgyyGFB3j.BeXb
C2kksDuXbiKfGyGr3bWVLw8g.BtKVBvIiq8Bc4c8IgS30hHQ7Pnhl6hHDlOD
gvwtvyP7fJYVz0Zc.yE0WTjOnI10RQIyk4IlOjWwbQohtnH7u8gN003frhiA
cQhlHDRzLcM9T.tWXhbgHCD9dkrTg5hg+Lru5oStNwGF9SEtth7h6I90ZZxk
U5buPRvqkWStXt.kGBIHTGDff8g.DhSAlwGcjSdU6kdxkEWDeXpEge05IWj1
x39nmvtFLoKtmPtZT2E2StnkT3qN5jLDdA7bYVh6AChwQtFgOuaggScMKDhe
wtXyAxGrlXmDg3CI8Hm5IerJ.4ByI0GNzibI0YhXe0SmRiLwGzDyoHE3CEkX
mB1n3ZkTlCFN9ZorKrHnHOHoD4hcGHurT1EAi3qjAvGDEnQk8ayQ9AGqU5N8
fiypCNJqN9XrZ5ivpCO9pL6dxoN1p9z+8S+O.TIMA5.
-----------end_max5_patcher-----------