FluidKDTree Woes

spluta · April 29, 2020, 4:02pm

I have a very large data set of 150,000 points. I am loading this data set into a FluidDataSet, then converting that to a KDTree. Then I am searching that KDTree with elements that I know are in the set. 90% of the time, it gives me the same incorrect answer - 143100. 10% of the time it gives me the correct answer. So, it is either the same exact wrong answer or the correct answer and nothing else.

Here is the SC code:

~set = FluidDataSet(s, “synth”, 4);
~set.read("/Users/spluta/Documents/SC/FluCoMA/synthAnalysis/synthAnalysisDataSet.json", {“done”.postln})

~setKDTree = FluidKDTree.new();
~setKDTree.fit(“synth”, {“done”.postln})

x = Buffer.read(s, “/Users/spluta/Documents/SC/FluCoMA/synthAnalysis/synthAnalysisVarious.aif”);
x.loadToFloatArray(action:{arg vals; x = vals.clump(4)});

(
var rand;

{
a = Buffer.alloc(s, 4, 1);
b = Buffer.alloc(s, 4, 1);
"index into analysis: ".post;
rand = x.size.rand.postln;
"array at analysis: ".post;
a.setn(0, x[rand].postln);
s.sync;
~setKDTree.kNearest(a, 5, {|thePoint|
"nearest points in KDTree: ".post;
thePoint.postln
});
~set.getPoint(rand.asSymbol, b);
s.sync;
b.loadToFloatArray(action:{arg vals;
"the same point in the FluidDataSet that the KDTree is made from: ".post;
vals.postln
});
}.fork
)

Here is a sampling of the output:

index into analysis: 50163
array at analysis: [ 0.26637315750122, 0.18424999713898, 0.12581014633179, 0.3000465631485 ]
nearest points in KDTree: [ 143100, 90600, 90530, 135525, 98030 ]
the same point in the FluidDataSet that the KDTree is made from: FloatArray[ 0.26637315750122, 0.18424999713898, 0.12581014633179, 0.3000465631485 ]
-> a Routine
index into analysis: 42947
array at analysis: [ 0.10542738437653, 0.12532222270966, 0.039602041244507, 0.062839388847351 ]
nearest points in KDTree: [ 42947, 44072, 54197, 56072, 111577 ]
the same point in the FluidDataSet that the KDTree is made from: FloatArray[ 0.10542738437653, 0.12532222270966, 0.039602041244507, 0.062839388847351 ]
-> a Routine
index into analysis: 51697
array at analysis: [ 0.15139389038086, 0.13631534576416, 0.037152528762817, 0.17713236808777 ]
nearest points in KDTree: [ 143100, 90600, 90530, 135525, 98030 ]
the same point in the FluidDataSet that the KDTree is made from: FloatArray[ 0.15139389038086, 0.13631534576416, 0.037152528762817, 0.17713236808777 ]
-> a Routine
index into analysis: 149183
array at analysis: [ 0.42514955997467, 0.092134237289429, 0.0041602849960327, 0.20018994808197 ]
nearest points in KDTree: [ 143100, 90600, 90530, 135525, 98030 ]
the same point in the FluidDataSet that the KDTree is made from: FloatArray[ 0.42514955997467, 0.092134237289429, 0.0041602849960327, 0.20018994808197 ]
-> a Routine
index into analysis: 6947
array at analysis: [ 0.27852559089661, 0.18011999130249, 0.14657354354858, 0.18381690979004 ]
nearest points in KDTree: [ 6947, 127493, 3577, 44988, 44243 ]
the same point in the FluidDataSet that the KDTree is made from: FloatArray[ 0.27852559089661, 0.18011999130249, 0.14657354354858, 0.18381690979004 ]

spluta · April 29, 2020, 4:07pm

While I am at it, is there a time frame for the release of a Server side .kr version of FluidKDTree lookup? Do you maybe have one already that you could share?

spluta · April 29, 2020, 4:29pm

I figured this out. The buffer I was using to look into the KDTree was not yet allocated when I was trying to set its values.

tremblap · April 29, 2020, 7:36pm

you beat me to it! Is it working with the 150k entries?

@weefuzzy and @groma were talking about similar ideas at the same time as you were writing this… we are still fighting between 2 archetypical usages (batch processing and real-time) to find an interface that is sensible and allows interaction between the 2… and in Max and SC paradigms!

spluta · April 29, 2020, 7:54pm

It is working. It was just…out of sync…ohhhhhhhh! Basically, it was mostly trying to find an array of all 0’s, but every now and then would actually look for the correct array. Pretty dumb.

In SC, the .kr version could work for both batch processing and live, but the current version will only work for batch, even though it is clearly designed to work for live (otherwise why go through all this messy trouble with the buffers and such). I would love to try that as soon as you have it, as the thing I am working on won’t really work without it.

spluta · April 29, 2020, 7:59pm

NearestN already does this in SC. It works exactly as this FluComa version
could.

Sam

weefuzzy · April 29, 2020, 8:04pm

I agree about getting this working for live, but the other part of the answer to this is because one can’t yet write language side extensions in C++.

Is that Dan Stowell’s one? IIRC, the way this works is to build the tree language side and then transfer it to the server somehow, but the querying, yes. However, we’re stuck with having the data structure on the server.

spluta · April 29, 2020, 8:10pm

Is that Dan Stowell’s one? IIRC, the way this works is to build the tree language side and then transfer it to the server somehow, but the querying, yes. However, we’re stuck with having the data structure on the server.

Right, but that is a good thing, isn’t it? the data structure is already on the server. So we just need a ugen that does what kNearest is already doing, but in real-time. It shouldn’t spit out buffers. It should spit out a kr stream of indices.

spluta · May 1, 2020, 9:20pm

Thought about this some more. Buffer output is actually great as long as there is a FluidBufToKr object that outputs a buffer that is being written to out to a kr stream…which now that I think of it, is a PlayBuf, probably, haha…

weefuzzy · May 1, 2020, 11:44pm

Good thinking there. TBH, the only real impediment to getting a kr matching KDTree into your eager paws is to do with plumbing code. These new objects are a bit real-time and a bit non-realtime simultaneously, which requires some (hopefully small) internal tweaking,

We can experiment with outputs, I guess, although a straightforward k-channel kr stream does seem the obvious starting point.

tremblap · May 2, 2020, 1:44pm

Definitely in the air - we’re working hard on the SC interface right now, stay tuned!

tedmoore · July 22, 2020, 11:45pm

Sorry if I missed this somewhere, but what is it that the KDTree returns when it happens in real-time?

I have the buffer filling, trig pitching and catching all working. Super slick. It seems to be returning the vector of the point that it found as being closest, what I really want though is the ID of the audio that that analysis vector came from. Is there a way to get at this right now?

Thanks! Hope you guys are getting some rest!

spluta · July 23, 2020, 12:00am

The third argument to the KDTree is the DataSet that it looks into and gives you results from. So, you need to have a dataset with the same labels as your KDTree, but those labels point to the data you want to populate the outBuffer with.

That sentence could have been take from page 400 of Dianetics, so let me know if that doesn’t make sense.

spluta · July 23, 2020, 12:04am

for this one, ~ds loads all the indices into my big buffer, so when I am getting the NN from KDTree, it gives me indices. But that DataSet could hold anything as long as the labels are the same as the KDTree.

~ds = FluidDataSet.new(s,\randomName);
~tree = FluidKDTree.new(s, 1, ~ds);
~ds.read("indices.json");
~tree.read("datasetTree.json");

~tree.size
~ds.size

(

~tree.inBus_(~pitchingBus);
~tree.outBus_(~catchingBus);
~tree.inBuffer_(~inputPoint);
~tree.outBuffer_(~predictPoint);
{
    var trig = Impulse.kr(4); //can go as fast as ControlRate.ir/2
    var point = 2.collect{TRand.kr(0,1,trig)};
    point.collect{|p,i| BufWr.kr([p],~inputPoint,i)};
   // Poll.kr(trig,point);
    Out.kr(~pitchingBus.index,1);
    Poll.kr(In.kr(~catchingBus.index),BufRd.kr(1,~predictPoint,Array.iota(2)));
    Silent.ar;
}.play(~tree.synth,addAction:\addBefore);

)

tremblap · July 23, 2020, 7:35am

In the latest helpfile I made a deliberate example to show how a new (reference) dataset can be provided, which is dumb in this case (the label is an int which is also the returned value) - it is pedagogically useful to see that, yet quite useless since having labels as number on the server is not very useful. what we usually want is the nearest neighbour’s values (or a subset thereof hence the clever possibility thought of by @groma and @weefuzzy of providing another dataset to what you care to be returned.

Does that help?

tedmoore · July 25, 2020, 1:30am

I see. This is clever. Thanks both @spluta and @tremblap

I’m still kind of confused as to how the KDTree is able to find the nearest neighbor if it no longer has the original points vector information–or does it in this case? Is that what is loaded in Sam’s:

So the ~tree.read loads the tree information (including the original point’s vectors?), but the ~ds which is passed contains the information that the tree will return?

weefuzzy · July 25, 2020, 8:04am

Hi @tedmoore,

The tree still matches against the original data, but this a question of what gets returned. Internally, the tree is mapping input vectors to labels. This mapping remains against the space the tree was fitted on (because the tree copies when it fits); all that happens with the lookup dataset for RT is that the label from the tree is then turned back into a vector by retreiving the point for that label from a (any) dataset. By default, this should be (is) the dataset that the tree was fitted against (but after reloading, it’s possible the object doesn’t know which instance of dataset that was anymore…)