Descriptor-based sample navigation via dimensional reduction(ish) (proof-of-concept)

weefuzzy · July 28, 2019, 2:23pm

True!

Cavets:

@groma knows better than me
Per https://en.wikipedia.org/wiki/K-d_tree#High-dimensional_data how well a KDTree performs for a given dimensionality depends on how much stuff is in there. According to that wikipedia page you need many more data points than 2 to the power of your dimensions for it to do better than a brute force search.
I’m not sure what ‘many more’ might really mean though, but at least an order of magnitude I’d guess. So for 10 dimensions, at least ~10k entries? For 6 dimensions > 640 points. Etc.
I don’t know how well SC’s KDTree really scales up. I think development may have stopped before all the bits were done. Certainly, I started to make it wheeze trying even 3D radius searches at one point.
Beyond a certain point, Approximate Nearest Neighbours can be Good Enough at potentially much lower cost, but that’s not yet available for any creative coding platform, AFAIK.

spluta · July 28, 2019, 2:40pm

Thanks. I was getting really nice and quick results using the NearestN UGen, which uses an altered KDTree data structure. Maybe that is the best route. (which looks like it might be an ANN implementation?)

weefuzzy · July 28, 2019, 3:13pm

Man, I didn’t even know it was there, such is my SC n00bness.

Looking at the source, I think any extra speed is because its compiled C running on the server; it uses structures generated by KDTree AFAICS, so will still be exact nearest neighbours, but it saves time by only ever retrieving up to the three nearest things. As I understand it (which is very little), Approximate NN works by generating a special hash for each entry that encodes similarity somehow <waves hands, hopes for no follow up questions>. (https://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf)

Anyway, NearestN is doing what you need, then This Is Good. (And gives a useful precedent for us to look at how matching could work server-side for the next toolbox)

groma · July 28, 2019, 5:13pm

I did not know either, although the magic here is the KDTree. Back to the yes/no quesiton, less dimensions will alwyas be faster, and in some cases it may help discarding useless information (say one descriptor is almost always zero), but not necessarily (e.g. if it is kind of random it may still be harmful).
Yet the most important cost is in the construction of the tree, which you may do in advance in many cases.

jamesbradbury · July 29, 2019, 10:35pm

So I’m starting to look at these and I have a few questions/clarifications from more experienced people i.e @weefuzzy and @groma or anyone who has some insight. Please correct me where appropriate.

t-sne is mostly concerned with distances between points. I don’t need to standardise the data but normalisation is important if i want to make features between analysis equally weighted…correct?
should I standardise/normalise the input data for the isomap? Is it basically the same concerns as point 1?

groma · July 30, 2019, 10:47am

In our experience standardizing does not make a difference for the algorithms we tried. I guess the explanation would depend on the algorithm, but they all have to deal with the different distributions and scales of each dimension anyway.

jamesbradbury · July 30, 2019, 1:19pm

Okay thanks Gerard! I will stick to normalising so that the weighting is uniform.

rodrigo.constanzo · May 4, 2020, 11:01pm

Bit of a necro bump here, but now that I’m working through the Kaizo Snare blog post properly, I’m putting up and filming relevant bits of stuff. So I’ve cleaned up this code some and made a video demoing the audio-based version of this.

tremblap · May 5, 2020, 7:24am

wooohoooo! Looking forward to reading it!

rodrigo.constanzo · May 5, 2020, 8:04am

Man, it’s gonna be a long one. Not sure if it will be as long as the Improv Analysis or Cut Glove ones, but I have so much media and example I filmed along the way, it feels endless…

tremblap · May 5, 2020, 8:18am

Obvious idea: break it in chapters, on vision/tech/trials/gig etc? Its 2020 nobody likes long reads Cheeky humour apart, I am looking forward to reading your blog as usual! I’m sure @jacob.hart is also looking forward to it…

rodrigo.constanzo · May 5, 2020, 8:36am

Yeah obviously done that.

Not doing formal subsections, but each section gets fairly deep into details and stuff, but I don’t like the look of subsections.