Removing outliers while creating classifiers

So I’m revisiting the JIT-classifier thing from a while back (thread) and one of the improvements I’d like to make is to be able to disregard bad training examples.

Say I’m creating a classifier by creating loads of hits on the drum, and by mistake I hit something I shouldn’t, or if one of the hits happens to be a whole lot loader (or brighter, etc…) than the others. It would be great to able take the labels/entries and remove the outer-most 5% of the outliers or whatever if everything else is within a tighter cluster.

Now, I’m mainly thinking pragmatically here, where “wrong hits” would presently mean having to clear and start the training again, which if you’re a load of hits in, can be a bit annoying. But I would imagine this may also be useful for improving the overall classifier.

I guess some of this may be possible presently with fluid.datasetquery~ (not sure, but we have at least a way to manipulate fluid.dataset~s, whereas fluid.labelset~ is a bit more monolithic in its interface (you can delete single points (cough), but as far as I know, there is no conditional stuff you can do.

I guess you could do something to a fluid.dataset~ and then iterate through the corresponding fluid.labelset~ entries to remove them one-by-one.

But the interface gets a bit clunky on that.

Thoughts?

I just was dealing with this. I don’t think it quite answers your question, but what I did was create a 2d PCA, then looked at the plot and could easily see the outliers. Then went through the set and looked for all points with x>0.7 or y<0.2 and removed them.

Interesting.

A bit faffy if you’re doing a lot of classes, but could be a quite powerful way to sift through things.

Oh, I can show you faffy code. I have piles of it. You know what they say about big data - it takes a billions lines of faff to get a turd diamond.

3 Likes

Did you use FluidDataSetQuery for this?

1 Like

Oh maaaaaaaannnnn. When did that show up?

May 21st. Alpha02.

Not the fastest in all jobs, and will be optimised as soon as we confirm its interface, but quite powerful.

It would be good to have some native-ish way to do this that didn’t require having to manually find what the outliers are for each dimensions and then pruning them (or doing a reduction thing, which can also impact the perceptual clustering).