Removing outliers while creating classifiers

rodrigo.constanzo · June 16, 2023, 5:36pm

Revisiting this now with some new goggles with more recent issues related to this (dealing with classes as a chunk inside a larger dataset/context).

Now with the interface being “done” it seems like poking at individual classes is a pretty friction-ful endeavour requiring a lot of data dumping/processing in either dicts and/or colls nearly to the point that I think it may be useful to use fluid.dataset~ / fluid.labelset~ as storage containers where the data ends up at the end, rather than being where you put data in initially as you go.

That being said, I’ve been thinking that doing something like this (removing outliers from a data/labelset) would be beneficial both to the quality of the classification, but also to data hygiene overall (bad/stray hits messing things up).

So in my case I don’t want to transform/scale the data at all (I’ve gotten more accurate results with raw data) but I do want to remove outliers such that I keep just the central 90% centile (or 95%, probably based on overall data size so the smaller the training set, the less I remove outliers).

What would be the best way to go about doing this? As in, start off with a fluid.dataset~ and fluid.labelset~ and then, based on the arbitrary amount of classes in the labelset, completely remove entries from both the dataset/labelset that aren’t within 90% of each respective individual label (not the overall dataset).

Based on the discussion and limitations found in the thread about interpolation I now have a bit of a loop that will iterate through a dataset this way, but don’t have a way to crunch numbers on it. fluid.robustscale~ will kind of do what I want, but it transforms the data in the process.

How would I find out the indices of entries where the criteria isn’t met?

Lastly, I have assumed that it wouldn’t impact things if I’m just building a classifier from the data/labels, but will having gaps in the data/labels mess up with stuff down the line?