Removing outliers while creating classifiers

rodrigo.constanzo · February 12, 2024, 10:01pm

It seems like I’m on a thread bumping roll at the moment…

/////////////////////////////////////////////////////////////////////////

Rather than burying the lede I’ll open with the TLDR questions:

What do I do to determine what an outlier is in a higher-dimensional space without transforming the space in the final result?
How do I go about doing that transformation? (if possible short of manually dumping/iterating each row and repacking at the end (e.g. some kind of fluid.robustscale~ dump hack or something))

(unpacked questions/thinking/context below)

/////////////////////////////////////////////////////////////////////////

So with regards to doing this, I wonder if it’s possible to leverage some of the “hacks” that @tremblap initially shared in this thread about biasing queries.

I’m wondering if the output of fluid.robustscale~ in particularly may be useful for this. Taking the example on the first tab of the fluid.robustscale~ helpfile, it dumps out a dict that looks like this:

{
“cols” : 2,
“data_high” : [ 3161.112060546875, 0.097521238029003 ],
“data_low” : [ 0.0, 0.0 ],
“high” : 75.0,
“low” : 25.0,
“median” : [ 1086.87158203125, 0.0 ],
“range” : [ 3161.112060546875, 0.097521238029003 ]
}

So in my case if I want to keep something like 95% variance I could change the attributes to @low 2.5 @high 97.5 or something like that, which would then report back these values.

Would it then be a matter of iterating through all the data and if the first column for each entry is > or < than median + range, I would delete it?

That feels like it would go funny for higher dimensional (i.e. MFCC) data as I’m specifically not trying to scale the data here, only remove outliers.

So with that said, perhaps “variance” isn’t the correct word here. I guess in terms of stating what I want in case the terms are wrong.

Intended use case:
-creating a classifier by giving it x amount of examples of a given class (typically 50+)
-taking the resultant dataset/labelset pair, and then removing outliers in case there were stray hits, or hits that were otherwise anomalous

So does that mean I want things that are x distance away from the mean of each individual column?

Or does it necessitate something like what @spluta suggested last year where I take it down to fewer dimensions (UMAP/PCA) then remove things based on how far from the lower dimensional mean (still keeping the original higher-dimensional data)?

/////////////////////////////////////////////////////////////////////////

I was/am still a bit concerned about this, but something tells me that in order to do the stuff above, I will have to dump/iterate through all the data outside of a fluid.dataset~ so will probably just end up having to manually pack/label everything when putting things back together with the gaps missing. (e.g. if I remove entry 4 out of a dataset with 10 entries, entry 5 will then be renamed as entry 4, then 6 to 5, etc…) Will be a bit annoying to do that to both the data and the labels, but with what I’ve had to do for other patches, doesn’t seem as insurmountable as it once did.