Removing outliers while creating classifiers

So I’m revisiting the JIT-classifier thing from a while back (thread) and one of the improvements I’d like to make is to be able to disregard bad training examples.

Say I’m creating a classifier by creating loads of hits on the drum, and by mistake I hit something I shouldn’t, or if one of the hits happens to be a whole lot loader (or brighter, etc…) than the others. It would be great to able take the labels/entries and remove the outer-most 5% of the outliers or whatever if everything else is within a tighter cluster.

Now, I’m mainly thinking pragmatically here, where “wrong hits” would presently mean having to clear and start the training again, which if you’re a load of hits in, can be a bit annoying. But I would imagine this may also be useful for improving the overall classifier.

I guess some of this may be possible presently with fluid.datasetquery~ (not sure, but we have at least a way to manipulate fluid.dataset~s, whereas fluid.labelset~ is a bit more monolithic in its interface (you can delete single points (cough), but as far as I know, there is no conditional stuff you can do.

I guess you could do something to a fluid.dataset~ and then iterate through the corresponding fluid.labelset~ entries to remove them one-by-one.

But the interface gets a bit clunky on that.

Thoughts?

I just was dealing with this. I don’t think it quite answers your question, but what I did was create a 2d PCA, then looked at the plot and could easily see the outliers. Then went through the set and looked for all points with x>0.7 or y<0.2 and removed them.

Interesting.

A bit faffy if you’re doing a lot of classes, but could be a quite powerful way to sift through things.

Oh, I can show you faffy code. I have piles of it. You know what they say about big data - it takes a billions lines of faff to get a turd diamond.

3 Likes

Did you use FluidDataSetQuery for this?

1 Like

Oh maaaaaaaannnnn. When did that show up?

May 21st. Alpha02.

Not the fastest in all jobs, and will be optimised as soon as we confirm its interface, but quite powerful.

It would be good to have some native-ish way to do this that didn’t require having to manually find what the outliers are for each dimensions and then pruning them (or doing a reduction thing, which can also impact the perceptual clustering).

Revisiting this now with some new goggles with more recent issues related to this (dealing with classes as a chunk inside a larger dataset/context).

Now with the interface being “done” it seems like poking at individual classes is a pretty friction-ful endeavour requiring a lot of data dumping/processing in either dicts and/or colls nearly to the point that I think it may be useful to use fluid.dataset~ / fluid.labelset~ as storage containers where the data ends up at the end, rather than being where you put data in initially as you go.

That being said, I’ve been thinking that doing something like this (removing outliers from a data/labelset) would be beneficial both to the quality of the classification, but also to data hygiene overall (bad/stray hits messing things up).

So in my case I don’t want to transform/scale the data at all (I’ve gotten more accurate results with raw data) but I do want to remove outliers such that I keep just the central 90% centile (or 95%, probably based on overall data size so the smaller the training set, the less I remove outliers).

What would be the best way to go about doing this? As in, start off with a fluid.dataset~ and fluid.labelset~ and then, based on the arbitrary amount of classes in the labelset, completely remove entries from both the dataset/labelset that aren’t within 90% of each respective individual label (not the overall dataset).

Based on the discussion and limitations found in the thread about interpolation I now have a bit of a loop that will iterate through a dataset this way, but don’t have a way to crunch numbers on it. fluid.robustscale~ will kind of do what I want, but it transforms the data in the process.

How would I find out the indices of entries where the criteria isn’t met?

Lastly, I have assumed that it wouldn’t impact things if I’m just building a classifier from the data/labels, but will having gaps in the data/labels mess up with stuff down the line?

It seems like I’m on a thread bumping roll at the moment…

/////////////////////////////////////////////////////////////////////////

Rather than burying the lede I’ll open with the TLDR questions:

  1. What do I do to determine what an outlier is in a higher-dimensional space without transforming the space in the final result?
  2. How do I go about doing that transformation? (if possible short of manually dumping/iterating each row and repacking at the end (e.g. some kind of fluid.robustscale~ dump hack or something))

(unpacked questions/thinking/context below)

/////////////////////////////////////////////////////////////////////////

So with regards to doing this, I wonder if it’s possible to leverage some of the “hacks” that @tremblap initially shared in this thread about biasing queries.

I’m wondering if the output of fluid.robustscale~ in particularly may be useful for this. Taking the example on the first tab of the fluid.robustscale~ helpfile, it dumps out a dict that looks like this:

{
“cols” : 2,
“data_high” : [ 3161.112060546875, 0.097521238029003 ],
“data_low” : [ 0.0, 0.0 ],
“high” : 75.0,
“low” : 25.0,
“median” : [ 1086.87158203125, 0.0 ],
“range” : [ 3161.112060546875, 0.097521238029003 ]
}

So in my case if I want to keep something like 95% variance I could change the attributes to @low 2.5 @high 97.5 or something like that, which would then report back these values.

Would it then be a matter of iterating through all the data and if the first column for each entry is > or < than median + range, I would delete it?

That feels like it would go funny for higher dimensional (i.e. MFCC) data as I’m specifically not trying to scale the data here, only remove outliers.

So with that said, perhaps “variance” isn’t the correct word here. I guess in terms of stating what I want in case the terms are wrong.

Intended use case:
-creating a classifier by giving it x amount of examples of a given class (typically 50+)
-taking the resultant dataset/labelset pair, and then removing outliers in case there were stray hits, or hits that were otherwise anomalous

So does that mean I want things that are x distance away from the mean of each individual column?

Or does it necessitate something like what @spluta suggested last year where I take it down to fewer dimensions (UMAP/PCA) then remove things based on how far from the lower dimensional mean (still keeping the original higher-dimensional data)?

/////////////////////////////////////////////////////////////////////////

I was/am still a bit concerned about this, but something tells me that in order to do the stuff above, I will have to dump/iterate through all the data outside of a fluid.dataset~ so will probably just end up having to manually pack/label everything when putting things back together with the gaps missing. (e.g. if I remove entry 4 out of a dataset with 10 entries, entry 5 will then be renamed as entry 4, then 6 to 5, etc…) Will be a bit annoying to do that to both the data and the labels, but with what I’ve had to do for other patches, doesn’t seem as insurmountable as it once did.

I’m just passing though, so I’ll restrict myself to a high level tl;dr answer, the very abbreviated version of which is that (musician friendly) tools for model evaluation are the biggest omission in the flucoma data stuff at the moment, in part because the musician friendliness bit is hard – there’s some unavoidable technicality in model evaluation and comparison, and a great risk of creating the impression that certain magic numbers can do the job. Unfortunately, they can’t: there is always a degree of subjectivity and context sensitivity to this. (That said, some of your colleagues in PRISM have started rolling some evaluation stuff for themselves – maybe they can be persuaded to share?)

So, thing with ‘outliers’ is that (barring an actual objective definition) the problem they create is generally an overfitting one: i.e as atypical points in the training data (that you wouldn’t expect to see in test or deployment), they exert too great an influence on the model training, making it less generally useful. There’s different ways to try and deal with this: cross-validation, especially leave-one-out cross validation, can be used to try and diagnose problematic data like this. In principle, one could make an abstraction for doing this: it basically involves training and testing a bunch of models with different subsets of the data, but it will be fiddly in the absence of easy methods for generating splits of datasets algorithmically, and getting some evaluation metrics on held-out data.

Model regularisation*, meanwhile, just tries to make models more ‘robust’ to outliers (so the MLP objects could be augmented with some, limited, regularisation control). A hacky thing to try, though, might be to ‘augment’ your training data by adding noise to it which, if you squint and are generous of spirit, can have some regularising effects.

So, to the Qs specifically

  1. there is no such general method that doesn’t involve at least some thinking about what ‘outlier’ means in a given case. Rank order statisical things like median and IQR aren’t a magic bullet.
  2. there will be a certain amount of dumping and iterating whatever you do, and that’s probably unavoidable until a communal ‘we’ zero in on some approaches to this that make sense for musical workflows, and we can abstract stuff away…

* so called because it’s there to try and encourage models to be more sceptical of ‘irregular’ data

3 Likes

Mucho helpful response!

Yup yup. I think for the general classification stuff I’ve been doing it’s worked well enough as the sounds are, generally-speaking, distinct enough that the small amount of junk in there doesn’t really mess things up. Obviously cleaner and more accurate stuff would be better there, but it was mainly when working on interpolation (between classes) more recently, I felt like something like this was necessary since I’m basically “interpolating” by navigating nearest neighbors, so junk around the edges of the zones has a bigger impact.

Exciting! Now to find out what those magic numbers are…

I did think about this after making this post a bit, as there could be cases where the data is all nice and tidy with nothing out of line (very unlikely obviously), so the “outlier” becomes a bit more conceptual. I imagine this is comfortably in the "it depends"™ territory, but is throwing out 5-10% of the entries of every class based on a more generic metric (distance from mean/median or something) “bad”? (for context, I’m often giving around 50-75 examples per class, though this can be as low as 10-15 with quicker trainings)

I’m certain there will be nuance in refining things passed that point, but at the moment I’m doing nothing, and surely doing something is better than doing nothing…

2c3

I imagined this would be the case. Boy howdie is it faffy to do something to a dataset based on the information in a labelset! I held out some small hope that it would be possible to hijack robustscale’s output to prune things rather than rescaling things and call it a day.

Cool, have dropped Sam (Salem) a line to see what’s up. I am intrigued.