Removing outliers while creating classifiers

I’m just passing though, so I’ll restrict myself to a high level tl;dr answer, the very abbreviated version of which is that (musician friendly) tools for model evaluation are the biggest omission in the flucoma data stuff at the moment, in part because the musician friendliness bit is hard – there’s some unavoidable technicality in model evaluation and comparison, and a great risk of creating the impression that certain magic numbers can do the job. Unfortunately, they can’t: there is always a degree of subjectivity and context sensitivity to this. (That said, some of your colleagues in PRISM have started rolling some evaluation stuff for themselves – maybe they can be persuaded to share?)

So, thing with ‘outliers’ is that (barring an actual objective definition) the problem they create is generally an overfitting one: i.e as atypical points in the training data (that you wouldn’t expect to see in test or deployment), they exert too great an influence on the model training, making it less generally useful. There’s different ways to try and deal with this: cross-validation, especially leave-one-out cross validation, can be used to try and diagnose problematic data like this. In principle, one could make an abstraction for doing this: it basically involves training and testing a bunch of models with different subsets of the data, but it will be fiddly in the absence of easy methods for generating splits of datasets algorithmically, and getting some evaluation metrics on held-out data.

Model regularisation*, meanwhile, just tries to make models more ‘robust’ to outliers (so the MLP objects could be augmented with some, limited, regularisation control). A hacky thing to try, though, might be to ‘augment’ your training data by adding noise to it which, if you squint and are generous of spirit, can have some regularising effects.

So, to the Qs specifically

  1. there is no such general method that doesn’t involve at least some thinking about what ‘outlier’ means in a given case. Rank order statisical things like median and IQR aren’t a magic bullet.
  2. there will be a certain amount of dumping and iterating whatever you do, and that’s probably unavoidable until a communal ‘we’ zero in on some approaches to this that make sense for musical workflows, and we can abstract stuff away…

* so called because it’s there to try and encourage models to be more sceptical of ‘irregular’ data

3 Likes