Removing outliers while creating classifiers

rodrigo.constanzo · September 8, 2020, 7:24pm

So I’m revisiting the JIT-classifier thing from a while back (thread) and one of the improvements I’d like to make is to be able to disregard bad training examples.

Say I’m creating a classifier by creating loads of hits on the drum, and by mistake I hit something I shouldn’t, or if one of the hits happens to be a whole lot loader (or brighter, etc…) than the others. It would be great to able take the labels/entries and remove the outer-most 5% of the outliers or whatever if everything else is within a tighter cluster.

Now, I’m mainly thinking pragmatically here, where “wrong hits” would presently mean having to clear and start the training again, which if you’re a load of hits in, can be a bit annoying. But I would imagine this may also be useful for improving the overall classifier.

I guess some of this may be possible presently with fluid.datasetquery~ (not sure, but we have at least a way to manipulate fluid.dataset~s, whereas fluid.labelset~ is a bit more monolithic in its interface (you can delete single points (cough), but as far as I know, there is no conditional stuff you can do.

I guess you could do something to a fluid.dataset~ and then iterate through the corresponding fluid.labelset~ entries to remove them one-by-one.

But the interface gets a bit clunky on that.

Thoughts?

spluta · September 8, 2020, 8:37pm

I just was dealing with this. I don’t think it quite answers your question, but what I did was create a 2d PCA, then looked at the plot and could easily see the outliers. Then went through the set and looked for all points with x>0.7 or y<0.2 and removed them.

rodrigo.constanzo · September 8, 2020, 9:08pm

Interesting.

A bit faffy if you’re doing a lot of classes, but could be a quite powerful way to sift through things.

spluta · September 8, 2020, 9:27pm

Oh, I can show you faffy code. I have piles of it. You know what they say about big data - it takes a billions lines of faff to get a turd diamond.

tedmoore · September 9, 2020, 1:24am

Did you use FluidDataSetQuery for this?

spluta · September 9, 2020, 1:38am

Oh maaaaaaaannnnn. When did that show up?

tremblap · September 9, 2020, 9:09am

May 21st. Alpha02.

Not the fastest in all jobs, and will be optimised as soon as we confirm its interface, but quite powerful.

rodrigo.constanzo · September 9, 2020, 12:24pm

It would be good to have some native-ish way to do this that didn’t require having to manually find what the outliers are for each dimensions and then pruning them (or doing a reduction thing, which can also impact the perceptual clustering).

rodrigo.constanzo · June 16, 2023, 5:36pm

Revisiting this now with some new goggles with more recent issues related to this (dealing with classes as a chunk inside a larger dataset/context).

Now with the interface being “done” it seems like poking at individual classes is a pretty friction-ful endeavour requiring a lot of data dumping/processing in either dicts and/or colls nearly to the point that I think it may be useful to use fluid.dataset~ / fluid.labelset~ as storage containers where the data ends up at the end, rather than being where you put data in initially as you go.

That being said, I’ve been thinking that doing something like this (removing outliers from a data/labelset) would be beneficial both to the quality of the classification, but also to data hygiene overall (bad/stray hits messing things up).

So in my case I don’t want to transform/scale the data at all (I’ve gotten more accurate results with raw data) but I do want to remove outliers such that I keep just the central 90% centile (or 95%, probably based on overall data size so the smaller the training set, the less I remove outliers).

What would be the best way to go about doing this? As in, start off with a fluid.dataset~ and fluid.labelset~ and then, based on the arbitrary amount of classes in the labelset, completely remove entries from both the dataset/labelset that aren’t within 90% of each respective individual label (not the overall dataset).

Based on the discussion and limitations found in the thread about interpolation I now have a bit of a loop that will iterate through a dataset this way, but don’t have a way to crunch numbers on it. fluid.robustscale~ will kind of do what I want, but it transforms the data in the process.

How would I find out the indices of entries where the criteria isn’t met?

Lastly, I have assumed that it wouldn’t impact things if I’m just building a classifier from the data/labels, but will having gaps in the data/labels mess up with stuff down the line?

rodrigo.constanzo · February 12, 2024, 10:01pm

It seems like I’m on a thread bumping roll at the moment…

/////////////////////////////////////////////////////////////////////////

Rather than burying the lede I’ll open with the TLDR questions:

What do I do to determine what an outlier is in a higher-dimensional space without transforming the space in the final result?
How do I go about doing that transformation? (if possible short of manually dumping/iterating each row and repacking at the end (e.g. some kind of fluid.robustscale~ dump hack or something))

(unpacked questions/thinking/context below)

/////////////////////////////////////////////////////////////////////////

So with regards to doing this, I wonder if it’s possible to leverage some of the “hacks” that @tremblap initially shared in this thread about biasing queries.

I’m wondering if the output of fluid.robustscale~ in particularly may be useful for this. Taking the example on the first tab of the fluid.robustscale~ helpfile, it dumps out a dict that looks like this:

{
“cols” : 2,
“data_high” : [ 3161.112060546875, 0.097521238029003 ],
“data_low” : [ 0.0, 0.0 ],
“high” : 75.0,
“low” : 25.0,
“median” : [ 1086.87158203125, 0.0 ],
“range” : [ 3161.112060546875, 0.097521238029003 ]
}

So in my case if I want to keep something like 95% variance I could change the attributes to @low 2.5 @high 97.5 or something like that, which would then report back these values.

Would it then be a matter of iterating through all the data and if the first column for each entry is > or < than median + range, I would delete it?

That feels like it would go funny for higher dimensional (i.e. MFCC) data as I’m specifically not trying to scale the data here, only remove outliers.

So with that said, perhaps “variance” isn’t the correct word here. I guess in terms of stating what I want in case the terms are wrong.

Intended use case:
-creating a classifier by giving it x amount of examples of a given class (typically 50+)
-taking the resultant dataset/labelset pair, and then removing outliers in case there were stray hits, or hits that were otherwise anomalous

So does that mean I want things that are x distance away from the mean of each individual column?

Or does it necessitate something like what @spluta suggested last year where I take it down to fewer dimensions (UMAP/PCA) then remove things based on how far from the lower dimensional mean (still keeping the original higher-dimensional data)?

/////////////////////////////////////////////////////////////////////////

I was/am still a bit concerned about this, but something tells me that in order to do the stuff above, I will have to dump/iterate through all the data outside of a fluid.dataset~ so will probably just end up having to manually pack/label everything when putting things back together with the gaps missing. (e.g. if I remove entry 4 out of a dataset with 10 entries, entry 5 will then be renamed as entry 4, then 6 to 5, etc…) Will be a bit annoying to do that to both the data and the labels, but with what I’ve had to do for other patches, doesn’t seem as insurmountable as it once did.

weefuzzy · February 24, 2024, 10:45pm

I’m just passing though, so I’ll restrict myself to a high level tl;dr answer, the very abbreviated version of which is that (musician friendly) tools for model evaluation are the biggest omission in the flucoma data stuff at the moment, in part because the musician friendliness bit is hard – there’s some unavoidable technicality in model evaluation and comparison, and a great risk of creating the impression that certain magic numbers can do the job. Unfortunately, they can’t: there is always a degree of subjectivity and context sensitivity to this. (That said, some of your colleagues in PRISM have started rolling some evaluation stuff for themselves – maybe they can be persuaded to share?)

So, thing with ‘outliers’ is that (barring an actual objective definition) the problem they create is generally an overfitting one: i.e as atypical points in the training data (that you wouldn’t expect to see in test or deployment), they exert too great an influence on the model training, making it less generally useful. There’s different ways to try and deal with this: cross-validation, especially leave-one-out cross validation, can be used to try and diagnose problematic data like this. In principle, one could make an abstraction for doing this: it basically involves training and testing a bunch of models with different subsets of the data, but it will be fiddly in the absence of easy methods for generating splits of datasets algorithmically, and getting some evaluation metrics on held-out data.

Model regularisation*, meanwhile, just tries to make models more ‘robust’ to outliers (so the MLP objects could be augmented with some, limited, regularisation control). A hacky thing to try, though, might be to ‘augment’ your training data by adding noise to it which, if you squint and are generous of spirit, can have some regularising effects.

So, to the Qs specifically

there is no such general method that doesn’t involve at least some thinking about what ‘outlier’ means in a given case. Rank order statisical things like median and IQR aren’t a magic bullet.
there will be a certain amount of dumping and iterating whatever you do, and that’s probably unavoidable until a communal ‘we’ zero in on some approaches to this that make sense for musical workflows, and we can abstract stuff away…

* so called because it’s there to try and encourage models to be more sceptical of ‘irregular’ data

rodrigo.constanzo · February 26, 2024, 11:31pm

Mucho helpful response!

Yup yup. I think for the general classification stuff I’ve been doing it’s worked well enough as the sounds are, generally-speaking, distinct enough that the small amount of junk in there doesn’t really mess things up. Obviously cleaner and more accurate stuff would be better there, but it was mainly when working on interpolation (between classes) more recently, I felt like something like this was necessary since I’m basically “interpolating” by navigating nearest neighbors, so junk around the edges of the zones has a bigger impact.

Exciting! Now to find out what those magic numbers are…

I did think about this after making this post a bit, as there could be cases where the data is all nice and tidy with nothing out of line (very unlikely obviously), so the “outlier” becomes a bit more conceptual. I imagine this is comfortably in the "it depends"™ territory, but is throwing out 5-10% of the entries of every class based on a more generic metric (distance from mean/median or something) “bad”? (for context, I’m often giving around 50-75 examples per class, though this can be as low as 10-15 with quicker trainings)

I’m certain there will be nuance in refining things passed that point, but at the moment I’m doing nothing, and surely doing something is better than doing nothing…

2c3

I imagined this would be the case. Boy howdie is it faffy to do something to a dataset based on the information in a labelset! I held out some small hope that it would be possible to hijack robustscale’s output to prune things rather than rescaling things and call it a day.

Cool, have dropped Sam (Salem) a line to see what’s up. I am intrigued.

rodrigo.constanzo · June 5, 2024, 11:07pm

So returning to this after some time but from a slightly different angle.

Been working on some refinement of the absolute position triangulation stuff with a buddy and he’s gotten really good results using a NN (MLP) where the input is the time difference of arrival and the output is the known positions using a bunch of dots on my drum to be as accurate as possible. Like so:

Now when I try to do the same thing using fluid.mlpregressor~ I can’t get it to converge at all. Or if I normalize the input I get some non -1. loss values, but it doesn’t give sensible outputs.

Turns out that my buddies version (in Python) is using regularisation, specifically batchnorm.

I’ve done some googling, but everything I’ve come across seems quite a bit beyond me/my understanding, so unlikely I would be able to do anything like that within a FluCoMa context, but I was rereading this with regards to “adding noise” as a way of regularisation.

So when you say

Do you mean create more entries that have some +/- amount added to them? (5%? 10%?)

e.g.:
In my case I have 157 points on the drum, so I have a 4d dataset of 157 TDOAs (via cross correlation) and a 2d dataset of the cartesian coordinates. Would “adding noise” in this context mean duplicating (or more) the amount of entries with some +/- variation for both the TDOAs and positions?

I did also see some stuff about ‘early stopping’ as a way of intrinsically avoiding overfitting. Is that something that can be leveraged here?

weefuzzy · June 6, 2024, 12:55pm

Howdy,

To the specific question you end with

Yes, make new training data. Noise added only to inputs. Determining the amount needs experimentation. 5% feels like a lot though.

Meanwhile, a longer response. It’s not a given that regularisation[*] is the magic bullet here. If our MLP is failing to converge in training then it could be that the data is just too funky. Intuitively, for data like this, I feel like there’s a strong possibility that the ‘circle-ness’ of it could lead to challenges for the optimizer, such as sharp discontinuities in the gradients.

So, I’d be inclined to trying to narrow down what’s happening. Few things to check:

The python network converged, but did it converge to something sensible?
Have you tried importing the data into max / fluid.mlp and playing with it?

Besides the batchnorm, how equivalent are the setups you and your buddy are trying?
Same network layouts, activations etc? Can the python be stripped back to the fluid equivalent? (including making sure to use Stochastic Gradient Descent for training, rather than something fancier like ADAM). If you can recreate the failure to converge in python, then that might be an easier platform to start experimenting with things like the learning rate (possibly to something much lower) to see if a vanilla setup can be made to converge at all. (Or it may reveal some inconsistency between ours and the python package’s…).

If you want to send me your dataset privately to have a look at, feel free

[*] Whilst batchnorm can have regularising effects, it also exists to make life easier for the gradient descent by mitigating situations where very sharp / skewed gradients can emerge.

timlod · June 6, 2024, 8:34pm

Heya,

I’m the buddy!

We’re using comparable but different datasets (Rodrigo sensor data and myself microphone data, but we both use TDoA/Sample lags between sensors based on that data), and I’m running a very simple MLP implementation, but with optional batch norm after every hidden layer (also some other optional bits but they’re not important here).
I’m a little rusty with my NNs at the moment as I haven’t been deep in the theory in quite some time, so please take my observations with a grain of salt.

The network is super simple (one hidden layer of 10 or so neurons), and the training data is tiny.
I thought I might need something a little fancier because there’s some non-linearity I don’t understand in the target, hence the MLP, but I can imagine we could achieve some level of results comparable to the more physical model (optimizing equations for multilateration) using some form of linear regression.

Now since my dataset is so tiny (just 40 observations in this particular example) it’s not really 'batch’norm, but boils down to normalization over the entire dataset after it’s passed through the input layer.
To be honest, I haven’t thought deeply about it before, but now I imagine simply standardisation of the inputs should probably achieve the same effect - I didn’t simply because doing it in the network worked for me and means I can pass integer lags directly without additional steps later, simplifying my pipeline at this stage.

It’s such a simple problem (2in [lags] → 2out [cartesian coordinates]) where really the added value of using an MLP or similar is just those non-linearities the standard trilateration doesn’t get.
Since the dataset is so small it’s quite sensitive to bad quality data, and I suspect there’s something going on there with Rodrigo’s data since he is normalizing the inputs. For the same reason I don’t think noise can help when mapping lags/tdoa to outputs directly. I visually checked that every/almost every one of my samples is plausible, i.e. to my eye the onsets are in the same spot across every input channel.

I’ll connect ‘offline’ with Rodrigo about the data once more - after thinking and writing this reply here I don’t really see why flucoma shouldn’t be able to learn this particular task. Perhaps just sending my data over for him to try.

weefuzzy · June 7, 2024, 10:10am

Hi @timlod and welcome!

Yeah, I think if you and @rodrigo.constanzo are actually dealing with different (but related) data, then swapping and seeing if you can make each other’s work would be informative. It could well be that 2 input dimensions vs 4 makes a difference, for example (depending on the character of the nonlinearity).

In fact, I think I’d be inclined to try using simulated data, given that the reverse mapping (from coordinate to lags) is so trivial to generate. That way it’s easier to play around with network parameters and training set sizes unburdened by measurement uncertainty.

timlod · June 7, 2024, 12:49pm

The non-linearity should be the same/similar in both cases (and, I expect, will depend on the tuning of the drum and type of drumhead).
I think Rodrigo should be using 3 inputs (differences of 4 sensors, one will always be redundant)? In practice, if the data is good, an additional sensor should just help convergence though, not make it harder.

Simulated data is easy to generate based on the physical model, but it won’t contain the non-linearities we’re actually using the MLP for. I’m not sure how well that would translate - then again, that could be a small step for the NN, so it might be worth a try indeed!

rodrigo.constanzo · June 7, 2024, 1:24pm

Gotcha, that makes sense. So in effect, it would be multiple inputs (with the additional noise) pointing to the same solution on the otherwise the network, which I guess is where the regularisation comes from in that case.

Yeah I jumped the gun a bit as I was feeding it a much bigger set of data (the entire mesh) instead of just calibration data (center + hits a few cm in from each lug on the drum). I still feel like it should converge as you’d think there’s a relationship there, but I’m likely messing something up before getting the data there.

@timlod is also working with a 3 sensor array (three sensors along the top), rather than the 4 equally-spaced sensors I’ve been working with up to this point, so that data is a bit spicier/funkier than the 4 sensors. I plan on running the 4 sensor data into the NN as well as doing that with known locations as the outputs is easier/faster (and hopefully more accurate) than pre-computing a bunch of lap maps (or solving quadratics).

Though with this I need to figure out the best combination of numbers for input as we only ever use the 3 nearest sensors, and send that to the corresponding lag maps. To an NN those numbers would look the same (as in, no idea of cardinal orientation). So my ideas are:

send all 4 values and hope the NN makes sense of the furthest one being dogshit
send 4 values with the furthest one being zeroed out
send 4 values with the 4th one being the index of whatever onset arrived first (effectively the “orientation” of the data)

I’ll experiment to see what works best, but that’s what I have in mind for some additional testing.

That’s indeed a big part of the goal atm. The lag map version had better precision than the quadratic version, but does get a bit non-linear towards the edge of the drum.

weefuzzy · June 9, 2024, 12:00pm

The point about simulation is to try and compartmentalize different sources of uncertainty whilst getting things working. The idealised relationship between arrival times and cartesian coordinates is already nonlinear, and if one part of the puzzle is whether or not our MLP is being weird, then confirming that it can do something sensible (meaning ‘similar to python’) for a simulation helps narrow down where the weirdness might be happening. (FWIW, I’ve had a go and – yes – a single hidden layer with 10 neurons makes a reasonable stab at learning a simulated (and idealised) mapping between sensor lags and cartesian positions on a circular plane.)

Now, the extra value of that IMO is that you can do some of this experimentation about input representations etc in this idealised space and have something to compare to measurements to (e…g, by eyeballing plots), which can help you reason about whether

the additional non-linearities of the real system are significant (e.g. non-constant velocity in the medium)
something screwy is going on with the measurements

With respect to the number of sensors to send in, it seems to me that zero-ing out the earliest in each case introduces another strong non-linearity and doesn’t make life easier. If instead you pick a sensor (doesn’t matter which) that is the ‘reference’ and make that always zero (and hence redundant) – so that you now have +ive and -ive lags – then you can use just 3 inputs.

rodrigo.constanzo · June 9, 2024, 12:30pm

On a positive note, I’ve gotten some stuff up and running using an apples-to-apples approach.

I tried training up the NN with the circular lug positions on a 3 sensor array, both as just 9 points (means) and as 72 points (8 of each pointing to the same output) and that didn’t lead to much joy.

The means version with a Sigmoid activation (all the other activations threw up a -1) and 10 hidden layers gave me a loss of ~2000 after running it for a few minutes. I think the lowest it got (before I stopped it) was ~1800.

With the 72 point version I got it to about ~9000 loss after a few minutes. So pretty dogshit.

Now if I standardize both datasets (like @timlod mentions up thread) the results are better, but still not “great” (particularly in terms of position results).

9 means got down to ~0.04 after a minute or so.
72 points got down to ~0.17 after a minute or so.

Both of these with Tanh now given the standardization range.

If I then run real input into it and apply the inverse transforms to get things back to the right scale, numbers are somewhat in believable ranges, but nowhere near as good as Tim’s results.

Hmm, not sure I follow here.

What’s been working out the best for the lag map version was computing all the individual pairs so you have NW, WS, SE, EN lags always, and then depending on what physical sensor received the onset first (i.e. was closest), we go forward with only the nearest sensors. So if the N sensor received the first onset, the only EN and NW were used and the intersection of those two “hyperbola” (lag map representation of) was the strike position.

So I guess it’s just 2 values then? (EN and NW) (but being derived from 3 sensor inputs)

Have just ended up confusing myself…

Looking at a real world set of values from cross correlation (at 44.1k using sub-sample resolution from @a.harker’s FrameLib) I have:
-13.132803 -9.664434 10.661433 12.456226 (NE, ES, SW, WN)
And in this case the N sensor arrived first

Are you saying I pick a random sensor, and consistently offset everything so that is at 0. and the rest are above/below it? I don’t see how that would deal with the fact that the reason for doing this is that the “furthest” sensors end up generating junk data (at least with the quadratic and lap map-based approaches).