Removing outliers while creating classifiers

:partying_face:

Glad the code is broadly parse-able too.

Yes – I would think that for more complex clumpings of data it will matter more. Like if you’ve got a bunch of clear bunches and stuff scattered between them and only want to get rid of the scattered stuff, you’ll need to tune.

More broadly, yes there’s a CPU hit associated with more neighbours too, especially in higher dimensions, and (according the the graphs in the paper), accuracy steadily goes down once you’re past some sweet spot for the data in question (but not abruptly).

So, if you can live with the CPU cost and it still does what you need for a given job, don’t sweat it, otherwise use for finer-tuning.

  • Is there a benefit of adjusting both tolerance and fluid.datasetquery~ threshold independently?

They’re not completely equivalent because of the non-linearity in the way the final scaling works (clips at 0, meaning you have a class of points that are definitely inliers). However, for practical purposes you can probably usually leave at a value that gives intuitively sensible results for your purposes. As you show, when it’s lower, you get some points marked as possible outliers within the main bunch; sometimes that might be useful for surgical stuff – so, again, perhaps good to twiddle for fine(r)-tuning in difficult cases.

  • Should outlier rejection happen pre or post dimensionality reduction?

As a rule I’d say pre-, especially if the DR is just being used for visualisation and the actual classification will be done in higher dimensions. Nonlinear DR like UMAP, in particular, is quite liable to reduce the outlier-ness of points at some settings (which you showed above), because it’s trying to preserve the topology of whatever you give it. Because it also uses a kNN for part of its work, the two things could interact in surprising ways.

So, the it depends version: generally before, except when that doesn’t work

  • Does an approach like this have any useful implications for computing class “interpolation”?

Don’t know. Will need to remind myself of what you’re trying to do there and think more about it. But just have a play in the meantime and come back with Qs.

  • Can anything bad happen in the fluid.verse with disjointed entries in a fluid.dataset~?

Nope. They’re not actually disjointed (the order of entries isn’t coupled to the IDs).

1 Like