Mucho helpful response!
Yup yup. I think for the general classification stuff I’ve been doing it’s worked well enough as the sounds are, generally-speaking, distinct enough that the small amount of junk in there doesn’t really mess things up. Obviously cleaner and more accurate stuff would be better there, but it was mainly when working on interpolation (between classes) more recently, I felt like something like this was necessary since I’m basically “interpolating” by navigating nearest neighbors, so junk around the edges of the zones has a bigger impact.
Exciting! Now to find out what those magic numbers are…
I did think about this after making this post a bit, as there could be cases where the data is all nice and tidy with nothing out of line (very unlikely obviously), so the “outlier” becomes a bit more conceptual. I imagine this is comfortably in the "it depends"™ territory, but is throwing out 5-10% of the entries of every class based on a more generic metric (distance from mean/median or something) “bad”? (for context, I’m often giving around 50-75 examples per class, though this can be as low as 10-15 with quicker trainings)
I’m certain there will be nuance in refining things passed that point, but at the moment I’m doing nothing, and surely doing something is better than doing nothing…
I imagined this would be the case. Boy howdie is it faffy to do something to a dataset based on the information in a labelset! I held out some small hope that it would be possible to hijack robustscale’s output to prune things rather than rescaling things and call it a day.
Cool, have dropped Sam (Salem) a line to see what’s up. I am intrigued.