IQR-ing corpora

weefuzzy · February 22, 2021, 2:20pm

If ‘wouldn’t be used’ = ‘ignored’ then this is no longer a straightforward nearest neighbour search. What would happen in a nearest neighbours search is that the results would all be pulled in the direction of G2.

If the ranges of each dimension of the data against which the tree is fitted don’t have more or less equal ranges, then queries will be weighted towards those dimensions with the greater range, as these will dominate the calculated distances between points. So, you don’t need standardise for the same reasons it’s needed in PCA (where the assumption of zero-mean, ~~unit~~ comparable variance is quite important), but you need something that will put different dimensions on a equal pegging (or unequal-by-design); so, that could just as well be normalised as standardised.

I think that’s simpler: standardise / scale the spaces independently of each other.

Not necessarily: it depends on the distribution of what you’re fitting. If the input really does have outliers that would yield an unhelpfully large variance when standardising, then robust scaling will help with that. But it’s predicated on the assumption that there are outliers so (I think) if the input is actually closer to normally distributed it would end up affecting the input more.