It oughtn’t be that tricky to get something to converge (useful results is different). Always (always) start small, with low learning rates, and embrace tweaking the LR (to try and find a sweetish spot between not-converging and converging-but-too-slowly) as part of the model design. Be wary of making the network bigger, unless you’re sure you have enough training data to absorb the extra complexity (and can live with the extra computation).
I think UMAP’s had it since Alpha 07.
No, it shouldn’t screw up convergence in and of itself: if the whole range of your feature is huge, then that might give odd results, but in general points outside -1:1 are fine (consider that the network is learning weights to apply to the inputs anyway, so these weights can just get smaller, within reason).
It doesn’t screw up the KD tree either. The tree doesn’t really care about the range of data in absolute terms, but the euclidean distance it uses implicitly assumes that features are comparably scaled (euclidean distance = sqrt of sum of squared distances of each feature for the two points being compared).
I think some of this will partly be down to it being a new way of working (and there still being unsanded UX edges to the toolkit, natch): to a very large extent ‘programming’ with ML stuff is about data monkeying as much / more than it’s about models. I’ve certainly seen the argument put forward that it constitutes a completely different programming paradigm.