Thanks for this. There’s so much knowledge here that overall falls into the “it depends” quadrant, but at the same time are “things you should know, and not really deviate from”.
Well, I’ve been experimenting using as few dimensions as possible but this is stemming from “the other side” where I take a load more features/stats and then run it through some kind of funnel to pull things down from there. So the 76D feels like a modest amount of features to start off with.
I hadn’t thought of this as it seemed either redundant or “bad practice”, particularly with how brutal PCA is. I’d be more inclined to do some of the SVM stuff from the other thread to prune down to the most salient features before running them through some kind of destructive/transformative process like PCA/UMAP/MLP/etc… Don’t know if that’s just my lack of knowledge, but it seems that having multiple layers of transformation this way is akin to transcoding audio, which can introduce compound artefacts along the way.
I guess in general with this, if I have a “small” amount of samples in my corpus (<40k), but have a “large” amount of descriptors (>70), I would have to add some additional steps to overcome the initial fit-ing of the network?
Lastly, is the faff and twiddling required to train an autoencoder something that (generally speaking) leads to something that better reproduces/captures/clusters/whatever the information in a dataset? And I suppose in non-linear ways that wouldn’t otherwise be possible with PCA/UMAP/TSNE. Or rather, is the autoencoder a whole load of work just to have something that’s “different” from PCA/UMAP/TSNE?
I guess there’s specific flavors or reasons for each activation type to be there, but is this then another parameter for “test and see what you get”?