Intelligent Feature Selection (with SVM or PCA)

Meanwhile, I went down a little rabbit hole last night reading around feature selection, and unsupervised selection in particular, with an eye on what PCA-based things were in the literature and what of the available techniques might be possible with the toolkit facilities as they stand. I’m trying some stuff out, will report in a separate topic later.

Some potted findings, that I may well follow up on one day:

  1. Irrespective of whatever technique, some visual inspection of PCA results can be useful to gauge how structured input data is. Just plotting values (a ‘scree plot’) and making sure that the characteristic ‘elbow’ is there will confirm that (at least) the data aren’t totally unstructured (within PCA’s operating assumptions).
    If the data have been standardised pre-PCA, you can also recover the pairwise correlations between input features – useful, on the basis that you generally don’t want to have lots of strongly correlated features downstream as it adds more computation for no gain. Doing this involves some linear algebra fiddling, but it’s not too bad (I’ll try and make a shareable example one day).

  2. Feature selection for supervised models has been researched much more than for unsupervised. Perhaps not surprising, given that the supervised case is conceptually simpler (you know what you’re aiming for). In the unsupervised case, the challenge is similar to that faced by dimension reduction algorithms, in balancing the global and local structures of the data.

  3. There seem to have been a number of stabs at using PCA for unsupervised feature selection, although the literature is generally quite sniffy about them. One of these [1] seems to work by doing repeated fittings against a sub-selection of features, and comparing this to a fitting against all features. Another [2], uses K-Means on the PCA bases as a way to try and rank the input features. I’m trying this out in Python, using some stackexchange code, to see what happens.

  4. Recent schemes get more and more fancy. There’s interesting looking stuff using modified autoencoder setup, but it made my eyes water [3]. More manageable looking is a scheme called Laplacian Score [4], which makes a nearest neighbours graph and uses this as a basis to figure out a ranking (which feels slightly similar to what UMAP does in its early stages, although I don’t know the details of that very well). That could be done with a KD-Tree and some patience, so I may well have a bash at that as well.

[1] Krzanowski, W. J. (1987). Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components. Applied Statistics, 36(1), 22. https://doi.org/10.2307/2347842
[2] Sean, I. C. Q. T. X., & Huang, Z. T. S. (n.d.). Feature selection using principal feature analysis.
[3] He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18, 507–514.
[4] Wu, X., & Cheng, Q. (2020). Fractal Autoencoders for Feature Selection. ArXiv:2010.09430 [Cs]. http://arxiv.org/abs/2010.09430

3 Likes