I’ve started up on making a PCA based utility to do some data inspection for correlation etc.
On the left there is the ‘scree’ plot, giving a visual indication of how much variance is accounted for by the PCs. In this (which is the dataset you sent from your neural network adventures last week), we see PC1 is doing some good stuff , then a drop and then 2-4, before another drop (and then that very sudden drop about halfway over). So, according to this, with PCA you could probably use half the dimensions.
On the right a representation of how the features in the dataset are correlated to each other. If two features are correlated, then it means they move in ‘phase’ and aren’t really adding new information. Likewise, if they’re anti-correlated, one inverts the other but, again, not really any new information. Uncorrelated suggests that they move independently of each other, and each contribute their own information. Reading a matrix like this is a knack, but is like self-similarity plots for time series except that here the axis isn’t time, but each individual dimension in the dataset across each axis.
I’ve coloured this as a heatmap. red = correlated, blue = anti-correlated, white = uncorrelated. You’d always expect to see a completely red stripe across the main diagonal as this shows the correlation of a dimension with itself (which should be 1). Then the upper and lower triangles either side of the main diagonal should reflect each other. Ideally then, what you want is lots of white and pale, indicating that your features are all doing useful work. Stronger colours indicate candidates for removal.
In that plot there’s some interesting structure, and definitely the implication that not all your 76 features are contributing usefully. You’ve got these stripes at regular intervals off the main diagonal where some features are very strongly (anti-)correlated with each other, and these seem to be spaced 1/4, 1/2 and 3/4 of way through the features: presumably reflecting the start of blocks of particular stats or derivatives? Then you see this check-board pattern, again dividing the space in to four. The features in your 2nd chunk all seem to be highly correlated with each other, and quite anti-correlated with those in the 3rd chunk. The takeaway there is that maybe you don’t need all of those. Beyond that, it’s possibly a matter of then looking over individual rows / cols closer up and considering whether certain strongly correlated dimensions can go or not. Automatic thresholding could help, but one still needs to choose what to keep.
Am very open to any other ideas on how to represent this information.