I finally took a little time to investigate this idea that I’ve thrown out enough times now…
Is it possible to use a Support Vector Machine to identify which features would be most useful? How can one do “dimensionality reduction” by just choosing some features and ignoring other (but trying to maintain as much of the “variance” / “predictive power” of the dataset as possible)?
And here’s the code I’m running (sorry it’s not really clean but if anyone wants to poke at it, it’s available).
In short, yeah, it kinda works! Obviously with fewer features it has less accuracy, but the strategy does seem to maintain a decent accuracy with much fewer features.
- I noticed that sklearn has an SVM Regressor so I need to look into how that works and see if that makes more sense to use, since (as you’ll see) I’m hacking a bit with the KMeans approach
- It would be good to make this so one can drop in a FluidDataSet in json format and tinker with it themselves.
- Test it on some sound stuff–as in through speakers and my ears.
- When testing the performance at the end of the script, try using an MLP Classifier so it’s not the same algorithm as the SVM.
I’m planning to do these things at some point…
Also, found this paper on the topic. It’s quite high level and abstract but the bib is probably quite useful. If anything, reading over this makes me realize that trying to select 10 dimensions from 100 is, perhaps, just kind of trivial. 100 isn’t that many and the predict function for most of our uses is quite fast (maybe it’s more relevant for a KDTree?). With just 100 features a qualitative or intuition based approach is probably fine. The strategies in this paper are more aimed at selecting from thousands of dimensions.
Those are some thoughts.