Intelligent Feature Selection (with SVM or PCA)

I finally took a little time to investigate this idea that I’ve thrown out enough times now…

Is it possible to use a Support Vector Machine to identify which features would be most useful? How can one do “dimensionality reduction” by just choosing some features and ignoring other (but trying to maintain as much of the “variance” / “predictive power” of the dataset as possible)?

I know that @rodrigo.constanzo has posed this a few times and I had suggested a the Support Vector Machine (SVM) as a possible approach. Also, other day @tremblap suggested the PCA strategy.

Here’s a little video I made (I’m not brave like @rodrigo.constanzo to toss everything on YouTube!).

And here’s the code I’m running (sorry it’s not really clean but if anyone wants to poke at it, it’s available).

In short, yeah, it kinda works! Obviously with fewer features it has less accuracy, but the strategy does seem to maintain a decent accuracy with much fewer features.

// TODO:

  1. I noticed that sklearn has an SVM Regressor so I need to look into how that works and see if that makes more sense to use, since (as you’ll see) I’m hacking a bit with the KMeans approach
  2. It would be good to make this so one can drop in a FluidDataSet in json format and tinker with it themselves.
  3. Test it on some sound stuff–as in through speakers and my ears.
  4. When testing the performance at the end of the script, try using an MLP Classifier so it’s not the same algorithm as the SVM.

I’m planning to do these things at some point…

Also, found this paper on the topic. It’s quite high level and abstract but the bib is probably quite useful. If anything, reading over this makes me realize that trying to select 10 dimensions from 100 is, perhaps, just kind of trivial. 100 isn’t that many and the predict function for most of our uses is quite fast (maybe it’s more relevant for a KDTree?). With just 100 features a qualitative or intuition based approach is probably fine. The strategies in this paper are more aimed at selecting from thousands of dimensions.

Those are some thoughts.

1 Like

Awesome!!

That’s super interesting in terms of how useful it appears to be (like, in a concrete/usable sense), but also in terms of what descriptors end up making the most sense (at the end). Some of it is somewhat intuitive, as you point out, like loudness/pitch, but then MFCC13 is a funky one to randomly be in there. I guess that individual MFCCs would have quite a bit of variance given any arbitrary corpus (and potentially not others). It’s also potentially interesting for an LPT-type approach where what makes up some significance for each overall vector may not be intuitive.

Thanks for sharing the code too. I wonder how implementable something like that would be in a fully fluid.context~, although the PCA-based version approximates things quite well, and that would be manageable to implement given the current (native) tools.

Now that we’ve got a(n easy to use) visualizer for Max it will make some of these investigations easier. Like even testing the kmeans stuff, which I’ve not done much (any?) of on my own sounds.

Would also be interesting to run things on the classes I had made when trying to do this process here manually a while back (soo tedious…).

Yeah, the PCA version should be totally implementable with the current tools (perhaps I should take a swing at a SC version).

I was also thinking of finding a javascript implementation of an SVM and porting it to SC. Unfortunately I would doing this in the language, so it would be slow, I don’t know if it’d be too slow as to make it unusable though.

What I really need is to get my head wrapped around how to interface this kind of stuff with the C++ scsynth… perhaps when the Fluid code base is all visible I can poke at it and see…

I guess with something like this, speed isn’t massively important as (presumably?) this is something you would deal with while creating the initial descriptor space, then ignored thereafter. Same goes for processing .json files, as that can happen in python or SC if it’s just a step in a dataset processing chain that’s independent of the language it’s then implemented in.

Oh yeah, in my excitement I forgot to respond to this bit.

I just chuck everything on as unlisted so it’s easier to post/share. I often struggle downloading longer vids off drive accounts (thanks for shrinking/handbrake-ing the shit out of your vid!), and youtube seems to be fairly ubiquitous/easy.

1 Like

It also occurred to me that this will be super useful for determining which of the more abstract descriptors are useful in these contexts. It’s semi-intuitive that loudness and MFCC13 may be have a lot of “predictive power”, but when it gets to the standard deviation of the 2nd derivative of MFCC4, it gets a lot less sensible. And most of the testing I’ve done so far as been mashing together different combinations of these vectors to see what kind of worked with no rhyme or reason as to which I chose.

1 Like

Yes, I agree one thing that seems useful is hopefully some intuition that can come out of it around what descriptors seem to continue popping up as having importance in navigating a multidimensional space.

Also, had a thought today about the PCA strategy if you were to go after it… instead of using only the first n principle components, maybe try using all them but create a weighted sum, where the weights are the explained variance ratios so that all are accounted for but the first few will contribute more to the final summed curve. I know that the sklearn PCA has these values readily available, but it looks like FluidPCA doesn’t make them available…yet.

1 Like

Indeed. As it is all those numbers seem super duper arbitrary.

I tried building a thing with this now and it seems like I’ve run into a bug/problem, so will see what comes of that.

Hopefully this is not some native dict thing where you can’t view arrays or something awful like that.

And to make sure I understand, by this you mean taking each column from the array and summing them together (ala an iterative vexpr $f1 + $f2)?

Ok, I think I’ve built it in Max with @weefuzzy’s help in the other thread.

I think I’ve done the maths right (do correct me if I’m wrong). I was a bit confused as to what I was getting out of the dump-ing process (a 21x21 vector in this case), but in the end I guess that’s how fluid.pca~ does its thing.

So I shove all the stuff into a coll then take out the amount of vectors I’m interested in. By default it does 5 (the amount of dimensions I asked for), but it can go higher as per @tedmoore’s suggestion. Interestingly the results lean much more towards the MFCCs when I do that.

Once that process is done, I do some peak picking to find the most important features (in this particular dataset).

I’ve not yet built anything that verifies and/or plays back sounds based on this, and honestly didn’t think I’d get this far with this today, but wanted to post my results as it’s really interesting.

A next step will be to try this on a much larger descriptor space (150+) to see what it makes of stats/derivatives, as this is my “lean and mean” 21D space (20(19) MFCCs + loudness/pitch), as well as adding something to test the accuracy.

pca.zip (350.6 KB)

Yes, so this is basically what I was running into when I was summing (the abs of) all the principle component vectors–it started telling me things like MelBand 12 was the most important feature…and that seemed fishy to me. So, like you, I didn’t use all the PCs, just some of the most important ones to sum the vectors and then find the peaks.

Then I had the idea to do a weighted sum of the PC vectors where the weight is explained variance ratio, so the first PC might have a weight of 0.2 or something but the last few PCs have weights of, like, 0.002, so basically nothing. This seems to give nice results. When I did just the first few PCs features selection strategy compared with the SVM features selection strategy, the SVM seemed to perform better, but now doing the weighted PCA approach, this one seems to perform better (mostly).

Also I’ll add that although the subvector that this strategy finds doesn’t perform as well as the original entire vector it actually performs pretty similarly (slightly less well) when compared to using PCA as a dimensionality reduction method. The obvious benefit being that by retaining (some of) the original features there is some qualitative value there, where as with PCA the dimensions it returns are not “human-readable”.

svm classifier score (whole vector): 0.7647875590113857
mlp classifier score (whole vector): 0.6986948069980561

svm classifier score (svm selected subvector): 0.52929741738406
svm classifier score (pca selected subvector): 0.5781727297972785

SCORES JUST DOING A PCA FOR DIM. RED. TO 10 FEATURES
svm classifier score (pca reduced space): 0.5398500416550958
svm classifier score (pca reduced space): 0.6712024437656207

TEST TO MAKE SURE IT IS DOING BETTER THAN RANDOMNESS
random subvector svm score: 0.19605665093029714
random subvector svm score: 0.22104970841432936
random subvector svm score: 0.07553457372951958
random subvector svm score: 0.1757845043043599
random subvector svm score: 0.11135795612329909
random subvector svm score: 0.22299361288530964
random subvector svm score: 0.10302693696195502
random subvector svm score: 0.2901971674534851
random subvector svm score: 0.2602054984726465
random subvector svm score: 0.15662316023326853

best random score: 0.2901971674534851
avg random score:  0.181282976950847
1 Like

Ok, I thought I had misunderstood what you meant by “weighted sum” here, and rewatched that bit of the video to help clarify, but then forgot about it as I ironed out other bugs I found.

Also had a quick look through the code, but my Python isn’t good enough to glean exactly what is happening.

So in a verbose way, is what you mean here summing (the abs) of all the values in each PC, and then doing that for each PC, and then figuring out the relative weight of those, and then using those weights to on all the PCs again (so all the PCs will be there, but will account for a scaled/weighted amount).

So in my case with a 21d space, I have a 21x21 matrix like:

1 2 3...19 20 21
11 12 13...29 30 31
21 22 23...39 40 41

So I would sum the entire row (1 2 3...19 20 21) then divide each component by the total and rinse/repeat for each row?

Or do you mean something different?

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

On a more qualitative note, it’s interesting to see how practical/useful this approach is as a (potentially) more perceptually-relevant form of dimensionality reduction. I really hope/anticipate that this will be really useful for just picking/using the most useful/variant descriptors in the first place.

Probably the best way to answer your question is that we need to get the eigenvalues out of the PCA, but I believe that currently the FluidPCA doesn’t make that available through ‘dump’. Is there a way to reverse engineer the eigenvalues from what does come out of ‘dump’? @groma @weefuzzy @tremblap

1 Like

Hiya, I’ve not read everything in this thread, but you have all the Eigen-doodads you need for this sort of thing, between the bases matrix (eigenvectors of the zero-meanified covariance) and the values array (squareroots of the eigenvalues of same).

However, PCA is only of limited use for suggesting what ‘raw’ features might be useful in less linear algorithms, with very different assumptions. Just looking along the rows of the bases will give you a contribution of a feature to a PC, but that’s about all it can tell you. It’s quite common to these values for PC1 against PC2, and that gives you some indication of their relative contributions / directions against (presumably) the most variance-bearing components, but – again – there’s not a great deal you can infer from this: PCA’s assumption that the greatest variance in a input feature indicates greater importance is still just an assumption. Combined with the fact that it’s only a linear transform, it won’t be any use at all at indicating possible non-linear relationships that something else could pick up.

With all that said: looking at these numbers (especially the values), can give you some useful pointers about the structure of your data, and whether it’s worth pursuing other algorithms or just going and getting some better data. In particular, if you just visually inspect the values, they should have a distinctive shape if PCA is going to much use: there should be a steepish gradient down, followed by an abrupt levelling out (called an ‘elbow’, I’m told), where the PCs from that point contribute less to the overall variance. OTOH, if you see a linear slope, or even a straight exponential-ish one with out the elbow, then the data isn’t showing (enough) linear structure for PCA to be able to help.

1 Like

Hehe. I somehow understand this even less than when I didn’t understand it when @tedmoore mentioned eigenvalues in the previous post.

Although it’s all quite speculative, the results (as per @tedmoore’s video above) are really promising, and “better than random”, as someone mentioned in their Q&A this past weekend.

From the looks of it, an SVM gives better results than an PCA for this, though the general idea is the same. As a kind of “figuring out which descriptors/stats may be worthwhile using at all”, which is something I was struggling with in this thread, where I arbitrarily (and tediously) tested all kinds of permutations trying to arrive at a similar conclusion.

I just wanted to underline that all the ghastly entrails are there :smiley:

It’ll be better than random yes, but probably prone to being pessimistic (if you use as a basis to exclude features from other processes). What PCA is great for is telling whether or not (and, roughly where) your features are already correlated, and giving you a rotation of the space it was trained on that makes the most it can from those correlations in a linear setting.

SVMs are all about being able to tease out ways of getting round non-linear relationships. The neat trick they use is to project data into higher dimensions to try and untangle it. Say you want to classify something in 1D where Class A is all in the middle, and Class B is to either side of that. A straight threshold won’t do it, i.e. you can’t draw a straight line between the clumps that represents the classification you want.

By projecting into higher dimensions using a Cunning Trick, SVMs try to find a space where it is possible to draw a line (-> plane → hyperplane) between the classes in a way that works. It’s very cool. Unfortunately it scales quite badly with training set size, because the Cunning Trick involves taking the distance (or similar) between every single pair of training points.

1 Like

I’ll have a look with a clear head tomorrow.

Hmm, how poorly does it scale up? The use case here is being applied to descriptors (rows) in the dataset no? So like, 300ish tops?

Or does the SVM also get run across the whole dataset to do this cunning stuff?

I should (probably) revisit/retest this stuff, but in my initial experiments a while back, when optimizing for speed (obviously!) I found that having a lower amount of “natural” dimensions was faster (and not significantly less accurate) than having a higher dimensional space, which is then run through some other processes on the way (i.e. PCA).

Granted, that may change if I end up implementing some kind of LPT structure, but given that I don’t really care about pitch (or more importantly, it’s not a very useful descriptor when being run on 256 samples), it may be useful to massage stuff down to equal numbers of descriptors per “thing” that I’m interested in (e.g. loudness, timbre, morphology).

All of that is to say is that this general exploration is to kind of whittle down which (natural) descriptors are the most significant, without transforming them first. There’s also other knock-on/useful effects of this (e.g. having human readable/perceptual dimensions left at the end), but that’s kind of gravy for my purposes.

Yes, it gets run across everything: it’s just another kind of supervised learning algorithm, like the MLP. As for numbers, it’s a bit how-long-is-a-piece-of-string, I’m afraid. All one can say is that (for training, particularly), it’s going to start huffing noticeably as the dataset gets bigger, and the benefits of its Cunning Trick start to dissipate. How much depends on All The Many Things (but, anyway, 3000 isn’t huge; and I’m presuming this validation work is offline in any case…)

Which isn’t to say don’t play with it (there’s one in Wekinator, if Python isn’t your bag) .

Fewer things will always be quicker, but at the cost of generalisability, I suppose. Perhaps finding small families of very highly curated low level stuff that work well with known territories that you can flip between will work well for you (so a multitude of very small, very tailored models). There are no ‘natural’ descriptors though: even those ones that lay claim to representing some perceptual attribute directly (like pitch) are engaging in all manner of abstract shenanigans, and their truth claims are hugely contingent on being given amenable material to analyse (but I know what you’re getting at).

1 Like

256 samples obviously. Or occasionally sometimes 4410.

Yeah totally. This will be an offline, pre-processing step to figure out what descriptors to focus on in the first place.

Cool, thank you @weefuzzy, And thank you for the analysis!

Yes, like I said, it does seem to be better than random, which intuitively makes some sense. Also, I was testing it on just one task: classifying categories that were created by KMeans, so it would be good to test it on more varied tasks and more datasets to see if it holds up in a useful way.

Another point to make, which I made above somewhere, is that this would really only be useful if it’s necessary to be keeping raw features (which I know is what @rodrigo.constanzo is after, I’m just dropping the note here for posterity). Transforming just one point (incoming data) with PCA is really just a handful of calculations, it should be quite fast! And in my tests, the first few PCs have always out performed the same number of “selected raw features”, which also intuitively makes sense. So if one is already running a PCA on the data maybe just use the PCA transform? But then again, they become non-human readable…tradeoffs.

@rodrigo.constanzo, would be curious to hear how much latency is added by having a PCA in the pipeline.

1 Like

I’m organizing and sifting through some older patches, but a while back I manually made 34 classes, along with 5 (dfferent) examples of each, for testing purposes. I did it via random playback and counting stuff, so it would take a bit of time to process, but setting up a better way is obviously possible.

It’s just a matter of consolidating stuff. Also figuring out and creating labels for what is in each point of the vector (e.g. mean of 1st deriv of MFCC3, etc…), so I can parse the results that come back.

Similarly, I need to go back and test, but when I first set this up I remember coming to the conclusion that doing it “raw” was significantly better enough to embark on this process of figuring out what’s best to keep. I should revisit that assessment though and see exactly what we’re talking about.

With PCA specifically, I always got significantly worse results afterwards though (like 20-30% worse, consistently), even if I kept a fairly high amount of PCs (20ish for a 80+ point vector), which is another reason I ruled that approach out.

1 Like

This is the relevant bit from my testing before that I found in this thread:

As compared to this:

This was pretty aggressive PCA-ing as you can see. I don’t know why I stopped at 9d as it was going up from there (say vs 20d), but I think I was comparing 196d->9d vs “raw” 21d or something along those lines. So going from 21d->20d doesn’t make much sense.