Tips for dimension reduction (Statisticians hate this guy)

Apologies for the clickbait-y title of the post I’m going to link but I found it really useful in learning more about how to curate your dimension reduction parameters and algorithm selection. I particularly found it useful as I was interrogating what I thought were a same number of components to reduce to.

@spluta and @rodrigo.constanzo , you seem to be heavily in this space right now with PCA big data sets to search through so perhaps you will find it especially useful!

https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006907

3 Likes

Awesome, I’ll have a more detailed read through this.

Don’t know if this is useful across the board, but this chart:

Reminded me of fluid.bufnndsvd~, so I wonder if there is (or will be) a way to know how many dimensions are “sufficient” with any given dataset.

I naively take 10% of my points as my lower limit and work upwards if its garbage. Today I’ve had great success going from 273 values to 1 though. The one dimension that all the samples sits on is muddy, but 0.1 versus 0.9 are two distinct points which is quite interesting.

What are you using to assess this?
Like matching, or looking at some kind of 2d representation, or some kind of difference-ness metric?

this could be good to talk about later today indeed.

1 Like

How good whatever happens afterwards is :stuck_out_tongue:

For clustering it works quite well, but who knows for other applications not yet imagined or implemented.

Yes please :slight_smile: I remember hearing about some rules of thumbs for DR that are dependent on how many vectors you have and how many samples per vector you’ve got too.

this kind of wisdom is from @groma and @weefuzzy though…

But by this, do you mean “listening” or whatever? Like a subjective measure of effectiveness, or is it getting x amount of clusters, or something numerical/“concrete” like that.

I’ve been wondering about this too, to see if/how DR stuff works on time series of events (based on all the recent AudioGuide discussions). Like, if instead of putting a statistical summary into a fluid.dataset~ and then reducing that down to x amount of dimension, putting every frame of the analysis into a fluid.dataset~, and seeing if it can/does somehow summarize change-over-time-ness, as a way of getting better morphological matches (in a kdtree context).

I had a suspicion that that’s where the wisdom came from :slight_smile: If only I could remember the details of the advice.

This Simple Trick is Driving Statisticians in Valparaiso, In Crazy (BIG PAYDAY!!!)

4 Likes

I audition the queries in Max using a dict object to load my outputs. I generally loop the samples and just iterate over the clusters to see homogenous they are.

1 Like

Have people compared using PCA, LDA and ICA ? In my data world experience they do provide different results which is expected since they’re maximizing different things.

I’d be curious to see how this translates to audio but not sure if anything else is implemented in flucoma or max besides pca.

There is multidimensional scaling also in the fluid.mds~ object

and there is a cool paper we published here where we discuss the various affordances of such differences in interface building :slight_smile:

1 Like

Going to start my experiments with this stuff soon, as I’m going to do some MFCC testing, and the dimensions required balloon up really fast.

I’ll read through the paper, and do some playing, but I’m curious/wondering how to retain a perceptually meaningful set of dimensions at the end? Like, loudness (and its related statistics) are probably more significant than a single MFCC band, but if I understand things correctly, they would both have the same amount of “weight” in any dimensionality reduction context.

what I plan to do when I have time to actually use the tools is to take the APT patch and to make a 1D redux of MFCC1-13 (12 to 1) and scale that into the same range (0-110) than my tolerance in dB and midifloats… but feel free to go there and share. I just noticed that this is in the wrong part of the forum so sorry @bafonso there are tools still not public… in early 2021, and if you want maybe a bit earlier when we get to beta of what is currently alpha…

1 Like

I’d be happy to test/play with beta or alpha versions of things to come and give you feedback :slight_smile:

ok I’ll put you on the list for the beta. Thanks!

1 Like