Ways to test the validity/usefulness/salience of your data

Based on discussions with @weefuzzy, @tremblap, and @tedmoore both on the forum and geekouts is this idea of trying to boil down the data to a small amount of useful and descriptive data points.

Among the techniques discussed so far are using PCA/SVM to determine which descriptors in a dataset represent the most variance, comparing standardized/normalized MFCCs vs raw (as per @weefuzzy’s suggestion) to see if there is “noise” in the higher coefficients, or just qualitatively poking/listening to the clustering after each step of plotting.

Some of these are quite useful, and others I’m going to play with a bit more, but I want to know if there’s a better (automatic/automagic/programatic) way to go about verifying the data.

Up to this point, my understanding of the working paradigm is to “shove a bunch of stuff in” and then let The Algorithm™ (be it PCA, UMAP, MLP, etc…) “find the important stuff” for you. And that has worked up to a point. But now there’s differences between amount of MFCC coefficients, including/grouping/scaling different descriptors, amounts of “noise” being introduced at various steps of the process, etc… that kind of complicate approach of collecting a ton of stuff and letting it get sorted out.

So what are some other approaches and/or workflows for optimizing stuff like this? (short of manually testing every permutation, which can be tedious, slow, and ineffective).

I’ve started up on making a PCA based utility to do some data inspection for correlation etc.

On the left there is the ‘scree’ plot, giving a visual indication of how much variance is accounted for by the PCs. In this (which is the dataset you sent from your neural network adventures last week), we see PC1 is doing some good stuff , then a drop and then 2-4, before another drop (and then that very sudden drop about halfway over). So, according to this, with PCA you could probably use half the dimensions.

On the right a representation of how the features in the dataset are correlated to each other. If two features are correlated, then it means they move in ‘phase’ and aren’t really adding new information. Likewise, if they’re anti-correlated, one inverts the other but, again, not really any new information. Uncorrelated suggests that they move independently of each other, and each contribute their own information. Reading a matrix like this is a knack, but is like self-similarity plots for time series except that here the axis isn’t time, but each individual dimension in the dataset across each axis.

I’ve coloured this as a heatmap. red = correlated, blue = anti-correlated, white = uncorrelated. You’d always expect to see a completely red stripe across the main diagonal as this shows the correlation of a dimension with itself (which should be 1). Then the upper and lower triangles either side of the main diagonal should reflect each other. Ideally then, what you want is lots of white and pale, indicating that your features are all doing useful work. Stronger colours indicate candidates for removal.

In that plot there’s some interesting structure, and definitely the implication that not all your 76 features are contributing usefully. You’ve got these stripes at regular intervals off the main diagonal where some features are very strongly (anti-)correlated with each other, and these seem to be spaced 1/4, 1/2 and 3/4 of way through the features: presumably reflecting the start of blocks of particular stats or derivatives? Then you see this check-board pattern, again dividing the space in to four. The features in your 2nd chunk all seem to be highly correlated with each other, and quite anti-correlated with those in the 3rd chunk. The takeaway there is that maybe you don’t need all of those. Beyond that, it’s possibly a matter of then looking over individual rows / cols closer up and considering whether certain strongly correlated dimensions can go or not. Automatic thresholding could help, but one still needs to choose what to keep.

Am very open to any other ideas on how to represent this information.


Man, that is suuuper useful and interesting!

I don’t completely follow some of the explanation there, but it would be great to have some kind of frontloaded meta-analysis thing where given a corpus, you can see (and more usefully, get a list of) the most salient descriptors and statistics such that you can then choose to only use those. Like, point it to a corpus and say you want “50 dimensions”, it gives you back a list of the most useful descriptors/stats given those constraints.

This wouldn’t be universally useful as a corpus, and how I may want to navigate it, are not exactly the same. So a corpus of synth blips may have one set of descriptors/stats that best represent it, but those may have very little in common with the input (or generally manner) in which I want to navigate it.

But for the use case where I have a finite/known “input” (in my case, prepared snare/drums/percussion), it could take out the guesswork of running various combinations of descriptors/stats manually to kind of guestimate what is “working”.


As a follow up/tangential question, given this kind of interrogation how much pre-picking is worth doing vs massing stuff and letting The Algorithm™ sort it out for you. As in, if I want 12D overall, should I go through all of this stuff and pick those perfect 12D, or should I get a load, then reduce it down to to 12D, or should I get a load, then pick the 50 most salient features, then reduce those down to 12D etc…

I was thinking about this today, with regards to descriptors that are correlated, but perhaps still somewhat useful. I’m specifically thinking of a min/mean/max combo, where they will always have some kind of relationship (at minimum mean will be between the other two), so they will always move in phase with each other, but the difference between them may be meaningful.

I suppose something like standard deviation may be a better statistic in that you can get a sense of min-ness and max-ness, and the std would likely/often be a number that moves independently of the mean.

Additional, although not useful/meaningful for my tiny analysis frames, having the min/max for a longer sound file may be useful information to have in the number soup, even if it is correlated with the mean.

In general I guess it’s hard to know whether to be going for maximum statistical coverage or musically(/conceptually) meaningful descriptors. I suppose the sweet spot would be where those things overlap which circles back to the initial purpose of this thread.

Not necessarily. This would be true for a series that was completely stationary (has no amplitude or frequency modulation), and where the statistics are gathered across a uniform sampling interval, but I think it wouldn’t obtain once those two conditions are absent (e.g. for highly non-stationary sounds like drums, segmented and summarised across varying time-spans).

It’s also important to bear in mind that correlation is a matter of degree, not a yes/no thing. The point being that if two features are almost completely correlated or anti-correlated then you can be pretty sure that they are adding nothing useful to any any later efforts to discover structure in the data, irrespective of whether the quantities they came from made some perceptual sense to start with. Or it could be an indication that the process of generating the data in the first place isn’t working quite as expected.

I can’t remember how your 76-d points are structured, but the diagonal stripes in the above suggest that there’s a periodic pattern of very significant correlation every 19 dimensions, repeated three times, so that feature numbers 19, 38 and 57 are candidates for removal.

1 Like