Ways to test the validity/usefulness/salience of your data

Not necessarily. This would be true for a series that was completely stationary (has no amplitude or frequency modulation), and where the statistics are gathered across a uniform sampling interval, but I think it wouldn’t obtain once those two conditions are absent (e.g. for highly non-stationary sounds like drums, segmented and summarised across varying time-spans).

It’s also important to bear in mind that correlation is a matter of degree, not a yes/no thing. The point being that if two features are almost completely correlated or anti-correlated then you can be pretty sure that they are adding nothing useful to any any later efforts to discover structure in the data, irrespective of whether the quantities they came from made some perceptual sense to start with. Or it could be an indication that the process of generating the data in the first place isn’t working quite as expected.

I can’t remember how your 76-d points are structured, but the diagonal stripes in the above suggest that there’s a periodic pattern of very significant correlation every 19 dimensions, repeated three times, so that feature numbers 19, 38 and 57 are candidates for removal.

1 Like