Questions on normalization order in a PCA -> KDTree pipeline (Audio Matching)

Hi everyone,

I am currently working on a project where I analyze a corpus of environmental recordings using various FluCoMa audio descriptors. Since I am combining different features (MFCCs, Spectral Descriptors, and Loudness), my initial data features have completely different scales.

My goal is to use Principal Component Analysis (PCA) to reduce the dimensionality of this dataset, and then use a KDTree (fluid.kdtree~) to perform a K-Nearest Neighbor query. I want to input a new audio file (a recording of a musical performance) and find the environmental soundscape from my corpus that is closest to it.

While studying the FluCoMa help files and examples, I noticed that the pipeline often feeds the dataset directly into fluid.pca~ and applies fluid.normalize~ only after the PCA object.

As far as I understand the theory behind PCA, the algorithm is highly sensitive to the scale of the input data. If features aren’t normalized beforehand variables with larger numerical ranges (like Spectral Centroid in Hz) will dominate the variance calculation over smaller ones (like Loudness or linear amplitude).

Given my specific pipeline, I have a couple of questions:

  1. Why do the examples usually perform normalization only after the PCA? Is there an internal rescaling happening inside fluid.pca~?

  2. In my case, since I am mixing descriptors with drastically different units and scales, would it be a better practice to perform a double normalization? Specifically: standardizing/normalizing the dataset before PCA (to give equal weight to all features) and then normalizing the coordinates after PCA (to scale them nicely for the KDTree query)?

I would love to hear your insights on the best practice for this specific workflow.

Thanks in advance for your help!

2 Likes

These examples are probably using MFCCs which are “ok” to use without scaling. It is also ok to scale MFCCs before applying dimensionality reduction. (Sometimes it’s useful to try both ways and assess which sounds better!)

This is probably because it’s being sent to the fluid.plotter next and common practice is to normalize it into the plotter’s native range (zero to one on each axis).

Very true.

No, I don’t think there is.

For plotting purposes as described above.

If you have descriptors with drastically different scales, and you want all the descriptors to be equally relevant in the measurements, then yes, you should scale before PCA.

As described, normalizing afterwards is normally for plotting purposes.

If you’re using the KDTree in correspondence with the mouse, then you do need to fit the KDTree on the normalized dataset (that is being seen on the plotter) for it to make sense. But keep in mind that this is likely distorting the dimensionality reduction space created by PCA. For example the 2 PCs that it keeps might form a “cigar” shape, which could be squashed into a circle for the square plotter, so the distances might not reflect “similarity” as truly as one might assume.

I usually try to not get too far in the weeds on this stuff. I try to set up my patch to be able to try a few different scalers, dimensionality reduction algorithms, etc. and see which combination feels most musically useful.

2 Likes

Hi Ted,

Thank you so much for this incredibly clear and insightful answer! Understanding that the post-PCA normalization in the examples is purely for fluid.plotter compatibility, clears up a lot of my doubts.

In my project the goal is not to reduce everything to just 2 dimensions for visual plotting. My real objective is to find the minimum number of significant dimensions needed to represent the dataset accurately, specifically to avoid the curse of dimensionality later in the process, while keeping the data as compact and clean as possible for the fluid.kdtree~. I am specifically focusing on PCA because my priority is to keep the computation as lightweight and fast as possible. From what I understand, compared to non-linear reduction algorithms PCA is computationally the most performant and efficient for this task.

I will combine all my different descriptors (MFCCs, Spectral, Loudness, Pitch confidence) and normalize them before PCA to balance their weight. Given this scenario, I would like to double-check a few things with you:

  1. Just to confirm, since I don’t have to map the data to a 2D plotter/mouse grid, applying a further normalization after the PCA is not useful or necessary at all for the fluid.kdtree~ algorithm to work correctly?

  2. In general do you think that inspecting the "values" array inside the PCA dictionary to evaluate the variance of each Principal Component and then discarding the non-significant ones, is a solid and standard methodology for optimizing the subsequent KNN query?

As a further development I would also like to implement a sort of “novelty/difference detector” between consecutive snapshots of a live musical performance. The goal is to evaluate if there is perceptual coherence or a significant difference between snapshots, determining whether the performance is static or evolving over time. Could variance informations from the PCA algorithm (ideally using the same pool of descriptors) be useful for achieving this? I am still clarifying my ideas on this specific implementations, so any advice on which direction to take would be extremely helpful.

Thanks again for your time and guidance!

This is wise and PCA is a good choice because the lower order PCs (which are the ones used for dimensionality reduction) store more of the variance, so by using PCA you’re able to marginalize some of the redundancy in the dataset and therefore focus the comparisons on more meaningful dimensions.

The question always ends up being, how many PCAs should be kept? One approach is to aim to keep n% of the variance (90-95% is common). Sometimes this can reduce a huge number of dimensions, sometimes not.

Another approach is to look at the explained variance of the PCs on the scree plot and look for the “elbow” at which point the PCs become sort of “all the same”. After that point, the PCs get less valuable to “add in”.

:+1:t2:

Correct. Doing so is called “whitening” the PCA. It has some use in certain applications, but by default it’s not usually done.

Yes, as I described above.

(Do you mean KDTree?) Yes, I think it makes sense. I think in this case “optimizing” is more about removing redundancy in the dataset and focusing the distance comparisons, rather than “optimizing” as in CPU efficiency. Unless you have an insane number of data points, the flucoma KDTree is quiet fast, especially with the < 20 or so dimensions you’re using.

Keep in mind you’ll need to take your incoming (real-time?) analysis vector, then scale it with the scaler’s transformpoint and then project it into PCA space with PCA’s transformpoint before sending it to the KDTree for NN lookup.

My gut says that variance measured by PCA isn’t going to be as directly useful as it might seem. Maybe there’s a clever idea in there but…

I think essentially what you’re describing is a distance threshold: is the current point’s location in high dimensional space far enough from the previous point’s location to identify that the current moment is different? That’s a heuristic you might have to just decide!

If you know what the musical material is ahead of time (what the general sections of the performance are) you might be able to build a dataset to test different strategies, for example train a classifier to detect what section a piece is in, but if it’s improvised then it will be harder of course.

You might research “anomaly detection” algorithms since that’s essentially what you’re looking to do. (You might even research “anomaly detection with PCA”, I bet someone has considered this).

2 Likes

This is amazing! Thank you so much, you’ve helped me to understand so many core concepts at once.

My project indeed focuses on free improvisation, so I don’t think I’ll be able to rely on trained classifiers. Ideally I would like to use the same type of processes used for analyzing “offline” the environmental recordings.

I will definitely spend some time thinking about and investigating your suggestions, to see how I can develop a robust heuristic approach to detecting stability and novelty in an improvised context.

If you’re doing offline processing, then this might also be a good way of testing / validating your anomaly detection of whether a performance is evolving over time. You could build a dataset of different performances and annotate where you think the “changes” are, whatever that means to you, then try some different algorithms and heuristics via non-real-time, “offline”, processing and see which algorithms and heuristics end up matching your annotations the best.

1 Like

Dear both

Fantastic thread, and let me bring a little complexity and nuance, and a trick.

Now, there is a conceptual danger to scaling all without remembering what it actually means for non-uniform datasets like ours: numerical distance do matter a lot in the ‘metaphor’ of using it as a proxy for similarity.

The simple example I keep using is pitch and loudness. Just scaling is an oversimplification that most of the time won’t help, especially for a complex task like:

because it assumes that the range of your pitch descriptor, and its unit, are matching perceptually the range and unit of your loudness descriptor. For instance:

pitch between 200 and 16000 Hz

loudness between -80 and -20 LU

are not the same (one is linear, the other is expon, one has a wider perceptual range than the other)

MFCCs are even worse: scaling them has never given me better or more convincing results. There is a paper somewhere about a perceptual weighting of them, I’m trying to find it.


So what is your solution for preparing your data is first to check if your similarity space works as you intend. Scale (and unit) away, in the high dimension, until your high-dimension nearest neighbourg is convincing for your musical application. Then, and only then, apply PCA. Then test how your new space’s proximity is vaguely as accurate.

Does it make sense? Does it help? Have I made it too complicated?

4 Likes

This approach makes a lot of sense, and it will definitely be interesting to experiment with!

Dear P.A.,

Thank you for the methodological insight, it makes a lot of sense and helps me get into the right mindset for navigating this process. At the same time, it raises a few questions regarding the actual workflow.

To make sure I understand your advice correctly, my goal would be to establish a solid methodology that looks like this:

  1. Test and combine the descriptors one by one, manually tailoring how each is scaled based on its perceptual nature (for instance, to follow your example, converting Pitch to a logarithmic scale before scaling it so it can properly match Loudness), until all planned descriptors are joined.

  2. Find the right balance by experimenting with different scalers or manual feature weighting. (Regarding this, may I ask if you have any practical advice or a preferred strategy on how handle this balancing act when designing such a space?)

  3. Apply PCA only at the very end, once the full high-dimensional space is already validated and convincing to the ear.

While this pipeline makes a lot of sense to me, it raises an additional doubt regarding the curse of dimensionality.

If I combine all my descriptors, especially since I plan to include 2 or 3 statistics (like mean, standard deviation) for each of them, I will easily reach a very high number of dimensions before applying PCA.

If I test this massive space with the KDTree to find a convincing configuration by ear, won’t the curse of dimensionality already distort the distances and blur the nearest neighbor results, making my perceptual validation unreliable before I even get the chance to compress it with PCA?

Are there any strategies or best practices you recommend to avoid this when testing large descriptor spaces?

hello

so for me your point 1 and 2 are related. I’m always trying to find a scale where a distance of x feels the same. FluidPitch has midicents as outputs (so 1 semitone can be 1 LU, and the full range of pitch being 80-90 is similar to a full range of usable sounds from 80-90)

once I have that, timbre is more ‘fun’ because mfccs are multivariate (the dimensions are needed together to mean something) and do not scale perceptually simply (+/- 1, or +/- range, do not have the same perceptual impact on each dimension) - I have heard of a paper at IRCAM, and/or from MacAdams, on this, and I think that this is what @naiv40 implemented in his software he presented yesterday here

My trick is to use the full count of dimensions to explore proximity. Then indeed, I try to simplify it. @rodrigo.constanzo has a more data-science approach at one point, throwing in all the dimensions in the world then using PCA to try to make sense… but that was not super successful IIRC. He will remember which of his (fantastically documented research) threads it was in, I hope.

1 Like

There’s a couple threads where I explore some of these ideas:

I don’t think the patches work anymore as the interface(s) changed along the lines, but hopefully the thinking makes sense.

As an aside, I’ve most the best results from hand-picked “hybrid” descriptors where you narrow down to what you think is important, and just use those. In my case the main ones I use are loudness/centroid/flatness/pitch with the derivative of the first three and the confidence of the 4th rounding it out to an 8d space. This is my “generic” set of descriptors that I use unscaled (though pitch/centroid are in MIDI and loudness/flatness are in dB).

I do also use MFCCs, though mainly for classification and because it doesn’t encode loudness (if you drop the 0th coefficient).

Although this is aimed more at the specific descriptors and use cases within Data Knot, I unpack how/where I use each descriptor type across the package here:

2 Likes

Thanks for the mention — yes, that’s exactly what I tried to implement.

The McAdams weights I used come from the 1995 study on timbral similarity (McAdams, Winsberg, Donnadieu, De Soete, Krimphoff). The perceptual dimensions are: spectral centroid (brightness), spectral flux, attack time, and roughness/irregularity. Each is weighted differently in the UMAP computation — spectral centroid carries the most perceptual weight, attack time less so for sustained sounds.

The key issue you raise — that +/- 1 does not have the same perceptual impact across MFCC dimensions — is exactly why I avoided raw MFCCs for the compositional space and used the McAdams descriptors instead. Each descriptor has a clearer perceptual correlate, and the weights make distances more interpretable: a step of x in brightness feels roughly comparable to a step of x in spectral flux, at least within the range of the ConTimbre corpus.

That said, the FluCoMa real-time analysis layer in the system does use FluidMFCC — but there it’s used as a tension estimator rather than for navigation. The variance across MFCC coefficients over time feeds a composite tension value, so the multivariate nature is actually useful: I’m looking for overall spectral instability, not a specific perceptual distance.

Would be very curious about the IRCAM paper you mentioned on perceptual scaling of timbral dimensions — do you have a reference?

1 Like

I’m trying to find it - there was something about perceptual scaling of MFCCs. I’ll write to my contacts :slight_smile:

ok I found this:

https://www.academia.edu/111594101/In_Search_of_a_Perceptual_Metric_for_Timbre_Dissimilarity_Judgments_among_Synthetic_Sounds_with_MFCC_Derived_Spectral_Envelopes

I need to read it in details, but it seems scaling is not a good idea. then I check the latest stuff - Siedenburg, K., Fujinaga, I., & McAdams, S. (2016). A Comparison of Approaches to Timbre Descriptors in Music Information Retrieval and Music Psychology. Journal of New Music Research, 45(1), 27–41.

and this:

So I need to go read more. It seems people are still poking at this. I need to spend a bit of time on this so if you jump the gun, update us :slight_smile:

Thanks for these — very useful. The Siedenburg et al. 2016 comparison is exactly the kind of overview I was missing. Will read both and report back. :face_with_monocle:

1 Like

Hello all

I’m in need of real-time joint mfcc normalisation. With a bit of help from the internet, I got that coeffs are indeed going down in range, and that for the built in implementation of flucoma, we have a range of plus/minus 112 at the default settings. What changes the extreme range a lot is the number of melbands (not of coeffs kept). The size of fft is roughly adding 6 per doubling.

If anyone is playing with this, you can either divide all coeffs by 120 and coeffcients 1 and upwards will be between -1 and 1. If you play with that, please report here to see if that works as well. I’m looking at you, @rodrigo.constanzo :wink:

One can also load the settings in FluidNormalize via a dict, for instance here for a 20 MFCC pipeline:

{
  "cols": 20,
  "data_max": [
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0,
    120.0
  ],
  "data_min": [
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0,
    -120.0
  ],
  "max": 1.0,
  "min": -1.0
}
3 Likes

I don’t normalize/standardize for the actual querying or anything, but for UI/mapping purposes I normally treat it as -80. to 80. as that visually covers most range.

1 Like

Hello, it works here, I compared nearest neighbour search with a kdtree after normalising and the percentage rate of success was exactly the same as without normalising . ( which was not the case when using fluid.normalise
with “fittransform” message).
Cheers

1 Like

same here, and in line with the literature. I got a network to converge and it was super fun :slight_smile:

In the end, I used FluidStandardize because it is even simpler. In SuperCollider, for 40 dims:

~scaler = FluidStandardize(s).load(Dictionary.newFrom([\std, 120.dup(40), \mean, 0.dup(40), \cols, 40]));
~scaler.transform(~corpusMFCCs, ~corpusMFCCs)
1 Like