Ah, I misspoke slightly there. The variances of features don’t need to be 1., but they do need to be sensibly comparable in order for the results to make much sense, i.e. the ranges of the input need to be at least in the same ballpark, otherwise the larger ranges will dominate everything, even if they don’t really contribute much structure. **The zero-mean assumption is enforced by the algorithm irrespective of any prior conditioning: the mean of each feature will be subtracted from the values before the gory part of the algorithm is done.**

Also, and importantly: doing Robust Scaling doesn’t get you zero mean necessarily, because it centres around the *median* instead. If your feature happens to be Gaussian distributed, this will be the same thing (but the the point of Robust Scaling is that the data probably isn’t Gaussian distributed). IAC, it doesn’t matter so much, for PCA, *which* normalisation scheme you use, but using one of them is a good idea unless your features all just happen to span similar ranges to start with (which is rare with audio descriptors).