Many dimensions to one vector?

tutschku · October 23, 2024, 12:37pm

Since the beginning of FluCoMa I have been searching for a way to order short sounds into a one-dimensional vector to order them by their sonic proximity. I’m well aware that this is a tricky task, as our perception is not 1:1 with MFCC or any other analysis. We are very sensible to the attack moment of a sound for example. Anyhow, many iterations have lead to different solutions, some of them more convincing than others.

The goal would be the browsing of the corpus with a slider and getting ‘smooth’ timbre variations between neighboring slices.

I have tried combinations of

many_dimensions
→ brute force mds to one dimension
→ mds to 20 dimensions, then mds to 1 dimension
→ pca → mds
→ umap
→ pca → umap → mds

all with different explorations on selected descriptors, stats, metrics, learning rates, number of neighbors, etc.

Nothing has been strikingly convincing across several sound types. Some work better for pitched material, etc.

I’m just wondering if anybody else has some thoughts or hints.

thanks, Hans

tedmoore · October 23, 2024, 6:10pm

Hi Hans,

When I did this,

it was PCA → UMAP. If I remember correctly, I chose to keep enough PCs to retain 99% of the variance (or maybe it was 95%?). In this case I remember it was 11 PCs. I then used UMAP to reduce the 11 dimensions to 1 dimension.

The analysis I used was the whole gamut (SpectralShape, Pitch, MFCC, & Loudness), plus all the stats from FluidBufStats. Of course this is kind of a crazy amount of initial features, and the slices were only 100ms, so that’s why PCA could reduce many hundreds of dimensions down to 11 while retaining 99% of the variance… it was my early days in the FluCoMa-verse and I was throwing a lot of stuff at the wall. All that being said, I was pleasantly surprised by the results!

One takeaway that I had from this was that there are big gaps in the embedding space (because some of the sounds are quite different) and when I playback all the grains in 1D (through time here) they’re all right up against each other, so we loose the spatial relationship that should separate out those differences better in this 1 dimension. But of course if there’s no slices in that region of the embedding space (probably because there’s no slices in that region of the initial feature space), there’s no way to create a smooth transition between the disparate sounds.

This is all to say that maybe the corpus doesn’t have the slices in it to actually create a smooth transition (I know your corpus is huge) at least in the way that you imagine. That being a separate issue from as you say the mapping not being 1:1.

Regarding this, have you tried using loudness as a weight for the statistical summaries? That way the attack (if that’s the loudest part) would be weighted as more salient to the analysis similar to how it is more salient to us humans.

t

tutschku · October 23, 2024, 7:48pm

Thanks Ted for your useful answer. I will explore more.
I also used pca (keeping 85%) and from there to 3 dimension for the 3d model.

Here is a strange observation, perhaps more for the bug section. pca seems to produce all zeros in the reduced dataset if the number of dimensions is greater than the number of rows

tutschku · October 23, 2024, 9:46pm

adding a short video with my results so far.
Thanks again Ted, keeping 95% of the initial data accuracy for the pca makes a big difference

tremblap · October 30, 2024, 2:57pm

This is cool. Another cool approach to explore could be to try to make a neural net post-pca instead of having UMAP inventing an order (since it will always be arbitrary). I would train it first on a small set (manually placing 10 sounds from the 3d space you seem to like in a 1d dataset and regressing). That is crazy small though, so maybe 20 sounds, but still, having fun with a neural net that tries to model/fit my personal navigation through the net.

In all cases, thanks for the thorough demo @tutschku - it links well with what @tedmoore was doing - I love it!