(edit:no more) Confidential! Sharing Experimental Results on Descriptor Space Dimension Reduction Experiment

Dear all

In the lab, we just finished and submitted a paper, which it is not reviewed, so please do not share any of this, but I am really excited to show it to you as team members!

On this forum, and in the lab, we had many discussions here since the plenary on how to make sense of the segments once they were produced. I posted a few descriptor+dada-based visualization, mostly of the CataRT type, but that, plus my automatic orchestration via the nearest match, left me a bit short-changed, and some of you had similar feelings on similar tests.

Anyway, we sat and thought hard and as we are working slowly on establishing the foundations for the 2nd toolbox, ideas of visualization (2 or 3 D) and dimension reductions (since we can get many descriptors on various time scales as well) were floated, some experiments were done, and we got comparing different algorithms available for both tasks together. I’m quite encouraged by the early results so I wanted to share them with you PRIVATELY.

This video is made with the first prototype (no colour mapping, size is length, so x/y mostly) from a bunch of modular synth sounds, segmented via their amplitude. The first part is described through MFCCs in time, and data-reduced via isomap. I find the clusters to be significant, and the shape to be inspiring. The second part, on the same sample set, is described via an autoencoder on the spectrogram and data-reduced through t-sne. The form is different, and the localities are differently convincing.

On this alpha test devised by @groma, I made a deliberately non-musical video: I click to neighbors so you can appreciate their proximity :wink: The playback engine is simply playing back when I click, nothing exciting yet, but as research code, I find it inspiring. The idea for the paper was to make a comparison of different algo with indications on how they excel (and where they fail). We hope to port the most potent one(s) to help creative coders to make their own description, reduction, mapping and interface, relevant to their practice.

I’m sharing this with you all now because I thought it could be good that the alpha users felt we’re listening and are inspired by their proposals, questions, and queries. Feel free to feedback to us again on anything, and send ideas around. As soon as the paper is reviewed, we should be able to share the code should you want to try it and/or see the implementation details.


1 Like

I think something that hasn’t been explored yet is using shape of the node to display something. You could make them range from brittle/jagged edges -> perfectly circular. It would be hard to see zoomed out, but you might differentiate small differences between sounds such as length or be very verbose and map noise_ratio to it or something. Just 2c

1 Like

Awesome, I look forward to seeing where this kind of thing goes.

For me, I have to say that I don’t really hear too much of a correlation in the clumping (except for that “tail” in the bottom right of the first example). I could also not be understanding what’s happening with the dimensional reduction.

But in bits where you’re clicking around, even in the zoomed in views (which is a super awesome idea!), it seems like it’s playing back arbitrary samples with very little (to no) correlation based on what we’re hearing.

Obviously it’s not going to be as clear/legible as something like straight x/y(/z) mapping from something like CaTaRT where “pitch” is one axis, and “amplitude” is another, but the dimensional reduction seems kind of “noisy”.

Again, this is probably a naive question, but in this kind of mass-dimensional reduction are there priorities that can be assigned to descriptors (or MFCCs, though I wouldn’t think that would make much sense there)? So that proximity in one dimension is prioritized over its perceptual implications vs some other less perceptually-bound descriptor.


Having the autoencoder produce a mapping that is then “supervised”(/assessed) by a human in terms of its efficacy. Or something like that.

Again, the math/science is way over my head, so just offering forward some naive/outsider thoughts on the process.

Now as far as the dimensional reduction itself, I personally find a 2d (or faux 3d) “visual” representation to be pretty uninspiring, and pretty far removed from how I would personally want to explore sounds. Plus, perceptual proximity, in-and-of-itself is pretty uninteresting to me to. (I’d probably prefer the opposite really) So some of this bit should be taken with a grain of salt.

But there’s definitely a ton to explore in terms of how to play with a 2d (or 3d, faux 3d) representation of sound. Color obviously, shape, connections (or not). I mentioned it during the plenary but D3 is amazing for this kind of shit.

Beyond that, it would be worth looking at more natively multi-dimensional media types (i.e. shortcuts/hotkeys from 3d design apps (Fusion360, etc…)), where you could have a 4d+ visual space represented that can be more easily navigated. And/or looking at video game paradigms for negotiating 3d worlds (virtual camera systems, where the motion of the player is decoupled from the camera representation).

So there you have some rambley and tangential thoughts!

1 Like

Great thoughts and musings indeed, thanks!

I presume if you listen to sounds from ‘far’ regions, and those in ‘surrounding’ you find a lot more timbral/profile proximity in the latter… but I might know the corpus too much! The sounds were for me a lot more together when they had similar profiles. Maybe the fact I’ve actually tried the interface biases my judgement!

Actually, with the same database, there was nothing legible in human-defined descriptors when I put it in CataRT. The sounds are so noisy (yet so fun) that they go all over the place, without any significant clustering… that is why I got excited about this… It is noisy indeed but way less noisy than to pick 2 descriptors.

That is the bit we are trying to go around. For most things, a low number of descriptors are not super useful in term of creating a space that is coherent, so we try to get a ‘space’ from the data itself. It is still a very young idea though…

you might find that a few variations on similar sounds being near, but clusters being far, make a potential win/win situation. That is what I was surprised by so far. Local sense/global excentricities were actually quite inspiring.

The space we try to shrink has 104 dimensions, so 3/4/5 won’t cut it…

Feel free to continue musing, this is research in progress and that problem is so complex… I’m sure @weefuzzy and @groma will have much more mature and clever answers and questions to poke you more :wink:

1 Like

As I said, I could be missing something, and have only watched the video a couple of times, but even when you zoomed in to a section and were picking sounds very close to each other, as far as I could tell (just listening wise), they were as different as sounds that were far apart in the entire space. (with a few very clear and obvious exceptions)

That, and it’s very unclear how duration impacts things. Obviously bigger points are longer, but what about those sounds is similar?

There’s definitely clustering going on, quite clearly. I’m just not hearing it (too well). I look forward to hearing/seeing other examples too.

Yeah, I can see that. I was more commenting on the fact that I don’t, personally, find this way of working with sounds interesting, as to contextualize some of my comments.

I mean in that what you’ve made is a 2d representation. If you add color, you get 3, but if you add visual rotation (like in CAD programs) you could clearly see 3 dimensions, and use color and size for more, etc…

So I meant more reducing it to a more navigable set of dimensions, but having a think about how those dimensions are represented, and more importantly interacted with (with the mouse/trackpad being one of the single worst HI interfaces in terms of expressiveness).

1 Like

for me their profile (spectromorpho) is more similar. It is true that the zoom is not convincing in my examples too much though…

no but you find relevant neighbouring interesting (in your audio based corpus navigation) no? This is what excites me - the complex neighbours are more similar than with my own choices (for this complex corpus, that is)

ah, I misread your OP. It is true that there are other visualisation paradigms to look at, thanks for the pointer! Ideally the processes are in steps that allow people to replace function (description, reduction/clustering, visualisation) by what they want, so let’s hope we get the granularity right (I’m sure you’ll tell us if we don’t :wink:

1 Like

I think I would not say it “has not been explored” in general, it would depend on the field but I have seen many visualizations where the sape corresponds to the audio. I was hoping to try squares or arrows where rotation is a fourth dimension (third mapped to color), but still haven’t had time. On the other hand we have developed a library for visualizaing the sounds themselves that hopefully will play nicely with this.


Sorry, the language I used there is a bit general.

In some of the more popular musical applications it seems that i hasn’t appeared as something you can control from the UI. Rotation of shapes sounds great.

Hi guys, exciting developments!
If it were me, I’d be thriving to have some words on the canvas. You know, like in an atlas! Big words and small words characterizing the regions of sound (you know, like “EUROPE”, then “Italy”, then “Milan”, etc…).
Yep, I’ve said that already in the plenary, but that popped into my mind once again while seeing the video. (And I also know that that would require huge AI effort, although something is being done on that side already.)

keep up the great work!!!

(PS. not that it matters that much in the discussion, since I definitely understand the need for your own auto-organizing space reduction UI: it makes much more sense – but you can control the size of the grain in dada, it’s just that they are all going to be regular polygons/circles.)

1 Like

These are good ideas! What is the algo you use in dada ? I’m curious since we compared 6 here… I also think that what we feed it (the 104 descriptor vectors) change quite a lot the distance we compute… but again it is all in research code now so I have to try more options at different levels… more soon!

Hi PA, in dada.distances I’m using this library: https://sites.google.com/site/simpmatrix/
Honestly I did not compare many others, it was lightweighted and it seemed to work…
(It did not scale up that nicely with many dimensions, though.)

1 Like

MDS is one that we compared indeed. I’m sure @groma will step in to let the world know what we did but since the paper is under review now it might have to wait a bit… at least you know we tried this one, isomap and t-sne :wink: what I can say (and you can see in the video) is that with the same dataset they yield very different maps, and that was super inspiring for me (strangely)

edit: I say strangely because I was expecting to be disappointed by the non-optimal use of space for instance, but the localities (clusters) were immediately fun to play with, and the drawings we just beautiful!

Update: now the video is public - more demos will be done at a later point, but as we are about to go public, we need people to be able to see stuff :wink:

1 Like

quick question: at 1:19 in the video (thus second part) you have a few grains clustered which all have upwards glissandi. How is this feature taken into account? I thought you take mean-values of every descriptor over the entire duration of a sound. Or do you actually keep track of evolving parameters as well?


Indeed, that is the beauty of the derivative. @groma told me it is quite typical of speech recognition: you keep the stats of the descriptor, but also the stats of the variation of the descriptors. I’ve tried to make the help files of bufstats~ with graphical examples to explain this, but in effect, if your pitch goes up continuously, the average of the variation of pitch will be positive.

In FluidCorpusMap, Gerard coded the min, max, mean and standard deviation of the MFCCs, but also the min, max, mean and stddev of the first derivative of the MFCCs, aka how much each coefficient changes between each frame. That allows to keep a stat of the movement.

You can go crazy and do 2nd derivative (how much that first derivative changes) too… but the results were fun with theses stats and you can get too much data at one point, I’m told…

1 Like