Descriptor-based sample navigation via dimensional reduction(ish) (proof-of-concept)

rodrigo.constanzo · July 24, 2019, 4:03pm

There’s some interesting discussion on this topic in the thread on @groma’s talk, as well as @tutschku’s thread on Dimensionality Reduction, but rather than derailing those threads, I figured I’d make a new thread for the proof-of-concept for what will be a dimensionality reduced “automatic sample parser”.

This is basically a combination of batch analysis and “onset descriptors” bits of code discussed in this thread (and elsewhere). I have a corpus of 3094 samples which I’ve analyzed for around 50 data points, of which I’m only using two for now (loudness max, and spectral centroid mean) to browse the sample space.

A video explanation:

And short musical demo:

So at the moment it’s a direct 1-to-1 mapping of velocity->loudness and radius->centroid. (well, I’m doing some clever normalization and min, mean, max scaling to make it more browse-able).

Among many other ideas, this is one of the intended use cases that I wanted to apply the descriptor space dimension reduction stuff to. Where I can take an arbitrary amount of descriptors (and their stats) and map them onto a dimensionally reduced space (in this case velocity/radius on a drum pad).

A bunch of the discussion at the 2nd plenary went over my head on this, but incorporating self-organizing maps to evenly space the available samples would also be great, to avoid the clustering from the videos/demo of @groma’s talk. That makes total sense when browsing/searching samples, but less so (for me) when playing sounds.

Either way, I just wanted to share an early proof-of-concept on that.

jamesbradbury · July 24, 2019, 5:24pm

This is cool - especially bell on bell action in number 2. With just two descriptors you get a lot of variety and control.

Some thoughts on dimensionality reduction which I’m sure others can weigh in on. I have recently been using PCA to try and ‘find’ things, using the computer to make sense of some information and throw me the results. There is a huge variety of timbres/lengths and behaviours in the samples and there are clear perceptual groupings of how things might go together and so I had a few exemplars from the data set that I serendipitously came across and wanted to find things close to them.

I initially did Principal Component Analysis on my samples, trying to reduce 120 data points down for each, down to 2 which are then converted into cartesian coordinates, similar to what you have. The clustering was okay. Groups of samples that perceptually went together weren’t always grouped with each other but occasionally really tight clusters spread out from a main cluster would appear and they would be very agreeable (all sinusoidal for example). What I didn’t like about this was that there was a huge blob of pretty much everything else in the middle of my plot which is fairly incomrehensible in terms of navigation or what the clustering really means.

After talking to @a.harker he said that the big learning people take many more data points than what I did and let the computer figure it out. So that’s what I did. Using a whole magnitude more points REALLY improves the mapping and perceptual grouping. Things like ‘iterated’ samples come together in the plot while sustained sounds are separated for example, overall, the mapping is just that much better. For each file I was looking at maybe 5000 points. I used almost every descriptor available in the Essentia library and ran batch processing which took 49 hours to complete. It then takes some aggregated stats on each descriptor, much like fluid.bufstats~ giving me the mean, min, max, median, variance, moments, stddev, derivative mean and so on so forth.

If you do go down the dimensionality reduction route, go big - it really helped me achieve better results and I’m still tinkering with it now to see where it takes my understanding of this huge set I’ve been dealing with.

The repository with all the code for essentiais in here.

The other code (a mixutre of python and fluid~) is in its parent.

rodrigo.constanzo · July 24, 2019, 6:23pm

Yeah totally. I plan on actually sampling all of my own metal bits in great detail to add to the library, to further add source blurring. Maybe even including some robot/solenoid playback to to have “acoustic samples” playing back too.

Very interesting and promising sounding.

I look forward to seeing where the fluid. objects go in this direction, particularly with regards to the clustering and clumping. There was some interesting discussion about being able to seed/massage the map and then rerunning the self-organization to fit that criteria (better).

I’d be curious to hear what you’ve gotten up to. Even a screencap video browsing the resulting maps.

jamesbradbury · July 24, 2019, 6:24pm

Cool, I’ll put something together when I feel like its more mature. Right now its me playing guessing games with the tech and the implementation.

weefuzzy · July 25, 2019, 9:13am

That’s lovely, thanks for sharing!

A note on self organising maps: they will collapse everything down into a set of single points, so – unless I misread you – for what you’d describe you’d want an SOM with as many nodes as you have samples, and I’m not sure how well that would work (@groma would know). If what you’re after in the future is way of taking a set of dimension-reduced points and forcing them to be evenly distributed across a space, there may be other ways of doing it as post-processing to other techniques.

weefuzzy · July 25, 2019, 9:24am

Thanks James, great to hear what you’ve been up to.

I think this merits some qualification: probably, as a general rule, throwing more data at things can yield better results, but you might have needed so much in this case as a consequence of using principal component analysis, which can only provide a linear mapping between your spaces. You might find that some of the more contemporary techniques like ISOMap or t-SNE give you as good or better results without needing to throw everything at it.

Our NIME paper this year (well, @groma’s really) was about this (explaining what we showed folk in Huddersfield in November). Don’t know if you’ve seen the code yet, but it might be interesting for your purposes (it’s a mutant mixture of python and SC at the moment):

Paper here:
http://www.nime.org/proceedings/2019/nime2019_060.pdf

rodrigo.constanzo · July 25, 2019, 9:30am

Hmm. I thought a SOM did by default (unless “node” has a special meaning in the context of SOMs). Isn’t the whole idea that each sample (or object/whatever) would correspond with a point that is then organized in some manner (usually cluster-y(?)).

Do tell me more…

weefuzzy · July 25, 2019, 9:44am

A SOM has a given number of nodes (‘neurons’) which doesn’t have to be the same as the number of points in your original space. The way it works in training is that a point in your original space gets associated with only a single neuron in the map, but a single neuron can be the best match for multiple points from the original space. This means that, in principle, there could be neurons in the map for which there no samples from the original space (I think), giving you gaps, and other neurons for which there are many points, all collapsed to a single point. What we did in FluidCorpusMap, to avoid a bunch of points sitting on top of each other was to use some noise to distribute the samples around the points represented by each neuron.

With the caveat that I may be talking rubbish It seems to me that what you’re after here is a slight modification of what the visually orientated clustery approaches are shooting for. With these, the dimension reduction algorithms are focused, in part, on trying to preserve the original distances between points from the higher dimensional space. For your purposes, you’re interested in the layout but want to enforce equal distance so that everything is uniformly distributed across the drum, yes? Now, that may be possible by modifying an algorithm, or (perhaps more simply) by warping the reduced space after the fact to enforce this. That way, you might be able to explore the kinds of association that other reduction algorithms give, but still have things spread evenly.

jamesbradbury · July 25, 2019, 9:59am

Thanks for the further materials to investigate @weefuzzy. I will delve deeper this weekend and report back on what I find out

jamesbradbury · July 25, 2019, 10:00am

The ml.som example is pretty good for explaining the way SOM maps. I too was a bit “eh” when I realised you have to map onto a fixed grid of a certain size.

jamesbradbury · July 25, 2019, 10:07am

Bless sklearn.manifold! Excited to try these techniques

rodrigo.constanzo · July 25, 2019, 10:27am

Oh I see. I guess that’s for computational efficiency/sanity purposes?

For the time being, the resolution on the BopPad is only 0-127 (although in reality it is more like 0-4 resolution…), so it would make sense that each MIDI value would correspond with several samples from the set, so there would be cluters at each “MIDI note node”. This isn’t the most effective use of controller space/expressivity though, but is the reality with a controller.

A more problematic version would be using the onset descriptors idea to navigate an arbitrary sample space, but trying to hyper-normalize the “playable space” via SOM-type funny business. So I could use only a couple of input features (loudness/centroid for example) and use that to navigate a large dimension (but reduced) space. In this use case the “nodes” would be possible input descriptor values, I guess(?).

Yes, exactly. Although it may be another X/Y controller (gamepad), or audio descriptor input (as mentioned above).

Is this a toolbox2-type of thing (or a wishlist of toolbox2 at minimum) or is this something doable (in Max) with existing tools (without a great deal of faff)?

jamesbradbury · July 25, 2019, 3:48pm

I think its perfectly reasonable to scale 2dimensional coordinates over a space more evenly with just native max stuff but it depends on where you store it… entrymatcher might require some pre-processing before you store it in entries.

rodrigo.constanzo · July 25, 2019, 4:35pm

Hmm. What kind of process does that? Like figuring out the minimum and maximum values, and dividing the difference between them by the amount of entries to evenly space them out? What if it is multidimensional?

(also, is it possible to just do that on the incoming query data, so the data in the database remains “real”?)

jamesbradbury · July 27, 2019, 5:23pm

At some point your multidimensional data is reduced and becomes 2d on the pad? That’s the point at which you do the spacing out I assume as this is the information you are interested in making more coherent across the playable space.

I guess you could, but to me it makes more sense to morph the space of your data and keep a reference to the original set. In my way of working I almost always do any transformations in memory and keep the ‘real’ database on my disk. If I need to use the morph again its stored on my disk and I create a new iteration. To me, morphing the input data seems wonky - but thats just me.

One technique might be:

rodrigo.constanzo · July 28, 2019, 12:38am

Right, so it’s two distinct processes.

One is just the data reduction, with all of its variations/complications, then after that, the reduced data is then spaced out with some kind of SOM.

If I remember right you were using machine learning package/externals during the last plenary. Was this the ml.star package? Do you have any patches that you’ve done either one of these things that you could share?

(I also found an old patch I had gotten from @a.harker that has some SOM stuff in it. Sadly it looks like there are underlying externals that do the heavy lifting and either I don’t have them or they’re not 64bit cuz Max isn’t happy with the patch)

spluta · July 28, 2019, 10:17am

Where can I go to understand the what the autoencoder is doing in the FluidCorpusMap?

weefuzzy · July 28, 2019, 11:09am

Depends a bit where you’re starting from, and how deep / mathsy you want to go. We don’t give any substantial detail in the NIME paper, because we were pressed for space unfortunately.

I’ll give you a very-potted version here, and a couple of links. If you’d like, I’ll write up something a bit longer and put it in the learning resources section as a trial for the KE site. @groma knows much more than I about all this stuff, so might also have some good resources or corrections to what follows.

Briefly

We use autoencoders here as a way of learning features directly from the data. That is, given some collection of sounds, we don’t know in advance what set of features out of all the possible options represents this particular collection well. Moreover, because in Fluid Corpus Map we’re going to squish these features down into just a couple of dimensions, we’re not all that interested in exactly what each feature individually represents about the signal, so long as the combined features capture the overall properties well across the collection. In this particular case, we were comparing this approach with MFCCs, so we’re using the autoencoder to take a spectral frame and yield the same number of features as we’re using MFCCs (12), to see whether this yields more musically interesting or perceptually robust spaces after dimensionality reduction (I think the answer was: sometimes, maybe).

Autoencoders are a neural network architecture (or, rather, a family of NN archiectures) that are simply trained to try and reproduce their inputs. You have a layer of input neurons that are the same dimensionality as your input (say 513 magnitude spectrum coefficients) and an output layer the same size. In the middle you have one or more ‘hidden’ layers that are of progressively smaller size from the input to the centre of the network, and symmetrical structure out towards the output layer. The idea is that the smaller layer(s) provide a useful – albeit slightly abstract – representation of the data using fewer numbers, in such a way that the original can be retrieved. You then get at your learned features by reading directly from the hidden layer(s) in response to input on a trained network.

So, during trainng, we throw a bunch of spectral frames at it from the provided collection of sounds, and adjust the weights between points in the network so that error between the inputs and outputs is minimized. In this particular case, we use a very small network and not very many iterations of learning, to keep things quick. Then, once trained, we feed the sounds in frame by frame, and read the features from the hidden layer, and use these features as the input to the selected dimensionality reduction algorithm.

In general, this would be an insufficient approach for a model that you could then throw any arbitrary sound at in the future, but because it’s so small and relatively quick to train and because we’re not interested specifically in how generalisable the features it learns are, this scheme works quite well for producing informative features for the moderately sized collections we were testing with. If you start using it with much bigger collections (or if you think it’s not delivering), you might need a bigger network (more layers) and / or more iterations of training.

Links

This article isn’t too unfriendly, but does take a certain amount of jargon for granted:

This is similar, but has slightly more concrete code examples

This chapter is much more technical
https://www.deeplearningbook.org/contents/autoencoders.html

rodrigo.constanzo · July 28, 2019, 11:23am

I have nothing meaningful to add here in response, other than excitement as to when (versions of) these things are implemented in Max…

spluta · July 28, 2019, 1:25pm

You are a gentleman, and, the previous post has also revealed, a scholar as well.

One dum follow-up with a true or false answer:

True or False: Dimensionality reduction will improve the efficiency/speed of my search algorithm (NearestN search of a KDTree).

I am imagining reducing my high dimensional data down to maybe 6-10 for efficiency. I don’t want to plot.

Sam