Depends a bit where you’re starting from, and how deep / mathsy you want to go. We don’t give any substantial detail in the NIME paper, because we were pressed for space unfortunately.
I’ll give you a very-potted version here, and a couple of links. If you’d like, I’ll write up something a bit longer and put it in the learning resources section as a trial for the KE site. @groma knows much more than I about all this stuff, so might also have some good resources or corrections to what follows.
Briefly
We use autoencoders here as a way of learning features directly from the data. That is, given some collection of sounds, we don’t know in advance what set of features out of all the possible options represents this particular collection well. Moreover, because in Fluid Corpus Map we’re going to squish these features down into just a couple of dimensions, we’re not all that interested in exactly what each feature individually represents about the signal, so long as the combined features capture the overall properties well across the collection. In this particular case, we were comparing this approach with MFCCs, so we’re using the autoencoder to take a spectral frame and yield the same number of features as we’re using MFCCs (12), to see whether this yields more musically interesting or perceptually robust spaces after dimensionality reduction (I think the answer was: sometimes, maybe).
Autoencoders are a neural network architecture (or, rather, a family of NN archiectures) that are simply trained to try and reproduce their inputs. You have a layer of input neurons that are the same dimensionality as your input (say 513 magnitude spectrum coefficients) and an output layer the same size. In the middle you have one or more ‘hidden’ layers that are of progressively smaller size from the input to the centre of the network, and symmetrical structure out towards the output layer. The idea is that the smaller layer(s) provide a useful – albeit slightly abstract – representation of the data using fewer numbers, in such a way that the original can be retrieved. You then get at your learned features by reading directly from the hidden layer(s) in response to input on a trained network.
So, during trainng, we throw a bunch of spectral frames at it from the provided collection of sounds, and adjust the weights between points in the network so that error between the inputs and outputs is minimized. In this particular case, we use a very small network and not very many iterations of learning, to keep things quick. Then, once trained, we feed the sounds in frame by frame, and read the features from the hidden layer, and use these features as the input to the selected dimensionality reduction algorithm.
In general, this would be an insufficient approach for a model that you could then throw any arbitrary sound at in the future, but because it’s so small and relatively quick to train and because we’re not interested specifically in how generalisable the features it learns are, this scheme works quite well for producing informative features for the moderately sized collections we were testing with. If you start using it with much bigger collections (or if you think it’s not delivering), you might need a bigger network (more layers) and / or more iterations of training.
Links
This article isn’t too unfriendly, but does take a certain amount of jargon for granted:
This is similar, but has slightly more concrete code examples
This chapter is much more technical
https://www.deeplearningbook.org/contents/autoencoders.html