I’m thinking about working on some descriptor dimensionality reduction for audioguide to create something like a 2D space for timbre. Anyone care to share any tips/algorithms/methods that work well? @jamesbradbury would you be willing to share which python module/params you’re using for MFCCs?
Ideally: I’d like to find a way to reduce N dimensions down to 2 dimensions, edit the data in 2D, and then map it back into N dimensions. If anyone has ideas about approaches that would enable this I’d be grateful.
librosa has a good implementation which is easy to reason around but I think it is not that fast, but probably fast enough. Compared to processing the same file in FluCoMa I think it was about 3 times faster. I wrapped up the CLI here but that probably won’t suit your use case as you want AudioGuide to be fairly self contained. In any case librosa is a good place to start, and Brian McFee is a good maintainer and none of my code has broken between versions where I use it. The thing with the librosa mfcc implementation is that it’s not curated but that gives you the power of applying liftering or choosing your own DCT type. This might be interesting to explore.
As you are well aware Python gives you access to many many algos for rescaling/reducing/mapping data. I have had great success with umap-learn as a dimension reduction tool with MFCCs. I talk about my most recent application of those two things in this talk. It’s relatively simple and you can configure how it works with just two of its parameters (n_neighbours, mindist). You might also try t-SNE algorithm which is another manifold technique (good at capturing non-linearities) and gives you some power over whether you want to preserve local or global topology, much like umap with its two main parameters. What will be useful for you is that you can dump the embedding from either of these algorithms to disk and save it for later, (in fact I think this is because they adhere to the sci-kit learn API) allowing you to reduce future data using that previously calculated embedding. It would also allow you to map reduced data back out into that embedding to compare a corpus that is pre-analysed with some new stuff.
You are no doubt able to reason around how the reduction works, but I thought I would link the simplest implementation that I’ve written which doesn’t include a bunch of ipynb jargon and matplotlib diagrams around it, as is custom for Python research code in the wild.
def reduce(self):
self.output = {} #output dictionary
data = [v for v in self.input.values()]
keys = [k for k in self.input.keys()]
data = np.array(data) # convert data to numpy array
reduction = umap.UMAP(
n_components=self.components,
n_neighbors=self.neighbours,
min_dist=self.mindist,
)
data = reduction.fit_transform(data)
# Now reassign the reduced data to the appropriate key
for key, value in zip(keys, data):
self.output[key] = value.tolist()
If you’d like someone to contribute code toAudioGuide I’d be more than enthusiastic to do so. Just drop me a line here or at my e-mail jamesbradbury93@gmail.com.
Thanks for the detailed reply James, I really appreciate it.
That’s funny, I’m actually already using librosa for NMF. Some colleagues have had trouble installing it on OS X with pip though.
Great, this is exactly the kind of info I’m looking for. Thanks so much. Just looked at/installed umap and it seems very straight forward. Will try it out in earnest in the new weeks
That’d be great. Some of the codebase it pretty primitive, as I started the project before I really knew how to code python. Bit I’m looking to revisit and expand the project starting sometime in the next year. It’d be great to have you contribute.
The idea is in its infancy, but, in addition to aiding browsing, I’d like to find a way to use multidimensional scaling to permit a time varying morphology to be visually “edited” (in the context of concatenative synthesis). One possibility I’m exploring is to reduce a sound’s multidimensional descriptor data to 2D, let the user manipulate the 2D data (stretch, deform, rotate, but it’s all just movin points in the end), and then extrapolate the post-transformation multidimensional descriptor values. I don’t really know yet if this is a workable approach.
This sounds ideal to an autoencoder task. Soon you’ll be able to do that with our tools, for now you can only get the 2d, but if you trained your MLP as dimension redux you can the have fun in 2D and get the corresponding Xd.
You are in Python so check these explanations which I found human friendly:
If that’s recent it might be because the maintainers of librosa haven’t pinned numba to 0.48. 0.50 has a breaking change and I think they are letting pip auto-update.
Morphology does seem to be one of the concerns on many minds right now. So I’m curious to see what comes of this!
One thing to keep in mind is that the dimensions that result from dimensionality reduction often loose their meaning (left vs. right no longer correlates to any known dimensions), but they can still have some meaning (such as with PCA where variance is salient when moving across the space). With things like TSNE, autoencoders, and (I believe) UMAP, the resulting space is more akin to sonic “regions” that have local similarities, but may have less similar global dimensional relations (I believe TSNE will be the most guilty of this out of the three, is this right @jamesbradbury and @tremblap?).
This is all to say that as a user is manipulating the data in these lower dimensional spaces, they will probably first have to “learn the space” in order to understand “where” they may want to move a point or “where” they may want to look for new points, etc.
Similar to @jamesbradbury (I think) I’m a post grad with some time on my hands so if there’s someway I can help, let me know!
My hope is that what I am proposing might nonetheless be workable, since the mapping between target and corpus dataspaces is already a pretty abstract notion in concatenate synthesis, at least in the implementation that I use.
The goal of editing the target’s morphology is to make the mapping between the corpus and target a more fluid, creatively controllable process. To achieve this, the target and corpus data would be plotted in the same 2D reduced dimension space. The target data would be “editable” while the corpus wouldn’t be. My thinking is that the corpus points could be used as sonic guideposts for understanding the 2D space and what different edits of the target would portend.
This makes a lot of sense to me. I think it’s a really nice lens for concatenative synthesis. I have a plotter tool in SC that I use a lot – I might edit that tool to make it possible to move points around and see what that sounds/feels like.
If you pip install pynndescent it drastically increases the speed of umap. I can’t remember where I discovered that but it helped me get over some longer training times. I believe in 0.5 its becoming an explicit dependency anyway so it will probably good to get ahead of the curve.
Thanks for a great talk and demo yesterday. It really got some ideas turning for me. Here’s a little video with some updates to a plotter I’ve been working with. I think it gets at some of the ideas you were discussing.
It has yet to be able to edit the data points of the target. I need to think more about what that really means. It makes sense to me to just change the descriptor data for a given point, but that would also change it’s dimensionality reduction position and it’s nearest neighbors (which I understand to be part of the purpose). If/when to recompute all of that (driven by the why) is what I’m thinking about now. Any thoughts would be great. Thanks again for the inspiration yesterday!
I very much like the last bit: my intuition is that your shared t-sne space shows why your distances are far in the normalised independant space. I like that very much as it shows your corpus is not ‘compatible’ aka it has a whole there where the target has its own coherence in its own space.
The ‘stretching’ idea is a transpose, which is inspiring and (because of) messy. If one manage to keep salient contours of perceptually transparent descriptors (like you do) it could be fun as variation maker, especially if both spaces are rich enough, but it gets very abstract in MFCC land…
Congrats for the video, very full of potential. The one thing that I keep wanting to do is to move the ‘result’ of the nearest match in the selected 2d space to hear that, and if happier, retrain with that new (biased, distorted, powerful, curated) relationship of proximity… a semi-assisted ML approach with the time profile. I think you and @b.hackbarth are touching something good there, and I wonder how it would link with the research questions of @groma
I look forward to have @b.hackbarth’s comments on it too.
I think the deeper we go into those interventionary strategies and semi supervised approaches the more we need to turn towards things that learn like the NN’s and co that can be trained on examples. My hunch is it will become computationally hefty territory (potentially!) but maybe not.
I have long wanted this kind of approach for segmentation, i.e here are ten perfectly sliced files I did by hand now go do the rest! And also the ability to transform a space as I hear it. Maybe one approach is to try and make a neural network learn how to do UMAP/T-SNE and then we train it on our modifications to push it in a new direction. The model is stored so you can always revert back to the original but you can keep modifying the model by touching new points, moving them around, stretching the space. I feel like I want to finish my PhD now
I am finishing the AE example (we now can tapin and tap out of the network anywhere) but I think there might be a way to adapt it that way… I’m still fuzzy about all this, so we release as soon as it is stable for you guys to explore it that way too as data reduction…