Live audio mosaicing on the web

jorgemf · September 12, 2021, 10:44am

Hi!

Could anyone here recommend any resources for implementing realtime CBCS? I’m guessing some of Diemo Schwartz’s stuff? I want to code a web-based corpus-based concat synth using Freesound as a source for corpora. I think it could be fun but I know implementation can be tricky, so any pointers to particularly didactic material on the topic would be great (books, articles, webs, videos… anything really).

Thanks!

jamesbradbury · September 12, 2021, 11:22am

Hey Jorge,

This is such a cool idea to take on for the web. Your question is specifically about CBCS, but it might be relevant to think about how the disparate network of web technologies can come together to meet your goal. I’ve seen some of your code on GitHub, so I’m going to assume you have some js-fu already

Audio

In my experience Tone.js is pretty good for dealing with the web audio API. If you’re more fluent you could roll your own playback engine but I believe just having buffers of sound that you then point to with a Tone.Player will get you pretty far. With Tone.Loop and such you can also implement high priority calls, meaning something like CataRTs “fence” and “bow” mechanisms with repetitive playback aren’t going to be troubled by the scheduler of the main js thread. It’s at least a good place to start where you dont have to write your own API’s and stuff

Analysis

This will entirely depend on the interface that you develop for the user. Do they upload a sound client-side and then analysis is run on some local blobs? Does it get sent away to a server somewhere over a RESTful API and all the magic happens behind the scenes?

In any case, you’ll need some audio-descriptors and have to choose if you want it run client or server side and how to get the data in and out of the places it needs to be. If the targets are pre-analysed then replicating that analysis on the source sounds will be something you’d strongly want to consider.

Keeping parity on the source with essentia.js will likely be your friend here. I know there is/was some FluCoMa compilations for js that @weefuzzy played with so he has more info on that side if it culminated in something useable.

Matching

Matching is perhaps a more open-ended question you will have to answer in your implementation and has some territory for further exploration in distance metrics and structures you use to query your descriptor space. It could be interesting to see what happens when you implement different configurable distance metrics and how this effects the closeness of target->source matching. Similarly, dimension reduction on the source and target could be novel and a playful form of control on the matching process, allowing the computer to mutate the space and relationships of your data. Some hints at structures for matching below:

quadtree
kdtree.

In terms of more general resources here are some ones I’ve personally read and found useful in various ways.

good starter on cbcs

more windy explanation of audio mosaicking
mosaicking 2

dimension reduction and its effect on matching
dimension reduction paper for fluid corpus map

Does this help? I’m not sure where you are currently at in terms of your knowledge of the problem space so I’ve tried to just share what has been helpful to me, again, recognising that I think you’re a pretty proficient web developer already

jorgemf · September 14, 2021, 9:53am

Thanks for such detailed reply @jamesbradbury!

Your break-down of the problem is in line with what I assumed would be the parts involved, i.e. audio management and playback, analysis, and matching.

Matching is perhaps the part where I need most guidance, and also where I’m most lost regarding tooling (so your links to quadtree and kdtree js implementations are welcome; I assume there will be similar tools for distance metrics and dimension reduction). With dimension reduction, you’ve made me think of neural net embeddings as another possibly interesting descriptor that’s already a sort of reduction of the input representation. It could be fun to see what sorts of matches you get (mis)using embeddings from a music autotagging model, for instance.

In terms of tooling and implementation for the audio playback and analysis parts, I already have some things I want to use and experiment with.

I want everything to happen client-side. I might try to grab analysis data from Freesound together with the sounds, but only if I can replicate it easily on the live input, which brings me to…
Analysis: I’d like to use essentia.js, but that’s just my own bias
For audio playback, I thought using AudioWorklets could work, precisely to steer clear of the main js thread. I’ve never used Tone.js, and I’m not sure if both tools will work together (as far as I understand, Tone.js is simply a wrapper library on top of the Web Audio API, right?).

One thing I’m unsure about yet is how to deal with storage of the corpus. I’m guessing for quick selection and playback, the whole thing needs to be kept in RAM. Otherwise there’s the IndexedDB API which can be used with Web Workers so as to not block the main UI thread when reading from disk, but I don’t know if that would be fast enough for real-time audio matching and playback. Here they mention that the corpora is “stored in FTM data structures in memory” but also provide persistent storage of the data using SQLite, so I guess both?

One last question regarding terminology: I’m a bit confused with your use of “target” and “source”. I always assumed “source” refers to corpora being used for resynthesis, and “target” in this case being the live input to match using “source” units. I.e. “source” is what could be pre-analysed. Is that correct?

jamesbradbury · September 15, 2021, 7:48am

It will likely be worth your time to do a proof of concept of just matching with a few data structures. Either create some fake data, or make some descriptor analysis that is static and fetch() it into a dummy javascript page and experiment with source → target matching of very simple data. I would envisage that k-dtree is the best place to start as its well documented and the implementation I linked you looks super transparent in how you might use it. Once you get something that seems to behave properly, you can probably put that bit down before working on the more gnarly stuff which is likely to be client-side analysis and data storage.

This is a great idea and something that could be interesting to explore. Good potential in this to allow your user to transform the mapping by rearranging the topology of the network or adjusting the input data. That said, probably a lower-level use case that is prone to lots of “false starts” if you catch my drift.

client-side will be much easier to begin with at least, as you wont have to broker your analysis through an API either in the retrieval of pre-analysed targets or your source being sent away to get some analysis back.

I had a look at the emscripten transpiles and the API looks decent. The web workers might be useful, although they are not a panacea for magic and can still block the main thread in the (de)serialisation of objects that are passed between the worker and the main thread. If you are passing humongo objects like audio buffers in and out it might get slow

You are correct. It is an abstract layer of the web audio API which gets rid of some of the complexity required in doing some things. It’s mainly aimed at musical apps IMO, so preferences things like “pitch” and gridded time structures. Useful for some things and not for others so your milage may vary.

Yeah this is something quite gnarly to think about. Depending on the browser you have limited internal storage that you can use which is not your RAM. Here is something to read about it. https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Browser_storage_limits_and_eviction_criteria

I don’t know really anything about IndexedDB. But I think if you’re a good boy when it comes to managing your memory and prodding the gc to take away things at the right time you can likely get away with having everything operate in memory, at least to begin with. If you want persistence then using the local storage api is dead simple to navigate around and quite powerful. Behaves like a cookie but isn’t one. Window: localStorage property - Web APIs | MDN Take this with a grain of salt because I’m not sure what the edges of IndexedDB are, although I know it is prevalent and well liked.

As above. You will likely have some kind of in memory operations that are then persisted in a local database. You can create a handler in your app on page close to quickly shove everything into local storage or a database.

I think I’m confused a bit too lol. I’ve muddled source and target up a bit but my understanding is that the source is the sound file you provide which you want to match the targets to. Does that clarify?

jamesbradbury · September 17, 2021, 6:58pm

BTW @jorgemf I’d be more than happy to go through any prototype code you have and feedback where my expertise is valuable. Feel free to share some WIP stuff even if its rough.