AudioGuide (talk)

First, thanks for the super detailed and thoughtful response!

It was interesting watching the video and hearing the nuts and bolts of your specific take and perspective on this, as it varies quite a bit from the (current) FluCoMa paradigm.

Thanks for the additional comments on the normalization stuff. I’m still getting my head around this aspect of things as it can get complex, particularly when MFCCs are in the mix.

That’s quite interesting.

I guess this makes the most sense in a “one off” context where you have a fixed target and a set corpus, since you can just normalize it as part of the query, but I wonder how this would fair with a stream of targets pouring in a real-time context, re-normalizing on a per query basis.

This is one of the toughest things to wrap my head around when dipping into the machine-learning side of things is that penetrability evaporates almost instantly. Not a big deal when dealing with things like MFCCs or a high dimensional space, but there are still individual numbers (i.e. duration, loudness, etc…) that probably still mean a lot.

At the moment I’m trying to square that circle since the tools are built around a “match everything to everything” paradigm.

I like this kind of conditional matching. @tremblap has done some conditional santizing where things that are below a certain loudness or have a spectral spread above a certain value are “dismissed” by the corpus creation process, but this could be very useful for querying varied input where things like pitch and/or confidence may be useless for certain targets as a way to just skip that part of the query, rather than finding a way to sanitize the results, which is not without its own problems.

That’s great, and probably accounts for the sound you get from AudioGuide, where things sound whole/complete (as opposed to granular/mosaicked).

I can’t think of how to do that in the FluCoMa context, as it on its face it would seem like a query per analysis frame or something like that. OR just dumping the whole time series into a machine learning algorithm and letting it “sort itself out”. Presumably the time-series-ness would be reflected in the matching, but perhaps not explicitly, as it would be treated as any other distance relationship, rather than a hierarchical “container” for the rest of the querying to fall inside of.

As far as I understand it, the closest we have at the moment is having derivatives for any given value, which contains some kind of time varying information, though skewness/kurtosis can perhaps offer some idea as well. We don’t have vanilla linear regression (again, as far as I know).

At the moment, each statistic is an island. That is, you get seven stats (mean, standard deviation, skewness, kurtosis, and low/mid/high centiles), and then derivatives of these things. But each one is run on single data stream (typically a descriptor of some type, but since it’s buffer based it can happen on audio as well).

I suppose one could do this “manually”, but it would be quite tedious/messy since it would involve manually multiplying every sample in a buffer by a value, since all(ish) data types are buffers.

Did you abandon this approach due to complexity, or because of the limited usability? (i.e. only mel-based descriptors)

I’ve been working on some real-time spectral compensation (e.g. using the (mel-band)-based spectral shape of the target, to apply a corresponding filter to the match to more closely have the two sound alike) so an approach like this might make sense since I’m already doing mel-band analysis of both the source and target anyways.

Presumably what follows below about the specifics of how to subtract and find remainders (based on loudness) would be the same when doing it per mel-band?

This makes more sense… And I understand what you mentioned above about weighing descriptors (in general) against their linear amplitude.

Aaand the devil is in the details. So taking the means of spectral descriptors wouldn’t play so nice with this approach.

Thankfully for my most general use case I’m detail with tiny analysis windows (256 samples with @fftsettings 256 64 512), so the amount of smearing for so few frames is probably much less than what would happen across a file or segment that’s 1000ms+.

Either way, tons to think about, both in terms of things to test and apply, as well as some wish-list-y stuff for the FluCoMa tools.