AudioGuide (talk)

@tremblap sent out the link to @b.hackbarth’s ('s talk on AudioGuide out via email the other day, so I imagine most people got it, but figured it was worth making a post as there are lots of interesting ideas here.

Here are some thoughts I had, somewhat unpacked.

It was interesting to hear the thinking and discussion on normalizing vs standardizing, and whether this is per dimension or across all dimensions. Definitely resonates with a lot of the sanitization stuff being discussed with the FluCoMa bits.

It was particularly interesting with regards to pitch, as you can match direct matches, but that leads to potentially problematic things (which @tremblap offers some interesting workarounds for in his paper from a few years ago). OR just matching the overall trajectory, which is useful for certain circumstances.


This was a passing comment, but it was interesting to hear him mention that having too many (similar) descriptors tends to give shittier results and drifts towards a median where you get an “ok” match for each descriptor type, as opposed to a good match for any. So kind of a (curated) less-is-more approach.

I wonder if that still holds true in the context of dimensionality reduction, or is part of the idea with the reduction stuff, that the algorithm(s) iron out and throw away useless information, letting you take a more more-is-more approach to descriptors/stats?


His multi-pass approach is also super interesting, and similar to the workflow(s) I’ve been using in entrymatcher and am trying to work out in here. Even putting aside stuff like I mention in that thread about meta-data-esque searching on duration or other conditional queries on “simple” descriptors (e.g. loudness > -20dB), the fact that queries can be chained seems like it could be super useful still.

Say you have like 150 dimensions, and like 40 of them relate to loudness-y things, and the rest are timbral ones. To have a query where you find the nearest neighbor(s) for loudness, and then from that small pool, search for the best timbral match etc…

I guess you can do stuff like that with fluid.datasetquery~, but the speed of that is pretty slow (for real-time use anyways), and it seems more like trying to shoehorn that way of thinking into the current tools/paradigm (a single long-thin stream with no tags).


I’m not entirely sure I understood what was happening with regards to time series, but unless I’m mistaken, I think he was saying that rather than (only) taking average statistics for each segment (ala fluid.bufstats~) he will take numbers per frame, to retain the temporal aspects of the sound.

Is that right? If not, how does he retain such a clear temporal shape in the matching?

Some of the LPT approach is a way to mitigate this I guess, where you manually segment a segmentation (both via fixed durations) and use the summed statistics for those windows to inform a temporal shape.

I’ve personally not tried this (yet), but is it a thing to take all the analysis frames for a given window/segment and query that way? I could see that being an insane amount of numbers very quickly.


Sadly, the briefest part of the video was the discussion on how layering works, as I was trying to conceive some of how that works in this thread.

So my take away from this is that his implementation is (much!) simpler than I was initially thinking, where you would compare every possible combination ahead of time and then query based on that. From my understanding, AudioGuide does a normal query, finds the nearest match, and then subtracts the amplitude from the target and then searches using the remainder of that.

It also seems that when doing layering, that samples can be mosaicked and offset from the start, which circles me back to the matching-per-frame thing mentioned above.

The part I especially found interesting was the way timbral descriptors are treated. Where “weighted sum of the weighted average” (?) gives you an idea of how the timbral descriptors might sound when summed.

So does that mean that if I sum two samples:
centroid = 800Hz, loudness = -20dB
centroid = 1200Hz, loudness = -10dB

I would presume that the results of summing them would be:
centroid = 1000Hz, loudness = -15dB

(or maybe the centroid would be 979Hz instead if the summing/averaging happens in the log domain):
Screenshot 2020-07-06 at 3.17.21 pm

And something similar would also apply to rolloff, flatness as well? (also including spread and kurtosis?)

So in summary, my (poor) understanding of the layering is that it will make a first pass like it does for normal matching to find the nearest sample (via whatever descriptors/method/whatever).

That then leaves a remainder (based on amplitude only?).

It will then query all the potential matches again on how they would sum with the best match based on the maths above, and select that. Then this process would repeat for as many layers as are requested/desired.

Hello Rodrigo,

Yes, I think that normalization needs to be parametrizable per dimension. In my experience, some descriptors work better with min/max normalization routines while other lend themselves better to something like mean/std. Regardless, my $0.02, the most important choice is whether to standardize the corpus and target’s descriptors together or separately.

There is some text discussing normalization in the audioguide docs here.

My experience has been that matching 50 different dimensions of descriptors gives pretty bland results, However, I think this has to do with the nature of the “gap” between the sound world of the target and corpus. The more similar the corpus and target sound worlds (and/or the more comprehensive/variable the corpus), the better large dimensional searches should work.

I have not (yet) tried matching in descriptor spaces which have been scaled. It is an interesting idea.

I think that, for real time purposes, the hierarchical matching structure would be most useful, as you note, since you can first “prune" the size of the search pool based on lower cost descriptor comparisons (power, duration, etc).

The thing that I think I like best about this approach is that is it feels creatively purposeful. Rather than asking for the best match on 40 dimensions, which tends to be impenetrable to the user (ditto for dimensional scaling), you dictate what you want and the order that you want those measurements to be considered. In my work I’ve found that there is no gold standard for measuring similarity, only what you’re interested in.

Yes, this is one scenario that I happen to use a lot. There are lots of other interesting possibilities for hierarchical search functions. For instance, if a target seg’s noisiness is greater than 0.5, calculate similarity with descriptorN, otherwise use descriptorM.

You’re correct - audioguide lets you match sounds using time-varying descriptor differences. And I do think that this is key to capturing morphological shape (alongside layering, which I discuss below). In the program, one has control over this on a descriptor-by-descriptor basis: asking for d(‘centroid’) matches time varying centroids, d(‘centroid-seg’) matches based on power-weighted averaged centroids, d(‘centroid-delta’) matches the first order difference of time varying centroids; d(‘centroid-delta-seg’)… well, you get the idea. It is possible to match target segments based on different descriptor modalities — one could match, for instance, time varying mfccs, averaged centroid, and the linear regression of amplitude.

The most important thing with averaging spectral descriptors is to weight averages with linear amplitude. Are you guys doing this in fluid.bufstats~? If not, it should certainly be an option, if not the default.

Of course, it is possible to represent time varying descriptor characteristics in other ways (fixed-length arrays, differences, linear regressions, etc) which can help circumvent the need for frame-by-frame calculations. I personally like how frame-wise matching sounds.

Yes, I think this method quickly approaches the limits of realtime, depending on the size of corpus. But all of this will be moot in 10 years, when even the cheapest laptop will be able to churn out an excellent baguette.

Yes, you’re right, what audioguide does for layering is really quite simple compared to something like orchidée. It is a looped brute force approach. For each target segment:

1.) the best sound is selected.

2.) the time varying amplitude of the selected sound is subtracted from the target segment’s amplitude.

3.) the onset detection algorithm is then rerun on the subtracted target’s amplitude. another onset may be triggered at the same time, or later in the target segment depending on the strength of the residual amplitude. this permits sounds to be selected at different moments within a target segment.

4.) if another onset is found, a second sound is selected. this is done by comparing the target segment’s descriptors to all other corpus sound descriptors which have been algorithmically mixed with the descriptors of corpus sounds that have already been selected to fit the target segment in question. this is done frame by frame. So, if corpus segment A is selected to match a target segment, the next selection is made by comparing the target’s descriptors to a mixture of A + every other valid corpus sound. for each additional selection, the mix gets larger. e.g. selection three = A + B + every other valid corpus sound, etc.

5.) this process repeats as long as the target’s subtracted amplitude continues to trigger onsets (or unless the user supplies manual density restrictions).

way back when, I was originally doing this in a more computationally intense way with the mel spectrum. when the first segment was selected, its mel amplitudes were subtracted from the target’s amplitudes and target descriptors were recalculated on the residual mel spectrum. this only worked for mel-based descriptors like mel centroid, mel flattness, mel-FCCs, etc. you could also do this on FFT magnitudes, but that would be crazy.

Almost. I don’t think the log/lin domain of descriptors matters for mixtures, but you need to weight the average of the different sounds according to their respective linear amplitudes. So,

sound 1 frame 1 = centroid 1000, power = 0.01

sound 2 frame 1 = centroid 2000, power = 0.02

mixture frame 1: centroid 1666.66, power = 0.03

This algorithm comes from Damien Tardieu’s PhD thesis. IIRC, Tardieu found that this approach was 95% accurate for spectral centroid, and should work well for all spectral features.

My intuition tells me that this works best for approximating time varying descriptor mixtures, and will not work as well for sounds where descriptors have already been averaged into a single number. Of course, you could do this first on the time series, then average the result in a second step (which is what AG does internally for averaged descriptor mixtures).

Audioguide does this automatically when layering sounds for most descriptors, except those that are not “mixable" (f0) or not spectral. For power, I think it just adds the numbers (hence the 0.03 value, above), which is quite dubious if you have a corpus of detuned sine waves. For zero crossings, it takes the max.



First, thanks for the super detailed and thoughtful response!

It was interesting watching the video and hearing the nuts and bolts of your specific take and perspective on this, as it varies quite a bit from the (current) FluCoMa paradigm.

Thanks for the additional comments on the normalization stuff. I’m still getting my head around this aspect of things as it can get complex, particularly when MFCCs are in the mix.

That’s quite interesting.

I guess this makes the most sense in a “one off” context where you have a fixed target and a set corpus, since you can just normalize it as part of the query, but I wonder how this would fair with a stream of targets pouring in a real-time context, re-normalizing on a per query basis.

This is one of the toughest things to wrap my head around when dipping into the machine-learning side of things is that penetrability evaporates almost instantly. Not a big deal when dealing with things like MFCCs or a high dimensional space, but there are still individual numbers (i.e. duration, loudness, etc…) that probably still mean a lot.

At the moment I’m trying to square that circle since the tools are built around a “match everything to everything” paradigm.

I like this kind of conditional matching. @tremblap has done some conditional santizing where things that are below a certain loudness or have a spectral spread above a certain value are “dismissed” by the corpus creation process, but this could be very useful for querying varied input where things like pitch and/or confidence may be useless for certain targets as a way to just skip that part of the query, rather than finding a way to sanitize the results, which is not without its own problems.

That’s great, and probably accounts for the sound you get from AudioGuide, where things sound whole/complete (as opposed to granular/mosaicked).

I can’t think of how to do that in the FluCoMa context, as it on its face it would seem like a query per analysis frame or something like that. OR just dumping the whole time series into a machine learning algorithm and letting it “sort itself out”. Presumably the time-series-ness would be reflected in the matching, but perhaps not explicitly, as it would be treated as any other distance relationship, rather than a hierarchical “container” for the rest of the querying to fall inside of.

As far as I understand it, the closest we have at the moment is having derivatives for any given value, which contains some kind of time varying information, though skewness/kurtosis can perhaps offer some idea as well. We don’t have vanilla linear regression (again, as far as I know).

At the moment, each statistic is an island. That is, you get seven stats (mean, standard deviation, skewness, kurtosis, and low/mid/high centiles), and then derivatives of these things. But each one is run on single data stream (typically a descriptor of some type, but since it’s buffer based it can happen on audio as well).

I suppose one could do this “manually”, but it would be quite tedious/messy since it would involve manually multiplying every sample in a buffer by a value, since all(ish) data types are buffers.

Did you abandon this approach due to complexity, or because of the limited usability? (i.e. only mel-based descriptors)

I’ve been working on some real-time spectral compensation (e.g. using the (mel-band)-based spectral shape of the target, to apply a corresponding filter to the match to more closely have the two sound alike) so an approach like this might make sense since I’m already doing mel-band analysis of both the source and target anyways.

Presumably what follows below about the specifics of how to subtract and find remainders (based on loudness) would be the same when doing it per mel-band?

This makes more sense… And I understand what you mentioned above about weighing descriptors (in general) against their linear amplitude.

Aaand the devil is in the details. So taking the means of spectral descriptors wouldn’t play so nice with this approach.

Thankfully for my most general use case I’m detail with tiny analysis windows (256 samples with @fftsettings 256 64 512), so the amount of smearing for so few frames is probably much less than what would happen across a file or segment that’s 1000ms+.

Either way, tons to think about, both in terms of things to test and apply, as well as some wish-list-y stuff for the FluCoMa tools.

@b.hackbarth thanks for that brilliantly detailed reply, and for raising this point. No, we don’t do weighting yet, but should. The key will be coming up with a sensible interface, and producing proper weighted versions of the other measures (standard deviation, centiles + median)


Hi Owen,

Right. I agree that it is complex, as different descriptors likely need different averaging routines running under the hood. For instance, I don’t think it makes sense to average F0 or peaks, since then you’ll likely end up with pitches that aren’t actually found in the original data. I believe that audioguide takes the median for f0.

In any case, averaging something like MFCCs by linear amplitude yields much better results than a simple average of the descriptor values in my experience, especially for finding ‘exact’ timbre matches. Not sure what this means for std and centiles. In audioguide I have a simple user-editable dict where one can change the default averaging routine for different descriptors. By default, pretty much everything is power weighted.


you could also take the pitch value with the highest pitch confidence… so many things to try!

I’ll put my head together with @tremblap and @groma. The idiom so far has been to try and allow maximum leeway to do whacky things, hence bufstats delivering a range of statistics to experiment with, rather than curating per-descriptor. That said, it’s great to build up a knowledge base of what typically works well for different things.

In the future, I’m hoping we find and experiment with some other approaches to summarizing morphologies as well. Meanwhile, perhaps adding a weighting option to bufstats would be viable.


Can you explain averaging linear amplitude for MFCCs a bit more? I’ve used them in the past to compare timbre with success but I’m curious what your approach is here and how I might learn from it.

Sure. Take the following example: you’ve got a sound segment 4 frames long. Below are this sound’s centroid and amplitude values for each frame. Like most acoustic sounds, as the sound gets softer, the centroid gets higher:

centroid = [500, 600, 1000, 2000]
power = [0.1, 0.05, 0.001, 0.0001]

We want to average the centroid values to get a single number for similarity measurements (or whatever). Averaging the centroids gives us 1025. However, should the frame that is only 0.0001 loud account for 25% of the average? What I am proposing is that whenever we average spectral descriptors, we should weight by amplitude such that a frame which is twice as loud is twice as important in the average. If we average the centroids above using the linear amplitude as weights, the averaged centroid is 537.

Same thing works (and makes sense, at least to me) for MFCCs, flatness, kurtosis, etc.

Oh right!

I think I initially misunderstood that. Or rather, interpreted it in another kind of useful context (weighing the centroid of an averaged set of frames by the (also averaged) linear amplitude).

That would be super useful for all sorts of things, across the board.

I guess this is possible now, but it would be a matter of manually iterating through samples in Max and then running the stats yourself (or re-peek~-ing the values into a buffer~ and then fluid.bufstats~-ing them after that.

This makes sense - I think like @rodrigo.constanzo I was caught up in thinking about it in an obtuse way rather than more directly. I think for MFCCs it could be problematic as MFCC values are meaningless and meant to be sturdy against changes in amplitude anyway. From a purely numerical point of view I’m not sure how attenuating those values when the measurement itself is already attempting to guard against that would influence how the matching works but I’m on board for that kind of thinking elsewhere. Have you had success with this particular example? I’m a little invested now!

In regards to what Rod said:

I think that this is where ML could be powerful in determining these types of relationships automatically. In the same way that NMF (in my experience) tends to capture things that move together, or in nmf speak have similar activations, it would be interesting to try and apply that kind of thinking to data which is musically important significant. I suppose, although i’m still a noob on this front, this is exactly what some techniques for dimension reduction do when they are finding ways to wrap data around kernels or draw orthogonal lines.

1 Like

I think the point is less to do with the invariance of (higer quefrency) MFCCs to the energy of the frame, and more to do with the contribution they make to our perceptual impression of the whole. Consider that for, say, a percussive sound, you could have a lot of very low amplitude frames in the decay that are essentially mush, and don’t figure as much in our experience of that sound. So, by weighting the means, essentially dumb matching processes are more likely to pull out things that feel similar to our ears.

@tutschku’s Mfcc comparison thread demonstrates this, insofar as he’s getting audibly more satisfying matches from Orchidea (which uses a weighted mean) than from the vanilla mean via fluid.bufstats~.


Gotchya, I didn’t think about the process which happens after, where the frames that are super attenuated for being quite would contribute very little to a distance metric compared to those that were still ‘doing something’ in their respective bands. I suppose a gate could be useful in that scenario too, or simply looking at the max rather than an average. Alex talked about this in his descriptor talk at the CCL some time ago and it made a lot of sense in his example of a sample which had a long tail with mush at the end.

Hi James,

Perhaps using centroid was a bad example on my part. I didn’t mean to imply that centroid’s strong correlation with power is a reason to weight centroids by power. I agree with Owen that this is a perceptual question in the sense of getting averaged descriptor values to represent what we hear.

From somewhat of a more direct example, consider the 2 sounds below. The waveforms of theses two sounds are identical - both are made from the same recording of me saying the word ‘ash’. The only difference is that, on the first one, I turned the volume down for the ‘sh’ and for the second I turned the volume down for the ‘a’.

sound 1
sound 2

For most all descriptor analysis routines, these two sounds will have virtually identical* descriptor values, which is disconcerting, since they sound quite different. If we do a “vanilla” average of MFCCS, for instance, we will get the same numbers out and the two sounds will be treated as timbrally equivalent.

Taking the descriptor values at the max amplitude frame is useful in some contexts I think, as would be gating. However, in this example, that would result in sound 1 being completely represented as ‘a’ while sound 2 is represented by ’sh’. When weighting by linear amplitude, the numerical outcome is the closest to what we hear IMO: ‘a’ with a bit of ‘sh’, and then ‘sh’ with a bit of ‘a’ :slight_smile:

    • depending on whether or not you normalize FFT magnitudes before calculating spectral measurements, which more ppl do.

Cool. I look forward to hearing about what you come up with and what ends up being useful!


This is a great example, thanks :slight_smile: I think the deliberation over so many scenarios in this thread with different answers and strategies shows how hard the issue is. I’ll need to investigate this weighting thing more in my own applications. Seeing as I’m mostly in Python the friction for these kinds of processes is very low so I’m eager to apply them to what I’m working on now.

1 Like

Agreed, these kinds of things require careful listening and exploration. Python makes it very easy!

numpy.average(mfccs, weights=powers) :sunglasses: