AudioGuide (talk)

b.hackbarth · July 7, 2020, 8:23am

Hello Rodrigo,

Yes, I think that normalization needs to be parametrizable per dimension. In my experience, some descriptors work better with min/max normalization routines while other lend themselves better to something like mean/std. Regardless, my $0.02, the most important choice is whether to standardize the corpus and target’s descriptors together or separately.

There is some text discussing normalization in the audioguide docs here.

My experience has been that matching 50 different dimensions of descriptors gives pretty bland results, However, I think this has to do with the nature of the “gap” between the sound world of the target and corpus. The more similar the corpus and target sound worlds (and/or the more comprehensive/variable the corpus), the better large dimensional searches should work.

I have not (yet) tried matching in descriptor spaces which have been scaled. It is an interesting idea.

I think that, for real time purposes, the hierarchical matching structure would be most useful, as you note, since you can first “prune" the size of the search pool based on lower cost descriptor comparisons (power, duration, etc).

The thing that I think I like best about this approach is that is it feels creatively purposeful. Rather than asking for the best match on 40 dimensions, which tends to be impenetrable to the user (ditto for dimensional scaling), you dictate what you want and the order that you want those measurements to be considered. In my work I’ve found that there is no gold standard for measuring similarity, only what you’re interested in.

Yes, this is one scenario that I happen to use a lot. There are lots of other interesting possibilities for hierarchical search functions. For instance, if a target seg’s noisiness is greater than 0.5, calculate similarity with descriptorN, otherwise use descriptorM.

You’re correct - audioguide lets you match sounds using time-varying descriptor differences. And I do think that this is key to capturing morphological shape (alongside layering, which I discuss below). In the program, one has control over this on a descriptor-by-descriptor basis: asking for d(‘centroid’) matches time varying centroids, d(‘centroid-seg’) matches based on power-weighted averaged centroids, d(‘centroid-delta’) matches the first order difference of time varying centroids; d(‘centroid-delta-seg’)… well, you get the idea. It is possible to match target segments based on different descriptor modalities — one could match, for instance, time varying mfccs, averaged centroid, and the linear regression of amplitude.

The most important thing with averaging spectral descriptors is to weight averages with linear amplitude. Are you guys doing this in fluid.bufstats~? If not, it should certainly be an option, if not the default.

Of course, it is possible to represent time varying descriptor characteristics in other ways (fixed-length arrays, differences, linear regressions, etc) which can help circumvent the need for frame-by-frame calculations. I personally like how frame-wise matching sounds.

Yes, I think this method quickly approaches the limits of realtime, depending on the size of corpus. But all of this will be moot in 10 years, when even the cheapest laptop will be able to churn out an excellent baguette.

Yes, you’re right, what audioguide does for layering is really quite simple compared to something like orchidée. It is a looped brute force approach. For each target segment:

1.) the best sound is selected.

2.) the time varying amplitude of the selected sound is subtracted from the target segment’s amplitude.

3.) the onset detection algorithm is then rerun on the subtracted target’s amplitude. another onset may be triggered at the same time, or later in the target segment depending on the strength of the residual amplitude. this permits sounds to be selected at different moments within a target segment.

4.) if another onset is found, a second sound is selected. this is done by comparing the target segment’s descriptors to all other corpus sound descriptors which have been algorithmically mixed with the descriptors of corpus sounds that have already been selected to fit the target segment in question. this is done frame by frame. So, if corpus segment A is selected to match a target segment, the next selection is made by comparing the target’s descriptors to a mixture of A + every other valid corpus sound. for each additional selection, the mix gets larger. e.g. selection three = A + B + every other valid corpus sound, etc.

5.) this process repeats as long as the target’s subtracted amplitude continues to trigger onsets (or unless the user supplies manual density restrictions).

way back when, I was originally doing this in a more computationally intense way with the mel spectrum. when the first segment was selected, its mel amplitudes were subtracted from the target’s amplitudes and target descriptors were recalculated on the residual mel spectrum. this only worked for mel-based descriptors like mel centroid, mel flattness, mel-FCCs, etc. you could also do this on FFT magnitudes, but that would be crazy.

Almost. I don’t think the log/lin domain of descriptors matters for mixtures, but you need to weight the average of the different sounds according to their respective linear amplitudes. So,

sound 1 frame 1 = centroid 1000, power = 0.01

sound 2 frame 1 = centroid 2000, power = 0.02

mixture frame 1: centroid 1666.66, power = 0.03

This algorithm comes from Damien Tardieu’s PhD thesis. IIRC, Tardieu found that this approach was 95% accurate for spectral centroid, and should work well for all spectral features.

My intuition tells me that this works best for approximating time varying descriptor mixtures, and will not work as well for sounds where descriptors have already been averaged into a single number. Of course, you could do this first on the time series, then average the result in a second step (which is what AG does internally for averaged descriptor mixtures).

Audioguide does this automatically when layering sounds for most descriptors, except those that are not “mixable" (f0) or not spectral. For power, I think it just adds the numbers (hence the 0.03 value, above), which is quite dubious if you have a corpus of detuned sine waves. For zero crossings, it takes the max.

Best,
Ben