IQR-ing corpora

Revisiting this workflow now as I’m presently building some test/comparison patches.

At the moment the end results of this process will go into a fluid.kdtree~. In looking back at this there is an additional normalization step after the PCA-ing. Is this typically needed? I guess the output of PCA can be a bit “all over the place” (though it would be good to see the results of that numerically).

So far I’ve built a vanilla workflow that takes new incoming points and applies robust scaling, PCA, and normalization before being fed into knearest.

I’m going to try building similar versions with MLP instead, as well as one that does no PCA-ing at all (just using smaller initial dimensions, ala the SVM/PCA stuff).

No, it’s not typically needed. The output range of PCA is expressed in standard deviations, because of the way that PCA is about maximizing the variance in the early principal components. Ergo, it will tend to lie in a ±3 region, and this tendency will be stronger the more Gaussian the distribution of the feature in question: if the feature is Gaussian then 99.7% of the data will be in that zone, and 68% of it will be in ±1.

1 Like

Pardon me if this is super common knowledge, but I just came to the (glaringly obvious) realization that every corpus comes with its own set of fits for scaling/reduction/etc… (as you need a dataset in place to compute many of those things). And I guess the way these spaces overlap (or not) has to do with whether you use the existing fits for your incoming audio.

Although presumably you always need to use the same exact PCA fit, as otherwise you’ll be comparing gibberish numbers to (unrelated) gibberish numbers.

So super obvious this is, but it hadn’t really clicked for me. So pardon me “posting out loud”.

1 Like

Bit of a bump here as I’ve been revamping the concat stuff in SP-Tools (with great success!) and am thinking of implementing some kind of normalization/standardization for disparate descriptor spaces.

The way I handle this elsewhere in SP-Tools is to store multiple versions of each dataset (normalized/robustscaled/etc…) along with their respective fits such that depending on what the user selects, the appropriate dataset is loaded and the fits are applied to the incoming descriptors.

For the way I’m approaching real-time concatenation I have significantly larger datasets (500k+) as I’ve gone with a silly overlap of 8x (I may revisit this actually). But this essentially makes saving multiple versions of the dataset a bit unwieldy (though not entirely out of the question).

What I’m wondering is if I can bypass the need for that if I’m planning on using IQR by just storing the of the corpus (computed at the time of analysis), then creating a training/snapshot of the incoming descriptor space, and use the IQR of that to scale it to the IQR of the corpus. Such that the incoming/realtime audio descriptors would be stretched to fit the offline/corpus descriptors, rather than scaling/normalizing both sets.

So taking a tinkertoy example of scaling pitch. Say the IQR of my corpus has a mean of 60 with the IQR falling at 40 and 80, and my incoming pitch is centered around 80 and has a much wider range of 40 and 120. So something like this:

corpus: 40 ← 60 → 80
input: 40 ← 80 → 120

So would it be as simple as just scaling/offsetting the input to fit the corpus distribution, then feeding that into a KDTree of the raw corpus descriptor range?

I’ve heard - it sounds amazing! I will reply to all your questions later today or tomorrow, I’m about 3 weeks backlog of significant replies here.

In short for now: it depends. imagine that a subset of your dataset (10%) is all in the extreme, it would be dismissed completely. But I reckon you heard and tried my other thread on this matter, and in my current piece, I do map 10-90 to 10-90 and tried with that and with standardization and with PCA. I didn’t try to eliminate components, and used a kdtree with my mere 2300 points - carefully curated synth noises though :slight_smile:

I know this is not helping much, but I’ll reply more intelligently late.

1 Like

There’s obviously loads of caveats but was more wondering on the maths-side of things here since whenever I’ve done this in the past I’ve always scaled both spaces and compared them that way, whereas in this case I’d like to just scale one space (incoming descriptors) to the other (corpus descriptors). So just taking the mean and std, scaling the difference between the stds and offsetting by the new mean, and going from there.

this is the classic option then, where scaling doesn’t change much - you take the universe of one and hope the other one will have stuff in that zone. I tried to explain that clearly here:

In other words: in your corpus has stuff between -100 and 100 and you scale it 0-1, then you enter a target that wiggles between -200 and -100, it will scale to -0.5 to 0 and you will get no match other than 0 as the nearest neighbour. does it make sense?

1 Like

Hence this:

in other words: if your corpus is not filling the total world (in which case it would be boringly reproducing the input) you will have to decide how you want to distort the matching of the 2 spaces, and there are no simple answers - it is a creative endeavour.

1 Like

I’ll have to test all of that, but I guess I meant the question in a much more boring way. If knowing the mean/std of an input and offline space can be used to transform just the input space to the offline space (with robustscaling specifically). Like, mathematically/fluid.robustscale~-wise. It may very well produce bad/worse results and all that, but was just curious of whether I needed to compute/scale both spaces in order to map one onto the other.

Ok, had some time today to analyze/scale all the bits. So to ask more concretely.

Here is the fluid.robustscale~ fit for the corpus I’m working/testing with:

{
  "cols": 8,
  "data_high": [
    -20.102773666381836,
    0.284235775470734,
    74.88372039794922,
    0.619443297386169,
    -28.400463104248047,
    0.693569839000702,
    73.83578491210938,
    0.819302082061768
  ],
  "data_low": [
    -41.67296600341797,
    -0.263175576925278,
    61.05841827392578,
    -0.63897043466568,
    -67.24171447753906,
    -0.872315585613251,
    60.706329345703125,
    0.185682728886604
  ],
  "high": 75.0,
  "low": 25.0,
  "median": [
    -31.107669830322266,
    -0.001829719520174,
    68.79005432128906,
    -0.005249743815511,
    -54.21134567260742,
    -0.088609740138054,
    65.84819030761719,
    0.32907697558403
  ],
  "range": [
    21.570192337036133,
    0.547411352396012,
    13.825302124023438,
    1.258413732051849,
    38.841251373291016,
    1.565885424613953,
    13.12945556640625,
    0.633619353175164
  ]
}

And here’s my input(/testing) corpus (jongly.aif analyzed as a single pass using the same settings/hop/etc…):

{
  "cols": 8,
  "data_high": [
    -18.309932708740234,
    0.7461849451065063,
    92.47969818115234,
    0.6848874688148499,
    -22.852352142333984,
    0.35558420419692993,
    116.43193817138672,
    0.2502046227455139
  ],
  "data_low": [
    -29.621623992919922,
    -0.7721195220947266,
    47.43507385253906,
    -1.7414435148239136,
    -41.677947998046875,
    -0.8485021591186523,
    70.7255630493164,
    0.045023296028375626
  ],
  "high": 75.0,
  "low": 25.0,
  "median": [
    -24.74309539794922,
    -0.1418006867170334,
    63.80019760131836,
    -0.4384874403476715,
    -32.07149887084961,
    -0.2410660982131958,
    83.63229370117188,
    0.20762591063976288
  ],
  "range": [
    11.311691284179688,
    1.518304467201233,
    45.04462432861328,
    2.4263309836387634,
    18.82559585571289,
    1.2040863633155823,
    45.70637512207031,
    0.2051813267171383
  ]
}

So in this case, would/could I create a “synthetic” fit by subtracting the corpus by the input, and offsetting the numbers in the input one?

So the first column of “high” in each is dB. In the corpus it’s -20.102773666381836, and the input is -18.309932708740234, so subtracting them would give me -1.792841, which would then create a new/synthetic “high” of -16.517092.

And I would repeat this for every value across the fits? (except the "high: 75 and low:25 bit)? Then load that into a fluid.robustscale~ and transformpoint jongly.aif in to “stretch” jongly to the corpus space?

Actually looking at the numbers more closely and the high and low sections seem to be a product of the median and range (redundant?), so it would presumably just be a matter of creating a synthetic version of those numbers and then have that produce the corresponding high/low values for the rest of the fit (assuming that fluid.robustscale~ expects all the numbers).

So in the example above my corpus median/ranges are:

corpus:   -31.107669830322266, 21.570192337036133
jongly: -24.74309539794922, 11.311691284179688

(typing out loud here)

I need to scale the input from jongly to be quieter, and cover a wider range than it currently does. I’m not sure if this means I need to take the difference and compensate for that (by creating a synthetic median that is 7dB higher than what jongly currently has? So creating something like?

jongly: -18.37852096557617, 1.05319023132325

Aaaaand did some practical testing.

I forgot I need to somehow involve inversetransformpoint into the mix, as what I want is a space that is the scale of the corpus, and not a space that is the scale of the transformed corpus.

So is the solution to this something as easy as this then?:

edit:

I don’t think it is…

It’s possible the distribution of my test corpus is wonky, but putting the above in my descriptor analysis pipeline gives me significantly worse matching. And if I swap the fits (corpus on left, jongly on right) I get nearly identical matching (or perhaps identical, it’s hard to tell).

I’m re-reading this over and over and try to understand what you are trying to do to help.

if you have a mean/stddev of an input via fluid.standardize, your input space is now mean=0 and stddev +/- 1.

if you have the same process on a target it will have the same value. you can put that in a kdtree and query it with the standardized input and they will be ‘overlaping’ with all the deficiencies of aligning 2 unrelated shapes by zooming them (the map of France and a M10 bolt can be aligned, they are both vaguely hexagonal)

==
if you are trying to do outlier reject on one or more of them, the same principle applies with robustscaling.

Does this help?

or if you try to actually remove the outliers, that is relatively easy with datasetquery post robustscaling, since you will know that +/- 1 will be +/- IQR (if you set your values to 25 and 75 centile that is)

That’s basically what I’m trying to do (with all “actually useful or not” caveats) but without plugging the transformed version into a kdtree. I’d like to keep the kdtree in the original scale because it will be impractical to save my corpus in more than one state (500k+ points in some cases).

So I want to scale the input data (via robust scale (for outlier rejection reasons)) to the corpus data (relative to its own robust scaling) but putting the input in the scale of the original input (e.g. dB, MIDI pitch, etc…).

Conceptually, I think this is what I’m trying to do. But this doesn’t actually work well. So that could be a function of me implementing something incorrectly or my input/corpus not overlapping in a useful way such that it actually sounds worse when doing this.

to do that, your solution is good - you transformPoint the input point in the ‘middleground’ then inverseTransformPoint from the saved status of the scaler of the target. it will give you something, but definitely not dB - distorted dB - because the same problem of matching the map of France to a M10 bolt is there.

1 Like

Ok, if that’s the case then it’s just a matter of the scaling (or robust scaling at least) just doesn’t work well with this corpus/input.

I’ll try it with another corpus to see if that’s also the case, and if so, just give it a miss for now.

This is something I built into C-C-Combine but didn’t find it useful across the board. In fact going back to watch the first plenary presentation, the normalization actually sounds bad when I do it during the demo.

that can be due to what I explained above - your exceptions might throw out the overlapping space. Think of a 1D example: throwing the 10% edges of a flute pitch would just offset the problem of throwing the 10% bass pitch range… in fact you would just be further…

1 Like

I think you will find out that our datasets are too small, to strange and too peculiar to behave properly. This is where Fiebrink rocks: we’re not doing convergent large data science, but divergent small beautiful data art :slight_smile:

1 Like