Regression + Classification = Regressification?

This is a potentially weird “thinking out loud” thing, but is there a way to have a regression for a classifier? More specifically, I’ll explain what I mean in a musical way.

I have a corpus, which I can analyze however I want offline. At the moment I’m doing 0-256, 0-4410, 0-end.

I have real-time audio, which I do onset detection on and analyze 256 samples from that point (after waiting 256 samples of real-world time).

For my “apples to orange shaped apples” matching, I’m using that small window and querying any of the three analysis windows from the offline stuff. Not great, but it kind of works, and is about the most I can do without being able to tell the future…

…which led me to wonder about that part.

Given that with the offline files, I have multiple time segments (lets call them A, B, C), with a bunch of descriptors and stats for each, I presume there is some kind of way to make a correlation between what kind of B might follow, given any particular A. So as opposed to feeding the system an A and getting back the nearest A, I would ask for what the nearest predicted B would be (regression?).

NOW

What if I could also do the same for the real-time audio. Given my limitations for real-time/fast-ness etc… I don’t have a lot of time to wait around. I was originally doing 512 samples, but have cut that in half. At any stretch, I certainly don’t have 4410 samples, or 30s etc…

So, could I conceivably create a bunch of training data of the types of hits I might make with my snare/objects/etc… and train it in multiple time frames (probably only up to 100ms for “reasons” (more below)) and then do a similar process where I analyze 256 samples, have some kind of regression/prediction to what the B to that A would be, and then query based on both time series (ala @tremblap’s LPT patch).

///////////////////////////////////////////////////////////////////////////////////////////////////

From my rudimentary understanding of ML stuff, this is kind of mixing regression and classification yes?

Would something like this work (quickly)?
As in, I would have no idea how to even begin a process like this.

There are some potentially big snags with this idea, in that for the real-time audio, even though I can train stuff on A -> B -> C sequences, in a performance context, those may get overlapped and mixed up. I suppose the training data can allow for something like that where I could give it “clean” examples, and then some more performance-y examples, and have it try to make sense of that, but there could be a limit to how well it could predict overlapping input from isolated training data.

1 Like

Sounds like you want something like a Markov chain that can estimate the probability, based on the training data, that a particular ThingB follows a particular thing ThingA. The [prob] object Max provides a simple and old-school interface for this. Or there are fancier things in the ml.* and MuBu packages.

For the real-time matching, it sounds also like a probablistic approach with to temporal modelling might suit you, insofar as you want an immediate estimation of what kind of A you might have started. The mubu.hhmm object does this, I think. You’d train it along the same lines as the classifier you’re using, except that because it’s all about modelling temporal relationships, you probably wouldn’t use summary statistics. Rather your training inputs would be sequences of feature data mapped to classes. Then, at play time, it would output a (changing) likeliehood esitmate for each class.

2 Likes

I thought about that after posting, that rather than waiting 256 samples, and then having a summary of 7 frames, that I could probably feed it a frame at a time (via a parallel real-time fluid.descriptors~ thing(?)) and then hope for the best.

I could see something like this being tricky in that I’d have to make a decision as to when to “take” the prediction, which would ideally line up nicely with the querying process so that I can query with A + B.

I was also wondering about the temporal aspect of it. Since I only have 7 frames to begin with, and since I’m ignoring the outer ones for fluid.bufspectralshape~ anyways, that perhaps the derivative (c/w)ould encapsulate “enough” information about what was happening in those few frames.

So if I have loudness, centroid, flatness, rolloff, each with a derivative (and std for good measure), to take those 12 static values and then run the classification on that?

//////////////////////////////////////////////////////////////////////////////////////////////////////////////

So if I understand this part right, I would train loads of discrete A attacks by manually tagging them, then ask for the best match as I normally do now, but then use the known B section from the matched A to feed further down the chain?

Or more specifically, the classifier itself would have no idea about B, it would just tell me the nearest A and then I go off and pull up the corresponding data?

//////////////////////////////////////////////////////////////////////////////////////////////////////////////
I guess with this it would be better/sexier if it was more unsupervised(?) where I could just record a bunch of different attacks/hits, and not have to worry about inidividually training/tagging them, and then ask for matching based on that input.

These two things, I suppose, are not mutually exclusive in that I can just assign a random classification to each attack, and feed it a load of attacks, so each class would have a single training point, but that seems wrong.

Both of (my likely interpretations of what you said) approaches would be limited to a finite training set. That is, I’d have to play loads of different attacks and loads of different dynamics, and the precision of the system would be limited to the proximity of an incoming hit to a training point, which is why I was initially thinking of some kind of regression(?) thing, where I could also extrapolate to spaces where there is no/sparse training data (e.g. a quieter version of an attack I do, or hitting a different drum/object altogether).

Right, I’d lost sight of the whole lightning speed front end aspect of this. Probably this evolving likliehood setup isn’t much use at that timescale, because it involves waiting :scream_cat:

But, hang on: there’s a 1:1 mapping between your A's and the B's that go with them, is there not? So, if all you want to do is retreive the B more quickly, then once you have the A from KDTree, can you not use the same label (or a predictable varitation on it) to retreive the corresponding B directly from its DataSet ? Or have I missed the point somewhere?

1 Like

You know me. It’s either the hyper-speed highway, or the highway!

The whole thing is a bit confusing to me as well, but I guess the more critical part of this would be the aspect where I’m dealing with variable incoming audio that corresponds with a known analysis frame 256 samples long (A), and a predicted subsequent analysis frame from samples 257-4410 (or whatever) (B).

My, potentially brutally naive, thinking is that I could create a set of training data with all sorts of arbitrary hits which I would ideally not be manually classifying in a “little bit of this, little bit of that” sort of way. Basically trying to create a corpus or training set that more-or-less covers the types of sounds I may use.

Then “train something” on that data in a way that I could then give it an analysis frame (A), which would be generated by a JIT real-time process, and it would hopefully give me what the likely B that would follow it is.

So there would be a 1:1(ish) relationship between those sounds where presumably there would be a best match for each one, regardless of whether or not I created sounds that were radically different.

(parallel to this, there would be pre-analyzed and corresponding static A and B frames in a corpus which the real A + predicted B would be used to query against)

Is that what you meant?

Ok, I’ll re-read that a few times, but I think I’m grasping the general gist. We might need better names than A and B. I’ll come back with thoughts in due course

1 Like

Ok, here’s a bit of a visual example, and some clarification for terms.

For the sake of simplicity, I’ll refer to the corpus entries as source and the real-time input as target, and refer to the initial 0-256 samples (A) as a transient and the second bit from 257-4410 samples (B) as an attack.

So there will loads of sources with known/finite transients and attacks, not terribly dissimilar to @tremblap’s LPT thing.

Something along the lines of this:

source transient:

source attack:

These will exist in a database as separate entities (along with a another time scale for ‘entire file’, though that’s not relevant to this specific discussion) and will have corresponding descriptors/stats/etc…

The main bit of what I’m asking about has to do with the real-time side of things, how the target normally fits in.

So given the real-time limitations of my approach, I normally only have 256 samples to analyze before I need to ask for a query. So that means I’m limited to the transient of the target to work with.

So this is normally all I have:

So the idea, detailed in my previous post, is to have a separate training/matching process which would, hopefully, predict the attack of the target based on only the transient of the target.

So using the target transient:

To predict the target attack:

(based on some pre-computed training data)

So that in a real-time context, I can use the target transient plus the predicted target attack to query the corpus for a matching sample by weighing those against the known source transients and source targets.

Thinking about this some more, with my low-level, but ever increasing knowledge in this area.

Also, sorely disappointed that no one has commented on this gem:

///////////////////////////////////////////////////////////////////////////////////////////////////

The first thing that comes to mind is to try to create a training set that will be fed into fluid.knnclassifier~. Since the idea would be to capture a fairly solid representation of of the amount of sounds I can make with my snare, which as wide as they are, are not infinite, that I could give me a few hundred examples, making sure to cover different timbres and dynamics.

Since I don’t know what these sounds will be ahead of time, and I’m not terribly interested in creating a (supervised) taxonomy of these sounds, I was thinking of just assigned each individual attack a unique number which corresponds to its class. The data for each point would be something like 16-20 dimensions of (summary) descriptors/stats, and this would be only trained on the target transient, as defined above (i.e. first 256 samples of a snare onset).

So something like this:

1, -35.584412 -1.791829 3.50294 123.374642 -284.869629 35.527923 -3.364674 0.773734 0.322499 132.203498 1097.339844 68.029001;
2, -33.469227 0.217978 2.396061 100.806667 1306.492188 71.573814 -11.920765 6.502602 1.149465 97.934497 527.226685 82.160345;
3, -23.90354 -0.69517 3.378333 116.119705 -735.833984 51.341982 -6.689898 -1.087609 0.252694 127.367311 -2514.203613 74.122596;
4, -28.433056 0.200565 3.09429 96.997334 614.809875 77.417631 -14.684086 4.913159 2.038212 93.661909 -87.01062 25.470938;
5, -18.193436 -0.639064 3.84877 116.842794 -309.310791 35.34252 -5.882843 0.371181 0.325729 121.526726 242.327148 42.387401;

Logically, this seems like I could then play in arbitrary snare hits, and it would tell me which individual hit it was nearest to, and with the unique number it returned, I could then pull up from a coll (or entrymatcher) to retrieve the corresponding target attack.

BUT

This seems like a bad idea. Having a classifier built out of hundreds of hits with only a single example of each. Not to mention that my “matching” resolution would be limited to the training set. So if I only have mezzoforte hits when I strike a crotale with the tip of my stick, it may return mf values if I strike it softly, skewing rather than improving the matching. (this could be mitigated if I just use MFCCs sans the 0th coefficient, since it wouldn’t so much reflect the loudness of any given training point).

This approach could also potentially work well in a vanilla entrymatcher approach, where there are discrete entries, and I’m trying to find the nearest “real” match.

The overarching approach would be the same in that context, matching the nearest, and then bringing up the “missing data” to fill in for the target attack, to then query again but with a complete set of data (target transient + target attack).

///////////////////////////////////////////////////////////////////////////////////////////////////

After @tremblap showed some of the features of the upcoming neural network stuff, I was thinking that a more vanilla regression approach may be interesting.

I’m definitely fuzzier on how this would work, so semi-thinking/typing out loud.

So I would create hundreds of training points where I use target transients as the inputs and give it target attacks as the outputs, with the hopes that giving it arbitrary target transients would then automagically give me predicted/corresponding target attacks.

I guess since the algorithm is making crazy associations between its nodes, there’s nothing to say that the output would be a “valid” set of data (as opposed to some interpolated mishmash of numbers between the training points).

If it does work, it would be kind of nice, since the output could be more flexible and robust to unknown input sounds.

///////////////////////////////////////////////////////////////////////////////////////////////////

The last thing that occurred to me is that if I’m using “raw descriptors” (e.g. loudness, centroid, etc…) as my inputs and training data, that I can apply some of the loudness (or even spectral) compensation methods discussed elsewhere here, where the nearest match for a target transient is 5dB quieter than the incoming audio, so compensate for that difference in both the target transient and target attack.

This, likely, wouldn’t work with the regression-based approach, but I suppose the idea there is that it wouldn’t need this kind of massaging as the algorithm would take care of it for you.

///////////////////////////////////////////////////////////////////////////////////////////////////

My gut tells me that the classifier approach is more what I’m after, though I’m not sure it’s altogether viable (having loads of classes with just a single point each). The regressor could be interesting, but only if the automagic aspect of it works.

For now, I’ll try creating a viable training set with hundreds of sounds, and with a bunch of different descriptor types.

I would imagine it would probably easy enough to create a test patch that runs the input on itself and then stitches the input (target transient) and output (target attack) together to see if it sounds “real”. For the training points it would be a good way to test if the thing is working at all, but then giving it a lot of similar, but not in the training set, hits to see if it can create approximations that sound believable.

I do apologise. It’s a gem.

I might have gone the other way around re: source and target but as long as we’re talking about the same thing.

Stupid question: why don’t you have two datasets with identical labels for your source transients and source attacks? You match with KDTree against the source transients and then use the resulting label to retreive from the source attacks?

1 Like

I guess just thinking about it in terms of the spectral compensation stuff. Where the live snare is the “target” that I’m trying to get the sample to sound like. (Unless I flipped the terminology there too…)

You mean just as a way to avoid using a parallel coll or entrymatcher? Wouldn’t the second dataset here be functioning that way? (taking an exact index and returning the values associated with that index)

So if I understand you correctly, I have a dataset with target transients in it, I query that with my real-time incoming target transients and that returns a value like 15. As in, that target transient is most like entry 15 in the dataset. Then I just need to go fetch the corresponding target attack to further query with. Is that correct?

I’m in the middle of creating the samples for the dataset (with variations for descriptors, mfccs, etc…) so I haven’t yet built the bit for proper matching yet, but do you think that building a KDTree with loads of entries with just a single training point each will be alright?

Wait, I think I get it.

A KDTree is the vanilla entrymatcher, since it has single data points and you’re just finding the nearest match via euclidian distancing.

I was confusing it with building a classifier where I would have specific classes I’m labelling and then predictpoint-ing on.

It also occurs to me that this circles back to the (lack of) biasing a query problem.

So the best case scenario here is that in terms of what I’m doing my final query with, I will have a “real” target transient and a “predicted” target attack that I can feed into the system. So great in that I can query a longer span of time for a better match, but if I equally weigh them, I can end up losing some of the nuance of the “real” input by the limitations of the quantization inherent to having discrete points in the KDTree representing the transient attacks.

A more ideal solution would be to do the process outlined above, but then use the predicted target attack to only bias or influence the query (like 70/30, or 60/40, would have to test to see what works best).

Ok, I’ve built a thing and got it “working”!!

(actually had a gnarly crash, which I made a bug report for)

I’ve trained it on just 10 hits, and am feeding the training data back in to see how it fares. Surprisingly, it only gets the correct results 50% of the time, which is unexpected as I’m feeding it the same exact as it was trained on. The real-time version is running through an onset detection algorithms, so it’s not literally the same data as it may +/- some samples, but I guess that has knock-on effects because of windowing etc…

(I should note that the training data was, itself, segmented off real-time playing and the same exact onset detection algorithm)

If I test it on the same literal buffer~ data it works perfectly, so the system/plumbing is working.

Here’s the patch as it stands:

At the moment it’s oriented around validation primarily, so I can play specific samples, see which ones are matched, as well as fluid.bufcompose~-ing a franken-sample so you can “hear” what the algorithm thinks is correct. Even though it was 50% wrong, all the composite versions still sound believable, so that’s good.

It will definitely be critical to weigh these segments differently, since the second half (at the moment) has pretty shitty accuracy.

Speaking of…

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

At the moment I’m using the following descriptors:

loudness_mean
loudness_derivative
loudness_deviation
centroid_mean
centroid_derivative
centroid_deviation
flatness_mean
flatness_derivative
flatness_deviation
rolloff_max
rolloff_derivative
rolloff_deviation

These have been my go-to descriptors for a bit now, but they definitely aren’t robust enough for this application as an onset detection’s worth of slop in the windowing brings the legibility of them down to 50%.

(it should be said that I’ve not standardized or normalized anything here, but the input is the same as the dataset, so it shouldn’t matter that much…(?))

Off the bat, I can take all the spectral moments, along with all the stats, and perhaps even two derivs. Maybe throwing in a manually-calculated linear regression for good measure. Basically a more-is-more approach, but primarily based on “oldschool” descriptors.

OR I can try throwing MFFCs in the mix, as I did have the best results with MFCCs and stats when I was trying to optimize the JIT-MFFC-Classifier.maxpat a while back.

That patch, notably, does not do any dimensionality reduction, nor does it take any loudness descriptors (not even the 0th coefficient, looking back on it now). So it just brute-force takes a 96 dimensional space and matches based on that.

So I’m thinking I might do something like what @jamesbradbury is doing for his macro-segmentation stuff in Reaper and take MFCCs with all stats and two derivatives, and do some dimensionality reduction on that to see how things go.

This quickly ends up in territory as discussed in the (ir)rational dimensionality reduction thread, where loudness is important, as a perceptual entity, as is timbre, so if loudness only represents 1/13th of the total dimensional space, that’s a bit of a problem.

great prototype. I wonder what is happening as you should get your source back (you are querying from the same corpus) maybe the red bufcompose error you get in the background should help us?

Yeah I would think so. Or at least do better than 50%. (It literally has half the files right)

In the p playback subpatch I’m triggering files from a polybuffer~ that contains the entries that are in the fluid.kdtree~. The only thing is that I’m doing it as a real-time process (in p audioAnalysis). Both were created by the same onset detection and descriptor settings, but it’s likely that doing it again missing a couple samples off the start of the file that makes a big enough difference with the windowing to not match.

As I said, I didn’t sanatize the fluid.dataset~s at all either, which may exaggerate certain things(?).

I don’t think the fluid.bufcompose~ error is related because it’s after the matching process. Basically the fluid.kdtree~ spits out the match, then I put together a new buffer~ (in the bottom left corner) which is what would subsequently query a second fluid.kdtree~ that contains the actual corpus of samples. So I’m just concatenating the descriptors from the real-time analysis with the predicted descriptors that follow.

still: there is no distance at all to unity, so apart from very small fluctuations of attack/framing I don’t see how you don’t get your source back. a way to test the patch would be to make sure you send exactly the same input that you analyse. That way you should get 100% of the time your source back. If not, that means trouble in paradise.

That’s what I’m doing here.

I’m using pre-recorded audio files on the input for this patch, and used the same exact files to create the training set.

The process:

  • I created a bunch of training examples using onset detection on real-time input and then fluid.bufcompose~ to create 100ms files.

  • I fed these files into a batch analysis thing that analyzes samples 0-256, 257-4410, 0-end and saves that into file (using the descriptors/stats I mention above).

  • I load the analysis for samples 0-256 into one fluid.dataset~ (---transient) and the analysis for samples 257-4410 into another fluid.dataset~ (---attack).

  • I fit the 0-256 (---transient) one to a fluid.kdtree~.

  • Then I feed “real-time” audio from the original file playback into a JIT analysis thing. I’m playing back the pre-recorded 100ms files, but only analyzing the first 256 samples (from the point an onset is detected).

I’m thinking because the analysis window is so small (256 samples) that’s enough to throw things off (using these descriptors).

Here are the patches used and/or I can make a more detailed video showing the entire process.

Archive.zip (48.0 KB)

All that above being said, I did also test this using manual messaging to fluid.loudness~ and fluid.spectralshape~, so literally running it on the same audio (without being re-onset-detected) and it makes 100% of the time.

So I think the underlying stuff works. It’s just something getting lost in translation with the re-onset-detection and analysis.

indeed this points at the curating of the descriptors. I’ll get thinking, but one thing I notices when doing my drum classifiers in nmf and in kdtree was that there was a lot better result when I removed the first few MS because an attack is an attack but what is much clearer is the 2nd part of the onset just after the transient… have you tried not compensating for latency and just taking 1 or 2ms after the attack?

In this case I only have a 6ms window anyways, so a few ms is pretty big all things considered!

In the JIT-MFCC testing I played around with moving things back by some samples, but that didn’t seem to make any meaningful difference.

My theory here is that since I only have a finite amount of sounds I can make with my snare, that specifically the attack will give me a reliable (enough) idea of what the next 94ms will sound like.

I did think about trying to compensate for the re-onset-detection stuff, where I +/- some samples on either end to compensate for the fact that the final audio will have gone through that process twice, but I don’t think that will generalize out.