Regression for time series/controller data?

rodrigo.constanzo · July 25, 2020, 4:46pm

Something I’ve been thinking about for years now, but haven’t had the knowledge (or tools) to do something about, has been using some kind of machine learning to model/predict controller/gestural data.

To give a more concrete example, I want to be able to take some gamepad controller data (with both analog/continuous and button/discrete inputs), and feed that into an algorithm in a time series, and then ask for “more” of that to be generated.

Generally speaking, I was imagining this wrapped into a “time scrubbing” metaphor where I can be recording gestural data into a buffer~, and rewind/play it back sampler/looper-style, but then being able to scrub into the “the future” by transparently switching over to some algorithmic/predictive versions that would follow on from that training data.

From where my knowledge is now, this seems like regression. Where, rather than filling in the gaps of continuous (and/or discrete?) values, I would be filling in a “future gap”.

Am I in the right ballpark there?

Where this gets really confusing for me is, what that would mean in terms of code.

In having a simple think about this I pictured having some kind of regressor (say fluid.mlpregressor~) and giving it an input which is just a clock signal. Either from a count~ or cpuclock or whatever, while at the same time generating the kind of controller data that I’m after.

So to use a simple example, at time0, my x/y controller would be bottom left, at time50 my x/y controller would be in the center, and at time100 my x/y controller would be at top right.

Where this starts to seem a bit crazy is that, particularly if I want some good resolution, I will easily have thousands if not tens of thousands points I’d be training (say 1 point per ms).

So that seems a bit crazy in that that’s a huge amount of data to set up and train on (or maybe not? don’t know).

The other thing is that the structure of that seems weird to me. Like, having a multidimensional controller stream, but only feeding a time series as an input. So basically creating a 1->15 network (or something similar).

Lastly, is the idea that if I have something trained like this, I can then feed it a control stream that consists only of a time value, and it would recreate/predict what I wanted? And quite importantly, if I send in a time series that’s greater (or smaller) than what the input of the regressor was trained on.

Am I on the right track?
Would a NN or some other regressor type be well suited to this?
Will it be able to handle ms/fast inputs as well as querying?

weefuzzy · July 27, 2020, 9:12am

Ah, generated is different from mapped.

Very short answer:

time series are tricky, becase the dependencies between samples need to be modelled (and estimating how far back those dependencies need to be traced is non-trivial). A MLP on its own is’nt up to the job because (a) no way of accounting for these temporal dependencies and (b) no generative mechanism. There are things out there though: neural networks with feedback connections; hidden markov models; factor oracles; convolutional neural networks
Generating suposses that there is some driveable model at work which will respond sensibly to new input. One could hope that an MLP-based auto encoder could provide such a thing. However, if you watched the 3brown1blue MLP videos I pointed everyone to, you might remember that a problem can be whether the network has ‘learned’ anything that makes sense to us as a representation. There are some autoencoder idea to try and make this more likely, as well as others that aim for generative capabilities by learning the probability distributions of inputs (however, this isn’t always effective either).

Things you might look at:
– There are a couple of hidden markov models in Mubu
– Chris Kiefer’s Echo State Network stuff https://www.nime.org/proceedings/2014/nime2014_530.pdf (ESNs are useful because they have recurrent connections (for time series usefulness) and can be induced to generate new stuff by applying feedback to the output layer.

rodrigo.constanzo · July 27, 2020, 9:38am

Yeah I guess I hadn’t considered that specifically, and thought that the “input time” would kind of model “time time”, but I can see where that wouldn’t work.

I’ll have a look at those resources and see if I can get a better understanding of bits, though I’m hesitant to slide into MuBu-land… (though I can poke and test).

This wasn’t the original purpose of this post, but will there be some kind of time-aware regressor at some point, natively in the fluid.verse~?

rodrigo.constanzo · October 4, 2020, 11:02am

Throwing a bump at this to see, now that things are nearing feature complete-ness, if there is any likelihood of a time-aware regressor in the pipeline.

I’ve not been in hurry for stuff, so I’ve not explored MuBu or other options hoping that something will eventually be available here, but if that’s not going to be the case, I’ll start investigating more unpleasant alternatives.

weefuzzy · October 4, 2020, 11:22am

Still only ‘maybe one day’, so if this is something you want to start exploring , then it’s probably worth looking at some of the other options.

rodrigo.constanzo · October 4, 2020, 11:25am

That’s good enough for me (to avoid messing around with MuBu)!

balintlaczko · October 24, 2020, 11:36am

Hey all! Feels great to be an alpha tester. I had a very similar dream as you, @rodrigo.constanzo, and did some simple stuff in python (tensorflow/keras) with lstm networks. They are tricky to train (especially when it comes to overfitting), but when it’s done right, their output is far more interesting then markov chains and mlp-s. If you are up for it, you can try it with mlp-s though (now that I have access, I’ll try it too!). You need to train your regressor on the series, like: if input is “a b c” then output is “b c d”, and if “b c d”, then “c d e”, and so on. Here the length of the lists are your “memory”. (A larger memory will be like “if abcdefgh then bcdefghi”, etc.) It will have some results, but you need a big dataset with a lot of different scenarios, so that it doesn’t get “locked” into something. Also even though it will have some minimal memory, it will never have a “good” temporal understanding because of the vanishing gradient problem also known as the exploding gradient problem. That’s what lstm networks tried to solve with trying to train gates in every node how much to remember or ignore momentary stuff. Generally (at least in tensorflow/keras) you need to organize your dataset in the shape of samples, time steps and features. It would be great to have a model like this in Flucoma, though I understand that it is not first priority.

rodrigo.constanzo · October 24, 2020, 11:55am

Howdie and welcome!

So I guess what you’re describing (with mlp) is what I was sort of describing above in the first post?

My main concern with trying something like that with the kind of input data I have (mix of continuous and binary controller data) is that it may produce somewhat usable continuous data (presumption based on no knowledge), but without any temporal sense, I can’t imagine what it would do with on/off controller data where the when and for how long matters.

I haven’t come across lstm networks before, as most of my exposure to algorithms and such has been via the FluCoMa project and wekinator and the corresponding Kadenze course, so it’s good to know what other kinds of things are out there.

But yes, something that can handle this kind of regression/prediction would be super useful!

tremblap · October 24, 2020, 12:01pm

Now, I really look forward to see what you get to!

balintlaczko · October 25, 2020, 2:46pm

Hey all, so here is a quick first (and a bit stupid) example for time series prediction with fluid.mlpregressor~. I tried to package all dependencies in the project, hope everything will work. Let me know if not.

On the gif above, the red ball is the time-series we record and train our network on. The green is the result of the prediction feedback loop. This is an overly simple (and well, stupid) example, but it shows the gist of how it’s usually done. I excluded one usual step, which is shuffling the training dataset (after chunking), which in more novel cases helps the network to avoid overfitting. Nowadays it is also fashinable to use dropout layers/functions in the network, for the same purpose (maybe could be a future feature?).
But here is an annoying feature request: it would be super mindblowing awesome if the training could be a multicore process. Could that be possible? In this example I trained for around 8-10 minutes, but only one core was busy, so I imagine that would be at least half the time if all 4 cores would pitch in…
Anyway, I might have missed something, so feel free to criticize. This example still does not account for the binary data proposed by @rodrigo.constanzo above, but either the system can be extended to learn those things, or it could be circumvented with some algorithm outside/around the network…
Best,
B
ball_mlp_autoregression.zip (871.6 KB)

tremblap · October 25, 2020, 3:34pm

optimisation will come once we are convinced of the interface choices - this is the pleasure of being on alpha, dear @balintlaczko - patience is required but suggestions and sexy patches like this are more than welcome

rodrigo.constanzo · October 25, 2020, 6:19pm

Oooh awesome!!!

I tried training it up on a simplified version of my controller data (patch attached), so taking just a 9d controller vector (2 x/y pairs, 1 analog trigger, and 4 binary/buttons). I created 5000 data points as they came out of the controller (more-or-less), and included my simple visualizer patch to watch what’s happening.

I got through the training point of your patch (took me 20min to train, with a final loss of 33), but in the end I wasn’t sure how to adapt the output. Actually looking back I’m not sure I had the autoregressor set up correctly either, since I didn’t change anything in that subpatch. Also, I probably should have changed the @activations because my data is between 0. and 1.

A couple of (noobish ML) questions.

The chunking you do in the subpatch. Is that producing a chunk of 10 entries (out of 5000)? So you are then training on 10 entries at a time? So kind of a time series simplification? And the missing shuffling there would be in order to break things up a bit?

Next, with the settinsg for fluid.mlpregressor~, the hidden layers seem really big to me. So in the case of your data it goes from 2->128->64->32->2, if I’m reading things right. In my case it would be 9 and the start/end. Is the high amount of nodes there just a technical thing, in terms of making an autoregressor work well?

And in a more vanilla technical sense, you’re limiting the data sampling to 50ms here. Is that just to reduce computation time, or is it about overfitting? Like, if I have really fast/wiggly gestures and would want that level of detail/granularity, would that be bypassed? (in my test example, I did just that, removing all the time clamps)

And lastly, in terms of speed and timescales (this one is kind of aimed at @tremblap). I guess the smaller the training set, the more erratic or “inaccurate”(?) the autoregressor will become, but I guess the faster it computes. I don’t know if it works this way at all, but can you sort of macro-chunk (not using this correctly here I’m sure), where I take 30sec chunks of controller data, and computed them in a just-in-time manner, as opposed to feeding it a single longer string of data, with it somehow taking in the aggregate of what happened over that time.

And in terms of optimization, curious the scale of it as even if it’s 10x faster, that’s still 2min of pinwheeling.

Here’s my patch (which does nothing other than house the data and visualizer):
xboxdata.zip (115.6 KB)

balintlaczko · October 25, 2020, 7:50pm

Hey @rodrigo.constanzo,
First, here is the chunking part as a fluid.buf2trainingdatasets abstraction. (I am open to suggestions for a better name and for anything else)

Haven’t gone through your patch yet, but:

if the final loss is 33, that’s absolutely terrible. Abysmal. It should be less than 0.01 (I should have added that to the comment there). Maybe it would be wise to set @maxiterations to 1, and then if loss is more than 0.01 then retrigger the fitting. With hanging loops like that I often use a [deferlow], so it somehow still registers mouse clicks every now and then (in case you want to break the loop with a [gate]).
yes, I think the activations should definitely match the data. I am also an ML-noob, but I heard in several places that ReLu (more precisely the Leaky ReLu) is the gold standard nowadays for many use cases, but I don’t know why. So maybe it is a good idea to scale the datasets (or your input data) according to what ReLu likes there (don’t remember which range is that), and give it a try.
the chunking will to work like this: if your timeseries is “12345” and chunksize=3, then:

| step | input | target |
| 1 | “123” | “234” |
| 2 | “234” | “345” |
| 3 | “345” | “450” |
| 4 | “450” | “500” |
| 5 | “500” | “000” |

In retrospect, it should probably have an option to @dropremainder ala tf.dataset.Dataset.batch(). A future feature for fluid.buf2trainingdatasets. But in general, the chunking creates the “fake” temporal memory for the mlp. If it would be an LSTM (Long-Short Term Memory) which is created for timeseries, I could just say that in this dataset all entries are subsequent steps in a series and it would act properly. So this chunking is a trick to make mlp-s work with timeseries. (Nevertheless often used for RNN-s too!) Shuffling these can make sure that your network can handle sudden changes “off the playbook” better, instead of just imploding at a specific spot.

hidden layers: yeah, I might have been a bit over the top with 128 for this simple task. But 64 is a totally OK number AFAIK. I also have been working with LSTM layers of 512 nodes per layer, it certainly takes it’s time to train (aka. don’t even think about it without an Nvidia GPU), but the results were also better. But AFAIK the size is not everything. Generally they say you should scale your layer sizes to the size of your dataset (so for a dataset of 5000 entries, 128 may be an overkill actually, but for 500k, it would be totally OK, or even too small). Too many nodes will make training much slower/longer, unnecessarily. As for how many hidden layers you should have, they usually say (when it comes to LSTMs or CNNs but it probably goes as a general rule of thumb) that it should match how many meaningful levels of abstraction(? - probably not the best word) your source data has. For example pixels->lines->contours->shapes->facial_regions->facial_expression, or momentary_xy->little_bump->wiggle->arc_of_wiggles->large_section->movement. That is partly why it generally makes more sense to start with large layers, then go on with smaller and smaller ones AFAIK.
training time: brace yourself, 20 minute of training is a small time on the scale of deep learning. Some models train for weeks, and that’s not so rare. Also, training on a CPU is infinitely less efficient, than doing that on a GPU. Once I trained a model of one layer of LSTMs with 64 nodes for 2 weeks (not 24/7 but a lot) on my sh*tty little laptop. Then I borrowed a desktop PC with an Nvidia GTX 1660 Titan (if I remember correctly). With the same network, it took around 20 minutes to achieve the same accuracy. (It was humiliating!) So bottom line: 20 min is totally acceptable, 2 hours would be too, 4 hours is kind of standard (I think that’s how long they trained AlphaZero too before the battle with StockFish, but I may mix things up now).
the 50ms rate limit was just so that the animation is not too fast for the human eye, without that it was so fast that the trajectories blended together (kudos for the devs, because inference is super efficient!) So yeah, this should also depend on what makes sense in terms of your input data.
dataset size: yes, in deep learning, the bigger the better, but generally everything below 100k samples is considered to be a “sparse” or “small” dataset, AFAIK.
lastly, as I mentioned in the beginning of our conversation, long-time memory is where mlp-s will for sure fail even if they trained “perfectly”, because of the vanishing/exploding gradient problem. So don’t expect nice, high-level variations in the time structure.

After this long rant, will download and look at your patch now! Also a late disclaimer: I am a complete self-learned, half-dilettant ML enthusiast, so crosscheck everything I say with someone who really knows it.

rodrigo.constanzo · October 26, 2020, 12:02am

Awesome! I need to wrap my head around what’s actually happening here (as to all the other points below), but that looks suuuuuper useful!

The name isn’t great, but I guess it’s descriptive.

I’ll run this again now, but this could have been to do with the @activation stuff. @activation 3 (tanh) expects -1 to 1, and all my data is 0 to 1.

I think I follow this. This is the length of the “memory” here. Since there’s no actual memory/feedback in mlp, this kind of fakes it by giving it a time series, without letting it know it’s a time series.

Running this again in the background with @hidden 64 32 16 to see if it’s any faster.

It’s interesting to hear/read how things translate from “big” data in a music context, to big data in a fluid.context~. I remember @tremblap mentioning a bit ago that most of the algorithms have been chosen to be functional on a CPU without needing massive render times. So I guess that limits the scope of some algorithms or entire approaches.

For something like this, I would want something that’s “almost real-time-able”, where in the context of performing with the controller, at any given point I can have an amount of predicable data to draw on. Hence my question about having multiple passes of something like this going on at once. So in this context, 1minute is almost an eternity… much less 20, hehe.

For other things, I totally wouldn’t mind leaving crazy long render times going, if I can then chime in on that.

Actually, this is kind of tangential ask here, but are there algorithms where you can train a kind of overall hierarchy/structure and then add some more data to refine it?

So like, if I train it on a lot of performance/controller data which I aggregate and then compute for a long time (days/weeks or whatever). To have a general “shape” of the data. Can I then add some real-time performance data for a specific performance and then be able to leverage the pre-computed network? Like seeding it with a bunch of new data or something.

I mean, every time I use the controller, it’s a bit difference, but it’s within a universe of possibilities given the controller, my physiology (only so many fingers), speed, and general performative language, etc… Then below all that is the specific syntax/gesture of that particular performance.

As I said, just spitballing.

Gotcha. Wasn’t sure if this was a downsampling-for-the-sake-of-the-algorithm thing.

Hehe, long rant welcome! It’s cool to have some different experiences and perspectives in the “secret” forum.

balintlaczko · October 26, 2020, 1:12am

Yeah, good points. Well, I think it is totally realtimable, if you don’t intend to train during the performance. From what I’ve seen, training is never something people do on the same occasion where they want to predict. So it can be that you train your network at home, and on the performance you just load your weights, and go predict.
In my experience when it comes to choosing layer sizes: maybe start with a smaller/shallower structure, and if the loss doesn’t want go down much, then gradually extend (which will reset your weights though). It is probably very dependent on the results you want and the regularity/redundancy of the training data, but in my experience, it should go below 1, and preferably below 0.1 (and more preferably even lower). (That is if this number corresponds to what I saw usually, which was computed using mean squared error, don’t know how is it done in the case of fluid.mlpregressor~.)
Your idea about retraining to a specific “mood” makes perfect sense, and AFAIK it is called “transfer learning”. It is exactly what you say: first train the network on general data, then take that every time you need something, load the weights, and train to another dataset (without resetting), which should be of course related to the general dataset in principle. For example take ImageNet from Google (which generally recognizes objects), and retrain it to recognize different brands/models of game controllers.
But, I don’t know how good simple mlp-s are for transfer learning, I heard that LSTMs are bad, CNNs are good, and Transformers are insanely good (they often say “the transformers are the new lstms” because they share the strengths but avoiding most of the problems).
I thought a bit about your task here, and maybe it would be worth to try breaking down your long performance into separate gesture archetypes (with kmeans or something), collect a lot of examples for each class, and train separate mlpregressors on them. Then you can train another mlpregressor to learn the sequence of the gestures on a higher level as they appear in the performance. Then, during the performance maybe you don’t need to retrain the gestural archetypes, just the sequencer-mlp from time to time, which I guess should be much more realtimable, while all your “gesture-imitator” nodes can just make something when they asked for it. (something, something…)

rodrigo.constanzo · October 26, 2020, 1:12am

And to add, not sure why it’s going so much slower now, but I started the training while making that post and heading up to bed now with it still churning away over an hour later.

I also lowered @maxiter down to 500 thinking that would speed things up, but I guess not…

rodrigo.constanzo · October 26, 2020, 1:16am

That sounds pretty ideal!

I wouldn’t want to have a generic trained bank of stuff, as that would feel a bit like “pressing play on a recording”, or a generic auto-pilot, potentially not correlated to the kinds of stuff that happened during that specific performance.

I guess something like that would be possible, but the control scheme I’m using is fairly modal, meaning that specific gestures/movements may mean things in different contexts or different scales etc… So I would be wary on decoupling the context from the gesture by training classes based solely on gestures.

balintlaczko · October 26, 2020, 1:29am

Yeah, totally right. I think in this case the classes should not be too scalable in time, since then they mean something else. But this also leads to an interesting question: is the context solely the controller timeseries, or maybe the audio you hear is also part of it?
In the first case, if you have let’s say 100-200 classes (where the same gesture faster or slower are different classes), and a good understanding of their sequence and timing, it should be interesting. In the second case you may want to experiment with teaching the network of the controller timeseries with some representation of what’s happening in the sound (eg. maybe when you hear a snare, you tend to do something fast and short)…
One more idea about the training/freezing: there is a trick called “early stopping”, which is like if the loss does not decrease in the next X epochs since the new minimum, then stop training. So here you may want to try @maxiterations 1, and then implementing this “callback” manually.

rodrigo.constanzo · October 26, 2020, 2:04am

Correlating with audio could be interesting as well, though for this particular instrument the sounds are all live-sampled, so the sound sound itself may also be quite variable. There’s probably something with each of the states/effects/processes.

When I first played with an auto-encoder, I manually banged it until it got low enough, but given that last time it was 20minutes, I figured I’d just let it run again. I’ve left it running on my studio computer, so we’ll see. Sadly I didn’t setup a cpuclock as I’m kind of curious as to how long it actually took (or will take, if it’s still running).

balintlaczko · October 26, 2020, 2:41am

Oh man, this Cut Glove is great!

One thing is seems clear from all this discussion: fluid.mlpregressor~ would definitely benefit from those (sweet) @blocking options as the other externals. And maybe a stop method, and possibility to query the loss while training, but now I am getting greedy. :))