Regression for time series/controller data?

balintlaczko · October 25, 2020, 7:50pm

Hey @rodrigo.constanzo,
First, here is the chunking part as a fluid.buf2trainingdatasets abstraction. (I am open to suggestions for a better name and for anything else)

Haven’t gone through your patch yet, but:

if the final loss is 33, that’s absolutely terrible. Abysmal. It should be less than 0.01 (I should have added that to the comment there). Maybe it would be wise to set @maxiterations to 1, and then if loss is more than 0.01 then retrigger the fitting. With hanging loops like that I often use a [deferlow], so it somehow still registers mouse clicks every now and then (in case you want to break the loop with a [gate]).
yes, I think the activations should definitely match the data. I am also an ML-noob, but I heard in several places that ReLu (more precisely the Leaky ReLu) is the gold standard nowadays for many use cases, but I don’t know why. So maybe it is a good idea to scale the datasets (or your input data) according to what ReLu likes there (don’t remember which range is that), and give it a try.
the chunking will to work like this: if your timeseries is “12345” and chunksize=3, then:

| step | input | target |
| 1 | “123” | “234” |
| 2 | “234” | “345” |
| 3 | “345” | “450” |
| 4 | “450” | “500” |
| 5 | “500” | “000” |

In retrospect, it should probably have an option to @dropremainder ala tf.dataset.Dataset.batch(). A future feature for fluid.buf2trainingdatasets. But in general, the chunking creates the “fake” temporal memory for the mlp. If it would be an LSTM (Long-Short Term Memory) which is created for timeseries, I could just say that in this dataset all entries are subsequent steps in a series and it would act properly. So this chunking is a trick to make mlp-s work with timeseries. (Nevertheless often used for RNN-s too!) Shuffling these can make sure that your network can handle sudden changes “off the playbook” better, instead of just imploding at a specific spot.

hidden layers: yeah, I might have been a bit over the top with 128 for this simple task. But 64 is a totally OK number AFAIK. I also have been working with LSTM layers of 512 nodes per layer, it certainly takes it’s time to train (aka. don’t even think about it without an Nvidia GPU), but the results were also better. But AFAIK the size is not everything. Generally they say you should scale your layer sizes to the size of your dataset (so for a dataset of 5000 entries, 128 may be an overkill actually, but for 500k, it would be totally OK, or even too small). Too many nodes will make training much slower/longer, unnecessarily. As for how many hidden layers you should have, they usually say (when it comes to LSTMs or CNNs but it probably goes as a general rule of thumb) that it should match how many meaningful levels of abstraction(? - probably not the best word) your source data has. For example pixels->lines->contours->shapes->facial_regions->facial_expression, or momentary_xy->little_bump->wiggle->arc_of_wiggles->large_section->movement. That is partly why it generally makes more sense to start with large layers, then go on with smaller and smaller ones AFAIK.
training time: brace yourself, 20 minute of training is a small time on the scale of deep learning. Some models train for weeks, and that’s not so rare. Also, training on a CPU is infinitely less efficient, than doing that on a GPU. Once I trained a model of one layer of LSTMs with 64 nodes for 2 weeks (not 24/7 but a lot) on my sh*tty little laptop. Then I borrowed a desktop PC with an Nvidia GTX 1660 Titan (if I remember correctly). With the same network, it took around 20 minutes to achieve the same accuracy. (It was humiliating!) So bottom line: 20 min is totally acceptable, 2 hours would be too, 4 hours is kind of standard (I think that’s how long they trained AlphaZero too before the battle with StockFish, but I may mix things up now).
the 50ms rate limit was just so that the animation is not too fast for the human eye, without that it was so fast that the trajectories blended together (kudos for the devs, because inference is super efficient!) So yeah, this should also depend on what makes sense in terms of your input data.
dataset size: yes, in deep learning, the bigger the better, but generally everything below 100k samples is considered to be a “sparse” or “small” dataset, AFAIK.
lastly, as I mentioned in the beginning of our conversation, long-time memory is where mlp-s will for sure fail even if they trained “perfectly”, because of the vanishing/exploding gradient problem. So don’t expect nice, high-level variations in the time structure.

After this long rant, will download and look at your patch now! Also a late disclaimer: I am a complete self-learned, half-dilettant ML enthusiast, so crosscheck everything I say with someone who really knows it.