Generative Rhythm (regressing a time series)

This has been on my radar of things to test for some time, but finally getting around to it.

Basically taking the core idea(s) from this older thread on regressing controller data but applying it to delta values of onset timing.

One of the main issues with the chunking approach in that thread is that it didn’t seem to be possible to find a balance between having a reasonable gesture window for capturing meaningful data from controller input, and the required chunking/memory to capture that.

So either you needed gigaaaantic chunks, or a super slow sample rate, neither of which is viable for realtime/fast controller regression.

Fast forward a couple years (!) and after having an unrelated chat with @weefuzzy about SP-Tools things, he was talking about time series and deltas and that got me thinking about this again.

So I’ve dusted off @balintlaczko’s patch from that thread and tried to create a version that you can train on chunks of rhythmic data instead (via onset timing deltas).

This sidesteps one of the main issues in that the “sample rate” is irrelevant as it is an onset-per-entry, so the chunk size is directly mapped on to how many notes you want the “rhythmic memory” to be. (as far as I can understand it)


This first pass was just to get something (somewhat) working, and then to try and imrpove on it.

So here’s the patch as it stands: (12.0 KB)

The first glaring problem is that, it’s not very good. It does produce some delta values, but it’s nothing like the source material here (which I purposefully kept very rigid/square). This could be down to the (small relative) size of the network, the small amoun tof training hits (~80), the (potentially too small?) size of the chunks (16), or something as simple as me doing it wrong.

There’s also a funky thing going on where I have to reinstantiate @balintlaczko’s abstraction before running it or Max crashes. I plan on troubleshooting that separately, but just skipping past that to get things going.

So firstly, is this, theoretically, the right process here (for this particular approach (time series chunking into an MLP))? Or have I messed something up along the way?

I was(/am) still confused at step 5, but I guess the idea is you feed the results of the prediction back into itself with the idea being that the chunking will predict a new value for the last step of the chunk which represents the latest entry.

As mentioned above, this doesn’t sound/work too well here as I seem to be getting really erratic and somewhat random sounding rhythms, but that can have a number of explanations (listed above).


Now presume the above works and produces usable rhythms, the next steps would be to create some kind of larger structure to things. From an email chat with @balintlaczko he suggested having multiple networks where one encodes low-level patterns into “syllables”, “words”, “phrases”, which can then be zoomed out and extrapolated on. The technical specifics of doing something like that evade me, but I can think or imagine of things like encoding x amount of deltas (as above) which represents a kind of low-level rhythmic grammar, then separately account for things like amount of onsets at various time frames (3s, 5s, 10s, 20s, 30s, etc…) and encode that separate such that when generating material one network informs the other on how often generate material etc…

I’m aware that the NNs available in FluCoMa aren’t really designed (or able?) to do this sort of thing as none have feedback or memory, but at the same time I’m not trying to make generative or “AI” music in any way. This is hopefully adding another dimension to some of the things that can be done with SP-Tools where you can generate some kind of rhythmic/self-similar material based on previous training.

If it works, it would also be fantastic to include multidimensional input where rather than just feeding in delta values, I feed in descriptors at each given point, with the hope that with sufficient training it could predict the time series, and corresponding descriptors, for new hits.


So, any thoughts/suggestions/comments welcome, both in terms of fixing whatever is wrong with this patch, and how to improve it (drastically larger training set and/or network structure, ways to encode multiple levels of memory/structure, etc…)


I quickly tried much larger chunk size (32) with a corresponding network structure (25 15 10 15 25) and after training to a loss of 0.157 it produces results which… sound better. Still random/erratic, but a bit more push/pull but after a while seemed to regress to the mean where it drifted and stayed at a rhythmic value (~Δ114).

I did get this happening with the version posted above every once in a while, but it seems like the larger chunking does that more quickly?

Ok I came back to this this morning some and created a much longer initial training set (500 hits vs ~70 in the OP).

longrhythm.txt (8.2 KB)

After letting the network train for like 5min it got down to ~0.2 loss and stayed there for a long time.

This seemed to produce better sounding results? Depending on how I seeded it it either regressed to a mean and played the same singular rhythm over and over again (~Δ240) or sometimes got stuck doing a rhythmic phrase over and over (4 eight notes followed by 2 quarter notes), which is quite present in the training example to be fair.

It did take me a while to create this training data, and in this case was fairly easy as it was just combinations of quarter/eight notes I could just riff on infinitely. But if I had a more peculiar rhythmic language I was working on, I would struggle to do it for >5min straight. So I was wondering if there’s a way to “cheat” the training data. Since the chunking represents a finite memory/history for the network, that it would be possible to take the original training data and cut it up into larger chunks and reshuffle it around. This would obviously introduce some new rhythms at those junctures but it would potentially significantly increase the surface area for the network to work on.

I vaguely remember reading about something like this when you have a small initial training set, but don’t know how big is needed/required here for this to work properly.

this is very interesting! I’ve been working on something related, but from a different angle so I’m not sure how helpful it’d be.

i made a Max object that listens to onset timings and translates them (using RNNs) to onset timings applied to a midi score. I use it to “spice up” midi drums in response to my guitar

short paper . very short paper w/ demo vid

for version 2 of the project i jumped on the Transformer bandwagon :slight_smile: this should allow me not only to predict onsets, but to decouple the score and generate beats based on the live input + history of what it’s played before (sort of like GPT does continuation…)

the code is private right now pending double-blind review, but I should be able to share something soon

looking at the above, i too am interested in

  • working on multiple timescales. My understanding of Transformers is that Attention (when it’s large enough) makes this happen automagically, e.g. GPT maintaining coherence over sentences, paragraphs, pages etc. I have yet to study this in depth on time series of timing deltas
  • multidimensional inputs. I use a feature vector that also encodes stuff like time signature, and relative position in the beat. Of course if you want to stay free/agnostic then this kind of info constrains you, but it should aid with model stability. Again, something I haven’t thoroughly studied/tested yet. (don’t tell my reviewers)

finally, something that Transformers and standard MLPs have in common (as opposed to RNNs for time, or CNNs for spatial) is they’re unaware of the position of the input data in the sequence. My impression is that you don’t handle this explicitly in your system… you just expect the network to understand that the 16 inputs are chronological, but I think MLPs are not great at this. Transformers use positional encoding for this purpose:

hope I haven’t confused matters even more with this post…


That is super cool!

Heh, when I first watched the video I thought what was happening was generative and was blown away by the stylistic interpretation!

I imagine Transformers are where it’s at here, and is actually something @balintlaczko mentioned too. Unfortunately that’s significantly beyond my understanding, and consequently, skillset, to implement. Especially since I’d like to implement something dependency free, which in this case limits me to FluCoMa stuff (MLP being the closest/best example (as far as I understand it)) or fully vanilla/native Max.

It’s a shame that most heavy/proper/chunky(/modern) ML stuff happens in Python, or specifically, outside of Max(/gen~).

That being said, I think that having some larger-scale structures emerge automagically would obviously be ideal, though I’m not entirely sure how to go about it when that’s not the case. Looking back at some notes I made when talking to @weefuzzy and he also suggested running a clustering algorithm on the deltas of onsets to reveal phrases and tendencies, which could be useful too.

I was remembering/realizing today that I’m already doing some stuff to derive secondary attributes from strings of deltas (which I use for mapping stuff elsewhere), but may also be good to bake in somehow.

I basically take a stream/history of onset deltas and compute a few derivative-like things from it:
Screenshot 2023-05-07 at 1.46.37 PM

This is using a 7 value history (chosen arbitrarily) with “tempo” being a pseudo-tap-tempo algorithm where it’s the median of the series, “slope” is linear regression, and “variance” is standard deviation. So this encodes some kind of additional data about time/memory/history.

I’m not entirely sure what I would do if I were to generate/regress new values from this though. But I could imagine regressing a series of delta values from an input of tempo/slope/variance or something like that (I may test this separately). I suppose that could then generate new series of deltas based on incoming tempo/slope/variance, and I can take the last entry (as above) as a delta value to generate new onsets with. More like an accompaniment-ish thing rather than generative-ish.

1 Like

so I tried to run your patch and Max is spinning… as for regressing a time series, maybe we could find a way to implement @rvirmoors (or other time-aware NN) eventually. We just need motivated people to help implement it and find a good interface in the FluidVerse

Hmm, that’s odd.

Did you delete/reinstantiate the @balintlaczko abstraction? On mine it instacrashes if I don’t, but maybe it makes yours spin up.

In watching @brookt 's recent vids (this one specifically) it makes me think that some (simple) hidden markov stuff might be a better approach. I couldn’t find a native (e.g. non-external-based) implementation though, and I’m keen on not adding any additional dependencies to SP-Tools.

But yeah, having some proper time-aware network topologies would be ideal that’s a more modern interpretation of the idea/approach.

I do wonder how much of the “proper machine learning” stuff for rhythm generation leans towards the typical/awful models we hear from NSynth etc… Where it can generate all the generic 120bpm 4/4 dance music I can stomach, but is useless for anything else outside of that.

1 Like

I did - it was written in great red and big letters :slight_smile:

you know this doesn’t exist right? I am thinking of doing maybe the time warp the wekinator does, but all of these are still quite hacky. @weefuzzy has a long long long term project of maybe implementing @Chriskiefer 's reservoir computing… but again, hard to control and definitely not a silver bullet…

and that leads to questions of datasets in 3D (how to write them and how to deal with them)

you know you can fake a ‘markov’-ish approach by regressing current state to next state, as I know you have tried. It is sadly dependant on timing, hence my slow readings on wekinator DTW…

as we saw in the example by @philippe.salembier in this thread there are other ways within the current codebase to think about time and tempo. I’m still trying to get my head around a clear use case for corpus manipulation and high-D gesture recognition and variation making… one day.

I am not sure about the markov model complexity you have to deal with, but if the number of states and the order are rather low, Max provides a solution based on a combination of “anal” and “prob”. This drawing is from the anal help patch.

But maybe you are already aware of this solution but it does not fit your needs.

1 Like

I mean more like having some feedback in the network structure. Not so much that it would be (human) “time-aware”. Basically being able to encode temporal states in the network so it can hopefully do something.

The chunking above kind of forces that onto an MLP but the results don’t seem great (not that I’m certain I’ve implemented it well).

I was looking through some old patches I had saved for some 2nd order markov stuff, but I think that may not be enough to get meaningful (i.e. worth the effort to implement) results.

I guess something is better than nothing for testing purposes.

1 Like

There was a potential intern that wanted to implement something this summer. We’ll see if that materialises and I’ll ask for volunteer testers :slight_smile:

1 Like