Analysis "Pipeline"

rodrigo.constanzo · March 11, 2021, 8:12pm

Some else that came up during this week’s geek out is the idea of some kind of scriptable, or embed-able “analysis pipeline” , where you have a sequence of analyses/stats/scalings → dataset processes/reductions which can be applied to any given corpus (file/folders/whatever) and that that can be baked into a metadata-esque file such that it can be recalled and reused, with whatever relevant conceits for realtime use being made (threading modes etc…).

That would reduce a ton of headache when building patches, or moving between corpora and/or if you change your analysis/analyses. Would actually be super handy for exploring very niche analysis settings that suit and/or are optimized for a specific kind of material. It would mean that you could swap out (realtime) analysis/matching “pipelines” with each (offline) analysis/corpus, and not need to worry about what goes where.

There’s obviously some complex interface things here (some of which I’ll brainstorm below), but it would be cool to have some discussion about this as @tedmoore mentioned he was building (or thinking about building) a similar thing for SC.

///////////////////////////////////////////////////////////////////////

Where I see this becoming particularly complicated is when you have multiple timeframes on one or both ends. Specifically if it’s asymmetrical, as would be the case for my main use case (e.g. multiple “offline” analysis windows with only one realtime analysis workflow). For my purposes I’m intended to use the first/fast offline pipeline, but that may not always be the case.

There’s also potential interface friction with regards to normalization/scaling/overlapping where you may want to normalize/standardize on one side, but not the other, or apply a subsequent bit of scaling to further transform a 2D/3D space.

That could potentially be a post-processing step once a given pipeline is setup, but that could end up back and square zero if you have to prune/unpack a dataset to scale different bits differently, to then put them back together, with a unique version of this for each pipeline, etc…

So yeah, just wanted to make a thread about this as there’s some cool stuff to think about here.

weefuzzy · March 11, 2021, 10:00pm

Yup, @groma realised this almost immediately and its somewhere on my horizon list. I’m delighted if people start hacking at this in the meantime though, because – as you say – the interface questions are complex.

Our thinking was that pipelines are more useful once you’ve got a handle on what it is you’re doing, so they need to be as well as (or on top of) discrete objects, rather than instead of. Also, by their nature pipelines would abstract out the most common workflows, at the cost of some flexibility for added convenience, so keeping discrete objects retains space for doing wonky stuff.

There’s a few precedents to go on. pipo gets a lot right in Max, for my money, and sklearn has its own pipeline stuff that looks quite useful, especially once you start getting into doing heavier validation of learned models, parameter searches and what have you.

rodrigo.constanzo · March 11, 2021, 10:24pm

Bah.

I was hoping for the final release being a single object: fluid~, which can take an absolutely crushing amount of messages, and does everything. Including spawning discrete CCEs, each running in optimized OSs.

I’m not too familiar with pipeline stuff, but I imagine it’s something that “people have worked on before”. I guess it’s just the balance of having it be flexible and tailorable as well.

weefuzzy · March 11, 2021, 10:38pm

fluid~ seems a bit long, tbh. I think just f should do it.

I think it’s fair to say that pipeline architectures are an almost bottomless rabbit hole, because it all leads in the direction of something that starts to behave like an embedded language within the larger host environment. So the focus needs to be on identifying where it can help most (‘value added’ as the suity people might say) and avoiding ways in which it could get in the way.

What I like about pipo is the simplicity of specifying an analysis chain. However, it doesn’t (as far as I’m aware) allow for gnarly stuff like branching and merging. Where it makes me sad is that it can make exploratory tweaking within a process harder / more opaque because there can end up being so many attributes to comb through etc.

IAC, a good place to start for us might well be in being able to capture settings from objects: our TB2 models dump their state (which is obviously useful), but being able to grab / bulk-set parameters from non-realtime processors as well will provide some important scaffolding to beginning to experiment with this.

tremblap · March 12, 2021, 8:27am

what about we do a script that loads hidden unnamed pipelines in the back from what your Alexa has recorded you thinking about out loud? Very transparent interface, no?

indeed this is the thing we have been trying to avoid…

nor quick-ish, dirty, higher level prototyping… or maybe it does but I don’t know how.

that sounds so promising!

ps I like your idea of ‘holidays’

tedmoore · March 12, 2021, 3:55pm

I didn’t get very far…
https://github.com/tedmoore/FluCoMa-stuff/tree/master/FluidHandlers
because I quickly realized that I was facing this:

I don’t know that I’ll be coming back to this repo soon, but long term, yes, as everyone is saying, the pipeline paradigm will be awesome to have included.

jamesbradbury · March 13, 2021, 12:37am

This is essentially what fits does and then some (by allowing you to plug in any sklearn object in there too).

It supports

Automatic Metadata creation
Programmatic alteration of pipelines
Arbitrary branching and merging of processes
Multiprocessor optimised task management sometimes giving you 50% speed ups (especially on lots of small tasks)
The chain is just a script so can be loaded that way, or you can load an external metadata object in to recreate the chain from a previous process

(Oops pressed enter…)

I think Owens interface concern of a DSL emerging is really something I grappled with. Ftis began as a DSL that was much like YAML and that file parsed to python objects under the hood. What I learned was that the closer I made ftis to python and using th constructs of the environment I was in the more powerful it became at adapting in the moment while offering the framework I was wanting in the first place.

rodrigo.constanzo · March 13, 2021, 2:23pm

Yup, every bit of that sounds fantastic.

We’ve not spoken about this for ages, but do you transform between modalities here? As in, having a pipeline that does oodles of stuff (as above), which then also generates a realtime analysis equivalent?

jamesbradbury · March 13, 2021, 3:39pm

There is no real-time equivalent for Python and certainly not something I care about. It wouldn’t be hard to serialise a FTIS chain into some equivalent elsewhere on the presumption an analog exists (:.

rodrigo.constanzo · March 13, 2021, 3:56pm

That’s a part of the issue, though not a big one. As most of the backend needs to be swapped out for transformpoint-ing individual buffers as opposed to fittransform-ing entire datasets. Not a massive thing, but I guess a big divergence when it comes to the post-descriptor/stats step.