Continuous Training in the Cloud

Dear FluCoMa community,

I’m posting to describe a project that I’m working on in collaboration with @amgum. We want to “productionize” a ML workflow using the FluCoMa libraries (i.e. create a continuous training pipeline which runs in the cloud). The goal is to provide a recipe that can be shared and re-used by others, including infra, for research purposes. It’s early but I think this workflow will involve analyzing a corpus of sound in Google Colab and then providing a proof-of-concept to “ship it” using a platform like Vertex AI (when new sounds are uploaded, it triggers the training workflow). The full pipeline is a bit speculative and involves some reduction of scope to keep it realistic and feasible. We want a small-data version of something which could scale to big-data. Probably we would use @jamesbradbury’s Python bindings, to bring FluCoMa into the more traditional Python-based data science ecosystem. We could use some help with the technical scoping.

@amgum is a data scientist while I’m a DevOps by day and we’re working together within the structure of a professional peer mentorship. We want our work to serve as an example to others. In addition to sharing the notebooks and infra code, part of this project is to reflect on and showcase what makes, for us, a successful collaboration between Data Science and DevOps. @weefuzzy shared this video with me, which might serve as orientation.

This is a learning experiment for both of us. I’ve never done full-blown MLOps and @amgum has mainly worked with financial data and things like that, not spectral time series data. The specific inspiration for this was Alice Eldridge’s demonstration of her analysis of rainforest sounds.

I hope that’s enough context. I’m eager to share this with the FluCoMa community and hope we can get your support and encouragement.

Right now, specifically, we’re looking for public datasets (either on Kaggle or elsewhere), both of sounds and derivative spectral data of the natual environment, to get our bearing on the data itself. If anybody has any pointers, would like to help or know more, please get in touch in thread or in DM.

3 Likes

This is a great news. @tedmoore has done some back-and-forth between Python and FluCoMa, and @rodrigo.constanzo and Jordi Shier have something coming very soon doing something very very cool: using PyTorch to optimise and train networks, from FluCoMa made datasets and towards FluCoMa MLPs.

Looking forward to hearing/seeing what is going to happen and thanks for sharing!

2 Likes

I remember, towards the start of FluCoMa, that either you (@tremblap) or it may have been Hans (@tutschku) was really keen on the idea of having something like this that kept an up-to-date analysis of all your audio samples. Each time you added to it, it would get re-analyzed automatically overnight and be available the next day.

I was reminded of the usefulness of that again with this post.

2 Likes

Would node bindings be useful? Or would you rather keep it in Python? I ask this because I am working on native bindings to both, and I am learning a lot about pybind / napi in the process, but its probably too much effort to maintain both. I wonder what would be most useful, because it would be cool to support this project.

1 Like

See my comment here:

Could be a fun “proof of concept” for making lower-level language bindings. A small daemon that runs and is pointed to various folders, constantly updating a database which you can export to a dataset at any time.

2 Likes

At the moment, I would only use the Python bindings and I think there would be more users of that. What are you thinking with the Node bindings? Embedding FluCoMa stuff in websites?

Is what you’re working on a rework of the cli tools or also FluCoMa core?

I really like your project suggestion.