A fluid.datasetfilter~ object for conditional querying

The (in)ability to do some kind of conditional, hierarchical, or biasing of a query, once you get into a search space has been discussed a couple of times here (largely by me!, but also by others in relation to the recent AudioGuide discussions).

I had done some early testing with fluid.datasetquery~, which is definitely a useful object, but not for this specific application (since it duplicates the data, and is kind of “slow”).

So what I’m proposing is an object along the lines of fluid.datasetfilter~ which would let you apply conditional queries to columns with an interface similar to @a.harker’s entrymatcher (i.e. 2 > -35 5 == 76.5 etc…). So “generic” conditional filtering, along with some boolean stuff (stuff between this range AND this range OR this), which is very much in line with @tutschku’s interests as he explains in his great videos.

Some of this is currently possible with fluid.datasetquery~ but the interface and paradigm there is more about creating new fluid.dataset~s for further use and manipulation elsewhere. Not directly as a part of a query of some sort.

What I’m proposing here would be different in that it wouldn’t produce a new fluid.dataset~ (as such) but would apply these as modifiers to a subsequent query (knearest etc…). I’m not 100% on what this would mean interface wise, but I would imagine that avoiding copying over a large database per conditional step would be ideal.

Where this starts getting a bit sticky is that if you have a large dimensional space (>100), knowing what is in any given column is… tricky, and super error-prone. @a.harker gets around this by having symbols for each column and you just query based on a symbol, without needing to know what column it corresponds to in the database.

Interface-wise, I think that ship has sailed for fluid.dataset~s and buffer~s BUT given some of the discussions at the last plenary about having a more robust data structure, perhaps this object could live nicely with a revamped fluid.labelset~ which would let you add an individual label per column (if so desired). (so any given label would be an arbitrary array of strings)

So for the sake of simplicity, you could potentially address indices (the @tremblap way) and produce a query message like 2 > -35 5 == 76.5 OR if you point fluid.datasetfilter~ to a fluid.labelset~ you could instead create a message like loudness > -35 centroid == 76.5, and then the links/reference would happen automatically for you.

This is already possible with datasetquery, with the or and and message to tweak the filter. Or I don’t understand your idea,

That duplicates the dataset every time, so it’s very slow, particularly if you want to chain queries or have complex queries. And this has to happen per query as well.

So it seems to be more about splitting datasets for other purposes, not as part of querying (pre-knn or whatever).

Essentially something like entrymatcher, but for fluid.dataset~s.

make a single message, that will do it in one pass and should be much faster.

I’ll put a ticket up for speed though.

1 Like


Though there should probably be some kind of differentiation between duplication via query vs filtering via query (which is more what I’m talking about here).

fundamentally these are the same things: you either touch the original data, or you don’t. if you filter in place you don’t, otherwise you have to copy. You can optimise a bit though.

Don’t the existing objects copy stuff when you do things “in place”?

But that’s what I mean, just filtering as opposed to copying, which if it’s a big enough dataset will be slow.

But to a certain extent I’m also thinking interface here, where querying and all that would be part of the process itself, rather than tagged at the end.

This is what i say: you can’t just filter in place without destroying your dataset. otherwise you have to copy your output somewhere. If your query is tight, then the number of items to copy will be small… that is, if we don’t copy everything before, or lock the source, either of which has to be done for multithreading, as you know. So not simple. So carefully pushed forward.

Curious what @a.harker does in entrymatcher when you filter something because I can’t imagine it duplicates the dataset at each step of a query.

I don’t search the same way - I search in a slow way with some optimisations to make it fast with several search criteria - the approach in flucoma is totally different, in which data is structured for fast searching, but without the filtering options.

1 Like

As in, the flucoma approach precludes filtering with its structure, or it’s not implemented in a way that presently allows it?

I’m not knowledgeable enough about exactly what happens in these cases, but basically for entry matcher I make sure the data is stored in a fast access format (consecutive in memory) and try to optimise the searching based on a fundamentally high complexity algorithm (linear search).

I believe the structures for flucoma keep the data structured differently for low complexity searching. In all cases of data/searching etc. you can optimise for different things, but not everything:

random access/iteration/insertion + removal/filtering etc.

In reality there is always an interaction between algorithmic complexity (measured in how the speed responds to changes in the size of the input - https://en.wikipedia.org/wiki/Big_O_notation) and the constant time multipliers for different elements of a task. ML prefers low algorithmic complexity, because if you have. a lot of data that scales a lot better. For musical use “big” is often “small” in a data science sense, so a higher complexity algorithm may not actually win for speed. As always - testing is the best way to find out speeds.


After the discussion today (great to re-geek after a long break), I got to thinking about some more use cases for something like this.

Or rather, highlighting ways I presently use entrymatcher and would like to use in fluid.stuff~, as well as ways that would be great to use in the future.

1 - “continuous” filtering of a dataset

This is the most immediate use case, as I’m presently doing stuff like this now. Taking any descriptor (say loudness, or it can be a combination of multiple time-scale envelope followers etc…) and use that to scale or filter a query based on that.

This is something @tremblap suggested ages ago, where depending on how long/busy/whatever I’m playing, I filter the query accordingly. So if I’m playing loudly (as in, either this individual analysis frame has a high loudness and/or a slower envelope follower is above a threshold), I want to select only entries that then meet some criteria. Say, duration > 500 or centroid < 50, etc…

The point here being that once you have more than a single parameter on the input and/or that you want to filter by, it becomes impractical to create individual fluid.dataset~s for each of these. Not to mention a reduction in resolution if you have a finite amount of combinations.

2 - manipulation of the queryable space

This may be solved with the radius feature for fluid.kdtree~, but based on some of the ideas in the corporeal morphology thread, being able to dynamically change the pool of available entries that are queried from would be a super powerful thing.

Radius could kind of tackle that, but for unevenly distributed data points, it could mean the difference from getting no matchers, to getting all the matches. Rather than using something like % or whatever. Perhaps we will have some algorithms that can evenly space the data out, but as far as I know that’s not presently the case(?).

So here, it would be something like what @spluta is doing with his joystick. Where he could potentially move around and scale the available points which the incoming audio will then query from.

3 - conditional querying based on non-descriptor-space data points

This is similar to #1 but I guess the main difference being the ability to have points in a fluid.dataset~ that can be excluded from direct querying.

So this could perhaps happen using fluid.dataset~ query, where you create versions of each fluid.dataset~ for kdtree-able data and stuff for meta-data purposes, but this starts to get really clunky real quick, and you end up with “the buffer problem”, where you now have dozens of fluid.dataset~s you need to manage.

A use case here would be the conditional pitch confidence biasing that @tremblap mentioned. If your confidence is above (or below) a certain point, use these data points, or those data points. Again, possible with two fluid.dataset~s (for that particular example), but if you start making a few more conditionals, or nested conditionals, the interface starts becoming a messy spiderweb of unpleasant data management.

For something like this, I picture something where at the querying stage, you can specify what columns (or whatever) you want to search from, as well as just having fields that are contextually queryable but not distance queryable (things like duration, or binary flags (e.g. “this is an attack”)), etc…).

4 - variable weighting between time series

This relates to the LPT idea, as well as the “time travel” stuff I’ve been working on, where you may have a fluid.dataset~ that contains an attack, the-bit-after-the-attack, and the sustain (or whatever). In LPT all of these are weighted equally, but it’s not difficult to picture a situation where you may care more about the sharp attack and want to bias that in your query.

Or in the case of the time travel idea, I have a real set of descriptors for samples 0-256, and then a predicted set of descriptors for samples 257-4410. I want to take the latter into consideration, but I don’t want it to have the same weight as the real descriptors.

Now, from what @tremblap hinted at. It will soon be possible to create buffer~s which you can scale and transform, which is fantastic. So this could potentially solve this problem for static version of this where I can just apply a scaling to each buffer~ before it’s passed to a fluid.dataset~, and can even do it dynamically for new incoming points. But what I can’t do is apply this to a fixed fluid.dataset~ (as far as I was able to tell from @tremblap’s description).

Say, I have a huge dataset of samples that are all preanalyzed and I want to be able to bias the query to put more emphasis on pitch, or MFCCs, or loudness.


So all of that is to say that some of these “query” things that have been possible for ages with entrymatcher and coll even are super powerful, so it would be a shame to have to pick between doing complex querying or having complex structuring (via fluid.stuff~).

Conditional Querying;
Maybe I’m missing something, but isn’t this doable by clearing and establishing a new query with fluid.datasetquery depending on external conditions? e.g.

[some condition test] 
[sel 0 1                                ] 
|                                       |
(clear, addFilter 0 > 500, and 1 < 50) (clear, addFilter 2 != 3.14, or 3 == 42)
| ______________________________________| 

Query-able Space

Not presently. I guess you could try using a Self Organising Map from another package (like ml.star).

Conditional querying based on non-descriptor-space data points
This one feels a bit more abstract to me, but I’m not completely grasping how you’d have multiple query-able spaces without multiple KDTrees (as these need to be fitted). Or are you thinking more about fluid.datasetquery? (in which case, there’s a new message added to the forthcoming release that might help).

If you’re really ending up with literally dozens of dataset objects, it would be interesting to see concretely to help us think about possible future interface enhancements.

Variable Weighting
We’re adding a weighting mechanism to the stats object, so that could be applied as you build up a corpus but don’t currently have any concrete plans to produce a set of dataset mutators. Again, I’m not clear on whether you’re querying a via tree or via a query object in this scenario. In a tree (which is immutable once fitted), you’d need separate instances.

And, yes, there are two simple objects forthcoming to scale and gate buffer~s, to help with preparing weighting vectors.

The interface for this starts getting messy if you have have queries that change a lot, and want to pass on specific columns/ranges accordingly. Possible, but squirrelly. Also lacks some of entrymatchers sexy distance-based matches.

The main issue for this approach for me, at the moment, is that this process is really slow. I imagine this will improve once optimizations take place, but I have a hunch that creating multiple copies of a fluid.dataset~, and fit-ing multiple fluid.kdtree~s will always be “slow” (as in >20ms).

I haven’t fully made sense of this yet as this kind of paradigm is new to me, but I’m mainly thinking for metadata-esque stuff (e.g. duration, amount-of-attacks), where I’d want to query using this, but not use it for any distance calculations. I guess this is possible now, but would fundamentally (if I understand correctly) require a fluid.kdtree~ to be filled and fit… per query.

I think most of the use cases would be covered by scaling pre-fluid.dataset~-ing, but there are some cases (one outlined above) where that wouldn’t be the case. I guess I could dump/iterate, scale each buffer~, then dump/iterate back in, then query/filter, then fluid.kdtree~…again, per query, which probably gets “too slow” for even non-Rod standards.

Actually, as a more specific sub-question. I take it this is a technical property of a kdtree, that once it is fit, it can’t be altered or filtered or anything, without recomputing a new fit?

If so, that kind of solves 90% of my problems, in a "you have to keep using entrymatcher" kind of way. My understanding of kdtree stuff is that the fit is “slow”, so that your querying can be fast. A benefit that is lost if you have to fit-per-query.

In that patch around half the time is taken up by the dump , so I guess it hinges on whether you really need that.

Yes. They trade off mutability for some simplicity, making them relatively simple to implement and good enough for many purposes.

I wonder if there’s a way of reformulating what you want to do in terms of first querying a fixed all-in kdtree, and then filtering the results of that based on more flexible criteria? That would avoid re-fitting ( expensive, always will be), but also yields a smaller set to filter.

That’s exactly where seeing some abstractions developed to cope with the mess is helpful. E.g., I could imagine storing protoype query messages in a coll


One of Max’s greatest problems is that it tends towards mess very quickly, so any coping strategies are always interesting to see

It still ends up “slow”, but yeah , the dump sure doesn’t help there.

This is exactly it. In my head I’m in a “single big thing which I then filter down what I want” place, but that doesn’t seem to fit with the current tools/paradigm. So I guess, a good bit of understanding for me here is, that fit-ing a kdtree as part of a query is a non-starter. Meaning, any manipulations need to happen at a tree level. Meaning, an object like fluid.datasetfilter~ wouldn’t really solve the problem (or at least all of it).

I get confused very quickly with this kind of thing, due to my incomplete understanding of kdtree stuff, but if I have stuff in the kdtree that’s metadata-esque (e.g. all the “real data” is normalize or standardized, and then duration is in ms, and it computes distances with that in the mix), wouldn’t that skew things weirdly?

But if there’s a way to have a larger dataset/tree which can be whittled down, I’m all about it!

This isn’t the end of the world, as I kind of ‘list manage’ queries at the moment anyways. So it would just be part of that process.

Not really the main point of this thread, but I wanted to bump this aspect of this thread now that fluid.datasetquery~ is getting beefed up. In a lot of examples in the latest alpha (06) there are sections of patches that look as follows:
clear, addrange 0 24, addrange 84 24, transform example11.mfcc example11.mfcc.meanstd

or this:
clear, filter 11 > 0.7, addrange 0 4, transform example11.tmp example11.composite

Where individual columns are being moved around, filtered by, and/or queried, without any (built-in) way to know what’s in what. At every point you need to know what’s in these columns, what range they may be in, etc…

Before, I guess the idea was “flatten everything into a dataset and it doesn’t matter what is what anymore”, but now fluid.dataset~ appears to be a “working” data type where transformations are being applied to it.

So I’m bumping/suggesting a way to label everything in a fluid.dataset~ (with fluid.labelset~ or a variation thereof), which would then hopefully allow things like:
clear, filter loudness > 0.7, addrange timbre.0 timbre.4

So you don’t need to know where in the fluid.dataset~ something is, but only need to what what it is, and can call it accordingly. I guess it gets trickier for ranges, but a subnotation like that (loudness.1, loudness.2, etc…) may be useful?

This would also make everything more robust to changing things since you can add/remove all the descriptors you want, and my second message would still be valid, whereas the index/range messages presume everything being exactly where it was, thus making it very brittle to changes.

My suspicion is that this syntactic sugar would be much easier to do as an abstraction in Max than down in the C++, and that the effort for the latter would be out of proportion with the gains.

1 Like

Not the world most onerous patch, could be inserted before any fluid.datasetquery. Possibly a better design would use a settable non-local dict so column names could be managed centrally