Can you use fluid.datasetquery~ to filter other datasets?

rodrigo.constanzo · August 5, 2022, 1:19pm

One of the new things I’d like to add to SP-Tools is the ability to filter datasets as part of the corpus curation process. This is what fluid.datasetquery~ is meant to be used for. Now my misgivings about its interface aside, there are times where I want to filter by criteria that I’m not including in a queryable data space. For example duration, time centroid, amount of attacks, etc… Up to this point I have been keeping these in a separate/parallel coll and pulling up the (meta)data when needed, but now I’d like to filter a dataset based on some of these values.

Towards that end I’ve moved the contents of my coll into a dict and then into a fluid.dataset~ so I can use it as criteria to filter the data. This works ok except the dataset that I’m processing on is not the one that I actually want to filter.

I initially thought I would do this and then get the indices that were removed by dumping the dataset to a dict and then individually removing rows that way, but obviously what is left in the dataset is what was not removed.

I could do an ugly thing where I concatenate the column that I want to query with into the same dataset, then just not move that column over when I do the query itself, but that gets rough in that I may want to query with varied criteria. More importantly I want to do this to a whole bunch of datasets. In SP-Tools when I create a “corpus analysis file” it has close to 30 datasets in it (for different time scales, descriptor types, and pre-scaled/normalized versions of things) so having to manually concatenate, process, trim, and dump all of these each time I make a query would be brutal.

Is there a way to process a query with fluid.datasetquery~ but somehow get the results of that process in a way that I can then use to manually remove individual rows from a whole load of datasets?

Or is there a way to filter one dataset with the contents of another one being used as a filter?

weefuzzy · August 6, 2022, 11:12am

This is what fluid.datasetquery’s transformjoin message is for, if I correctly understand what you want

rodrigo.constanzo · August 7, 2022, 1:14pm

Hmm, it looks like it might.

Definitely a bit of a confusing message name and interface (and help/example), as I looked through the tabs and reference file for a while and couldn’t figure out what I was looking for.

weefuzzy · August 7, 2022, 9:02pm

Yeah, I agree it’s very much in that category of names that’s a bit jargony: join is the name of the SQL manoeuvre that would do the equivalent thing. I think we should relabel the help file tab to something better.

rodrigo.constanzo · August 8, 2022, 1:00pm

Even join seems like an odd word to filter one dataset by another as I’m not trying to “join” anything. And obviously the sequence of contextless names/messages doesn’t help either (reminiscent of TB1’s glorious 1 0 0 1 1 -1 0 1 era).

rodrigo.constanzo · August 19, 2022, 12:40pm

Not worth creating a new thread over this question, but how does fluid.datasetquery~ handle chained filter requests? Logically speaking.

Like if I just chain ands, that makes sense. Or if I just chain ors too.

But what does it do if you mix them up? For example:

[query1] and [query2] or [query3] and [query4]

Is that treated as:
([query1] and [query2]) or ([query3] and [query4])

or is each individual query an island? Or more specifically, what happens after the first different/new conditional? (e.g. and->and->and->*or*->and, or->or->or->*and*->or).

I could make test data and probe, but since the logic isn’t exposed, it would be quite tedious to try to deduce how this is being handled, and figured it would be easier to just ask.

Basically I’m going to add multiple filter/conditions to the next update of SP-Tools and I want to make sense of how they are chained as it may be easier to just limit it to one conditional applied to two filters and call it a day.

tremblap · August 19, 2022, 12:56pm

Each element is independent, so no parenthesis there. in the case of your example, if query3 is true, the statement is true, or if query 1 and 2 and 3 are all true then that statement is true.

rodrigo.constanzo · August 19, 2022, 3:53pm

I need to wrap my head around that, as I don’t know what “true” means with regards to filtering a dataset, since each condition comes with consequences.

I will think out load in terms of a series of examples.

///////////////////////////////////////////////////////////////////////////////

loudness > -10
and
centroid > 80

In this circumstance it will pass all loud sounds that are also bright. This would ignore sounds that are loud but not bright, and sounds that are bright but not loud.

///////////////////////////////////////////////////////////////////////////////

loudness > -10
or
centroid > 80

In this circumstance it will pass all loud sounds, as well as all sounds that are bright. They end up as two “independent” filters(?). This would include sounds that are loud but not bright, and sounds that are bright but not loud.

///////////////////////////////////////////////////////////////////////////////

So or functions kind of like an and (in a literal sense)?

Like if fluid.datasetquery~ was a restaurant and I walked in and placed an order:
“give me a hamburger or a hotdog”, I would then get both a hamburger and a hotdog?

///////////////////////////////////////////////////////////////////////////////

So if I chain some together:

loudness > -10
and
centroid > 80
or
duration < 1000

Would this give me all samples that are loud and bright and are shorter than 1000ms, ignoring sounds that are loud but not bright and bright but not loud, even if they are shorter than 1000ms? Or would “or” supersede the coupling of “and” (as in the first example).

tremblap · August 19, 2022, 4:15pm

if you want… but that is convoluted. if you read the sentence of your query in plain English it will make more sense:

for your 2nd example: pass along all items whose loudness is greater than -10 or centroid greater than 80.

for the last example: pass along all items whose loudness is greater than -10 ,and whose centroid greater than 80, or whose duration is 1000.

OR is always ‘greedy’ in logic. There is even a lot of geek humour about this. For instance:

https://www.reddit.com/r/Jokes/comments/ebt4v5/three_logicians_enter_a_bar/

AND → accumulate to make the condition more difficult so exclude more
OR → augment the inclusivity criteria by offering options.

I hope this helps