One of the new things I’d like to add to SP-Tools is the ability to filter datasets as part of the corpus curation process. This is what fluid.datasetquery~
is meant to be used for. Now my misgivings about its interface aside, there are times where I want to filter by criteria that I’m not including in a queryable data space. For example duration, time centroid, amount of attacks, etc… Up to this point I have been keeping these in a separate/parallel coll
and pulling up the (meta)data when needed, but now I’d like to filter a dataset based on some of these values.
Towards that end I’ve moved the contents of my coll
into a dict
and then into a fluid.dataset~
so I can use it as criteria to filter
the data. This works ok except the dataset that I’m processing on is not the one that I actually want to filter.
I initially thought I would do this and then get the indices that were removed by dumping the dataset to a dict
and then individually removing rows that way, but obviously what is left in the dataset is what was not removed.
I could do an ugly thing where I concatenate the column that I want to query with into the same dataset, then just not move that column over when I do the query itself, but that gets rough in that I may want to query with varied criteria. More importantly I want to do this to a whole bunch of datasets. In SP-Tools when I create a “corpus analysis file” it has close to 30 datasets in it (for different time scales, descriptor types, and pre-scaled/normalized versions of things) so having to manually concatenate, process, trim, and dump all of these each time I make a query would be brutal.
Is there a way to process a query with fluid.datasetquery~
but somehow get the results of that process in a way that I can then use to manually remove individual rows from a whole load of datasets?
Or is there a way to filter one dataset with the contents of another one being used as a filter?