A fluid.datasetstats~ object

In playing around with some of the IQR stuff (as well as encountering some bugs) it occurred to me that it would be super useful (sometimes) be be able to see the stats for columns inside of a dataset. This can be useful for seeing the state of the contents with regards to the need for normalization/standardization, as well as potentially “seeing” outliers. There’s also knock on effects of knowing whether some data is malformed or non-changing.

This may also be useful when trying to place multiple corpora onto a single space.

At the moment I guess you can manually poke out each entry, store that somewhere, transpose it so you can feed each “column” into fluid.bufstats~, but that seems a bit of a pita.

Oh, a perhaps more elegant solution would be to have a stats message for fluid.dataset~ which would dump out a dict (or multichannel buffer) with the info.

I agree that being able to generate summary statistics on a DataSet can be useful. When we’ve shipped the quick swapping between multichannel buffers and DataSets that we’re currently trialling here, that will at least be a slightly easier path to doing it via bufstats. Longer term, my inclination would be to make bufstats more omnivorous rather than adding to the interface of dataset.

FWIW, I disagree that you found a bug w/r/t to 0-variance features, so much as a rough edge where NaN-producing manoeuvres can sneak into the json. Ultimately, it still has to be up to folk not to feed un-sensible things into processes. But we can certainly look at ways of providing some scaffolding for sensible-ness checks as part of data preparation.

2 Likes

Yeah that’d be fantastic, although I guess it would need some capacity for transposing buffers/datasets since at the moment (unless I’m looking at this wrong) fluid.bufstats~ computes on rows (indices) of buffers, whereas what would be useful to know from a dataset would be the channels (channels).

Yeah, I guess not a bug, but definitely an unforeseen outcome of what otherwise looks like sensible data (samples of sounds that, I guess, fade to silence) which apparently has massive ramifications downstream.

Yes, that’s in there, but I think your description there is mixed up re: rows / columns.

2 Likes