Biasing a query

So jumping on the discussion from the LPT thread and the concept from the hybrid synthesis thread I got to thinking about how to bias a query in the newschool ML context.

For example, I want to create a multi-stage analysis similar to the LPT and hybrid approach, but a bit simpler this time. An ‘initial’ stage, and ‘the rest’. I then want to do the apples-to-orange-shaped-apples thing and match incoming onset descriptors-based analysis to find a sample from the corpus. So far so good. But what I’m thinking will be useful to do now, is have a parameter that lets me bias the querying/matching towards being more accurate in the short term vs being more accurate in the long term. Or more specifically, weighing the initial time window more, or the full sample more.

In the context of entrymatcher this would be a matter of adjusting the matching criteria by increasing/decreasing the distance for each of the associated descriptors/statistics. But with the ML stuff, it seems to me (with my limited/poor understanding at least), that the paradigm is “give me the closest n-amount of matches”, and that’s basically it. No way to bias that query (other than generic normalization/standardization stuff).

I guess I could do some of the logical database subset stuff, but that seems like it would only offer binary decision making (including or excluding either time frame).

Is there a way to do this in the newschool stuff? Or is there another solution to a similar problem/intention?

So I went ahead and created a patch that does some of what we talked about in the last chat and made a quick video demo-ing it.

So I’ve analyzed a mix of a bunch of different samples for 4 descriptors with some statistics each, and then analyzed three time scales from the file.

The descriptors/stats are:


And the time series analyzed are 0-512samples, 0-4410samples, and the entire file.

I chose these as they more closely correspond with my “real-time” analysis window of 512 samples (so I can do like-for-like matching there), and for the Kaizo Snare performance I matched against the first 100ms of each file, and that provided musically useful results.

The results for each matching window is quite different, not surprisingly. I think given the material I like the 512 and 4410 options the most. The 512 gives the most range and surprise as it’s matching the initial 11ms very well, and the rest is a “surprise”. The whole file matching is dogshit, particularly with these samples, as there’s so much fading out during the file that you don’t get anything useful here.

For the variable part of the video I’m using the % matcher in entrymatcher here and crossfading between putting all the weight on the 512 query, or the 4410 query.

So based on this, I think that it would indeed be useful to be able to bias a query between multiple versions of the same n-dimensional space. Because everything is in the same scaling/dimensions/space, nothing gets fucked around like what would happen when trying to bias things towards “caring more about loudness” or “caring more about spectral shape”.

Musically, this makes a lot of sense to me, to be able to nudge a query, or match in a direction that musically grounded, as opposed to selecting/massaging algorithms based on data science stuff.

I don’t really know what this would mean in practice, however. Simply exposing multipliers pre-matching would add some usefulness, like with @danieleghisi mentioned here, though things get complicated really quickly once you have a lot of dimensions and/or data scaling/sanitizing.

I’m going to try to move my querying/matching setup into the ML world (gonna have a jitsi geekout with @jamesbradbury tomorrow), and in the lead up to that I’ve been trying to think of other use cases that I need to satisfy.

Beyond biasing a query, being able to single out individual parameters would still be incredibly useful. I’m specifically thinking of metadata-esque stuff like overall duration, or time centroid, or amount of onsets within a file etc… Things that wouldn’t really be useful to use in a ML context, but would still be incredibly useful in terms of nudging along queries in a certain direction.

I remember seeing some database-esque logical/boolean queries, but I think that was primarily for creating subsets, not for doing individual queries.

With the tools as they stand (or how they plan on standing for the bits that are still in motion), is there a solution or workflow for queries like this?

A naive way for me to conceive it would be where you would ask for the n-nearest neighbors && timeCentroid > 3000.0 or something like that.

Found and played with the 7-making-subsets-of-datasets.maxpat example and this looks like it would kind of do what I want.

I don’t really understand the syntax (or intended use case actually).

Say I have a fluid.dataset~ that has 5000 points (rows) with 300 features (columns), and I want to filter and create a subset of those (something like filter 0 > 0.128). Meaning, I want to keep the amount of columns in tact.

My intended use case here would be to create a subset based on some metadata/criteria which I would then query/match against. In this case I want all the actual features to stay in tact, so I can query them. I don’t want to filter out just a single columns worth of stuff.

I don’t understand what addcolumn (or addrange) are supposed to do. I played with the messages a bit, but none of the examples show the dataset retaining the amount of columns.

The process also isn’t terribly fast. Even with just 100 points like in the example, it takes around 0.5ms to transform a dataset into another one. If queries are chained together, that can start adding up.

Granted this process wouldn’t be happening per grain/query, but it may need to happen often enough to be fluid (say if I’m modulating the filter criteria by an envelope follower or something where the louder I’m playing, the longer the samples I’m playing back are etc…).

Ok, I setup a speed test with a “real world” amount of data (10k points, 300 columns) and I get around 29-32ms per dump of the process.

This is, perhaps, not the way to go about doing what I want to do, but given the current tools I don’t know how else to go about doing something like this. (as in, I don’t want or need a new dataset, I just want to filter through the dataset as part of the query itself)


you actually do every time you change the query parameters. But I don’t see in your patch what you are trying to do… you’re making a 300 dim 10000 entry dataset and you want to get a subset? Why do you time the dump which has an overhead? I have here 30ms for the first query, then 18ms. If I keep all the columns (addrange 0 300) it gets to 77ms which I presume is the overhead of copying more, although @weefuzzy and @groma will be able to confirm…

In my intended use case I’ve got a descriptor space with 10k samples (or however many), each with 300 descriptors (vanilla descriptors, stats, mfccs, stats of mfccs etc…).

I would like to use one column of that (timecentroid for the sake of simplicity here), and bias a query based on that single number. As in, return me the nearest match (along all the dimensions of the descriptors) that has a time centroid above a certain value. Or find me the nearest match where the loudness is below a certain value (this gets weirder).

But the overall idea is something along these lines. Where some columns in the thin flat buffer correspond to data that I’d like to use to query against in a more direct way.

That’s how addrange works! I kept trying stuff and didn’t couldn’t get the default example to return every column it started with. (I could have sworn I tried addrange 0 4). What happens when addrange is backwards, like in the example? (addrange 2 1)

its not backward, it is like all our other range interface: start and num
0 300: start at 0 and gimme 300
2 1: start at 2 and gimme 1

In your case, if you’re going to bias a query on a subset, then query in RT, you make the subset, you then make a kdtree of that subset, and query that kdtree. the query should be faster than on the real dataset since you have less values. what is even better is to query only what you care about, so you could dismiss the columns you don’t care about and make a smaller subset in that dimension too (nb of dims)

Right. It’s just visually confusing because there is no associated message that goes with them, and addrange 0 5 returns the data I want, but for visually confusing reasons (as in, it looks like it goes from 0 to 5, inclusively). I’ve mentioned this already, but this <dataset> <dataset> <number> <number> syntax is a lot more confusing overall.

I tried to make a dataset that would have around the amount of numbers I’d be dealing with in a realworld context. I can easily get up to 10k samples/grains/bits and 300 descriptors/stats is about par for the course these days. Granted my realworld numbers may be slightly smaller, but I wanted to test it with numbers greater than 100 entries and 5 features.

I perhaps wouldn’t need all the columns for each query type, specifically if each column corresponds with a different time series and I’m choosing between the initial 256samples, or the initial 4410 samples etc…, but if I have a single value in that column that could potentially update on every query, I’d have to do this process per query.

To further clarify. It wouldn’t always need to update ‘per query’, but if I’m doing the thing you (@tremblap) suggested ages ago, where I have a slower envelope follower going, and us that output of that weighted against the current analysis frame to decide how “long” (or higher ‘timeness’) a sample to play back, that means I would need to update things per query.