Biasing a query

So jumping on the discussion from the LPT thread and the concept from the hybrid synthesis thread I got to thinking about how to bias a query in the newschool ML context.

For example, I want to create a multi-stage analysis similar to the LPT and hybrid approach, but a bit simpler this time. An ‘initial’ stage, and ‘the rest’. I then want to do the apples-to-orange-shaped-apples thing and match incoming onset descriptors-based analysis to find a sample from the corpus. So far so good. But what I’m thinking will be useful to do now, is have a parameter that lets me bias the querying/matching towards being more accurate in the short term vs being more accurate in the long term. Or more specifically, weighing the initial time window more, or the full sample more.

In the context of entrymatcher this would be a matter of adjusting the matching criteria by increasing/decreasing the distance for each of the associated descriptors/statistics. But with the ML stuff, it seems to me (with my limited/poor understanding at least), that the paradigm is “give me the closest n-amount of matches”, and that’s basically it. No way to bias that query (other than generic normalization/standardization stuff).

I guess I could do some of the logical database subset stuff, but that seems like it would only offer binary decision making (including or excluding either time frame).

Is there a way to do this in the newschool stuff? Or is there another solution to a similar problem/intention?

So I went ahead and created a patch that does some of what we talked about in the last chat and made a quick video demo-ing it.

So I’ve analyzed a mix of a bunch of different samples for 4 descriptors with some statistics each, and then analyzed three time scales from the file.

The descriptors/stats are:

loudness_mean
loudness_derivative
loudness_deviation
centroid_mean
centroid_derivative
centroid_deviation
flatness_mean
flatness_derivative
flatness_deviation
rolloff_max
rolloff_derivative
rolloff_deviation

And the time series analyzed are 0-512samples, 0-4410samples, and the entire file.

I chose these as they more closely correspond with my “real-time” analysis window of 512 samples (so I can do like-for-like matching there), and for the Kaizo Snare performance I matched against the first 100ms of each file, and that provided musically useful results.

The results for each matching window is quite different, not surprisingly. I think given the material I like the 512 and 4410 options the most. The 512 gives the most range and surprise as it’s matching the initial 11ms very well, and the rest is a “surprise”. The whole file matching is dogshit, particularly with these samples, as there’s so much fading out during the file that you don’t get anything useful here.

For the variable part of the video I’m using the % matcher in entrymatcher here and crossfading between putting all the weight on the 512 query, or the 4410 query.

So based on this, I think that it would indeed be useful to be able to bias a query between multiple versions of the same n-dimensional space. Because everything is in the same scaling/dimensions/space, nothing gets fucked around like what would happen when trying to bias things towards “caring more about loudness” or “caring more about spectral shape”.

Musically, this makes a lot of sense to me, to be able to nudge a query, or match in a direction that musically grounded, as opposed to selecting/massaging algorithms based on data science stuff.

I don’t really know what this would mean in practice, however. Simply exposing multipliers pre-matching would add some usefulness, like with @danieleghisi mentioned here, though things get complicated really quickly once you have a lot of dimensions and/or data scaling/sanitizing.

I’m going to try to move my querying/matching setup into the ML world (gonna have a jitsi geekout with @jamesbradbury tomorrow), and in the lead up to that I’ve been trying to think of other use cases that I need to satisfy.

Beyond biasing a query, being able to single out individual parameters would still be incredibly useful. I’m specifically thinking of metadata-esque stuff like overall duration, or time centroid, or amount of onsets within a file etc… Things that wouldn’t really be useful to use in a ML context, but would still be incredibly useful in terms of nudging along queries in a certain direction.

I remember seeing some database-esque logical/boolean queries, but I think that was primarily for creating subsets, not for doing individual queries.

With the tools as they stand (or how they plan on standing for the bits that are still in motion), is there a solution or workflow for queries like this?

A naive way for me to conceive it would be where you would ask for the n-nearest neighbors && timeCentroid > 3000.0 or something like that.

Found and played with the 7-making-subsets-of-datasets.maxpat example and this looks like it would kind of do what I want.

I don’t really understand the syntax (or intended use case actually).

Say I have a fluid.dataset~ that has 5000 points (rows) with 300 features (columns), and I want to filter and create a subset of those (something like filter 0 > 0.128). Meaning, I want to keep the amount of columns in tact.

My intended use case here would be to create a subset based on some metadata/criteria which I would then query/match against. In this case I want all the actual features to stay in tact, so I can query them. I don’t want to filter out just a single columns worth of stuff.

I don’t understand what addcolumn (or addrange) are supposed to do. I played with the messages a bit, but none of the examples show the dataset retaining the amount of columns.

The process also isn’t terribly fast. Even with just 100 points like in the example, it takes around 0.5ms to transform a dataset into another one. If queries are chained together, that can start adding up.

Granted this process wouldn’t be happening per grain/query, but it may need to happen often enough to be fluid (say if I’m modulating the filter criteria by an envelope follower or something where the louder I’m playing, the longer the samples I’m playing back are etc…).

Ok, I setup a speed test with a “real world” amount of data (10k points, 300 columns) and I get around 29-32ms per dump of the process.

This is, perhaps, not the way to go about doing what I want to do, but given the current tools I don’t know how else to go about doing something like this. (as in, I don’t want or need a new dataset, I just want to filter through the dataset as part of the query itself)


----------begin_max5_patcher----------
2129.3oc2as0jiiZE9Y2+Jnbk7RhWUbQWyCo176XpsbgrvtYWYIEITO8Las6
u8.HjLxMxVtaod1J9AeADbNmON2.N92eZy1zxWYMaA+KvW.a176OsYitIUCa
L+dy1yzWOjSazO11B1WKS+0s655RvdUnaV.RAo8sV1JxYBw2pXcy71TZwos6
LeB9EyiwyzCUNc+TX+PKZOyKjCVSLzkF6lRcqXSqUTwgm4Em1WyNH5HjOIzC
tCDm34CseEtCfgwdnj.DjfBi8iiB7SjMh8ffeQMc+wSOoda2GCE9dN3TcYaE
3m+dt74a3emAPPr+j.iBTbBH9DmHB9QQDLLwKPJnnPuPaDAS1AP9IJzZoAg1
uyADHbdJCxO4EB2P.7Af.xzP.JwWAAnHeOrMDfh1A7i9PHvw7RIabyU2IU4I
CZEUzZ5YlfUumUPSy0iFdK6gik0moZoK7Q0GBwZKD+HnWh0q3XoIBBA+PnQZ
qPTVb60cWHAF+tQhwxM5FxcDVIaj3PoW.qWDrTf80hs+msXCWew13PzOnWB+
.x4Dl6UfCUsB9YV8jPfzLgZYierrPTHEX8v+O0bZ912hNH+Gv5GYMyJmt51T
K320N.G6EXqO.kQEBCtXFXOCRIrCcLvijd7b1Kr5Ftb8+BarYKspxp4MVCQg
o+Zodhh2MzDunqIzPS0rW38iOXnUZsTNDRgnstC8dMrGlTSSYFqtnkqmotFk
qtFVRuNpP8lJ5gtAqVt669BxDf0hOJVavfR57PRhtfFREkS4kG9MVlkhpbYu
hUvKppYMrBAUXX9gtyXGos4h8iViPRzO.EGNf+td5iFF14T4TQZy1S07rxBE
KMZgQ0bOwkQF5jz.aQS+DEzJGCVpVJQoI5rQJxsMozZ05lwFF22onrLebWCi
KmcTX5thWTbElJJqltyZ9omuwXSKkcd9Vystml8sEc8tWphH12PeYLZKn44F
a+wS+qzBtLNDSY7qEW3Pmc9wdt4PcYd9H4sqmWbzSlTk+.6q7LwyZB4YsdKe
bdUuJ01gU4L9IViXbaB5olwszH9VGna0TapwjdufctJWJEie.owBuQz7b4Wa
LOXuhlM.bI0YaSbaGoiZ+JGpcNvFreux+mc6tBB3vMHzpGG97HZ++nHh1hNn
+8AcesahxymYEcPpLwLV8Kzb.u.btAva.0rpxZAKCHwC1HFuHi8pkeGSvEiK
n2I9LJfy3fNofQ3ysC5d6fNuIr76.3QWQn65eapTUgZusQw50lNOuv9v0qOp
JCjeP4X+lP63f4OF1R9wisHn1JHH7yFb+IoGsUCX8mDXwqNvZbq3qcqPv5sa
9Ihqqnqffefpq9w+esmfvejPaj97.VUGAZ44A0KsfmnIgG3rfmo8+YNwi6G8
WwefFYBnM.YBdx4BPKxLY.HSFvjafyj.v+0AHiWdfryx7QwQpB0L.4MAOWYP
cY6I47hoR5Typp9cipMks0G5wLCz.Fy1x7rE7hgst8kAuff4Zb7n7PzL4A7J
xCgyjGjwXAnUhGBlIODth3f+L4AzJxCjGXsXs3A7L4Axj7fow9CeYqZK0Y66
N9f8Tgnlm1J5LisOMoGZaty6XTNkWlRyuZGrt1q7SWDgk4XAaX4.QMsnQcX0
28t.be.oKzcfDf6tmi.8Gwv03RONl2xy7xnBZCS7mfF94pbVPSaZyPzqok9t
DolBFPwOvkiQtKLPjIpGYejmApjgBweHT4Lqogdh8FXIq8b0zWOhSoM4cbtu
2PZgHOhszFot0GzOj64fj7Ybf+q848K.bPJfOo3qtNuYb6d3ERqF4qSfGQPd
witcOj5BsBWEs5C4LZ8ioVGgVH05.0dWbdMVjnUQXUmmuXG3HOWpuJC38uAP
ODNdGflkcnLu8bA.K43KN66c9obF51QnJX7gx1tDj8eLbDtL3n4hwbq1zchF
gqYHh+aKq9a+4GLxPHZYrgLfg6hD.gWi3k47Fww1hBV9jXf5Qb65HY6h3qzT
aDNCQDDuFR8KrWqpA+siXv+Pc+FHv+T9CD3maNPyo0p6ua3vrla3xvkwdnuN
Q5xQBuNK5osGOxp6yNppTFaPJ6zyUM2rfQFLBlLxJBtL0OjIXh6piH1eMvjJ
F62FgHy8VzsDe21Cj2q8fyDDgqhAQSUsTjOphkzoN72ylEV3DFBVF2BFk.mw
XQQQqUISgTdbAKPYSsLUMUWhjtpjNDF+4sCJ0u+n6eZQSzzYEifBf+EY6SjE
JOytZlyoMvD6dRyMiOexNQ+5yew..u8bWf1G7x8O.r2fwykPgyfNQvEhPn6P
Hj+BPH0NKuqHsDDRMG36PGb7BQn6AcDzBPnHzLftEgPyR8dInjdRt6pDYAnT
vmD14SlCgVBeCphN9tJd9jkhR3OCJQ9rTwwywMTxRPn3YPnfkhP2Uc.tTTBO
GUbzGkRyxXZQVlvygR9KfLglilGZIBLgly5DFuDTZN4ofWh0Ij+LnD4s4Ozk
x2UE6rhHWUjyWUfyus3lmtvlutnl02oVWEcdUZlC2tkLy5vgRatuVgGi.zCG
XEhCk4cb0W.Pu3cp2HIXTn5aQvjXnuTwzpHvUokqGyddgRhY8iEty5M6Qjdx
hFHSul2fdA8f4NypUsjottXU2dVx8ZoGd0p6HA1pLIbKxiDVeLI.mn3ARDV+
OgB5gi7CPF1ewXqQarwMi8dP0i777AAx9JT62dz1S0zLNa3DDLxOZX0NT9JZ
myugrKGb8vv8CKJNVhb6b9s2Nr9QgSfnDsVEwnUo9lrIzUiRtcVSgiGYU8yU
0kpZtoud58HC2tKsUTNHnlkhK2hq6kuwVGyaUztPkmg1kdMyWpVAw5uIkYRx
6VypyIyD+8Ed5Od5+AHvtypD
-----------end_max5_patcher-----------

you actually do every time you change the query parameters. But I don’t see in your patch what you are trying to do… you’re making a 300 dim 10000 entry dataset and you want to get a subset? Why do you time the dump which has an overhead? I have here 30ms for the first query, then 18ms. If I keep all the columns (addrange 0 300) it gets to 77ms which I presume is the overhead of copying more, although @weefuzzy and @groma will be able to confirm…

In my intended use case I’ve got a descriptor space with 10k samples (or however many), each with 300 descriptors (vanilla descriptors, stats, mfccs, stats of mfccs etc…).

I would like to use one column of that (timecentroid for the sake of simplicity here), and bias a query based on that single number. As in, return me the nearest match (along all the dimensions of the descriptors) that has a time centroid above a certain value. Or find me the nearest match where the loudness is below a certain value (this gets weirder).

But the overall idea is something along these lines. Where some columns in the thin flat buffer correspond to data that I’d like to use to query against in a more direct way.

That’s how addrange works! I kept trying stuff and didn’t couldn’t get the default example to return every column it started with. (I could have sworn I tried addrange 0 4). What happens when addrange is backwards, like in the example? (addrange 2 1)

its not backward, it is like all our other range interface: start and num
0 300: start at 0 and gimme 300
2 1: start at 2 and gimme 1

===
In your case, if you’re going to bias a query on a subset, then query in RT, you make the subset, you then make a kdtree of that subset, and query that kdtree. the query should be faster than on the real dataset since you have less values. what is even better is to query only what you care about, so you could dismiss the columns you don’t care about and make a smaller subset in that dimension too (nb of dims)

Right. It’s just visually confusing because there is no associated message that goes with them, and addrange 0 5 returns the data I want, but for visually confusing reasons (as in, it looks like it goes from 0 to 5, inclusively). I’ve mentioned this already, but this <dataset> <dataset> <number> <number> syntax is a lot more confusing overall.

I tried to make a dataset that would have around the amount of numbers I’d be dealing with in a realworld context. I can easily get up to 10k samples/grains/bits and 300 descriptors/stats is about par for the course these days. Granted my realworld numbers may be slightly smaller, but I wanted to test it with numbers greater than 100 entries and 5 features.

I perhaps wouldn’t need all the columns for each query type, specifically if each column corresponds with a different time series and I’m choosing between the initial 256samples, or the initial 4410 samples etc…, but if I have a single value in that column that could potentially update on every query, I’d have to do this process per query.

edit:
To further clarify. It wouldn’t always need to update ‘per query’, but if I’m doing the thing you (@tremblap) suggested ages ago, where I have a slower envelope follower going, and us that output of that weighted against the current analysis frame to decide how “long” (or higher ‘timeness’) a sample to play back, that means I would need to update things per query.