So - I have a problem that I’m not sure if it can be done with the tools - I’m not sure how to formulate it.
I want to group sounds so that those with common frequencies (or near common frequencies) end up together - things that are further away move further. However, whilst for a single note of frequency this seems trivial, I’m struggling to figure out a way to do it for multiple notes, as I’ve no guarantee that the ordering of notes/frequencies matches.
So - I might have two sets of 5 frequencies:
200 220 300 550 900
300 1050 1200 1400 1900
That’s an artificial example, but we can see that the 300 appears in both but in different places in the list. I have no way of constructing a space that would bring those close together - any thoughts?
Do you mean to rotate the 2nd so the 3rd entry is 300 in both cases, or to shift the 2nd so 300 is near 300, then 900 (list1item5) is near 1050 (list2item2) and the other items are offpiste?
I guess I mean best matching possible between the two. There are two parts to this - for a query I now see that I can set this up, but the original question (which I don’t have an answer to) is to create a space in which this is how things are arranged.
I think that once order goes out the window having the computer make sense of stuff becomes more gnarly, or maybe you just need a neural network to figure it out One approach might be to think of it kind of like a word and to calculate the edit distances. So if 300 is the 0th index for both lists it will be closer than a list with 300 at different indices which is closer than one with unmatched indices. Only problem is words have a granularity per character of 26, whereas your values I assume are something like 20k so you would have to bin them which might nullify the usefulness of the data. I think it also a pretty approachable, non ML way too and the data could be put into a knn tree after you retrieve the normalised or un-normalised edit distance.
Two popular algorithms are Levenshtein and jaro-winkler.
what about if you normalise (what I assume are frequencies) and then UMAP them?
What if you summarize by encoding these as new vectors of the input’s rank statistics (min, median, max). That seems like it gives you something for which a distance measure might make more sense.
I was also thinking about ML approachs to it. You have a training dataset, and Alexander Schubert ('s ircam team of engineers) seem to have got good results with autoencoding to latent spaces via melspectrograms…
so I would think:
- threshold on note to not work on silence/background noise
if the frame is valid,
3. bandpassed large count of melbands (the instrument won’t generate anything under 200 and you don’t care about 5k if at that for classification) so let’s say 200 bands which would be a lot for that range
4. normalise the frame
5. then explore various reduction, but autoencoder would be good there I think… but again, just PCA on that with 90% accuracy could lower the dimension count a lot with the whole dataset i reckon
@weefuzzy and @groma might want to critique this idea as it is quite fresh in my head the whole idea of latent space is getting more and more concrete in my head but is still rough on the edges.
Unless I misunderstand Alex’s problem, the challenge here is that he already has some frequencies but these vectors aren’t n-dimensional points that can be compared as they are, but collections of points from a 1D space (frequency).
I’m sort of assuming that the goal here is to get one of these vectors and find the closest equivalent from a set of stored training vectors?
Some slightly bleary googling suggests that one way of approaching this could be a density estimation problem, by treating each vector as a set of samples on some unknown 1D probability distribution that you then try and model. This is sort of like a (much) better histogram, in that rather than being at the mercy of some more or less arbitrary bin choices, you can fit some kernel (often gaussian) around each sample value to give you an estimate of the overall distribution. Apparently (according to sklearn) there’s an efficient way of doing this with kd-trees, but having looked at the sklearn code, I don’t understand how it works and it doesn’t look like it would be easily achieved by dumping data out of our tree and doing stuff to it ‘outside’, because you’d end up having to reimplement all the hard bits.
I guess this could also be related to the earth mover’s distance used by the optimal transport algorithm in some of our objects (like
AudioTransport). However, you’d still need to re-encode the input in terms of some PDF estimation.
In terms of what’s actually do-able (besides my hacky idea above of just encoding these in terms of order statistics), I wonder if the various MuBu objects for gaussian mixtures would get the job done here? If you know the vectors are always going to be a fixed number of points, then perhaps the
mubu.gmm could be used as a density estimator? (the grimace of someone possibly talking rubbish).
In the spirit of putting my rubbish where my mouth is, I’ve had a crack a the GMM idea with MuBu.
It doesn’t seem like the most efficient way of going about it, because you have to generate quite a lot of redundancy as far as I can see, by treating the frequencies of interest as 1s in a grid of 0s sampled-as finely as you want your model to discriminate. That said, it seems to do what it is I was imaging @a.harker might have been after.
This way queries are visualised are pretty cool:
Sometimes I get something like parts of the data are shifted:
but overall the matching actually seems pretty clever in capturing the distribution. Thanks science
That is roughtly what I was hoping for from the normalised bandpassed bands I described above, and was hoping an autoencoder would remove the bands that are always small. Was I too hopeful and deluded? I will try it with your patch anyway later this week once I’m done with all the videos/presentations I have to do…