I’ve been looking for a tool that allows me to compare the similarity of musical ideas within my catalogue. I am doing work as a music editor, so I will get a reference track, and I will have to go through a huge list of old musical snippets and songs in order to find something musically similar to that reference.
Sononym’s workflow is perfect for this, but the results are not great for this application. Something descriptor based might not achieve this as genre/instrumentation are more important, but anything trained on a dataset seems to be geared towards finding similar songs within a paid music library and not from your own local catalogue. I’m finding that tools to help intelligently organize/recall a library of music vs sounds to be elusive. Especially those whose learning curve can scale to include some collaborators who are less tech savy.
I figure this is the type of crowd who might know if there is a tool out there that can help!
I suspect that the exact thing you’re looking for doesn’t quite exist yet because, as you say, the descriptors that Sononym uses work better for short samples than tracks.
That said, have you tried any of the DJ organising tools like rekordbox? I know they do some analysis but not what it is.
Unfortunately, the rekordbox is not organized by spectrogram per se. However, I have considered doing something similar, particularly for dance music. The best way to go is to detect the highest part of energy within a track and train a model based on 30 seconds of that piece and from others to look for similarities. In the case of dance music, having all a kick, this feature would become irrelevant and other descriptors would have higher weight
For this kind of thing the numbers that you generate (spectral descriptors, rekorkbox, other magic numbers) are super important. For this kind of unstructured similarity searching there are some very powerful models that do this quite well:
The main point of this specific model is that both semantic labels and audio samples are trained together, so “dog” and the sounds of dogs become meaningfully associated. A the core of it though, we can re-use the embedding model and ignore text information. They provide an example here:
# Get audio embeddings from audio data
audio_data, _ = librosa.load('/home/data/test_clap_short.wav', sr=48000) # sample rate should be 48000
audio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)
audio_embed = model.get_audio_embedding_from_data(x = audio_data, use_tensor=False)
print(audio_embed[:,-20:])
print(audio_embed.shape)
You could store that in a fluid.dataset~ and compute the embedding for your sound you want to search by, and find other sounds that are in theory similar by their distance.
Now that doesn’t really help you on your specific quest, but perhaps it broadens your horizons a little as to how other people are approaching this problem.