Sam and I are wanting to do some work on the musical use of descriptors, we are interested in gathering stories about issues, interesting usages, points of technique, usefulness of various descriptors etc. etc. that arise when attempting to use this things for musical/composition/performance purposes (rather than MIR).
I had some interesting discussions at the plenary with people about these things and we’d love to start a larger conversation here with the hope that we can put something together to publish. For now the shape of that is very open, so if you are interested we’d love to have your thoughts. PA and team have agreed that it is of interest to them to see the conversation, hence I’m keeping it here.
This is of great interest to me as well.
In my big sound collection my familiarity with certain subfolders varies from ‘well known’ to ‘somewhat familiar’ to ‘unknown’. Methods of exploring descriptors on these different categories varies obviously.
All analysis have been calculated with Alex’ descriptors~ in non-realtime.
For example, I analyzed a collection of 250 soprano saxophone multiphonics. All values are mean over total duration of sound.
The range of inharmonicity is only from 0.005 to 0.321, while to my ears they are all pretty ‘inharmonic’.
Harmonic ratio on the same dataset ranges from 0.319 to 0.537. smf is 0.000 to 0.003.
All this just to say that the output is much less intuitive than I would have imagined.
Here are a couple of useful combinations:
durations above 3 seconds (to get sustained sounds)
energyMax below -25dB
this yields soft sustained multi phonics
AttackLoudness is one I added to the game - this is just the energyAbs over the first 200ms of the sound.
Very handy on percussive or any ‘attack’ related sound.
Pitchdeviation comes in handy to exclude or search for glissandi and other pitch contours.
I have either a bug or did not understand the linear spread. My values range from 124732.805 to 920360.997. No idea what I’m measuring here. Sorry for my ignorance.
lin_brighness is very useful in general.
Just as a start of diving more into this subject.
Best, Hans
Thanks. I think Lin spread is broken in that build. At the time I was writing the object I didn’t realise that sfm is much more manageable in dB form, so I’d suggest converting it with atodb to yield more useful values.
As some of you may have already picked up I’m currently interested in “automatic symbolic transcription” of some sense (that’s why I was asking).
What I use quite a lot is searching for melodic intervals in datasets, and what I have now is at the current stage relatively clumsy and unreliable. It works, but only because I know it doesn’t and I know how to trick it But before delving more into anything, I’ll need to test the inputs you have already given me during the last few days
Another two cents: I’ve used your foote descriptor over and over, I find that always extremely valid to distinguish between “fluid, steady stuff” and “uptempo, highly moving things”. That was a sort of huge time saver. “Ballad versus up tempo” ahah
Interesting topic indeed. I presume I use descriptors in 3 different yet similar ways:
for composing granulation in real-time. I have a circular buffer that I analyze as it comes (50 ms grain size, 10ms hop size) and I analyze 3 values (pitch, energy, and a timbral descriptor that changes with my current tastes - sfm or centroid mostly) and after the calibration I showed you all, I’m certain of the ballpark of values I get so I can compose real-time granulation beyond the on-off. For instance:
I showed you all the clouds of quiet pitched material,
the beginning of my soprano piece is doing random stutter of mezzo-forte noises from the singer.
I also do some cool alignment in my last piece (super long grains as looper with sync’d loudest points) to make a chamber music ensemble tight
And I’ve also done a granular looper that skips some grains so that makes auto-edited loops.
Fun stuff over the last 8 years thanks to Alex’s objects. The other usages are in the same vein:
I’ve pre-analysed some very dirty modular synth files, and I’ve used the real-time stream of descriptors of #1 above, offset and scaled to allow rich overlapping descriptor spaces, to make some cool synth variations of a live gesture. It can be heard in the quiet sections of my piece mono no aware
I’ve done some sampling of my modular, as I explained around the table: the patch is controlling one value through my Expert Sleeper’s ES-3+6 and I collect the same 3 descriptors above for each ‘control’ value. I can then query via descriptors again, or via fixed desired values. So far I’ve only used this as pitch tracker on complex patches, but I had loads of fun to make very synthetic birds… I was using the query for the pitch, and the stream of spectral centroid to open a filter (with some mapping), and the stream of power to open the VCA. It was great fun!
I hope some of these simple uses will inspire. If anyone want a patch above I’m happy to share.
I made a patch/system/piece that was played by both a vocalist and trumpeter. It used descriptors to chunk up the preceding 16 seconds of sound, and recall grains from that buffer which were most similar to the ‘now’ input of the musician. The caveat was that the 16 second memory was a bit fuzzy, and would have grains/sections removed based on how similar they were to the last 4-5 seconds of incoming audio. The idea was that instead of thinking in terms of a catart like instrument, the system would have some interesting corners for the musician to respond to and compromise with as they improvise.
If anyone is interested in the patch I’d be happy to refactor it a bit and pass it on.
Recordings below
As it was said in the plenary, I think that descriptors are fairly weak in what they can tell us about the sounds especially when the acoustic model is not that close to what we hear. Salient descriptors seem to be centroid, amp, duration and as you increase the complexity of these it becomes harder to milk them for a compositional purpose. The system I spoke of just above this paragraph used MFCC’s and it was quite proficient at picking similar sounds, but by no means is this a perceptual truth that the sounds were the closest - just that the MFCC’s data happened to be numerically similar. There were definitely moments where I was confused by the descriptor matching algorithm, and that another ‘area’ of sound I had heard previously would’ve been a better pick. In my philosophy for that system, I was really just using the descriptor paradigm to shape the output in a semi-logical way. Maybe I could’ve done this with another method of processing/analysis/synthesis and produced similar results, however, what seemed important creatively was working in this way and hinging the system’s behaviour on the differences between computational and perceptual similarity.
As I was going to sleep, dreaming of a better described world, I remembered that Diemo had made this list of people using CataRT and therefore descriptors musically: http://imtr.ircam.fr/imtr/CataRT_Music
With the sounds being made by the instrumentalists here, it would make sense that the MFCC would be the best descriptor. Pitch isn’t going to get you very far with this material, I don’t think, unless combined with a kind of gate that ignores the noisier sounds and includes the less noisy. I imagine spektral centroids would be useful as well.
I think the grain size/envelope might be distorting your results, however. With the grains, I am perceiving the envelope as much, if not more than the timbre of the grain, which blurs my perception of the correlation between computer and live signal. This makes me think that we could use descriptors to control the envelope time, overlap, and duration so that the grain shapes more “correctly” match the material.
I used descriptors last year to control real-time synthesis: oscillators and such. The piece used silent brass mutes on trombones, so basically the trombones were used as controllers for synthesis arrays. You could not hear their acoustic sound.
I found that pitch, amplitude, onset, and centroids were the things that worked. Because you couldn’t hear the trombone, I really didn’t have to worry about things like matching pitch or sounds. Centroids and pitches were basically used like sliders that I could map to any range, and I just manipulated and clipped the ranges until the sound and control I wanted emerged.
I found that the best things happened when: multiple descriptors were used concurrently and a single descriptor mapped to multiple possibly unrelated variables - for instance, in one case, i used the centroid as the control of the index of modulation of an FM patch and also to control the envelop of the attack and the pitch of a percussive oscillator sweep. Centroid is a weirdly good controller of attack, as it will change drastically over the course of the attack envelop (see Hans’s AttackLoudness above).
This is only semi-related as I have nothing concrete to add, but something I’ve been wanting to figure out is a way to get meaningful descriptor data out of onset/transient-based music, where there is little to no sustain after the initial attack.
I spent a bit of time brainstorming/patching with Alex to get something that reliably spat out loudness/centroid/sfm (and perhaps pitch even) for a given drum attack. Obviously there are a lot of problems and compromises there, with the biggest concern for me being latency.
The dream, in this context, would be to be able to have some sense of a couple descriptors (loudness/centroid specifically, but sfm/pitch would be nice) in the amount of time it takes for normal-ish onset detection (<10ms).
This AttackLoudness idea is great, for offline (or delayed realtime) use. I started building something, but never finished it, that kept track of some ‘gestural’ data for each attack. Not useful for immediate/onset use, but my thinking was that it could be useful to create some kind of descriptor gesture/vectors which could be used elsewhere in the patch. (My exact use case for this was going to be to create “long” sounds concatenated together from a pool of samples based on the stretched micro gestural information from a short attack).
@spluta At times the duration and period parameters of the granulator hover around values that cause the grain to be significantly distorted as you put it. Also, when the duration is lower than the period you lose the continuity of the sound and it can disrupt that timbral fidelity.
In regards to your silent trombone piece, I like the use of the centroid as a kind of dirty envelope follower!
It’s a stereo file with the L channel being the audio recorded directly out of a Sensory Percussion drum trigger (you put a metal dimple on the drum and it has a hall sensor pick up the “audio”), and the R channel being a DPA 4060 recorded at the same time.
The software for the Sensory Percussion sensor does some cool machine learning stuff and is crazy fast, so I’m trying to replicate some of what it does, but without being inside their sandboxed “app” (or using the 7-bit MIDI output from it).
I will probably tweak the net version of descriptors~ (or similar) to not ever use info from incomplete analysis windows - the meaning of that data and possibility of skewing the result is just too high
I think the comment about zl stream refers to zl stream into zl median which is a simple median filter - that adds latency but can remove outliers in a convincing manner.
Alex gave a talk on descriptors. He spoke of many wise things. It made me think that I forgot to add to my list in this thread that I’ve done mute electric bass guitar as controller for corpus navigation through descriptor since 2009 in this paper (http://eprints.hud.ac.uk/id/eprint/7421/) An even more naive version was implement in Sandbox#2 in 2007.