The magnitude spectrum isn’t the only option besides the time domain for representing a signal, but because it’s versatile and often good enough, it tends to get reached for first, and so gets used a lot, and so is well understood, and so gets used a lot. Etc. etc.
As you probably know, with STFTs there’s always a tradeoff between temporal acuity and frequency resolution. In the specific case of the STFT, this trade off is applied uniformally, so that there’s an even division of frequencies, and the same granularity of time at every frequency. But there are represntations that don’t do this: for instance, we can have a finer frequency grid with coarser time at LF, and vice versa at the top. The upside it that it’s easier to match these to aspects of our hearing; the downside is that these reprentations are harder to turn back to audio that doesn’t suck, and harder to understand + process.
That doesn’t matter so much if what you’re making is a classifier: indeed, for things that you don’t need to hear the results of directly, you can get pretty baroque with the model you use to try and classify things with. For instance, by using quite abstract statistical models on top of something else; or some other aspect of the signal, like trying to find patterns in how the spectrum changes over time, without being interested in particular frequencies per se (e.g the ‘modulation spectrum’).
It could be that you’re washing out too much detail in that case. To keep it simple, you could try just making a single dictionary entry from a single hit, and combining a bunch of these into a dictionary. Would be inetresting to see if the performance changes drastically, one way or the other – and which things it struggles to discriminate on: those might give us a clue for a better approach.
Turns out I have the binaries already. However, opening the patch in PD vanilla, it seems like there’s a whole bunch of old pd-extended goodies the example wants. I’ll try and get it working. It also turns out he was training on 32-point FFTs, rather than raw audio (so it may be no better).
Rebecca Fiebrink’s Kadenze thing is great, I think. It’s relatively gentle, but gives a useful general glimpse of what the basics of ML can do. There’s also Parag Mital’s creative applicartions of deep learning course, which are closer to the sorts of technique that get @groma out of bed in the morning.