Audio Classification with NN

Came across this today. Very python focused, sorry! But it looks at classifying audio with neural networks. Could be a useful resource for any pythonistas to refer to and translate to the flucoma world.

1 Like

I did a workshop with them at CCRMA last summer. I believe this uses a convolutional neural network on an image of a spectrogram (I say “image” only because we did actually save the audio to a spectrogram image format, then do the training on that).

One thought I have is that it doesn’t (or at least we didn’t in the workshop) move in the direction of audio descriptors, since the “description” they use is just the spectrogram.

Another thought is that because it takes in a spectrogram, there is some aspect of time baked in, on a truly frame by frame basis (not stats of the “past”), so that’s kind of interesting.

And my last thought is that I didn’t find the results to be very compelling (i.e., accurate or useful), and it took quite a while to train. But perhaps that was a problem with my training data!

In the NIME paper of @groma last year he proposed we used autoencoder directly on the spectrogram and you can take that bit of code (I think it is in c++ and already bridged to SC :slight_smile:

1 Like

Glancing at the code it seems like you use transfer learning, ie, you start off with a resnet, load the model already trained on the urban sound dataset and then chop off the end of the network and you define your new classification layer, ie, how many categories you want to classify. I am not sure what is the target of this workshop but by lesson 3 you’re deep in writing pytorch code to do all this so unless you are familiar with all these concepts it would be very hard to learn much from this!

I’ve been thinking quite a bit about using methods like CNNs adopted from images. Basically, when you create a spectrogram you are converting your audio to a 2D representation like an image. One of the reasons why CNNs appear to work so well with images is because neighbouring pixels are relevant to what each pixel is doing. In a spectrogram I also feel they are relevant but perhaps not in the same way.

Training a decently-sized network without using a GPU seems tricky and as of now seems like you’d need to create an external for Max that could actually define a network and train it on a gpu. Here is a project Jazer Giles did using neural networks. I have not yet opened the patch though!