Extracting syllables

brookt · May 1, 2022, 5:12pm

I’m looking to slice spoken word into individual syllables. I have not had a lot of success using the various segmentation objects thus far. Any suggestions on settings?

jamesbradbury · May 2, 2022, 7:01am

hey @brookt.

Would you be able to upload the samples you are trying to slice as a start point? There are a number of factors to think about here also which will affect the choice of algorithm and the settings.

How clean is the recording?
How refined do you need the separation to be?

If it is a clean recording, something like fluid.ampgate~ will be excellent as a choice but if its noisy or mixed with other sounds perhaps not so much. fluid.noveltyslice~ is also another good choice for generic slicing tasks.

tedmoore · May 2, 2022, 11:13am

As @jamesbradbury suggested, using fluid.bufnoveltyslice~ would be a good place to start. In particular, setting the algorithm parameter (previously called the feature parameter) to “mfcc” would be a good first try since MFCCs are often used for speech.

Also, one parameter that is often not given enough attention is the minSliceLength, which can really help clean up some of the results. Note that this parameter is measured in FFT frames so, essentially hopSizes. I usually manage this by figuring how many milliseconds I want the minimum slice length to be and then converting to hopSize.

Also, one good way of approaching the slicers (maybe you’re already doing this) is to use the real-time version (fluid.noveltyslice~ for example) which has all the same essential parameters. Send in a signal of the playing buffer you’re wanting to slice and watch and listen to the results in real time(maybe just sending the impulses output of the slicer to the speaker to hear the clicks where it’s detecting slices). That way you can break out a bunch of attrui objects and see how the params affect the slicing in real time to tweak it until it’s closer to what you want. Things like threshold (obviously) and kernel size will be good to fiddle with. And of course the FFT settings can also have a lot of affect.

Lastly, you might also check out some of the spectral decomposition objects, such as HPSS and NMF. It could be the case that first processing the buffer through one of these can do some of the separation spectrally and then sending that output through the slicers will get towards a more perceptually relevant separation (or even just perceptually different/interesting). (or vice versa–slicer, then spectral decomposer) If you have any questions about those objects, please ask!

brookt · May 2, 2022, 8:42pm

thanks - I will be recording the audio from the performer next week (mezzo soprano voice). It’ll be a clean recording in english and a little french. This is me reading them:

https://drive.google.com/drive/folders/15H6ShzxMqB3PmbxAreXnmQeyBekAgWQ0?usp=sharing

spluta · May 4, 2022, 12:30pm

TB, what’s happening? This might be a lot of work, but could one train an mlpclassfier to detect “vowels” and “onsets”, and then run the whole audio file through the classifier, using the classified “onset” points as the slice points? It might be more labor than doing it manually, but my brain thinks it could work.

Sam

brookt · May 4, 2022, 4:47pm

If your brain thinks it could work then it probably will!

The vowel training is clear, but I’m not sure how I would train for the onsets without already being able to detect them by other means in order to add the data points.

spluta · May 4, 2022, 5:53pm

If I have learned anything with this stuff it is that it “probably working” is not true, haha. But here is the idea - train the classifier by the short sections of audio that you hear as the vowel changes by doing an mfcc analysis over those chunks and getting the stats on that mfcc. So the training would be on the stats of an mfcc chunk. Then incrementally go through the original to find spans of audio that classify as “onsets.” Maybe a lot of work for something that might not actually do the job, but then again…it might.

Maybe this is how an MFCC Novelty Slice already maybe kind of works, but it would be more specific to the audio you are providing.

Sam

tedmoore · May 4, 2022, 7:24pm

Re: the MFCC novelty slice thing, check out some of the links below. The paper is not too technical, kind of fun. It may just help get a sense of what an “onset” is (in terms of “novelty” onsets).

The tough thing about onsets is that one needs to look for changes between adjacent MFCC analyses (which are divided by FFT frames), so I am skeptical that one could find a single class of FFT frame that represents an onset.

Another version of @spluta’s suggestion is to first classify all of the MFCC analyses (which, again, are split by FFT frames) in a sound file as a particular phoneme (which ones are “ah”, which are “oo”, which are “fff”) and then run over that sequence of classifications and look for where the class changes. That could be an onset!

As @spluta said, this would involve manually labelling a bunch of examples first and then training a fluid.mlpclassifier~ to then classify the rest (a la supervised learning). This is a super good approach.

Another approach, using unsupervised learning, would be to use a clustering algorithm, such as fluid.kmeans~, on the analyses in the hope that it will cluster together the different phonemes into classes (no manual labelling involved). Then do the same thing: run over that sequence of classifications and see where something changes. This is very similar to @jamesbradbury’s FTIS tool.

If I were to take the unsupervised approach using fluid.kmeans~ I would probably do the MFCC analyses and get one data point for each FFT frame (getting rid of the silence first though!), then use fluid.umap~ to reduce the number of dimensions to 2 and plot it. That way I could see what the general lay of the land is of my sound slices (and listen to the dots as well to to hear the lay of the land). Depending on the parameters one uses for UMAP you might be able to get some clean-ish looking clusters in your 2D space that are totally perceptually relevant! And then you could do a clustering in this 2D space (using fluid.kmeans~), even seeding the center of the clusters where you deem them to be!

That’s kind of a big list of ideas right there, so let me know what questions there are. I’ll be happy to help!

https://learn.flucoma.org/reference/noveltyslice/

https://www.audiolabs-erlangen.de/resources/MIR/FMP/C4/C4S4_NoveltySegmentation.html

brookt · May 4, 2022, 9:37pm

Thanks Ted and Sam - this is all making sense and now it’s just a matter of me doing the work! Thanks again.

brookt · May 4, 2022, 10:21pm

hey Ted - I’ve already run into a problem on the first step - I’m trying to analyse each frame of a sample using fluid.bufmfcc but I am getting 26 columns instead of 13. Here’s the patch:

analysis by frame attempt.maxpat (12.0 KB)

weefuzzy · May 4, 2022, 10:26pm

Is your source buffer stereo? That would explain it. Try adding @numchans 1 to the fluid.mfcc~

brookt · May 4, 2022, 10:31pm

that was it! Thank you!

brookt · May 5, 2022, 1:57am

I’m stuck at a part that should be fairly straightforward - I want to standardize the live signal so that I can compare it using a kdtree, but I keep getting a “no data fitted” message when I use on the fluid.standardize object. Here’s the patch:
KmeansForSyllables.maxpat (66.2 KB)

And a dependency:
Brook_Playback.maxpat (13.5 KB)

tremblap · May 5, 2022, 7:35am

Hello @brookt

If you are comfortable reading SuperCollider code, there is in the example folder of that package an example that I coded that does exactly that (clumping consecutive similar segments via clustering). It is not super well documented but I think it can help.

tremblap · May 5, 2022, 7:35am

oh wait, it has been deleted from the example folder. here it is:
12-windowed-clustered-segmentation.scd (8.2 KB)

tedmoore · May 5, 2022, 10:28am

@brookt, if the two channels are basically the same, this is a good solution, if they’re not, I usually sum the stereo file to mono and do the analysis with that. Then when I’m ready to do playback because I’ve figured out how I want to use the analysis I trigger playback on the stereo buffer.

tedmoore · May 5, 2022, 10:45am

In Step 3 and Step “this is the broken part” you should give your fluid.standardize~ object the same name. The name is set as the first argument. Make both look like this:

Screenshot 2022-05-05 at 11.45.05

This way, behind the scenes FluCoMa knows that you want to use the same fit data to standardize the incoming signal that you got from the fittransform in Step 3.

Let me know if that does it!

brookt · May 5, 2022, 5:52pm

Thanks Ted - yes, that works now. There were all sorts of other issues with the earlier version of the patch that I was able to work out as well and it’s now “working.” I’m not sure how great it is with the onset detection, but the results are still very interesting. Here’s a video of it in action: AI has a long history... (testing out MAX with Flucoma) - YouTube

Thanks again - and to the others who replied and helped!

And here is the patch, in case it’s of interest or you have any suggestions on how to refine it.
KmeansForSyllables.maxpat (75.5 KB)
Dependency:
Brook_Playback.maxpat (13.5 KB)

tedmoore · May 5, 2022, 6:06pm

Nice. Sounds great. I always love the dancing dot on a plot!

tremblap · May 6, 2022, 8:06am

If you like that soundworld, you might like the voice treatment in my last FluCoMa-related piece There is a bootleg online where we can hear a little of this treatment here and also I give a little explanation of the patch I sent above here - that might be of some help.