Segmentation by clustering

After the tail end discussion of the previous Friday meeting I decided to implement one of the theories that @tremblap suggested for dealing with some difficult segmentation scenarios I had.

The gist of the problem that I have sounds that have areas of rapidly changing spectral material, with others being more static and immobile. I want to seperate the areas of high change from the areas of low change to give me back ‘gestures’ and ‘textures’ put crudely. I am able to find segmentation parameters (with noveltyslice~) that are quite brittle and so a small change can easily produce too many or too little segments. Under segmenting a file is a huge loss in useful information while over segmenting makes it really hard to know what is meaningful while still often giving you the right slice somewhere in the mess of things. To narrow down the error in the over segmentation which is a more ideal situation, we could cluster together contiguous slices that are too similar with a classification/clustering algorithm. That way it resolves the ticklishness of the novelty slicing while keeping the general mechanism of the novelty slicing.

I’d love some feedback or ideas on how to push this further. Some initial ideas

  • Challenge the assumptions that I’m making about contiguity being important. There might be something more nuanced in looking at the distances between the slices if they are different clusters as they could be from two clusters which border each other.
  • Over segmentation will generally produce contiguous chunks that are similar so its likely you’ll end up with some slices after the clustering which mean nothing.
  • I’m using MFCC’s to make some analysis of the slices but there might be something else worth using

I’ll show the results in REAPER which I think speak for themselves, and then link the Python code below so anyone who wants to play with it and is so inclined that way can.

The top track is the pre-clustering
The bottom track is post-clustering

And the code is here:

import os
import subprocess
import jinja2
import numpy as np
from umap import UMAP
import umap.plot
from utils import bufspill, write_json
from slicing import segment
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering

COMPONENTS = 2 # UMAP Components
NEIGHBOURS = 10 # UMAP neighbours
MINDIST = 0.05 # UMAP minimum distance
CLUSTERS = 3 # number of clusters to classify

source = Path("audio/untitled.wav").resolve()
indices = Path("/tmp/slices.wav")
mfcc = Path("/tmp/mfcc.wav")
stats = Path("/tmp/stats.wav")
output = Path("slices").resolve()

# containers for data and labels
data = []
labels = [][
	"-source", str(source),
	"-indices", str(indices),
	"-threshold", "0.3",
	"-fftsettings", "1024", "512", "1024"

# Analyse contiguous slice pairs
slices = bufspill(str(indices)).tolist()
slices = [int(x) for x in slices]
for i, (start, end) in enumerate(zip(slices, slices[1:])):[
		"-maxnumcoeffs", "13",
		"-source", str(source),
		"-features", str(mfcc),
		"-startframe", str(start),
		"-numframes", str(end-start),
		"-source", str(mfcc),
		"-stats", str(stats),
		"-numderivs", "2"


# standardise data
standardise = StandardScaler()
data = np.array(data)
data = standardise.fit_transform(data)

# dimension reduction
redux = UMAP(n_components=COMPONENTS, n_neighbors=NEIGHBOURS, min_dist=MINDIST, random_state=42)

embedding =
reduced = embedding.transform(data)
p = umap.plot.interactive(embedding, point_size=2)
# umap.plot.points(mapper, labels=)

# clustering
cluster = AgglomerativeClustering(n_clusters=CLUSTERS).fit(reduced)

clumped = [] # clumped slices

cur = -2
for i, c in enumerate(cluster.labels_):
	prev = cur
	cur = c
	if cur != prev:

# Create reaper files to look at the results
tracks = {}
pos = 0		
for i, (start, end) in enumerate(zip(slices, slices[1:])):
	start = (start / 44100)
	end = (end / 44100)

	item = {
		"file": source,
		"length": end - start,
		"start": start,
		"position": pos
	pos += end-start

	if source.stem in tracks:
		tracks[source.stem] = [item]

pos = 0
for i, (start, end) in enumerate(zip(clumped, clumped[1:])):
	start = (start / 44100)
	end = (end / 44100)

	item = {
		"file": source,
		"length": end - start,
		"start": start,
		"position": pos
	pos += end-start

	if "clumped" in tracks:
		tracks["clumped"] = [item]

# write the reaper file out
env = jinja2.Environment(loader=jinja2.FileSystemLoader(['.']))
template = env.get_template("minimal_reaper_template.rpp_t")

with open("slicing.rpp", "w") as f:
	f.write(template.render(tracks=tracks))["open", "slicing.rpp"])

Oh I feel all squishy that this was helpful :heart:

This was an intuition, and because my memory is bad, it might be an idea that @weefuzzy talked about. He will confirm if it was his from long ago, or mine, or maybe a combination with the meta-clustering ideas @groma set me to code a few months back and I did not manage to do yet… oh well, I presume it is a team effort anyway, as usual.

My hunch would be to try to make a larger number of clusters - 3 seems small… but is is on a rolling window of 10 slices? If not that is another idea to try (not clustering the whole file but just the 10 neighbours)

What would be good is to also know what you are looking to do ‘further’ - how is it not working as much as you want ?

Finally, I know it is in Reaper so Python is your way to go, but I wonder how it would look like in Max or SC with fluid.dataset and cie. That will be a good homework for me on the pile of things to try

thanks again for sharing


This seems more generally like using a classifier as a basis for segmenting, which is all fine and good. @tremblap: yes, I remember @groma talking to you about this.

@jamesbradbury: If you’re primarily interested in distinguishing between the amount of spectral change, then perhaps only including the first derivative in your data points will have less redundancy. Also, you don’t need to reduce it all the way down to 2D for clustering (although this does allow you to inspect the results) . YMMV.

Ah yes, I did experiment with some other numbers but I wanted to visualise the output at first to help me navigate what was happening at the reduction stage. Perhaps you have some strategies for digging deeper into evaluating the embeddings?

I’ll give this a shot.

I have a mockup using fluid.mds~ that I started with which I can after I tidy and annotate.

Yes, more things to try! I’m not sure what you mean by the clustering. Do you mean run the whole process -> reduction -> clustering or just the clustering on a sliding window?

I am at the stage of tweaking now as I think that this gives me a much better interface than twisting knobs trying to find sweet spots. In my initial experiments testing between two files the generalisation of the approach seems better, while still rooted in some kind of musical/listening practice that started me off on this whole thing, rather than an approach like which means nothing to me personally.

1 Like


that is what I meant. you have less ‘diversity’ to choose from so that might make it sturdier. or not.

n=2… looking forward when it is your whole hard drive :slightly_smiling_face:
seriously, this is great that you can experiment with this and sharing your results. thanks!

I tend to operate in either n=1 or n=n+1 mode :wink:

1 Like

First, this is really cool!

I, too, have had a hard time finding meaningful segments “automatically” by running longer bits of audio through any of the slicers. The friction in testing changes also slows things down as it’s hard to see/hear the results without building a load of plumbing around the process.

This got me thinking that it would be super awesome to have a tweakable interface (like Reaper or Simpler) where you can adjust parameters and see the slices adjust in real-time. Obviously for some processes that’s not possible, but you could pre-render a load of settings. Somewhat like @tremblap’s automatic threshold finder, you can select a rough ballpark of slices you want, or a rough starting threshold, and then it would iterate through stuff (like in @tremblap’s patch), but what I’m suggesting would then cache all the results along the way, so you can scroll through some of the settings to find what you need. Or if you have time to spare, have it run in a way that generates a ton of outputs where you can then tweak the thresh in “realtime” and see the results of the slices in the material.

A super complex (and useful) variation of this would be doing the same kind of thing, but with intermediary processes (i.e. what you’re doing in Pyton), so you can tweak things and see/hear results to assess how well it’s working, rather than rendering, checking, rendering, checking, rendering, checking etc…

max version


And some more experimentation today!

This is really starting to show some potential.

Although the waveform looks like it has maybe 2/3 distinct gestures in it, its quite static on the surface and the clustering keeps these together!

This is using transientslice with a larger window and blocksize.


Ok, finally got around to testing this today.

Quite handy!

I tried it on some longer/weirder audio (jongly.aif has some fairly clear segments, so clumping here doesn’t seem musically intuitive (for me)) and the results were interesting. Wish there was an easier way to visualize the difference in slices (in Max-land) as I’m getting nearly 50% reduction, and short of playing all the segments and guessing, it’s hard to actually make sense of what that means, for each algorithm.

A small/silly thought is that the minimum slice length could be significantly bigger. I put a file in that was around 1min and one of the segments (even after reduction) was around 70ms, which at a “gestural” level is microscopic. Perhaps this could be set as a % of the overall file length, or related novelty kernel size.

It also got me thinking that this could be a great way to segment so higher level ‘gestures’ which can then allow each individual subsection to be (automagically, of course) segmented with bespoke values. This gets tricky obviously, but I was picturing some kind of interface where you do everything that’s happening now (perhaps with the faux “live” thresholding like I suggested above) and then each individual top-level section gets a tiny UI slider or threshold where you can manually tune the settings for each (via fluid.bufampslice~) to get the right sensitivity… per-section/gesture.

Just a small update - working with the garage door up and all that.

Here is a screenshot showing the same approach but clustering over small windows (aka the @tremblap approach) of the whole set of segments (in this case 15 segments per window with a hop of 1 slice). This screenshot from reaper demonstrates that process working with different n_clusters for the clustering algorithm, the equivalent for something like knn is specifying k.

Really interesting to see how the level of discrimination changes between clustering levels too. The next is to do the same process tomorrow morning running over several different windows, hops and cluster sizes exhaustively.

Where I think this windowed approach might be stronger is that it clusters together slices mid processing rather than waiting for the whole job to be done before deciding which clusters to squish together ala my first post. In the script im running now I’m using a recursive loop that squashes until there is no more squashing to be done and I think that this makes the future decision making more intuitive.

import numpy as np
import os, subprocess, jinja2, random, hdbscan
from umap import UMAP
from flucoma.utils import get_buffer
from flucoma import fluid
from pathlib import Path
from uuid import uuid4
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from datetime import date

COMPONENTS = 4  # UMAP Components
NEIGHBOURS = 7  # UMAP neighbours
MINDIST = 0.1  # UMAP minimum distance

media = Path("../reaper/source/media/")
source = media / "02-200420_0928.wav"
source = source.resolve()
output = Path("slices").resolve()

slices = get_buffer(
		threshold = THRESHOLD,
		fftsettings = [2048, -1, -1]

slices = [int(x) for x in slices]

# clustering
standardise = StandardScaler()
original_slices = list(slices) # make a templated copy
tracks = {}

for nclusters in range(2, WINDOWSIZE):
    model = AgglomerativeClustering(n_clusters=nclusters)
    count = 0
    slices = original_slices[:] # recopy the original so we start fresh
    while (count + WINDOWSIZE) <= len(slices):
        indices = slices[count:count+WINDOWSIZE] #create a section of the indices in question
        data = []
        for i, (start, end) in enumerate(zip(indices, indices[1:])):

            mfcc = fluid.mfcc(source, 
                fftsettings = [2048, -1, -1],
                startframe = start,
                numframes = end-start)

            stats = get_buffer(
                    numderivs = 1
                ), "numpy")


        data = standardise.fit_transform(data)

        # might not be necessary to reduce as the dimensions are already quite low
        # redux = UMAP(n_components=COMPONENTS, n_neighbors=NEIGHBOURS, min_dist=MINDIST, random_state=42).fit_transform(data)

        cluster =
        print(f"num slices {len(slices)}")
        cur = -2
        for j, c in enumerate(cluster.labels_):
            prev = cur
            cur = c
            if cur == prev:
                    slices.pop(j + count)
                except IndexError:
                    print(f"Error at {j}")
                    print(f"Count {count}")

        count += 1

    # Create reaper files to look at the results
    pos = 0
    track_id = str(nclusters)
    for i, (start, end) in enumerate(zip(slices, slices[1:])):
        start = (start / 44100)
        end = (end / 44100)

        item = {
            "file": source.resolve(),
            "length": end - start,
            "start": start,
            "position": pos
        pos += end-start

        if nclusters in tracks:
            tracks[nclusters] = [item]

# make the necessary folders
today =
now = today.strftime("%d-%m-%Y")
session_id = str(uuid4().hex)[:5]
session = Path(f"{now}-{session_id}")
if not session.exists(): session.mkdir()
reaper_session = session / "session.rpp"

# create a dictionary of metadata
metadata = {
    "components" : COMPONENTS,
    "mindist" : MINDIST,
    "neighbours" : NEIGHBOURS,
    "threshold" : THRESHOLD,
    "note" : "Windowed clustering",
    "window" : WINDOWSIZE

print('Generating REAPER file')
# now create the reaper project
env = jinja2.Environment(loader=jinja2.FileSystemLoader(['../RPRTemplates']))
template = env.get_template("SegmentationTemplate.rprtemplate")

with open(reaper_session, "w") as f:
    f.write(template.render(tracks=tracks, metadata=metadata))

this looks very promising! I really look forward to play with it (as soon as my video-talk slides/edits are done!)

1 Like