Observations on long file analysis in SC

Well, it says “startpoint and endpoint”. I would think that would be the first and last sample. But this is saying that is the first sample and the sample after the last sample?

BTW - some observations:

loading enormous 5gb sets of files doesn’t seem to work. it seems the larger the data set audio buffer, the slower things go, by exponential proportions, and the more likely everything is to crash. this is most certainly a memory issue. Probably even loading all those buffers is a no no.

the more files i load, the more chance of a crash. if i load 10, 95MB files, it is sometimes able to process them all without crashing, sometimes not

for analyzing lots of big files, say fifty 95 mb files, with 718 points each, I settled on the code below. while not using all the slick stuff, due to the file size issue, it is still WAAAAYYYY better than before.

FluidEqualSlice class is at the bottom. I would love to know if there is a better way to do this with the tools. It is a bit clunky…but WAAAAYYY better than before.

There does seem to be a memory leak or something. The code will sometimes go for a while, sometimes not. I found that adding some waits in there made the crashing happen less often. But maybe the dataSet isn’t always being freed or something?


~mfccbuf = 4.collect{Buffer.new};
~statsbuf = 4.collect{Buffer.new};
~mean = 4.collect{Buffer.new};
~flatbuf = 4.collect{Buffer.new};

~path = “/Volumes/Samsung_T5/Adaggietto_441_24/Stretch_12_0.25_0.1/”;
~paths = PathName(~path++“Chans/Chan0/4/”).files;


~extractor2 = FluidProcessSlices({|src,start,num,data|
var mfcc, stats, writer, flatten,mfccBuf, statsBuf, flatBuf, label, voice;
label = data.key;
voice = data.value[\voice];
mfcc = FluidBufMFCC.kr(src,startFrame:start,numFrames:num,numChans:1,features:~mfccbuf[voice],trig:1);
stats = FluidBufStats.kr(~mfccbuf[voice],stats:~statsbuf[voice],trig:Done.kr(mfcc));
flatten = FluidBufFlatten.kr(~statsbuf[voice],~flatbuf[voice],trig:Done.kr(stats));
writer = FluidDataSetWr.kr(~ds,label, -1, ~flatbuf[voice], Done.kr(flatten))

~extractor = {|counter, action|
var extractor,path;

path = ~paths[counter];

Buffer.read(s, path.fullPath, action:{|buf|
var slicer,index;

index = IdentityDictionary[(path.fileName.asSymbol -> IdentityDictionary[ ('bounds' -> [ 0, buf.numFrames ]), ('numchans' -> buf.numChannels), ('sr' -> buf.sampleRate) ])];

slicer = FluidEqualSlicer();
slicer.slice(buf, index);

	~extractor2.play(s,buf.postln,slicer.index.postln,action:{"Features done".postln; action.value(buf)});


var counter, netAddr, func;

netAddr = NetAddr(“”,NetAddr.langPort);
counter = 0;

~ds = FluidDataSet(s, \mfcc);

~extractor.value(0, ~ds,action:{|buffer|

func = OSCFunc({|msg|
“write json”.postln;
"counter ".post;
counter = counter+1;
~ds = FluidDataSet(s, \mfcc);

		"do it again".postln;
		"free osc".postln;

}, ‘fluidCount’)

FluidEqualSlicer {
var <>index;

slice {|buffer, indexIn, chunkSize = 44100|
	index = ();
	[buffer, indexIn].postln;
		var parent=indexIn[key], frames, bounds = [0,0], label;

		[key, parent].postln;
		frames = parent['bounds'][1]-parent['bounds'][0];

			var dict = IdentityDictionary(), lilDict, chunkPoint;

			lilDict = parent.deepCopy;
			chunkPoint = i*chunkSize+parent['bounds'][0];
			lilDict.put('bounds', [chunkPoint, min(chunkPoint+chunkSize-1,parent['bounds'][1])]);
			index.put((key++"-"++(i+1)).asSymbol, lilDict);


Memory leak aside (which is not clear yet) loading that much in one go might not be a good idea in general as it will clog your ram. Samplers do not load all as you know, they preload the amount they need to get the disk to load the rest (instant response from ram, then disk access) so an iterative load-analyse-free

The loading mechanism in SC uses the built in one, but a first experiment could be to try the load-in-2-pass example I left in the example folder, which is using bufcompose instead. It is slower yet still fast. If that behaves more at load time, or less, that could give an indication of where the problem is.

As fast as I am aware, all the other processes are not copying the full buffer but only the slice they are processing, with numFrames and startFrame. @weefuzzy will confirm but I’m 99% certain. So the next bottleneck is more the fact that scsynth is always in memory swapping mode - too much ram is used so a lot goes on the hard disk. Again, I’m sure Owen will have more insightful answers but I thought I would try to help first…

I am only loading 1 file at a time and analyzing that into a data set. I am making sure to kill my buffer once the DataSet is made.

I am then iterating through 50 files, which are each 10min long. SC will inevitably crash at some point. I once made it through all 50 files, with 350 points per file. Usually, it goes about 10-20 files before crashing.

You all probably tested this on big data sets of small files (lots of data, not many points), but maybe not as much on data sets of big files (lots of points, not too much data)?


ok I don’t understand. Looking at your code above, I wonder where it crashes. Since you need the audio to run the segmentation and the extractor, I wonder when you delete it? The workflow would be for me:

load the long file
split further
analyse to dataset (appending)
reset the buffer

you could even save to file the created dataset per long file, just for now, to not have to restart every time, since you can then concatenate them as one big json file… again just for now, just to see where the shit hit the fan…

btw I’ve move the bug-chasing bit to a new thread to simplify reading for people just wanting to download Alpha03

Thanks for perservering with this @spluta. Do you have crash logs? This would really help narrowing down what’s making it die. I started doing a run with lots of points against a Debug build, but that’s the long way around :laughing:

The logs should be avaialble via console.app under ‘User Reports’. If you right click on a item in the list, the context menu gives you a ‘reveal in finder’ option, and you can post / e-mail as an attachment(s).

Unfortunately it doesn’t seem to produce reports for the server just quitting. In SC it just says:

Server ‘localhost’ exited with exit code 0.
server ‘localhost’ disconnected shared memory interface

But no report in the console. Anyhow, the following (hacking the FluidStandardize help file) works better. Got through 5 gigs of files with no pooping. It throws a “Buffer UGen: no buffer data” error, but I don’t think this actually means anything.

I imagine there are some messages crossing paths with the FluidProcessSlices and it is tripping over itself.


//~sets = List.newClear(0);
~audioDir = File.realpath("path of a bunch o files");
~files = PathName(~audioDir).files;
~sets.do{|item| item.free};
~sets = Array.fill(~files.size, {|i| FluidDataSet(s, i.asSymbol)});
~doit = {|file, i|
	var mfccBuf, statsBuf;


	~leBeuf = Buffer.read(s,file.fullPath.postln, action:{|audioBuf|

		var mfccBuf = Buffer.new(s);
		var statsBuf = Buffer.new(s);

			var trig, buf, count, mfcc, stats, rd, wr1, dsWr, endTrig;
			var chunkLen = 44100;
			trig = LocalIn.kr(1, 1);
			buf =  LocalBuf(19, 1);
			count = (PulseCount.kr(trig) - 1);
			mfcc = FluidBufMFCC.kr(audioBuf, count*chunkLen, 44100, features:mfccBuf, numCoeffs:20, trig: trig);
			stats = FluidBufStats.kr(mfccBuf, 0, -1, 1, 19, statsBuf, trig: Done.kr(mfcc));
			rd = BufRd.kr(19, statsBuf, DC.kr(0), 0, 1);// pick only mean pitch and confidence
			wr1 = Array.fill(19, {|i| BufWr.kr(rd[i], buf, DC.kr(i))});
			dsWr = FluidDataSetWr.kr(~sets[i], buf: buf, trig: Done.kr(stats));
			endTrig = count - (BufDur.kr(audioBuf)-1).asInteger;
		//a.play(s, [\buffer, audioBuf, \mfccBuf, mfccBuf, \statsBuf, statsBuf, \set, ~sets[i]]);

~counter = 0;
~doit.value(~files[~counter], ~counter);

~oscy = OSCFunc({
	~counter = ~counter+1;
	if(~files.size>~counter) {~doit.value(~files[~counter], ~counter)}{"we're done here".postln; ~oscy.free;}