Dict from an uuge dataset

spluta · August 27, 2020, 4:33pm

I have a dataset with 1.25 million points. It loaded and PCA’d like a boss, but now I need to access the dictionary, and it just doesn’t want to give it to me. Seems to just freeze when I use .dump. Is there a good way to get this? Should I just save the json file and load it myself?

Sam

weefuzzy · August 27, 2020, 4:38pm

Give it a shot – this is essentially the same process through which you’d get the Dictionary anyway (i.e. via disk), but there are enough moving parts that I guess something is barfing.

spluta · August 27, 2020, 4:41pm

Yeah. Loading the json takes about 10 seconds. Dumping becomes too much like the action being performed. I’m going to send you the json file I am using.

spluta · August 27, 2020, 4:53pm

But I think the problem may be the efficiency of SC’s Dictionary. I can load the dict super fast, but traversing it is a whole different story.

jamesbradbury · August 28, 2020, 8:34am

When you say traversing are you iterating? Pulling out values from a dict should be uber fast

weefuzzy · August 28, 2020, 11:00am

The problem seems to be with the language-side performance of our parseJSON method. I can load the JSON file you sent me ok using String.parseJSONFile – it’s not quick, but it gets there. However, via FluidManipulationClient.dump it seems to be taking its sweet time loading the file language-side: SC has been at 100% CPU for about half an hour now

weefuzzy · August 28, 2020, 1:13pm

Or it could be that there’s a problem in the recursion that FluidManipulationClient.parseJSON uses, as this still hasn’t completed after more than two hours.

spluta · August 28, 2020, 1:30pm

Probably both. I can load the JSON like this:

~data = File.readAllString("/Volumes/Samsung_T5/SynthAnalysis/3_soundsAnalysis/normalizedPCA0.json").parseYAML;

But then stepping through the data set:

~keys = ~data["data"].keys.do{|key| ~data["data"][key].postln;}

prints about 3 per second and hangs the Interpreter.

But as you say, it may not even be getting to that point.

weefuzzy · August 28, 2020, 1:54pm

That’s using the built-in JSON parser though. I didn’t have problems doing

d = "normalizedPCA0.json".parseJSONFile; 
d["data"].keysValuesDo{|k,v| (k->v).postln};

So, I wonder if what’s choking SC in your example above (not sure I know what the result of the assignment to keys~ will be there: was this originally a collect?

spluta · August 28, 2020, 5:33pm

Huh. keysValuesDo actually just traverses the internal array of the Dict (which I had no idea it had) instead of actually accessing the values with keys. Crazy. Still, Dictionary with that many elements is not wise.

Where I am at with this is:

KDTree with 1.25 million elements actually works! But it is a bit pokey some of the time and great some of the time. As in, some inquiries are responded to quickly and others take a hickup moment.

For me, exactness is not the issue. Speed is. So, I have randomly placed the 1.25mill elements into 100 KDTrees. Then I just ping one of the KDTrees at random for my NN, which will hopefully be “good enough” at timbre mapping inputs to complex synths.