Optimizing a neural network classifier (fluid.mlpclassifier)

Bumping this as I’m curious why I’m getting worse results from MLP classifier than I was before.

As a TLDR I’ve been doing a bunch of work/testing on creating a good interpolator between classes (outlined in this thread) and where that ended up going was that using PCA to reduce the amount of MFCC dimensions from 104d to ~20d (i.e. retaining 95% variance) gave the best results when “interpolating” between the classes.

That led me back to testing the overall accuracy of PCA’d MFCC in the context of a vanilla classifier and the results were better than what I had before!

I then assumed that the MLP classifier I was experimenting with back during the tests in this thread would work even better only to be sorely disappointed with the results…

//////////////////////////////////////////////////////////////////////////////////

After experimenting with the network structure a ton the best results I got varied from “as good as KNN” to “only slightly worse than KNN”.

A range of tests/comparisons:

86.11% - KNN robust’d mfcc (104d)
86.11% - MLP robust’d mfcc (104d)

97.22% - KNN pca’d mfcc (20d)
95.83% - MLP pca’d mfcc (20d)

87.50% - KNN pca’d mfcc (24d)
80.55% - MLP pca’d mfcc (24d)

Here’s what some of the underlying data looks like as a point of reference.

The raw 104d MFCCs:

fluid.dataset~: DataSet #0classifier:
rows: 463 cols: 104
0 31.557 -10.553 13.9 … 7.7752 3.2047 2.76
1 30.688 -5.6341 9.6338 … 6.4456 6.956 1.6969
10 32.626 -9.4391 13.778 … 2.7359 6.6298 2.4849

97 36.217 -0.37663 10.068 … 7.0134 4.2217 3.6786
98 31.006 2.657 2.4032 … 5.9144 10.212 2.5373
99 37.892 -6.7152 7.5238 … 6.9213 7.0426 3.8227

The PCA’d MFCCs:
fluid.dataset~: DataSet #0classifier:
rows: 463 cols: 24

0 -42.224 3.9132 -5.0367 … -2.61430.00069366 -3.7621
1 -51.06 -1.564 7.2985 … -2.6545 0.67618 -4.5783
10 -44.454 5.2509 -6.4867 … 4.7932 1.2728 -11.723

97 -65.193 -0.25055 -5.8954 … 0.74348 -2.6644 5.0952
98 -52.332 -8.788 2.3775 … 1.0905 -1.8463 -6.4464
99 -65.682 2.4534 -12.136 … -2.8918 3.4572 0.55504

So kind of different ranges, but roughly similar (+/- 50, and fairly bipolar) so I thought it would work the same.

I also experimented with robustscaling with varying results (never “good”, somethings “almost as good”).

So based on the discussion with @weefuzzy here a while back, I was aiming for a funnel shape, going from ~40d → ~10d, or ~20d → ~10d. Here are some of the structures I tested out:
@hiddenlayers 89 74 59 44 29 (the structure I found worked the best last year)
@hiddenlayers 19 17 15 13 11 (trying to copy the same movement with lower dimensions)
@hiddenlayers 35 30 25 20 15 (this seemed to work the best)

I also tried much smaller networks (@hiddenlayers 15), ones that go up/down/up, or down/up/down, tried different @activations etc…, and got no joy.

//////////////////////////////////////////////////////////////////////////////////

-Is there something about PCA’d data that is less friendly to MLP-ing?
-Are these reasonable enough structures? (e.g. stepping down from input dimensions to output dimensions)
-Should I train the loss for a really long time? (in all the cases above, it goes down to <0.0001 pretty quickly and I let it run for like 2-3minutes just to be safe)
-Should I experiment more with robustscaling and/or some other normalization before MLP-ing?