Optimizing a neural network classifier (fluid.mlpclassifier)

So after putting out SP-Tools v0.8 I’ve got a bunch of things I want to put in place towards getting (eventually) ready for 1.0. Some of those things are improving/upgrading/shoring up core functionalities.

Towards that I want to add the option in the classification objects to use the (fancy pants) MLP classifier, instead of the (stupid and boring) KNN classifier.

Now I spent some time this evening messing around with different @activations and, more importantly @hiddenlayers and have gotten some pretty good results. What I came to realize, however, is that I’m not entirely sure what I’m looking at with trying to optimize things.

(results of tests so far below)

I know that when using a regressor it’s (typically) good to have a “funnel” or “arrow” shape, where you go something like:

[20d input] → @hiddenlayers 15 10 5 10 15 → [20d output]
or
[20d input] → @hiddenlayers 15 10 5 → [2d output]

Now in both of these cases there are separate input and output amount of dimensions I’m building the network structure around. When doing this with a classifier is your amount of dimensions “the number” on both sides? For example, if I have 104d of data, would I create a “funnel” from/to those numbers? Or is there something else going on? And/or should I try something else?

I also remember @weefuzzy mentioning in a chat that going “deeper” (more layers) is better than going “wider” (more neurons).

So with that in mind, I created some typical recordings, trained up a few networks, and compared the results. There were some nice surprises in there!

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////

So for all of these I’m doing the SP-Tools recipe for classification (which was largely hashed out in this thread a couple years ago) which consists of 13 loudness-weighted MFCC coefficients (skipping the 0th coeff) along with min/mean/max/std and their 1st derivatives giving me 104 of raw/un-normalized/un-standardized MFCC goodness.

I then created 461 entries and manually labelled them with 10 classes. Again, this would be the core use case for SP-Tools. I then created a separate recording and labelled those to test the classifier against.

Here are the results:

Screenshot 2023-04-18 at 10.03.07 PM

Screenshot 2023-04-18 at 10.00.55 PM

Screenshot 2023-04-18 at 10.06.07 PM

Screenshot 2023-04-18 at 9.58.13 PM

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////

As you can see with @activation 3 it converges super fast, and goes really low in terms of loss. I left it running for about a minute on a loop and stopped all of them at the same point (loss of 0.0004). I tried other @activations but those either did nothing at all, or converged very poorly, or rather, not as well.

One thing that jumped out to me right away is that the MLP classifier is faster. Typically twice as fast, and also way more consistent, with the min/max duration being very close to the average. You can see the KNN one sometimes has “slow” spikes up to 0.5ms in some cases. I don’t know if that’s to do with the k-d tree structure and those specific queries took “the long route”, but it’s interesting to see how fast this can be.

Secondly, the MLP performed as good as the KNN (at the worst end of things) and most of the time loads better. With @hiddenlayers 73 42 10 42 73 coming in as a quite magical recipe here.

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////

SO

Is this “funnel shape” a good assumption here? Should I try some other shapes/scales/numbers? You can see from my tests above I tried a few different directions/approaches, but I wanted to check in here as I’m literally shooting in the dark with my tests here.

Oh, somewhat related. I could’ve sworn that there was a thing where you could get the weights or distance of the class that you matched? Like say you have trained two classes (A, B) and then give it new hits where you could get the matched class (“A”) but also, a weight of “how A” it was?

So if I’ve trained a “dark sound” and a “bright sound” as A and B respectively, as I moved from dark to bright sounds it would give me:

A - 1.0 / B 0.0 = Class Matched A
A - 0.75 / B 0.25 = Class Matched A
A - 0.51 / B 0.49 = Class Matched A
A - 0.49 / B 0.51 = Class Matched B
A - 0.25 / B 0.75 = Class Matched B
A - 0.0 / B 1.0 = Class Matched B

I had a look through the help/reference files and couldn’t find anything. Am I misremembering and/or thinking of a different process/object?

Tried a few more oddball shapes here, with much worse results. So I guess the “funnel” is the way to go?

Screenshot 2023-04-19 at 1.26.54 PM

Screenshot 2023-04-19 at 1.30.52 PM

Screenshot 2023-04-19 at 1.54.34 PM

Perhaps worth mentioning that when trying to fit this last one (with the super long network structure I got an insta crash (Max just disappeared, no pinwheel (that I remember)).

crash report:
classifier crash.zip (17.9 KB)

relevant bit (I think, it’s from the crashed thread at least):

12  fluid.libmanipulation         	       0x1320628dc Eigen::DenseStorage<double, -1, -1, -1, 1>::resize(long, long, long) + 80
13  fluid.libmanipulation         	       0x1322763d8 Eigen::internal::product_evaluator<Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0>, 8, Eigen::DenseShape, Eigen::DenseShape, double, double>::product_evaluator(Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const&) + 108
14  fluid.libmanipulation         	       0x132276048 void Eigen::internal::call_dense_assignment_loop<Eigen::Matrix<double, -1, -1, 0, -1, -1>, Eigen::Transpose<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const, Eigen::Replicate<Eigen::Matrix<double, -1, 1, 0, -1, 1>, 1, -1> const> >, Eigen::internal::assign_op<double, double> >(Eigen::Matrix<double, -1, -1, 0, -1, -1>&, Eigen::Transpose<Eigen::CwiseBinaryOp<Eigen::internal::scalar_sum_op<double, double>, Eigen::Product<Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> const>, Eigen::Transpose<Eigen::Matrix<double, -1, -1, 0, -1, -1> >, 0> const, Eigen::Replicate<Eigen::Matrix<double, -1, 1, 0, -1, 1>, 1, -1> const> > const&, Eigen::internal::assign_op<double, double> const&) + 40
15  fluid.libmanipulation         	       0x132275b34 fluid::algorithm::NNLayer::forward(Eigen::Ref<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, Eigen::Ref<Eigen::Matrix<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >) const + 136
16  fluid.libmanipulation         	       0x132275880 fluid::algorithm::MLP::forward(Eigen::Ref<Eigen::Array<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, Eigen::Ref<Eigen::Array<double, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1> >, long, long) const + 344
17  fluid.libmanipulation         	       0x1322743bc fluid::algorithm::SGD::train(fluid::algorithm::MLP&, fluid::FluidTensorView<double, 2ul>, fluid::FluidTensorView<double, 2ul>, long, long, double, double, double) + 2060
18  fluid.libmanipulation         	       0x13229ec8c fluid::client::mlpclassifier::MLPClassifierClient::fit(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) + 1748
19  fluid.libmanipulation         	       0x1322b29f8 auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>)::operator()('lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) const + 96
20  fluid.libmanipulation         	       0x1322b27e0 fluid::client::Message<auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >::operator()(auto fluid::client::makeMessage<fluid::client::MessageResult<double>, fluid::client::mlpclassifier::MLPClassifierClient, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const> >(char const*, fluid::client::MessageResult<double> (fluid::client::mlpclassifier::MLPClassifierClient::*)(fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>))::'lambda'(fluid::client::mlpclassifier::MLPClassifierClient&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>), fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>) const + 80
21  fluid.libmanipulation         	       0x1322b255c _ZNK5fluid6client10MessageSetINSt3__15tupleIJNS0_7MessageIZNS0_11makeMessageINS0_13MessageResultIdEENS0_13mlpclassifier19MLPClassifierClientEJNS0_15SharedClientRefIKNS0_7dataset13DataSetClientEEENSA_IKNS0_8labelset14LabelSetClientEEEEEEDaPKcMT0_FT_DpT1_EEUlRS9_SE_SI_E_S7_S9_JSE_SI_EEENS4_IZNS5_INS6_IvEES9_JSE_NSA_ISG_EEEEESJ_SL_SR_EUlSS_SE_SW_E_SV_S9_JSE_SW_EEENS4_IZNS5_INS6_INS2_12basic_stringIcNS2_11char_traitsIcEENS2_9allocatorIcEEEEEES9_JNS2_10shared_ptrIKNS0_13BufferAdaptorEEEEEESJ_SL_SR_EUlSS_S19_E_S15_S9_JS19_EEENS4_IZNS5_ISV_NS0_10DataClientINS8_17MLPClassifierDataEEEJEEESJ_SL_SR_EUlRS1E_E_SV_S1E_JEEENS4_IZNS0_11makeMessageINS6_IlEES1E_JEEESJ_SL_MSM_KFSN_SP_EEUlS1F_E_S1J_S1E_JEEES1N_NS4_IZNS5_INS6_INS3_IJNSZ_IcS11_N9foonathan6memory13std_allocatorIcNS_17FallbackAllocatorEEEEENS_11FluidTensorIlLm1EEEllddldEEEEES9_JS14_EEESJ_SL_SR_EUlSS_S14_E_S1X_S9_JS14_EEENS4_IZNS5_IS15_S1E_JEEESJ_SL_SR_EUlS1F_E_S15_S1E_JEEENS4_IZNS5_ISV_S1E_JS14_EEESJ_SL_SR_EUlS1F_S14_E_SV_S1E_JS14_EEES1Z_EEEE6invokeILm0EJRNS0_24NRTSharedInstanceAdaptorIS9_E12SharedClientERSE_RSI_EEEDcDpOT0_ + 144
22  fluid.libmanipulation         	       0x1322b1ef8 decltype(auto) fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >::invoke<0ul, fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>&, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>&>(fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> >&, fluid::client::SharedClientRef<fluid::client::dataset::DataSetClient const>&, fluid::client::SharedClientRef<fluid::client::labelset::LabelSetClient const>&) + 360
23  fluid.libmanipulation         	       0x1322b17a0 void fluid::client::FluidMaxWrapper<fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> > >::invokeMessageImpl<0ul, 0ul, 1ul>(fluid::client::FluidMaxWrapper<fluid::client::NRTThreadingAdaptor<fluid::client::NRTSharedInstanceAdaptor<fluid::client::mlpclassifier::MLPClassifierClient> > >*, symbol*, long, atom*, std::__1::integer_sequence<unsigned long, 0ul, 1ul>) + 172

Funnel is good for error reduction. I will interact more when I’m out of the admin + premiere woods, but funnel is the autoencoder principle - you choke the dataset so it has to find the best overall explanation. Check the learn platform for examples and further articles on this concept.

1 Like

I get that funnel is good, but I wasn’t sure what was the from and to inside a classifier (rather than a regressor). So I wasn’t sure if I was funneling between my total amount of dimensions in fluid.mlpclassifier or something different.

Internally there’s nothing to distinguish the MLP classifier from the regressor until you get to the very end: the effective number of output dimensions is the number of classes in your output set, and the difference between the two is that the classifier has a final encoding stage that just selects the most activated output dimension as its guess for the class. So in this case, you’re going from 104 dimensions to 10.

As for the shape of the network, it has some bearing on how you’re asking the problem to be solved, but this is really hard to reason about in the abstract. Often a ‘funnel’ can be profitable, as it amounts to enforcing some dimension reduction such that the network might ‘learn’ some ‘higher-level’ regularities in the input-output mapping. Not always though: a neural network training algorithm is going to try and reproduce the (supposedly) ground-truth labels by whatever means necessary, so a given network shape could yield perfectly wonderful performance against previously unseen data but still have ‘learned’ this mapping that has no bearing on how the problem ‘ought’ to have been solved.

In contexts where the NN only needs to work within the limited contexts of someone’s current needs, that’s not such a problem, but in other contexts people need to be able to try and interrogate the network and make sense of what it seems to have latched onto (which is hard). For instance, where it misclassifies, one can inspect what the NN is doing internally in those cases, and whether it’s nearly right, or vastly wrong, or whether the ‘features’ in the hidden layers are amenable to any human interpretation.

Meanwhile, bear in mind once more, that all of these MLP hyperparameters interact (unpredictably), so a change in topology could well demand a very different learning rate in order to converge (and so on). And, furthermore, that the optimal hyperparameters are also a function of the training data as well as observed performance against an unseen test set (I can’t overemphasise that enough if you’re concerned about portability of performance).

The belt and braces way to do this is (a) to automate the search of the hyperparameter space using something like grid search (b) to try and increase the robustness with respect to overfitting by ‘cross-validating’ against different partitions of the training data. Currently, to do this, you’d need to step into python (or do some really tedious patching).

Having this stuff available in the CCE directly would be very useful, albeit unglamorous.

Re the performance difference you see between the MLP and KNN: it’s not cast in stone, not least because the computational load of the MLP depends in part on the size of the network, but: the MLP is (a) deterministic at query time and (b) partly as a consequence, much (much) easier for a compiler to optimise, especially as it hinges on standard operations like matrix multiplies that will have been very heavily optimised in the implementing library. What’s more, the MLP is likely to be much more cache-friendly (i.e., data are nearby in memory relative to when they’re needed). KD-trees aren’t ‘naturally’ cache-friendly, although I did experiment with some attempted improvements on ours at one point (the question mark being, as ever, whether the performance gains were significant enough to justify making the code weirder).

Re the crash: cheers, I’ll take a gander when I’m back on the coding horse one day, unless someone beats me to it. If you file GH issue, all the better.

1 Like

Super useful to know! That was the main thing I was unsure about. I’m surprised I got the best performance (with the given material) with the 73 42 10 42 73 funnel shape, rather than the 90 70 50 30 10 arrow, but I’ll experiment more with that directional shape.

Short of the answer likely being “it depends”, would it be a stupid idea to have the @hiddenlayers be dynamically generated by going something like ([input dimensions] - [amount of trained classes] / 6(?) to generate incremental steps in the network structure that would point towards the total amount of classes.

In this use case the input dimensions will always be the 104d of MFCC “stuff” but the amount of classes is an unknown as someone may train just 3, or 8, or 10 etc… (or even more I suppose).

Yup, doesn’t sound appetizing! We’ve talked about this tons, but tweaking hyperparameters, particularly when you don’t really understand them, feels like being a monkey at a typewriter, with presumably similar amounts of confidence (and efficacy). I’ve really founded the (scattered) “rules of thumb” here and there, and tend to go with that, or do a few tests to optimize between the options, but beyond that I have no idea what I’m doing or trying to do.

Yup, will make a proper GH issue with the crash and steps.

Ran a few more tests with this in mind and got mixed results:

Screenshot 2023-04-22 at 3.01.48 PM

Screenshot 2023-04-22 at 3.04.15 PM

Screenshot 2023-04-22 at 3.09.24 PM

It seems like the less “deep” option performed well here, though I don’t know if that’s to be extrapolated out.

I did get another crash when trying to run a really big network (added the new crash report to the git issue) but I guess I’ll stay away from really big network structures for the time being to be safe.

edit:

a few more:
Screenshot 2023-04-22 at 3.58.54 PM

Screenshot 2023-04-22 at 4.03.19 PM

Screenshot 2023-04-22 at 4.06.36 PM

Bumping this as I’m curious why I’m getting worse results from MLP classifier than I was before.

As a TLDR I’ve been doing a bunch of work/testing on creating a good interpolator between classes (outlined in this thread) and where that ended up going was that using PCA to reduce the amount of MFCC dimensions from 104d to ~20d (i.e. retaining 95% variance) gave the best results when “interpolating” between the classes.

That led me back to testing the overall accuracy of PCA’d MFCC in the context of a vanilla classifier and the results were better than what I had before!

I then assumed that the MLP classifier I was experimenting with back during the tests in this thread would work even better only to be sorely disappointed with the results…

//////////////////////////////////////////////////////////////////////////////////

After experimenting with the network structure a ton the best results I got varied from “as good as KNN” to “only slightly worse than KNN”.

A range of tests/comparisons:

86.11% - KNN robust’d mfcc (104d)
86.11% - MLP robust’d mfcc (104d)

97.22% - KNN pca’d mfcc (20d)
95.83% - MLP pca’d mfcc (20d)

87.50% - KNN pca’d mfcc (24d)
80.55% - MLP pca’d mfcc (24d)

Here’s what some of the underlying data looks like as a point of reference.

The raw 104d MFCCs:

fluid.dataset~: DataSet #0classifier:
rows: 463 cols: 104
0 31.557 -10.553 13.9 … 7.7752 3.2047 2.76
1 30.688 -5.6341 9.6338 … 6.4456 6.956 1.6969
10 32.626 -9.4391 13.778 … 2.7359 6.6298 2.4849

97 36.217 -0.37663 10.068 … 7.0134 4.2217 3.6786
98 31.006 2.657 2.4032 … 5.9144 10.212 2.5373
99 37.892 -6.7152 7.5238 … 6.9213 7.0426 3.8227

The PCA’d MFCCs:
fluid.dataset~: DataSet #0classifier:
rows: 463 cols: 24

0 -42.224 3.9132 -5.0367 … -2.61430.00069366 -3.7621
1 -51.06 -1.564 7.2985 … -2.6545 0.67618 -4.5783
10 -44.454 5.2509 -6.4867 … 4.7932 1.2728 -11.723

97 -65.193 -0.25055 -5.8954 … 0.74348 -2.6644 5.0952
98 -52.332 -8.788 2.3775 … 1.0905 -1.8463 -6.4464
99 -65.682 2.4534 -12.136 … -2.8918 3.4572 0.55504

So kind of different ranges, but roughly similar (+/- 50, and fairly bipolar) so I thought it would work the same.

I also experimented with robustscaling with varying results (never “good”, somethings “almost as good”).

So based on the discussion with @weefuzzy here a while back, I was aiming for a funnel shape, going from ~40d → ~10d, or ~20d → ~10d. Here are some of the structures I tested out:
@hiddenlayers 89 74 59 44 29 (the structure I found worked the best last year)
@hiddenlayers 19 17 15 13 11 (trying to copy the same movement with lower dimensions)
@hiddenlayers 35 30 25 20 15 (this seemed to work the best)

I also tried much smaller networks (@hiddenlayers 15), ones that go up/down/up, or down/up/down, tried different @activations etc…, and got no joy.

//////////////////////////////////////////////////////////////////////////////////

-Is there something about PCA’d data that is less friendly to MLP-ing?
-Are these reasonable enough structures? (e.g. stepping down from input dimensions to output dimensions)
-Should I train the loss for a really long time? (in all the cases above, it goes down to <0.0001 pretty quickly and I let it run for like 2-3minutes just to be safe)
-Should I experiment more with robustscaling and/or some other normalization before MLP-ing?