Well, wanted to look at numbers seeing if it was “correct”, and how often that was the case. Rather than just trying an arbitrary set of descriptors and going “hmm, I guess that’s slightly better?”. Mainly cuz each new analysis pipeline still takes me ages to set up, so I can’t easily pivot between radically different sets of descriptors/stats to compare a plotting.
So in your case here, if the inputs are stats of the raw descriptors and the outputs are that plus the stats of the derivatives, that might be very hard for the neural network to learn to predict.
In this case they be completely different descriptors/stats. What I’m thinking, at the moment, is running a fresh PCA/UMAP on each pool of descriptors. So it may be that the dimensions may have no overlap whatsoever between the two ends. In addition the longer analysis may include more initial dimensions too.
So the short window may have mean of loudness, std of MFCC3, pitch confidence (,etc…) and the long window may have mean of deriv of peak, time centroid, skewness of pitch(, etc…).
The problem is that as soon as you try to put in other sounds, the neural network will give poor predictions.
My original original idea was to just have a classifier rather than a regressor such that I would have an accurate analysis of the input sound, and then “some kind of idea” of what the rest of the sound may be based on the nearest match in the pre-analyzed corpus.
If you have the longer analysis window, no need to include the short analysis window! (Since you were trying to predict the longer one anyway!?) However, what could be useful is to use the two in coordination as in, the attack portion and a longer portion as you mention above.
The whole reason for this is that I want to do it in realtime, and don’t have 100ms to wait around for the the longer analysis window. So the goal is to take a meager 256 sample analysis, and get some better info out of it than I can presently.