@tedmoore and @jamesbradbury are hard at work developing considerably better docs but I think you know that.
Some important aspects of lore that we’ve discussed at various points
- absolute loss numbers aren’t all that important, and they don’t by themselves signify convergence or not. They’re especially meaningless for regressors without reference to the number of output dimensions and the range of each of those dimensions
- accordingly, the way to work when adjusting the network for a task is to use a small number of iterations and repeated calls to
fit
and look at the loss curve over time. - even then, the litmus test of whether a network is working as desired is not how well it does in training but how well it does with unseen test data (and even then, you might just have a ‘clever horse’)
- if things aren’t converging then the learning rate is probably too big, but maybe too small: an important initial step is to find the point where it becomes noisy in the loss curve and back off from there
- a batchsize of 1 will likely result in noisier convergence
- the number of in / out dimensions isn’t the only important thing when working out if you have enough data: what you need to think about is the number of unknown parameters the network is learning, which is a function of the dimensionality at each layer
- the quantity of data needs to be matched by the quality as well. Starting with a small number of raw descriptors is a perfectly sensible thing to do. Whether they exhibit patterns that match what you want the network to be sensitive to needs to be established empirically (i.e look at the descriptors over time and see if they (some of them) wiggle at points where you need to see wiggling). Being parsimonious is also a good idea: stuffing the learning process with loads of derived stats etc. is unlikely to help things if those stats don’t increase the obviousness of the things you want to capture.