In this post I’ll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.
I assume at least you know what is Stochastic Gradient Descent (SGD). If you don’t, you can follow this tutorial . Beside, I’ll consider some improvements of SGD rule that result better performance and faster convergence.
SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error … ). We can formulate this;
![Rendered by QuickLaTeX.com \[w(t) = w(t-1) - epsilon * bigtriangleup w(t)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-5d65d76785042ae1265b5c9b78c8a221_l3.svg)
![Rendered by QuickLaTeX.com \[w\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-ee9e03fe85c4c9bf1c0df21f9dcedc10_l3.svg)
is the model parameter,
![Rendered by QuickLaTeX.com \[epsilon\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-97bd80f0ae07c5ad5bdd122f7fd91e03_l3.svg)
is learning rate and
![Rendered by QuickLaTeX.com \[bigtriangleup w(t)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-5ef7b0f0fd7c94bc56ef307a37c27f9b_l3.svg)
is the gradient at the time
![Rendered by QuickLaTeX.com \[t\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-55ea0f32b5cf4aa5952dce06a7a1c741_l3.svg)
.
SGD as itself is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.
To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps. We can formulate Momentum technique as follows;
![Rendered by QuickLaTeX.com \[v(t) = alpha v(t-1) - epsilon frac{partial E}{partial w}(t)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-b96c61e1445e19e9071e410025795272_l3.svg)
(update velocity history with the new gradient)
![Rendered by QuickLaTeX.com \[bigtriangleup w(t) = v(t)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-4bea6f692444523242193fef64c1427f_l3.svg)
(The weight change is equal to the current velocity)
![Rendered by QuickLaTeX.com \[alpha\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-d0e6fec92fb5ab05b6926eed2237d7fb_l3.svg)
is the momentum coefficient and 0.9 is a value to start.
![Rendered by QuickLaTeX.com \[frac{partial E}{partial w}(t)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-def822b1511daf859f628495cb845112_l3.svg)
is the derivative of
![Rendered by QuickLaTeX.com \[w\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-ee9e03fe85c4c9bf1c0df21f9dcedc10_l3.svg)
wrt. the loss.
Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where a model parameter gets a gradient of +0.001 for each instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of. This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;
![Rendered by QuickLaTeX.com \[MeanSquare(w,t) =0.9 MeansSquare(w, t-1)+0.1frac{partial E}{partial w}(t)^2\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-045511dc944463b722c7d15c8c1ee205_l3.svg)
![Rendered by QuickLaTeX.com \[bigtriangleup w(t) = epsilonfrac{partial E}{partial w}(t) / (sqrt{MeanSquare(w,t)} + mu)\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-7c4eff016a036c4f989841f065597ff6_l3.svg)
![Rendered by QuickLaTeX.com \[mu\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-f5b073468c80d9a73e21d0f32bf13388_l3.svg)
is a smoothing value for numerical convention.
You can also combine Momentum and RMSprop by applying successively and aggregating their update values.
Lets add AdaGrad before finish. AdaGrad is an Adaptive Gradient Method that implies different adaptive learning rates for each feature. Hence it is more intuitive for especially sparse problems and it is likely to find more discriminative features and filters for your Convolutional NN. Although you provide an initial learning rate, AdaGrad tunes it regarding the history of the gradients for each feature dimension. The formulation of AdaGrad is as below;
![Rendered by QuickLaTeX.com \[w_i(t) = w_i(t-1) + frac{epsilon}{sum_{k=1}^{t}sqrt{{g_{ki}}^2}} * g_i\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-2a911970e8b8724b3508ccd3d7619f2f_l3.svg)
where
![Rendered by QuickLaTeX.com \[g_{i} = frac{partial E}{partial w_i}\]](https://erogol.com/wp-content/ql-cache/quicklatex.com-12877265253dbe51cd715b05a7cafa83_l3.svg)
So tihe upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.
Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.
For more information you can look this great lecture slide from Toronto Group.
Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.