- Teacher – Student paradigm:
- The idea is flickered by (up to my best knowledge) Caruana et. al. 2006. Basically, the idea is to train an ensemble of networks and use their outputs on a held-out set to distill the knowledge to a smaller network. Then this idea is recently hashed by G. Hinton’s work which trains larger network then use this network output with a mixture of the original train data to train a smaller network. One important trick is to using higher temperature values on softmax layer of the teacher network so class probabilities are smoothly distributed over classes . Student networks is then able to learn class relations induced by the teacher network beside the true classes of the instances as it is suppose to. Eventually, we are able to compress the knowledge of the teacher net by a smaller network with less number of parameters and faster execution time. Bengio has also one similar work called Fitnets which is the beneficiary of the same idea from a wider aspect. They do not only use the outputs of the teacher net, but they carry representation power of hidden layers of the teacher to the student net by a regression loss that approximates the teacher hidden layer weights from the student weights.
- Bayesian Breezes :
- We are finally able to see some Bayesian arguments on Deep Models. One of the prevailing works belongs to Maxwelling “Bayesian Dark Knowledge”. Again we have the previous idea but with a very simple trick in mind. Basically, we introduces a Gaussian noise, which is scaled by the decaying learning rate, to the gradient signals. This noise indices a MCMC dynamics to the network and it implicitly learns ensemble networks. The teacher trained in that fashion, is then used to train student nets with a similar approach proposed by G. Hinton. I won’t go into mathematical details here. I guess this is one of the rare Bayesian approaches which is close to be applicable for real-time problems with its a simple trick which is enough to do all the Bayesian magic.
- Variational Auto Encoder is not a new work but it recently draw my attention. The difference between VAE and conventional AE is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let’s say we want to fit gaussian distribution to the data. Then, It is able to learn mean and standard deviation of the multiple gaussian functions ( corressponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these. You can learn more from this great tutorial.
- Recurrent Models for Visual Recognition:
- ReNet is a paper from Montreal group. They explain an alternative approach to convolutional neural networks in order to learn spatial structures over visual data. Their idea relies on recurrent neural network which scans the image in a sequence of horizontal and then vertical direction. At the end, RNN is able to learn the structure over the whole image (or image patch). Although, their results are not better than state of art, spotting an new alternative to old fashion convolution is exciting effort.
- Model Accelerator and Compression Methods:
- We already talked about dark knowledge approach that is able to compress larger modes into a small ones. Beside, there are some structural approaches so as to compress larger models. One instance to these works is “Learning both Weights and Connections for Efficient Neural Networks“. You can reach my personal note relating to this work by this link.
- “Neural Networks with Few Multiplication” by Bengio’s team introduces a yet another algorithmic solution for faster and less memory bloating training.
In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)
- Parametric Rectifier Linear Unit (PReLU)
- The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.
- A new initialization method (MSRA for Caffe users)
- Xavier initialization was proposed by Bengio’s team and it considers number of fan-in and fan-out to a certain unit to define the initial weights. However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.
- Batch Normalization
- This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence with larger learning rates and robust models to initialization scheme that we choose. Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.
- Inception Layers
In this post I’ll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.
I assume at least you know what is Stochastic Gradient Descent (SGD). If you don’t, you can follow this tutorial . Beside, I’ll consider some improvements of SGD rule that result better performance and faster convergence.
SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error … ). We can formulate this;
is the model parameter,
is learning rate and
is the gradient at the time
SGD as itself is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.
To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps. We can formulate Momentum technique as follows;
(update velocity history with the new gradient)
(The weight change is equal to the current velocity)
is the momentum coefficient and 0.9 is a value to start.
is the derivative of
wrt. the loss.
Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where a model parameter gets a gradient of +0.001 for each instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of. This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;
is a smoothing value for numerical convention.
You can also combine Momentum and RMSprop by applying successively and aggregating their update values.
Lets add AdaGrad before finish. AdaGrad is an Adaptive Gradient Method that implies different adaptive learning rates for each feature. Hence it is more intuitive for especially sparse problems and it is likely to find more discriminative features and filters for your Convolutional NN. Although you provide an initial learning rate, AdaGrad tunes it regarding the history of the gradients for each feature dimension. The formulation of AdaGrad is as below;
So tihe upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.
Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.
For more information you can look this great lecture slide from Toronto Group.
Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.
MS researcher recently introduced a new deep ( indeed very deep 🙂 ) NN model (PReLU Net)  and they push the state of art in ImageNet 2012 dataset from 6.66% (GoogLeNet) to 4.94% top-5 error rate.
In this work, they introduce an alternation of well-known ReLU activation function. They call it PReLu (Parametric Rectifier Linear Unit). The idea behind is to allow negative activations on the ReLU function with a control parameter
which is also learned over the training phase. Therefore, PReLU allows negative activations and in the paper they argue and emprically show that PReLU is better to resolve diminishing gradient problem for very deep neural networks (> 13 layers) due to allowance of negative activations. That means more activations per layer, hence more gradient feedback at the backpropagation stage.
Here is the G. Hinton’s talk at MIT about t inabilities of Convolutional Neural Networks and 4 basic arguments to solve these.
I just watched it with a slight distraction and I need to reiterate. However these are the basic arguments in which G. Hinton is proposed whilst the speech.
1. CNN + Max Pooling is not the way of handling visual information as the human brain does. Yes, it works in practice for the current state of the art but, especially view point changes of the target objects are still unsolved.
2. Apply Equivariance instead of Invariance. Instead of learning invariant representations to the view point changes, learn changing representations correlated with the view point changes.
3. In the space of CNN weight matrices, view point changes are totally non-linear and therefore hard to learn. However, if we transfer instances into a space where the view point changes are globally linear, we can ease the problem. ( Use graphics representation uses explicit pose coordinates)
4. Route information to right set of neurons instead of unguided forward and backward passes. Define certain neuron groups ( called capsules ) that are receptive to particular set of data clusters in the instance space and each of these capsules contributes to the whole model as much as the given instance’s membership to neuron’s cluster.
Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz ponder about a machine that is intellectually capable as much as humans. Famous writers like Jules Continue Reading
Using Stochastic Gradient instead of Batch Gradient
- more suitable to track changes in each step
- often results with better solution – it may finds different ways to different local minimums on cost function due to it fluctuation on weights –
- Most common way to implement NN learning.
- Analytically more tractable for the way of its convergence
- Many acceleration techniques are suited to Batch L.
- More accurate convergence to local min. – again because of the fluctuation on weights in Stochastic method –
- give the more informative instance to algorithm next as the learning step is going further – more informative instance means causing more cost or being unseen –
- Do not give successively instances from same class.
Transformation of Inputs
- Mean normalization of input variables around zero mean
- Scale input variables so that covariances are about the same unit length
- Diminish correlations between features as much as possible – since two correlated input may result to learn same function by different units that is redundant –