**Weight Normalization**, a new data depended initialization method and

**Mean Only Batch Normalization**.

Posts tagged with: deep learning

Deep Learning is defined as (Goodfellow et al., 2016) a sub-field of machine learning consists in learning models that are wholly or partially specified by a class of flexible differentiable functions.

In this study there are three main methods which are **Weight Normalization**, a new data depended initialization method and **Mean Only Batch Normalization**.

Weight normalization id formalized as below. Weight values w are decoupled by their norms g and the direction v / ||v||. In this way they propose that SGD gives faster convergence.

They compare Weight Normalization with Batch Normalization. The main disadvantage they posit that BN has stochasticity due to varying data batches and one additional difference is that WN has lower computational burden compared to BN.

the second perk is data depended initialization of the network. They first give a initial minibatch to network and compute mean activation and std per layer. Then given the initial weight values sampled from mean 0 and std 0.05, they set g = 1 / std and b = - mean / std

One downside is that since this scheme is batch depended, it might suffer for the forthcoming batches with possible different data statistics. However, they say that this scheme works well in practice.

The third perk is Mean Only Batch Normalization.

This is a lighter operation due to the avoidance of variance normalization. We might easily skip variance normalization because of the initialization scheme already applied it. One another upside is that avodiance of variance normalization provides less distracted gradient feedbacks and therefore better learning.

At the experiments side, they note that batch normalization is 16% slower than weight normalization whereas BN yields better progress especially for initial iterations. As a final remark they note 7.31% CIFAR-10 performance which is the state of art up to my knowledge (not better then my best network :)) in terms of published works. they also experiment with different architectures like RNNs , reinforcement learning and others but please refer to the paper for more.

Maxout [1] units are well-known and frequently used tools for Deep Neural Networks. For whom does not know, with a basic explanation, a Maxout unit is a set of internal activation units competing with each other for each instance and activation of the winner is propagated as output and the loosers are kept silent. At the backpropagation phase, it means we update only the winner unit. That also means, implicitly, we always prefer to back-propagate gradient signal through the strongest path. It is an important aspect of Maxout units, especially for very deep models which are prone to gradient instability.

Although Maxout units have very good properties like which I told (please refer to the paper for more details), I am a proactive sceptic of its ability to encode underlying information and pass it to next layer. Here is a very simple example. Suppose we have two competing functions (filters) in a Maxout unit. One of these functions is receptive of edge structures whereas the other is receptive of corners. For an instance, we might have the first filter as the winner with a value, let’s say, ~3 which means Maxout output is also ~3. For another instance, we have the other function as the winner with approximately same value ~3. If we assume that each NN layer is a classifier which takes the previous layer output as a feature vector (I guess not very wrong assumption), then basically we give the same value for different detections for a particular feature dimension (which is corresponded to our Maxout unit). Eventually, we cannot expect from the next layer to be able to discern this signal.

One can argue that we should evaluate Maxout unit as a whole and it is reminiscent of OR function on top of multiple filters. This is a valid argument which I cannot refuse directly but the problem that I indicated above is still floating on air. Beside, why we would waste our expensive NN parameters, if we could come up with a better encoding scheme for Maxout units

Here is one alternative approach for better encoding of competing functions, which we call NegOut. Let's assume we have a ordering of two competing functions by heart as 1st and 2nd. If the winner is the 1st function, NegOut outputs the 1st function's value and otherwise it outputs the 2nd function but by taking its negative. NegOut yields two assumptions. The first, competing functions are always positive (like ReLU functions ). The second, we have 2 competing functions.

If we consider the backpropagation signal, the only difference from Maxout unit is to take negative of the gradient signal for the 2nd competing unit, if it is the winner.

As you can see from the figure, the inherent property here is to output different values for different winner detectors in which the value captures both the structural difference and the strength of the winner activation.

I performed some experiments on CIFAR-10 and MNIST comparing Maxout Network with NegOut Network with exact same architectures explained in the Maxout Paper [1]. The table below summarizes results that I observe by the initial runs without any finetunning or hyper-parameter optimization yet. More comparisons on larger datasets are still in progress.

NegOut give better results on CIFAR, although it is slightly lower on MNIST. Again notice that no tunning has been took a place for our NegOut network where as Maout Network is optimized as described in the paper [1]. In addition, NegOut network uses 2 competing set of units (as it is constrained by its nature) for the last FC layer in comparison to Maxout net which uses 5 competing units. My expectation is to have more difference as we go through larger models and datasets since as we scale up, representational power takes more place for better results.

Here, I tried to give a basic sketch of my recent work by no means complete. Different observations and experiments are still running. I also need to include LWTA [2] for being more fair and grasp more wider aspect of competing units. Please feel free to share your thoughts as well. Any contribution is appreciated.

PS: Lately, I devote myself to analyze the internal dynamics of Neural Networks with different architectures, layers and activation functions. The aim is checking under the hood and analysing any intuitionally well-functioning ideas applied to Deep Neural Networks. I also expect to share more of my findings at my blog.

[1] Maxout networks IJ Goodfellow, D Warde-Farley, M Mirza, A Courville, Y Bengio arXiv preprint arXiv:1302.4389

[2] Understanding Locally Competitive Networks Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber. http://arxiv.org/abs/1410.1165

**Teacher - Student paradigm:**- The idea is flickered by (up to my best knowledge) Caruana et. al. 2006. Basically, the idea is to train an ensemble of networks and use their outputs on a held-out set to distill the knowledge to a smaller network. Then this idea is recently hashed by G. Hinton's work which trains larger network then use this network output with a mixture of the original train data to train a smaller network. One important trick is to using higher temperature values on softmax layer of the teacher network so class probabilities are smoothly distributed over classes . Student networks is then able to learn class relations induced by the teacher network beside the true classes of the instances as it is suppose to. Eventually, we are able to compress the knowledge of the teacher net by a smaller network with less number of parameters and faster execution time. Bengio has also one similar work called Fitnets which is the beneficiary of the same idea from a wider aspect. They do not only use the outputs of the teacher net, but they carry representation power of hidden layers of the teacher to the student net by a regression loss that approximates the teacher hidden layer weights from the student weights.

**Bayesian Breezes :**- We are finally able to see some Bayesian arguments on Deep Models. One of the prevailing works belongs to Maxwelling "Bayesian Dark Knowledge". Again we have the previous idea but with a very simple trick in mind. Basically, we introduces a Gaussian noise, which is scaled by the decaying learning rate, to the gradient signals. This noise indices a MCMC dynamics to the network and it implicitly learns ensemble networks. The teacher trained in that fashion, is then used to train student nets with a similar approach proposed by G. Hinton. I won't go into mathematical details here. I guess this is one of the rare Bayesian approaches which is close to be applicable for real-time problems with its a simple trick which is enough to do all the Bayesian magic.
- Variational Auto Encoder is not a new work but it recently draw my attention. The difference between VAE and conventional AE is, given a probability distribution, VAE learns the best possible representation that is parametrized by defined distribution. Let's say we want to fit gaussian distribution to the data. Then, It is able to learn mean and standard deviation of the multiple gaussian functions ( corressponding VAE latent units) with backpropagation with a simple parametrization trick. Eventually, you obtain multiple gaussians with different mean and std on the latent units of VAE and you can sample new instances out of these. You can learn more from this great tutorial.

**Recurrent Models for Visual Recognition:**- ReNet is a paper from Montreal group. They explain an alternative approach to convolutional neural networks in order to learn spatial structures over visual data. Their idea relies on recurrent neural network which scans the image in a sequence of horizontal and then vertical direction. At the end, RNN is able to learn the structure over the whole image (or image patch). Although, their results are not better than state of art, spotting an new alternative to old fashion convolution is exciting effort.

**Model Accelerator and Compression Methods:**- We already talked about dark knowledge approach that is able to compress larger modes into a small ones. Beside, there are some structural approaches so as to compress larger models. One instance to these works is "Learning both Weights and Connections for Efficient Neural Networks". You can reach my personal note relating to this work by this link.
- "Neural Networks with Few Multiplication" by Bengio's team introduces a yet another algorithmic solution for faster and less memory bloating training.

**Adversarial instances**and robust models- Generative Adversarial Network http://arxiv.org/abs/1406.2661 - Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
- Apply genetic algorithms per N training iteration of net and create some adversarial instances.
- Apply fast gradient approach to image pixels to generate intruding images.
- Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )

**Blind training of nets**- We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.

**Optimization (as always)**- It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
- Still use SGD ? Still ?

**Sparsity ?**- After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it's been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.

**DISCLAIMER**: If you are reading this, this is only captain's note and intended to my own research make up. So many missing references and novice arguments.

Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF

CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.

Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation

is the gating function and is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).

The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks. Again, there is a gating unit which thresholds the normal activation function.

The first equation is the threshold function with a predefined threshold (they use 1 for their experiments). The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.

For more than this silly into, please refer to papers 🙂 and warn me for any mistake.

These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.

I recently attended Plankton Classification Challenge on Kaggle. Even tough I used simpler (stupidly simpler compared to the winner) Deep NN model for my submissions and ended up at 192th position among 1046 participants. However, this was very good experiment area for me to test new comer ideas to Deep Learning community and try some couple of novel things which I plan to explain later in my blog.

In this post, I share my notes about the winner's approach (which is explained here extensively).

In this text, I would like to talk about some of the recent advances of Deep Learning models by no means complete. (Click heading for the reference)

- Parametric Rectifier Linear Unit (PReLU)
- The idea is to allow negative activation in well-known ReLU units by controlling it with a learnable parameter. In other words, you learn how much negative activationsyou need for each unit to discriminate classes. In the work, it is proposed that PReLU unit is very useful for especially very deep models that lacks for gradient propagation to initial layers due to its depth. What is different is PReLU allows more gradient return by allowing negative activation.

- A new initialization method (MSRA for Caffe users)
- Xavier initialization was proposed by Bengio's team and it considers number of fan-in and fan-out to a certain unit to define the initial weights. However, the work says that Xavier method and its alternations considers linear activation functions for the formulation of the method. Hence, they propose some changes related to ReLU activation that they empirically proved its effect in practice with better convergence rate.

- Batch Normalization
- This work serves data normalization as a structural part of the model. They say that the distribution of the training data changes as the model evolves and it priorities the initialization scheme and the learning schedule we use for the learning. Each mini-batch of the data is normalized with the described scheme just before its propagation through the network and it allows faster convergence with larger learning rates and robust models to initialization scheme that we choose. Each mini-batch is normalized by its mean and variance, then it is scaled and shifted by a learned coefficient and residual.

- Inception Layers

In this post I'll briefly introduce some update tricks for training of your ML model. Then, I will present my empirical findings with a linked NOTEBOOK that uses 2 layer Neural Network on CIFAR dataset.

I assume at least you know what is Stochastic Gradient Descent (SGD). If you don't, you can follow this tutorial . Beside, I'll consider some improvements of SGD rule that result better performance and faster convergence.

SGD is basically a way of optimizing your model parameters based on the gradient information of your loss function (Means Square Error, Cross-Entropy Error ... ). We can formulate this;

is the model parameter, is learning rate and is the gradient at the time .

SGD as itself is solely depending on the given instance (or the batch of instances) of the present iteration. Therefore, it tends to have unstable update steps per iteration and corollary convergence takes more time or even your model is akin to stuck into a poor local minima.

To solve this problem, we can use Momentum idea (Nesterov Momentum in literature). Intuitively, what momentum does is to keep the history of the previous update steps and combine this information with the next gradient step to keep the resulting updates stable and conforming the optimization history. It basically, prevents chaotic jumps. We can formulate Momentum technique as follows;

(update velocity history with the new gradient)

(The weight change is equal to the current velocity)

is the momentum coefficient and 0.9 is a value to start. is the derivative of wrt. the loss.

Okay we now soothe wild SGD updates with the moderation of Momentum lookup. But still nature of SGD proposes another potential problem. The idea behind SGD is to approximate the real update step by taking the average of the all given instances (or mini batches). Now think about a case where a model parameter gets a gradient of +0.001 for each instances then suddenly it gets -0.009 for a particular instance and this instance is possibly a outlier. Then it destroys all the gradient information before. The solution to such problem is suggested by G. Hinton in the Coursera course lecture 6 and this is an unpublished work even I believe it is worthy of. This is called RMSprop. It keeps running average of its recent gradient magnitudes and divides the next gradient by this average so that loosely gradient values are normalized. RMSprop is performed as below;

is a smoothing value for numerical convention.

You can also combine Momentum and RMSprop by applying successively and aggregating their update values.

Lets add AdaGrad before finish. AdaGrad is an Adaptive Gradient Method that implies different adaptive learning rates for each feature. Hence it is more intuitive for especially sparse problems and it is likely to find more discriminative features and filters for your Convolutional NN. Although you provide an initial learning rate, AdaGrad tunes it regarding the history of the gradients for each feature dimension. The formulation of AdaGrad is as below;

where

So tihe upper formula states that, for each feature dimension, learning rate is divided by the all the squared root gradient history.

Now you completed my intro to the applied ideas in this NOTEBOOK and you can see the practical results of these applied ideas on CIFAR dataset. Of course this into does not mean complete by itself. If you need more refer to other resources. I really suggest the Coursera NN course by G. Hinton for RMSprop idea and this notes for AdaGrad.

For more information you can look this great lecture slide from Toronto Group.

Lately, I found this great visualization of optimization methods. I really suggest you to take a look at it.