Thanks to http://research.microsoft.com/pubs/192769/tricks-2012.pdf
I was hassling with interesting problem lately. I trained a custom deep neural network model with ImageNet and ended up very good results at least on training logs. I used Caffe for all these. Then, I ported my model to python interface and give some objects to it. Boommm!not working and even raised random prob values like it is not even trained for 4 days. It was really frustrating. After a dozens of hours I discovered that “Devil is in the details” .
I was using one of the Batch Normalization (“what is it ? “little intro here ) PR that is not merged to master branch but seems fine. Then I found that interesting problem. The code in the branch computes each batch’s mean by only looking at that batch. When we give only one example at test time, then the mean values are exactly the values of this particular image. This disables everything and the net starts to behave strangely. After a small search I found the solution which uses moving average instead of exact batch average. Now, I am at the stage of implementation. The puchcard is, do not use any PR which is not merged to master branch, that simple 🙂
- 0.15 – Supervision (AlexNet) – ~ 60954656 params
- 0.26 – ISI (ensemble of features)
- 0.27 – LEAR (Fisher Vectors)
- 0.06 – GoogLeNet (Inception Modules) – ~ 11176896 params
- 0.07 – VGGnet (Go deeper and deeper)
- 0.08 – SPPnet (A retrospective addition from early vision)
- Adversarial instances and robust models
- Generative Adversarial Network http://arxiv.org/abs/1406.2661 – Train classifier net as oppose to another net creating possible adversarial instances as the training evolves.
- Apply genetic algorithms per N training iteration of net and create some adversarial instances.
- Apply fast gradient approach to image pixels to generate intruding images.
- Goodfellow states that DAE or CAE are not full solutions to this problem. (verify why ? )
- Blind training of nets
- We train huge networks in a very brute force fashion. What I mean is, we are using larger and larger models since we do not know how to learn concise and effective models. Instead we rely on redundancy and expect to have at least some units are receptive to discriminating features.
- Optimization (as always)
- It seems inefficient to me to use back-propagation after all these work in the field. Another interesting fact, all the effort in the research community goes to find some new tricks that ease back-propagation flaws. I thing we should replace back-propagation all together instead of daily fleeting solutions.
- Still use SGD ? Still ?
- Sparsity ?
- After a year of hot discussion for sparse representations and it is similarity to human brain activity, it seems like it’s been shelved. I still believe, sparsity is very important part of good data representations. It should be integrated to state of art learning models, not only AutoEncoders.
DISCLAIMER: If you are reading this, this is only captain’s note and intended to my own research make up. So many missing references and novice arguments.
Today, I spent some time on two new papers proposing a new way of training very deep neural networks (Highway-Networks) and a new activation function for Auto-Encoders (ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF
CO-ADAPTING FEATURES) which evades the use of any regularization methods such as Contraction or Denoising.
Lets start with the first one. Highway-Networks proposes a new activation type similar to LTSM networks and they claim that this peculiar activation is robust to any choice of initialization scheme and learning problems occurred at very deep NNs. It is also incentive to see that they trained models with >100 number of layers. The basic intuition here is to learn a gating function attached to a real activation function that decides to pass the activation or the input itself. Here is the formulation
is the gating function and
is the real activation. They use Sigmoid activation for gating and Rectifier for the normal activation in the paper. I also implemented it with Lasagne and tried to replicate the results (I aim to release the code later). It is really impressive to see its ability to learn for 50 layers (this is the most I can for my PC).
The other paper ZERO-BIAS AUTOENCODERS AND THE BENEFITS OF CO-ADAPTING FEATURES suggests the use of non-biased rectifier units for the inference of AEs. You can train your model with a biased Rectifier Unit but at the inference time (test time), you should extract features by ignoring bias term. They show that doing so gives better recognition at CIFAR dataset. They also device a new activation function which has the similar intuition to Highway Networks. Again, there is a gating unit which thresholds the normal activation function.
The first equation is the threshold function with a predefined threshold (they use 1 for their experiments). The second equation shows the reconstruction of the proposed model. Pay attention that, in this equation they use square of a linear activation for thresholding and they call this model TLin but they also use normal linear function which is called TRec. What this activation does here is to diminish the small activations so that the model is implicitly regularized without any additional regularizer. This is actually good for learning over-complete representation for the given data.
For more than this silly into, please refer to papers 🙂 and warn me for any mistake.
These two papers shows a new coming trend to Deep Learning community which is using complex activation functions . We can call it controlling each unit behavior in a smart way instead of letting them fire naively. My notion also agrees with this idea. I believe even more complication we need for smart units in our deep models like Spike and Slap networks.
Here we have a very good slide summarizing performance measures, statistical tests and sampling for model comparison and evaluation. You can refer it when you have some couple of classifiers on different datasets and you want to see which one is better and why?
Gradient Boosted Trees (GBT) is an ensemble mechanism which learns incrementally new trees optimizing the present ensemble’s residual error. This residual error is resemblance to a gradient step of a linear model. A GBT tries to estimate gradient steps by a new tree and update the present ensemble with this new tree so that whole model is updated in the optimizing direction. This is not very formal explanation but it gives my intuition.
One formal way to think about GBT is, there are all possible tree constructions and our algorithms is just selects the useful ones for the given data. Hence, compared to all possible trees, number of tress constructed in the model is very small. This is similar to constructing all these infinite number of trees and averaging them with the weights estimated by LASSO.
GBT includes different hyper parameters mostly for regularization.
- Early Stopping : How many rounds your GBT continue.
- Shrinkage : Limit the update of each tree with the coefficient
- Data subsampling: Do not use whole the data for each tree, instead sample instances. In general sample ration
but it can be lower for larger datasets.
- One side note: Subsampling without shrinkage performs poorly.
Then my initial setting is:
- Run pretty long with many many round observing a validation data loss.
- Use small shrinkage value
- Sample 0.5 of the data
- Sample 0.9 of the features as well or do the reverse.