Sigmoid unit :
Rectified linear unit (ReLU):
as stepped sigmoid
as softplus function
The softplus function can be approximated by max function (or hard max ) ie
. The max function is commonly known as Rectified Linear Function (ReL).
In the following figure below we see different activation functions plotted.
The major differences between the sigmoid and ReL functions are:
- Sigmoid function has a range [0,1] whereas ReL function has a range
. Due to its range, sigmoid can be used to model probability hence, it is commonly used for regression or probability estimation at the last layer even when you use ReL for the previous layers. NERD NOTE: The view of softplus function is approximation of stepped sigmoid units relates to the binomial hidden units as discussed in
- The gradient of sigmoid function vanishes as x recedes from 0 so basically it is called “saturated” at this point. However, the gradient of ReL function is such problem free due to its unbounded and linear positive part.
The advantages of using Rectified Linear Units in Neural Networks are;
- If hard max is used, it induces sparsity on the layer activations.
- As discussed earlier ReLU doesn’t face gradient vanishing problem. Therefore, it allows training deeper networks without pre-training.
- ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.
- On Rectified Linear Units for Speech Processing
- Rectifier Nonlinearities Improve Neural Network Acoustic Models
- Deep Sparse Rectifier Neural Networks