Sigmoid unit :
Tanh unit:
Rectified linear unit (ReLU):
we call;

as stepped sigmoid

as softplus function
The softplus function can be approximated by max function (or hard max ) ie
. The max function is commonly known as Rectified Linear Function (ReL).
In the following figure below we see different activation functions plotted.
The major differences between the sigmoid and ReL functions are:
 Sigmoid function has a range [0,1] whereas ReL function has a range
. Due to its range, sigmoid can be used to model probability hence, it is commonly used for regression or probability estimation at the last layer even when you use ReL for the previous layers. NERD NOTE: The view of softplus function is approximation of stepped sigmoid units relates to the binomial hidden units as discussed in http://machinelearning.wu
stl.edu…  The gradient of sigmoid function vanishes as x recedes from 0 so basically it is called “saturated” at this point. However, the gradient of ReL function is such problem free due to its unbounded and linear positive part.
The advantages of using Rectified Linear Units in Neural Networks are;
 If hard max is used, it induces sparsity on the layer activations.
 As discussed earlier ReLU doesn’t face gradient vanishing problem. Therefore, it allows training deeper networks without pretraining.
 ReLU can be used in Restricted Boltzmann machine to model real/integer valued inputs.
References :
 On Rectified Linear Units for Speech Processing http://www.cs.toronto.edu
/~hinto…  Rectifier Nonlinearities Improve Neural Network Acoustic Models http://ai.stanford.edu/~a
maas/pa…  Deep Sparse Rectifier Neural Networks http://eprints.pascalnet
work.or…