Activation functions
- Linear activation function
- Sigmoid-function
- Tan Hyperbolic ( tanH )
- Rectified Linear Unit ( ReLU )
- SoftPlus Activation Function (smooth version of ReLU)
- Leaky ReLU
- Parametric ReLU
- Exponential Linear Unit ( ELU )
- Gaussian Error Linear Unit ( GELU )
Neural networks combines layers of perceptrons making them more powerful , however without non linear activation functions all the additional layers can be compressed back down to a single linear layer and there is no additional benefit
We need non linear activation functions !
Therefore Sigmoid and hyperbolic tangent ( tanH ) activation functions started to be used for non linearity , at the time we were limited to just these because we need a differentiable activation function since we are back-propagating gradients for updating model weights.
The effectiveness of the model was constrained by the amount of data , available computational resources etc.. Once the trick to use Rectified Linear Units ( ReLU ) was developed we could train the models faster
Linear Activation function is essentially an identity function
The problem with using Linear Activation is , all the layers can be compressed back to a single layer. for eg: for a neural network with 1000 layers , all using a linear activation function , the output at the end will be a linear combination the input features. This can be reduced to input features multiplied by some constant, simply a linear regression . Therefore non-linear-activation functions are needed for the neural networks to learn the data distributions.
Sigmoid Activation function
Sigmoid function asymptotes to zero at negative infinity and asymptotes to one at positive infinity but there are intermediate values all in between.
Hyperbolic tangent function ( tanH )
Hyperbolic tangent function is a scaled and shifted sigmoid function , ranges form negative one to positive one
Back then both Sigmoid and Hyperbolic tangent are great choices because they are differentiable , monotonic and smooth. However problems such as saturation would occur due to either high or low input values to function ending up in the asymptotic plateaus of the function. Since the curve is almost flat at these points the derivatives are almost zero therefore the weight updates become very slow or even halts.
If the gradients are close to zero then the weight updates becomes slow due to small step sizes down the hill during gradient descent. this doesn’t enable us to create the complex chain of function that we’ll need to capture the relations in the data
Rectified Linear Unit ( ReLU )
ReLU is non linear so we can get the complex modelling we needed and it doesn’t have the saturation in the non-negative portion of the input space. However due the negative portion of input space translating to zero activation the ReLU layers can end up dying which can cause the training to stop
Softplus Activation Function
Softplus activation function is a smooth version of ReLU.
The name ‘softplus’ is used because of the smoothed or softened version of ReLU.
Its a smooth approximation of the derivative of the ReLU
Leaky ReLU
This modified version allow small negative values when the input is less than zero . When ReLU is saturated and not active Leaky-ReLU rectifier allows a small non-zero gradient.
Parametric ReLU
Parametric ReLU learns parameters that control the leakiness and shape of the function. It adaptively learns the parameters of the rectifier
Exponential Linear Unit ( ELU )
ELU is a generalization of ReLU, that uses a parameterized exponential function to transform from positive to small negative values.
Its negative values push the mean of the activation close to zero, that means the activation's are closer to zero , enables faster learning as they bring the gradient closer to a natural gradient.
ELU is developed as a solution for the Vanishing Gradient Problem. ELU is approximately linear in the non negative portion of the input space , smooth , monotonic
The main drawback of ELU is its more computationally expensive than ReLU,
Gaussian Error Linear Unit ( GELU )
This is another high performing neural network activation like ReLU , but its non-linearity results in the expected transformation of a stochastic regularizer which randomly applies the identity or zero map to that neurons input
Lets create some dummy data for demonstration
Lets create a plotting function to display input x and output activation(x)
Threshold function
function outputs zero if x is less than the threshold value.
otherwise if x is greater than or equal to the threshold, function outputs one.
Threshold function is a boolean function
Threshold is the simplest activation
We can use Threshold function for solving simple linear problems like an AND operation, OR operation etc…
Threshold function is not differentiable ,tuning a model requires differentiation # Gradient descent
Tuning a model using threshold activation will leads to problems as it is not differentiable
# code
def threshold_fn(input,threshold):
output=[]
for i in range(len(input)):
if input[i]>threshold:
output.append(1)
else:
output.append(0)
return tf.constant(output,dtype=tf.dtypes.float32)
#set threshold
thres = 0
out = threshold_fn(input, thres)# plot the input and output
plot_input_and_output(x, out, 'Threshold Activation Function')
Sigmoid-function
Sigmoid activation function, sigmoid(x) = 1 / (1 + exp(-x))
.
Threshold function is not differentiable , training a model using treshold activation leads to problems , so we use Sigmoid
as activation fuction
Sigmoid Activation Function is Differentiable
Output of Sigmoid ranges from 0 to 1
The function is monotonic while its derivative is not monotonic.
Used in Feed-Forward-Neural-Nets.
# code
Tan Hyperbolic (tanh)
Hyperbolic tangent activation function. tanh(x) = sinh(x)/cosh(x) = ((exp(x) - exp(-x))/(exp(x) + exp(-x)))
.
Differentiable
Output ranges from -1 to 1
The function is monotonic while its derivative is not monotonic.
The negative inputs will be mapped negative and the zero inputs will be mapped near zero
Used in Feed-Forward-Neural-Nets.
# code
Rectified Linear Unit (ReLU)
In modern neural nets using sigmoid or tanh could lead to
Gradient Vanishing Problem
this can be remedied by using ReLU activation functionWith default values, this returns the standard ReLU activation: max(x, 0), the element-wise maximum of 0 and the input tensor.
The function and its derivative both are monotonic.
Modifying default parameters allows you to use non-zero thresholds, change the max value of the activation, and to use a non-zero multiple of the input for values below the threshold.
Range: [ 0 to infinity)
ReLU is more advantages than Sigmoid in modern deep nets
#code
# Resources
Github : https://github.com/Ajay-user/DataScience
Learn TensorFlow and Deep Learning fundamentals using Python :
Daniel Bourke