Activation functions

Ajay krishnan
6 min readApr 11, 2021

--

  • Linear activation function
  • Sigmoid-function
  • Tan Hyperbolic ( tanH )
  • Rectified Linear Unit ( ReLU )
  • SoftPlus Activation Function (smooth version of ReLU)
  • Leaky ReLU
  • Parametric ReLU
  • Exponential Linear Unit ( ELU )
  • Gaussian Error Linear Unit ( GELU )

Neural networks combines layers of perceptrons making them more powerful , however without non linear activation functions all the additional layers can be compressed back down to a single linear layer and there is no additional benefit
We need non linear activation functions !
Therefore Sigmoid and hyperbolic tangent ( tanH ) activation functions started to be used for non linearity , at the time we were limited to just these because we need a differentiable activation function since we are back-propagating gradients for updating model weights.
The effectiveness of the model was constrained by the amount of data , available computational resources etc.. Once the trick to use Rectified Linear Units ( ReLU ) was developed we could train the models faster

Linear Activation function is essentially an identity function

Linear Activation Function

The problem with using Linear Activation is , all the layers can be compressed back to a single layer. for eg: for a neural network with 1000 layers , all using a linear activation function , the output at the end will be a linear combination the input features. This can be reduced to input features multiplied by some constant, simply a linear regression . Therefore non-linear-activation functions are needed for the neural networks to learn the data distributions.

Sigmoid Activation function

Sigmoid function

Sigmoid function asymptotes to zero at negative infinity and asymptotes to one at positive infinity but there are intermediate values all in between.

Hyperbolic tangent function ( tanH )

Hyperbolic tangent function

Hyperbolic tangent function is a scaled and shifted sigmoid function , ranges form negative one to positive one

Back then both Sigmoid and Hyperbolic tangent are great choices because they are differentiable , monotonic and smooth. However problems such as saturation would occur due to either high or low input values to function ending up in the asymptotic plateaus of the function. Since the curve is almost flat at these points the derivatives are almost zero therefore the weight updates become very slow or even halts.
If the gradients are close to zero then the weight updates becomes slow due to small step sizes down the hill during gradient descent. this doesn’t enable us to create the complex chain of function that we’ll need to capture the relations in the data

Rectified Linear Unit ( ReLU )

Rectified Linear Units

ReLU is non linear so we can get the complex modelling we needed and it doesn’t have the saturation in the non-negative portion of the input space. However due the negative portion of input space translating to zero activation the ReLU layers can end up dying which can cause the training to stop

Softplus Activation Function

Softplus activation function is a smooth version of ReLU.
The name ‘softplus’ is used because of the smoothed or softened version of ReLU.
Its a smooth approximation of the derivative of the ReLU

SoftPlus , a varient of ReLU

Leaky ReLU

This modified version allow small negative values when the input is less than zero . When ReLU is saturated and not active Leaky-ReLU rectifier allows a small non-zero gradient.

Parametric ReLU

Parametric ReLU learns parameters that control the leakiness and shape of the function. It adaptively learns the parameters of the rectifier

Leaky ReLU , a varient of ReLU

Exponential Linear Unit ( ELU )

ELU is a generalization of ReLU, that uses a parameterized exponential function to transform from positive to small negative values.
Its negative values push the mean of the activation close to zero, that means the activation's are closer to zero , enables faster learning as they bring the gradient closer to a natural gradient.

Exponential Linear Unit , a varient of ReLU

ELU is developed as a solution for the Vanishing Gradient Problem. ELU is approximately linear in the non negative portion of the input space , smooth , monotonic
The main drawback of ELU is its more computationally expensive than ReLU,

Gaussian Error Linear Unit ( GELU )

This is another high performing neural network activation like ReLU , but its non-linearity results in the expected transformation of a stochastic regularizer which randomly applies the identity or zero map to that neurons input

GELU , a varient of ReLU

Lets create some dummy data for demonstration

code for generating dummy data
output dummy data

Lets create a plotting function to display input x and output activation(x)

utility function for plotting

Threshold function

function outputs zero if x is less than the threshold value.
otherwise if x is greater than or equal to the threshold, function outputs one.

Threshold function is a boolean function

Threshold is the simplest activation

We can use Threshold function for solving simple linear problems like an AND operation, OR operation etc…

Threshold function is not differentiable ,tuning a model requires differentiation # Gradient descent

Tuning a model using threshold activation will leads to problems as it is not differentiable

# code

def threshold_fn(input,threshold):
output=[]
for i in range(len(input)):
if input[i]>threshold:
output.append(1)
else:
output.append(0)
return tf.constant(output,dtype=tf.dtypes.float32)


#set threshold
thres = 0
out = threshold_fn(input, thres)
# plot the input and output
plot_input_and_output(x, out, 'Threshold Activation Function')
Out after applying Threshold Non Linearity

Sigmoid-function

Sigmoid activation function, sigmoid(x) = 1 / (1 + exp(-x)).

Threshold function is not differentiable , training a model using treshold activation leads to problems , so we use Sigmoid as activation fuction

Sigmoid Activation Function is Differentiable

Output of Sigmoid ranges from 0 to 1

The function is monotonic while its derivative is not monotonic.

Used in Feed-Forward-Neural-Nets.

# code

Applying Sigmoid Activation Function

Tan Hyperbolic (tanh)

Hyperbolic tangent activation function.
tanh(x) = sinh(x)/cosh(x) = ((exp(x) - exp(-x))/(exp(x) + exp(-x))).

Differentiable

Output ranges from -1 to 1

The function is monotonic while its derivative is not monotonic.

The negative inputs will be mapped negative and the zero inputs will be mapped near zero

Used in Feed-Forward-Neural-Nets.

# code

Applying tanh Activation function

Rectified Linear Unit (ReLU)

In modern neural nets using sigmoid or tanh could lead toGradient Vanishing Problem this can be remedied by using ReLU activation function

With default values, this returns the standard ReLU activation: max(x, 0), the element-wise maximum of 0 and the input tensor.

The function and its derivative both are monotonic.

Modifying default parameters allows you to use non-zero thresholds, change the max value of the activation, and to use a non-zero multiple of the input for values below the threshold.

Range: [ 0 to infinity)

ReLU is more advantages than Sigmoid in modern deep nets

#code

Applying ReLU activation

# Resources

Github : https://github.com/Ajay-user/DataScience

Learn TensorFlow and Deep Learning fundamentals using Python :
Daniel Bourke

--

--

No responses yet