Getting the MNIST

13 min readFeb 18, 2022

Today we are going to build a model using tensorflow that recognize digits.
We will be using mnist-digits dataset and build a model that has an accuracy around 99% .

MNIST digits recognition is a fairly easy image classification problem so lets start with a simple model and build on complexity as we go.

The MNIST dataset is very popular machine learning dataset, consisting of 70000 grayscale images of handwritten digits, of dimensions 28x28.

from tf.keras.datasets import mnist 
(x_train, y_train),(x_test,y_test)= mnist.load_data()print('training image set shape',x_train.shape)
print('training label set shape',y_train.shape)
print('testing image set shape',x_test.shape)
print('testing label set shape',y_test.shape)

Pre-processing

The images in the dataset has a dimension of [28 X 28] and the values are in the [ 0 to 255 ]. We need to scale down the values, also lets create a dataset using the tf.data.Dataset.from_tensor_slices( — )

# scaling
X_train = x_train/255.0
X_test = x_test/255.0# create an utility for training dataset
def create_training_ds(X,y,batch_size=128):
  ds = tf.data.Dataset.from_tensor_slices((X,y))
  # this small dataset can be entirely cached in RAM
  ds = ds.cache()
  # training set should be well shuffled
  ds = ds.shuffle(buffer_size=5000)
  # repeat the dataset
  ds = ds.repeat()
  # create batches
  ds = ds.batch(batch_size=batch_size, drop_remainder=True)
  # fetch next batches while training on the current one 
  ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
  return ds

# create an utility for evaluation dataset
def create_eval_ds(X,y,batch_size=128):
  ds = tf.data.Dataset.from_tensor_slices((X,y))
  # this small dataset can be entirely cached in RAM
  ds = ds.cache()
  # repeat the dataset
  ds = ds.repeat()
  # create batches
  ds = ds.batch(batch_size=batch_size, drop_remainder=True)
  return ds

Now without wasting a bit, lets jump straight and build models.

Single Layer Perceptron

Lets build a model that has a single layer (hidden layer) that has 10units corresponding to the number of classes. The output of this layer is fed in to a softmax layer which will give us probabilities that the given input belong to different classes.

Our inputs are images in 2D array [ 28, 28]. We can’t pass these into the hidden layer, we have to flatten it before we pass them.
So [28X28] => 784 flatten pixels

Steps :

Flatten the input image — We get a vector where each compoent corresponds to a pixel in the image
Pass the flattened image into the dense layer — Here we perform the weighted sum of input pixels and add a bias
The output of hidden layer is passed to Softmax layer — We use a softmax layer to get the output probabilities that sums to one

model_1 = tf.keras.Sequential([
   tf.keras.layers.Flatten(),
   tf.keras.layers.Dense(units=10, activation='softmax')
])

model_1.compile(
  optimizer='adam',
  loss='sparse_categorical_crossentropy',
  metrics=['accuracy'])

Lets train and evaluate the model
During training model accuracy is 92% and on Evaluation it give us 92%.

BATCH_SIZE = 128
STEP_PER_EPOCH = 60000//BATCH_SIZE  # 60,000 items in this dataset
history = model_1.fit(train_ds,
           steps_per_epoch=STEP_PER_EPOCH,
           epochs=EPOCHS,
           validation_data=eval_ds,
           validation_steps=100)

This is a good baseline to start with. Lets try to break this baseline as we go forward.

loss, accuracy = model_1.evaluate(eval_ds, steps=100)
print('Loss',loss)
print('Accuracy',accuracy)

Before we build complex models, lets take a look at the steps_per_epoch parameter in model.fit( — ). Since we are using dataset.repeat(). We’ve to provide a value for this param, otherwise training wont complete and tensorflow prevent this by giving an error wont allow the model to run.
We are using .repeat() on both train and validation dataset so be mindful about this.

Multi layer perceptron

Lets try to break the performance of the Single layer perceptron with a Multi layer perceptron, that has four hidden layer.
Adding more hidden layers to the network make it more powerful , it can learn the nook and corners of the dataset but there is a problem and its called overfitting. If model overfit the training data, it fails to generalize on unseen data. So we’ve to be mindful about this when adding more layers to our model.

Steps :

Flatten the input image — We get a vector where each compoent corresponds to a pixel in the image
Pass the flattened image into the first dense layer
— We are using 128 units in the first hidden layer
— Here we perform the weighted sum of input pixels and add a bias
W(1)X+b(1) and apply ReLU activation
Pass the flattened image into the second dense layer
— We are using 64 units in the first hidden layer
— Here we perform the weighted sum of input pixels and add a bias
W(2)X+b(2) and apply ReLU activation
Pass the flattened image into the third dense layer
— We are using 32 units in the first hidden layer
— Here we perform the weighted sum of input pixels and add a bias
W(3)X+b(3) and apply ReLU activation
Pass the flattened image into the fourth dense layer
— We are using 32 units in the first hidden layer
— Here we perform the weighted sum of input pixels and add a bias
W(4)X+b(4) and apply Softmax activation

Each layer is characterized by a set of weights and biases — these weights will get updated when we train the model via back-propagation

model_2 = tf.keras.Sequential([
   tf.keras.layers.Flatten(),
   tf.keras.layers.Dense(units=128, activation='relu'),
   tf.keras.layers.Dense(units=64, activation='relu'),
   tf.keras.layers.Dense(units=32, activation='relu'),
   tf.keras.layers.Dense(units=10, activation='softmax')
])

model_2.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Training accuracy and loss after 10 epochs

# let's check the performance on eval ds
loss,accuracy = model_2.evaluate(eval_ds, steps=100)
print('Loss',loss)
print('Accuracy',accuracy)

How can we make our model more general ?
How can we make our model more robust with unseen data?

use Dropout layers
use weight regularization techniques like L1 and L2
use a Simple model — sparse representation
Reduce the learning rate
Data augmentation
Early stopping

In our case we don’t need all of these, dropout layers and a decaying learning rate will do the job for us. Lets make use of some of these techniques:

Dropout layers

Dropout is a form of regularization where we drops out neurons in the hidden layer according to the dropout probability. Each batch of data sees a new neural network because units in the layers are being dropped according to the dropout probability and data flow through a different path every iteration. Essentially we are simulating an ensemble learning. With units dropped from layers, now the neural network will be forced to learn generalizable patterns.
Dropout prevents overfitting because units are dropped from layer , so in effect we are preventing the neural network from learning the noise in training data and instead force it to learn generalizable patterns.

How does dropout help? :

While training weights of neurons are trained for specific features that provide s some sort of specification
Neighboring neuron starts relying in these specializations (co-adaptation)
This leads to a neural network model that is too specialized to the training data
As neurons are randomly dropped other neurons have to step in to compensate
Thus the network learn multiple independent representations
This makes the network less sensitive to specific weights
Enhances the generalization capability of the network
Less vulnerable to overfitting
The whole network is used during testing , there is no dropout during testing
Dropout increases the number of iterations needed for the network to converge , but helps avoiding overfitting

In tensorflow we can achieve this using tf.keras.layers.Dropout(rate=0.2).
We have to set the dropout rate if the rate is very low the there will be no effect of dropout and if its too high then we’ll be getting no output activation.
A general good range : 10% to 40%

Weight Regularization

Minimize loss(data|model) + penality for model complexity

We can use regularize the weights using L1 or L2 regularization.
Weight regularization define model complexity as magnitude of weight vector. Magnitude of weight vector is represented by norm function.

L2 norm is imply the euclidean distance :
Suppose our weight vector = [ w(1),w(2),w(3),….w(n)]
L2 norm = sqrt( w(1)²+w(2)²+w(3)²+…+w(n)²)

L1 norm is imply the sum of absolute values of weights :
Suppose our weight vector = [ w(1),w(2),w(3),….w(n)]
L1 norm = |w(1)|+|w(2)|+|w(3)|+…+|w(n)|

The idea is to keep the weight vectors within a certain value.

L1 regularization give us sparsity, by zeroing out poor predictors while
L2 regularization keep all the variable by giving poor predictors a very low weight (close to zero).

Data augmentation

As we add more layers to our model, we are increasing the size of weight matrix and bias vector. More parameters we have more parameters we need to update. Ultimately with for updating parameters efficiently we need more data. More data we have , the more confidence in the prediction we make with that model.
If we have a situation where number of model parameters is greater than the number of data we have for training. We have two solutions :

Data augmentation
Add noise to the data that we already have to create similar replicas and train a model with augmented data
Transfer learning
Make use of a model trained on a similar problem , use the weights that model learned to solve the problem at hand

Early Stopping

Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.

Stop training when a monitored metric has stopped improving.

tf.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=0,
    verbose=0,
    mode="auto",
    baseline=None,
    restore_best_weights=False,
)

Learning Rate

Learning rate control the size of weight update :
W(t+1) = W(t) + learning_rate * gradient

Learning rate controls the size of the step in weight space.
If learning rate is too small then it’ll take too long to converge.
If learning rate is too large we may overstep the global optima.

In tensorflow we can user tf.keras.callbacks.LearningRateScheduler to control the learning rate during each epoch

tf.keras.callbacks.LearningRateScheduler(
    schedule, verbose=0
)

At the beginning of every epoch, this callback gets the updated learning rate value from schedule function provided.
Lets create a schedule function

import math
# decaying learing rate
def decay_lr(epoch):
  return 0.01*math.pow(0.6,epoch)

# lets view the decay
decay = []
for epoch in range(EPOCHS):
  decay.append(decay_lr(epoch))

Now lets create a new model with these modifications to compact overfitting

model_3 = tf.keras.Sequential([
   tf.keras.layers.Flatten(),
   tf.keras.layers.Dense(units=128, activation='relu'),
   tf.keras.layers.Dropout(rate=0.2),

   tf.keras.layers.Dense(units=64, activation='relu'),
   tf.keras.layers.Dropout(rate=0.2),

   tf.keras.layers.Dense(units=32, activation='relu'),
   tf.keras.layers.Dropout(rate=0.2),

   tf.keras.layers.Dense(units=10, activation='softmax')
])

model_3.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Lets make use of the schedule function to decay the learning rate with each epochs.

lr = tf.keras.callbacks.LearningRateScheduler(decay_lr)
history_3 = model_3.fit(train_ds,
                        epochs=EPOCHS,
                        steps_per_epoch=STEP_PER_EPOCH,
                        validation_data=eval_ds,
                        validation_steps=100,
                        callbacks=[lr])

# lets evaluate
loss, accuracy = model_3.evaluate(eval_ds, steps=100)
print('Loss',loss)
print('Accuracy', accuracy)

model performance during Evaluation

Our model now give us accuracy around 97% during both training and testing

Convolutional Neural Networks ( CNN )

stanford lectures link

Traditional machine learning approach consist of fully connected dense layers. We flatten the inputs and then feed them to the this fully connected network to get an output. For an example Multi Layer Perceptron (MLP).

Traditional machine learning methods do not handle translations well. Comparing feature vectors with straight line distance work for structured data. But for unstructured data, the straight line distance won’t work well.

Suppose : if someone showed us an image of cat and a rotated version of the same image , humans can recognize them both as the same image of a cat.

In the case of computers, our traditional method of comparing pixel to pixel is not robust to these translations. Computers sees different pixels in the corresponding locations when we give it an image and a flipped version of the same , so it thinks that these are two different images.

IMAGE → FLATTEN → MLP → output class

Now the idea is to make our model robust against the translations like rotation, shifting , zooming etc…
The object in the images are often the same under spatial translations like rotation , zooming etc…

CNN was the solution , a powerful model that works great in the image domain. Our solution was to decouple filters from specific locations in the image.

CNN learn a filter by sliding it through the image, whenever the filter aligns with a pattern in the image it give us a strong correlation. In the early layers CNN filters learn very basic patterns but as we go deep the CNN filters will learn more complex patterns. Essentially CNN learns a hierarchy of features.

Lets build a CNN model and see if we can outperform the MLP
CNN model requires a 3D input so we need to reshape the image and give it an channel dimension .

model_4 = tf.keras.Sequential([
    tf.keras.layers.Reshape(target_shape=(28,28,1)),
    tf.keras.layers.Conv2D(filters=12,
                           kernel_size=3,
                           strides=1,
                           padding='same'),
    tf.keras.layers.Conv2D(filters=24,
                           kernel_size=6,
                           strides=2,
                           padding='same'),
    tf.keras.layers.Conv2D(filters=32,
                           kernel_size=6,
                           strides=2,
                           padding='same'),
    
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(units=32, activation='relu'),
    tf.keras.layers.Dense(units=10, activation='softmax')                        
])

model_4.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])# MODEL TRAINING
history_4 = model_4.fit(train_ds,
                        epochs=EPOCHS,
                        steps_per_epoch=STEP_PER_EPOCH,
                        validation_data=eval_ds,
                        validation_steps=100
                        )# EVALUATION
loss, accuracy = model_4.evaluate(eval_ds, steps=100)
print('Loss',loss)
print('Accuracy',accuracy)

AlexNet has 8 layers
VGG Net has 19 layers
Google Net has 22 layers
As we go deep, we face many problems like:

Long training time
Internal Co-variate shift
Vanishing gradient
Exploding gradient
Sensitivity to parameter initialization

Internal Covariate Shift

Training Deep neural nets are complicated by the fact that the distribution of each layer’s input changes during training, as parameters of previous layer’s. This slows down the training by requiring lower learning rate and careful model parameter initialization.
This makes training models with saturating non-linearity extremely difficult.

We can simple think of the neural network as a function ( function ( fn…))
Nested functions , where input to a function is the output of previous function.

eg: three layers of neural network → h( g( f(input)))

Input to each function is the output of previous function. During every epoch the weights get updated and the output distribution of the function changes,
this output is fed as the input to next function and so on.
These changing input distribution will cause dramatic weight updates and output activation ends up in the saturating regions of the activation function and ultimately model learning suffers

We could try controlling the dramatic weight updates by using a very low learning rate but then it will take too long for the model to converge or it may never converge — This is why people often says training deep nets takes too much time.

We could use dropout layers , but its puny help. The dropout is a regularization technique that limit the model from learning the training data too well, the whole ideal of going deep is to capture more complex patterns in the data so with dropout alone we can’t fix this problem

We refer to this problem as Internal covariate shift and the solution for this is batch normalization

Batch normalization mitigates internal covariate shift by re-scaling and centering the weights of each layer in the neural network.
It makes normalization part of the model architecture and perform normalization for each mini-batch during training.

Without Batch normalization the inputs to a layer are skewed. The batch normalization has three ideas:

Compute average and variance on mini-batch
— We train the model using mini-batches of data, we can compute statistics on the mini-batches line average and standard deviation
Re-scale and center the data
— Using the computed statistics re-scale and re-center the data
Restore the expressiveness using two learnable params
— Add the learnable scale and offset for each logit so as to restore the expressiveness

avg = average of the batch
std_dev = standard deviation of the batch
x_hat = (x — avg) / std_dev
BN(x) = Alpha * x_hat + Beta

These learnable parameters alpha and beta can give us the original logit back if that is the right thing to do.

steps :

→ In each layer, the weighted sum + bias (logits) are re-scaled and centered
→ The noramalized logits are passed to the activation function

Batch noramalization layer comes between the weight-sum and activation-fn.

CNN model with Batch normalization

Changes :

Use Batch normalization
Use ‘relu’ activation
Use dropout layers
Use a decaying learning rate

model_5 = tf.keras.Sequential([
 tf.keras.layers.Reshape(target_shape=(28,28,1)),

 tf.keras.layers.Conv2D(filters=12, kernel_size=3,
                        strides=1, padding='same'),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Activation('relu'),

 tf.keras.layers.Conv2D(filters=24, kernel_size=6,
                        strides=2, padding='same'),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Activation('relu'),

 tf.keras.layers.Conv2D(filters=32, kernel_size=6,
                        strides=2, padding='same'),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Activation('relu'),
    
 tf.keras.layers.Flatten(),
 tf.keras.layers.Dense(units=128),
 tf.keras.layers.BatchNormalization(),
 tf.keras.layers.Activation('relu'),

 tf.keras.layers.Dropout(rate=0.2),
 tf.keras.layers.Dense(units=10, activation='softmax')                        
])

model_5.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])# lets use a decaying learning rate
lr = tf.keras.callbacks.LearningRateScheduler(decay_lr)

history_5 = model_5.fit(train_ds,
                        epochs=EPOCHS,
                        steps_per_epoch=STEP_PER_EPOCH,
                        validation_data=eval_ds,
                        validation_steps=100,
                        callbacks=[lr])

# lets evaluate
loss, accuracy = model_5.evaluate(eval_ds, steps=100)
print('Loss',loss)
print('Accuracy',accuracy)

Our model has learned the mnist digits really well to produce an accuracy around 99% both during training and evaluation.

That’s it for now… Check out the resources for notebook github link

Resources

Notebook link : Github

Getting the MNIST

Pre-processing

Single Layer Perceptron

Steps :

Multi layer perceptron

Steps :

Dropout layers

How does dropout help? :

Weight Regularization

Data augmentation

Early Stopping

Learning Rate

Convolutional Neural Networks ( CNN )

IMAGE → FLATTEN → MLP → output class

Internal Covariate Shift

steps :

CNN model with Batch normalization

Changes :

Resources

Written by Ajay krishnan

No responses yet