Machine Learning from disaster: Titanic Classification problem

Ajay krishnan
8 min readNov 16, 2021

--

This is the legendary Titanic ML competition — the best, first challenge for you to dive into ML

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Your score is the percentage of passengers you correctly predict. This is known as accuracy.

Link to Kaggle
Link to Kaggle Notebook
Link to Github repo

Today I’m going to show you how to solve this classification problem using both scikit-learn and tensorflow. Without wasting any time lets start.

Solving the titanic classification problem using scikit-learn models

Step 1 : Explore the data

#Read the csv
titanic_df = pd.read_csv(url)
(titanic_df.describe()).iloc[:3,:]
(titanic_df.describe()).iloc[:3,:]

Check for missing values

titanic_df.isna().sum()
# check for null/missing titanic_df.isna().sum()

Check for interesting patterns in the data

By simply plotting ‘Sex’ against ‘Survived’ we can see that among the survivors females survived more that males

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.barplot(x='Sex', y='Survived', data=titanic_df ,palette='coolwarm')
plt.subplot(1,2,2)
sns.countplot(x='Sex', data=titanic_df, hue='Survived', palette='inferno')
plt.show()
Survived and Sex
#Check for any trend or patterns in the datatotal_passengers = len(titanic_df)
print('Total number of rows in the dataframe',total_passengers)

percentage_not_survived, percentage_survived=titanic_df['Survived'].value_counts(normalize=True)
print(f'Percentage Survived {percentage_survived :.2f}% Percentage not survived {percentage_not_survived :.2f}%')

male,female = titanic_df['Sex'].value_counts()
print('Male passenger',male)
print('Female passenger',female)

male_survived = titanic_df[(titanic_df['Sex']=='male')&(titanic_df['Survived']==1)]['Sex'].value_counts()['male']
female_survived = titanic_df[(titanic_df['Sex']=='female')&(titanic_df['Survived']==1)]['Sex'].value_counts()['female']
print(f'Male survived {male_survived} male survival rate {(male_survived/male)*100 :.2f}%')
print(f'Female survived {female_survived} female survival rate {(female_survived/female)*100 :.2f}%')
output

By studying the data we can see that when ‘Sex’ = Female then the rate of survival was 74% , of the 314 female passengers 233 survived.
When ‘Sex’ = Male then the rate of survival was only 18% .

So we can think of a model that predict “survived” if the sex field is ‘female’ and predict “not survived” when sex field is ‘male’
Such a naive model can be set as the baseline and let see if we can break this

Step 2 : Data preprocessing

Our dataframe has numerical data (both continuous and discrete), string data ( both categorical and unordered). Plus missing fields.
There are some fields which are missing in the dataset , so we need to take care of missing fields as well. Model understand numbers only , so what can we do about it …. Data preprocessing !

Handling Missing values : check for null values and if there are null values the inpute them using StandardScaler

Handling Categorical feature columns :

we can encode the categories using a OneHotEncoder. Here we have Integer categories and String categories

Numerical categorical Features

  • Pclass’
  • ‘SibSp’
  • ‘Parch’

String Categorical Features

  • Sex’
  • ‘Cabin’
  • ‘Embarked’

Let’s first impute the missing values using a SimpleImputer then Let’s encode them using OneHotEncoder

How do we handle numerical data in the dataset ?

Numerical Features

  • Fare
  • Age

Let’s first impute the missing values using a SimpleImputer then Let’s Standardize features by removing the mean and scaling to unit variance

There is a lot of preprocessing to do , how can we do this neatly and efficently?

We can perform the above preprocessing by building a pipeline sklearn.pipeline.Pipeline ,sequentially apply a list of transforms and a final estimator. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

For continues numerical columns like [ ‘Age’, ‘Fare’ ] , we first impute the missing values using a SimpleImputer, then we scale the values by removing the mean and scaling to unit variance using using a StandardScaler

# So lets do some preprocessing!
# Select interesting features
# Use Pipeline class to apply transformations sequentially
# FLOAT COLUMNS / CONTINUES NUMERIC
numeric_cols = ['Age', 'Fare']
numeric_transformer = Pipeline(steps= [ ('impute',SimpleImputer(strategy='mean')),
('scale',StandardScaler())
])
# Here SimpleImputer use mean value for imputing the missing values
# hence the strategy = mean

For String Categorical columns like [‘Sex’,’Cabin’,’Embarked’] , we first impute the missing values using a SimpleImputer, then we encode the categories using OneHotEncoder

# preprocessing String Categories !
# Select interesting features
# Use Pipeline class to apply transformations sequentially
# STRING CATEGORY
str_cat = ['Sex','Cabin','Embarked']
str_cat_transformer = Pipeline(steps[
('impute',SimpleImputer(strategy='most_frequent')), ('one_hot',OneHotEncoder(handle_unknown='ignore'))])
# Here SimpleImputer use most-frequent value for imputing
# hence the strategy = most_frequent

For Integer Categorical columns like [‘Pclass’,’SibSp’,’Parch’] , we first impute the missing values using a SimpleImputer, then we scale the values by removing the mean and scaling to unit variance using using a StandardScaler

# preprocessing Integer Categories !
# Select interesting features
# Use Pipeline class to apply transformations sequentially
# INTEGER CATEGORY
int_cat = ['Pclass','SibSp','Parch']
int_cat_transformer = Pipeline(steps= [ ('impute',SimpleImputer(strategy='median')),('scale',StandardScaler())
])
# Here SimpleImputer use mean value for imputing the missing values
# hence the strategy = mean

Data preprocessing pipeline

ColumnTransformer is an estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful since out data is heterogeneous .
We can use the ColumnTransformer to combine several feature extraction mechanisms or transformations into a single transformer.

# Use ColumnTransformer to combine all the preprocessing stepspreprocessing = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_cols),
('string', str_cat_transformer, str_cat),
('integer', int_cat_transformer, int_cat)
])

Step 3 : Building a Model Pipeline

Lets use different classification models available in scikit-learn and compare the performance of different model.
First lets build some utility functions for:
1 . Combining Preprocessing pipeline and modeling
2 . Building a Model pipeline for training and cross-validation

# combine preprocessing and modeling
def build_model(model,preprocess=preprocessing):
pipe = Pipeline(steps=[('preprocessing', preprocess),
('modeling', model)])
return pipe
# build a model pipeline for training and cross-validation
def model_training(model,X,y):
pipe = build_model(model)
return cross_val_score(pipe, X, y)

Now lets list all the classification models we want to compare.
Model list = [ ( name, model ),…. ]

# list of classification models
models = [('RandomForest',RandomForestClassifier()), \
('LogisticRegression',LogisticRegression()), \
('GradientBoosting',GradientBoostingClassifier()), \
('SVC',SVC()), \
('SGDClassifier',SGDClassifier()), \
('XGBClassifier',XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
]

Step 4 : Training

# TRAINING

# Feature variable
X = titanic_df.drop(columns=['Survived'])
# labels
y = titanic_df['Survived']


for name,model in models:
cv_scores = model_training(model,X,y)
print(f'Model {name :20s} score: {cv_scores.mean()}')
Model performance

Can we Improve the model performance ?

Step 5 : Hyperparameter tuning

For tuning the models we can use GridSearchCV

GridSearchCV Exhaustive search over specified parameter values.

Lets update the model list by adding tuning parameters to the tuples.
New model list = [ ( name , model , hyperparameters ),.. ]

# Model list with tuning parameters 

models = [('RandomForest', \
RandomForestClassifier(), \
{'modeling__max_depth':[i for i in range(4,12)]}), \

('LogisticRegression', \
LogisticRegression(), \
{'modeling__C':[i*0.1 for i in range(10,15)]}), \

('GradientBoosting', \
GradientBoostingClassifier(), \
{'modeling__n_estimators':
[i for i in range(100,300,50)]}), \

('SVC', \
SVC(), \
{'modeling__C':[i for i in range(1,10)]}), \

('SGDClassifier',SGDClassifier(), \
{'modeling__warm_start':[True,False], \
'modeling__early_stopping':[True,False], \
'modeling__average':[True,False]}), \

('XGBClassifier', \
XGBClassifier(use_label_encoder=False, eval_metric='logloss'), \
{'modeling__colsample_bytree':[0.7], \
'modeling__colsample_bylevel':[0.5], \
'modeling__colsample_bynode':[0.7], \
'modeling__subsample':[0.6,0.7]}) \
]
# Tuning the model using GridSearchCV
for name, model, param_grid in models:
pipe = build_model(model)
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
print(f'{name :20} {gs.best_score_}')
Improved performance

That’s it . We explored the data, find interesting patterns, Set a baseline , Processed the data by creating a data preprocessing pipeline, then combined data-preprocessing and modeling step using a pipeline, We compared different model performance and improve the model performance by hyperparmeter tuning

Can we do the same using TensorFlow Neural Nets?

Yes ! Lets build a TensorFlow model and wrap up this tutorial

It you’re passing a heterogenous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model.

Note: If you have many features that need identical preprocessing it’s more efficient to concatenate them together before applying the preprocessing.

# Features and feature types 

features = [*numeric_cols, *int_cat, *str_cat]
feature_types = {feat:titanic_df[feat].dtype for feat in features}
# Input features
inputs = {}
for k,v in feature_types.items():
inputs[k]=tf.keras.Input(shape=(1,), dtype=v, name=k)

Feature preprocessing with Keras layers

For Continuous numeric features we will use a Normalization() layer to make sure the mean of each feature is 0 and its standard deviation is 1.

For categorical features encoded as integers, we will build a lookup table using IntegerLookup() layer then encode these features using CategoryEncoding().

We also have a categorical feature encoded as a string, we will build a lookup table usingStringLookup() layer then encode these features using CategoryEncoding().

Two options we have here:

  • Use CategoryEncoding(), which requires knowing the range of input values and will error on input outside the range.
  • Use IntegerLookup() or StringLookup()which will build a lookup table for inputs and reserve an output index for unkown input values.
https://github.com/Ajay-user/ML-DL-RL-repo/blob/master/Classification%20Problems/Titanic_classification_problem.ipynb
all_encoded=[]

# Numerical encoding
feature_input = [inputs[feat] for feat in sorted(numeric_cols)]
numerical_features = tf.keras.layers.concatenate(feature_input)
encoding = numerical_encoding(numerical_features, titanic_df)
all_encoded.append(encoding)
# Category encoding # integer category

feature_input = [inputs[feat] for feat in sorted(int_cat)]
integer_features = tf.keras.layers.concatenate(feature_input)
encoding=integer_cat_encoding(integer_features,titanic_df)
all_encoded.append(encoding)

# Category encoding # string category
feature_input = [inputs[feat] for feat in sorted(str_cat)]
string_features = tf.keras.layers.concatenate(feature_input)
encoding=string_cat_encoding(string_features,titanic_df)
all_encoded.append(encoding)

Assemble the preprocessing head
At this point “ all_encoded ” is just a Python list of all the preprocessing results, each result has a shape of (batch_size, depth):

Concatenate all the preprocessed features along the depth axis, so each dictionary-example is converted into a single vector.

# keras preprocessing head

feature_prep = tf.keras.layers.concatenate(all_encoded)
titanic_preprocessor = tf.keras.Model(inputs, feature_prep)

tf.keras.utils.plot_model(titanic_preprocessor,
show_shapes=True,
rankdir='LR')
Preprocessing head

Create and train a model

# Tensorflow Sequential model ( Body of the model )fully_connected = tf.keras.Sequential([
tf.keras.layers.Dense(32,'relu'),
tf.keras.layers.Dense(16,'relu'),
tf.keras.layers.Dense(1)

])
# utility function builds and compile the modeldef build_tf_model(inputs, preprocessor, sequential):
prep = preprocessor(inputs)
results = sequential(prep)
model = tf.keras.Model(inputs, results)


model.compile(
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
return model

# tensorflow model
tf_model = build_tf_model(inputs,
titanic_preprocessor,
fully_connected)

Keras models don’t automatically convert Pandas DataFrames because it’s not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:

# TRAINING DATA
training_data={}
for k,v in feature_types.items():
na_impute = 'missing' if v=='object' else -1
vals = titanic_df[k].fillna(na_impute)
training_data[k]= np.array(vals)

training_label=titanic_df['Survived']

Train the tensorflow model

history = tf_model.fit(training_data, training_label, epochs=10)
Model training — Loss going down and accuracy increasing

Plot the model performance

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'],color='green')
plt.title("MODEL ACCURACY")
plt.xlabel('Epochs')
plt.subplot(1,2,2)
plt.plot(history.history['loss'],color='salmon')
plt.title("MODEL LOSS")
plt.xlabel('Epochs')
plt.show()

You can see that accuracy is at 80% , while loss bottoms out at about .4016 after ten epochs.

Key point: You will typically see best results with deep learning with much larger and more complex datasets. When working with a small datasets like this one, its recommend using a decision tree or random forest as a strong baseline. The goal of this tutorial is not to train an accurate model, but to demonstrate the mechanics of working with structured data, so you have the code to use as a starting point when working with your own datasets in the future.

--

--

No responses yet