Machine Learning from disaster: Titanic Classification problem
This is the legendary Titanic ML competition — the best, first challenge for you to dive into ML
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Your score is the percentage of passengers you correctly predict. This is known as accuracy.
Link to Kaggle
Link to Kaggle Notebook
Link to Github repo
Today I’m going to show you how to solve this classification problem using both scikit-learn and tensorflow. Without wasting any time lets start.
Solving the titanic classification problem using scikit-learn models
Step 1 : Explore the data
#Read the csv
titanic_df = pd.read_csv(url)
(titanic_df.describe()).iloc[:3,:]
Check for missing values
# check for null/missing titanic_df.isna().sum()
Check for interesting patterns in the data
By simply plotting ‘Sex’ against ‘Survived’ we can see that among the survivors females survived more that males
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.barplot(x='Sex', y='Survived', data=titanic_df ,palette='coolwarm')
plt.subplot(1,2,2)
sns.countplot(x='Sex', data=titanic_df, hue='Survived', palette='inferno')
plt.show()
#Check for any trend or patterns in the datatotal_passengers = len(titanic_df)
print('Total number of rows in the dataframe',total_passengers)
percentage_not_survived, percentage_survived=titanic_df['Survived'].value_counts(normalize=True)
print(f'Percentage Survived {percentage_survived :.2f}% Percentage not survived {percentage_not_survived :.2f}%')
male,female = titanic_df['Sex'].value_counts()
print('Male passenger',male)
print('Female passenger',female)
male_survived = titanic_df[(titanic_df['Sex']=='male')&(titanic_df['Survived']==1)]['Sex'].value_counts()['male']
female_survived = titanic_df[(titanic_df['Sex']=='female')&(titanic_df['Survived']==1)]['Sex'].value_counts()['female']
print(f'Male survived {male_survived} male survival rate {(male_survived/male)*100 :.2f}%')
print(f'Female survived {female_survived} female survival rate {(female_survived/female)*100 :.2f}%')
By studying the data we can see that when ‘Sex’ = Female then the rate of survival was 74% , of the 314 female passengers 233 survived.
When ‘Sex’ = Male then the rate of survival was only 18% .
So we can think of a model that predict “survived” if the sex field is ‘female’ and predict “not survived” when sex field is ‘male’
Such a naive model can be set as the baseline and let see if we can break this
Step 2 : Data preprocessing
Our dataframe has numerical data (both continuous and discrete), string data ( both categorical and unordered). Plus missing fields.
There are some fields which are missing in the dataset , so we need to take care of missing fields as well. Model understand numbers only , so what can we do about it …. Data preprocessing !
Handling Missing values : check for null values and if there are null values the inpute them using StandardScaler
Handling Categorical feature columns :
we can encode the categories using a OneHotEncoder. Here we have Integer categories and String categories
Numerical categorical Features
- Pclass’
- ‘SibSp’
- ‘Parch’
String Categorical Features
- Sex’
- ‘Cabin’
- ‘Embarked’
Let’s first impute the missing values using a SimpleImputer then Let’s encode them using OneHotEncoder
How do we handle numerical data in the dataset ?
Numerical Features
- Fare
- Age
Let’s first impute the missing values using a SimpleImputer then Let’s Standardize features by removing the mean and scaling to unit variance
There is a lot of preprocessing to do , how can we do this neatly and efficently?
We can perform the above preprocessing by building a pipeline sklearn.pipeline.Pipeline
,sequentially apply a list of transforms and a final estimator. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.
For continues numerical columns like [ ‘Age’, ‘Fare’ ] , we first impute the missing values using a SimpleImputer, then we scale the values by removing the mean and scaling to unit variance using using a StandardScaler
# So lets do some preprocessing!
# Select interesting features
# Use Pipeline class to apply transformations sequentially# FLOAT COLUMNS / CONTINUES NUMERIC
numeric_cols = ['Age', 'Fare']numeric_transformer = Pipeline(steps= [ ('impute',SimpleImputer(strategy='mean')),
('scale',StandardScaler())
])# Here SimpleImputer use mean value for imputing the missing values
# hence the strategy = mean
For String Categorical columns like [‘Sex’,’Cabin’,’Embarked’] , we first impute the missing values using a SimpleImputer, then we encode the categories using OneHotEncoder
# preprocessing String Categories !
# Select interesting features
# Use Pipeline class to apply transformations sequentially# STRING CATEGORY
str_cat = ['Sex','Cabin','Embarked']str_cat_transformer = Pipeline(steps[
('impute',SimpleImputer(strategy='most_frequent')), ('one_hot',OneHotEncoder(handle_unknown='ignore'))])# Here SimpleImputer use most-frequent value for imputing
# hence the strategy = most_frequent
For Integer Categorical columns like [‘Pclass’,’SibSp’,’Parch’] , we first impute the missing values using a SimpleImputer, then we scale the values by removing the mean and scaling to unit variance using using a StandardScaler
# preprocessing Integer Categories !
# Select interesting features
# Use Pipeline class to apply transformations sequentially# INTEGER CATEGORY
int_cat = ['Pclass','SibSp','Parch']int_cat_transformer = Pipeline(steps= [ ('impute',SimpleImputer(strategy='median')),('scale',StandardScaler())
])# Here SimpleImputer use mean value for imputing the missing values
# hence the strategy = mean
Data preprocessing pipeline
ColumnTransformer is an estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful since out data is heterogeneous .
We can use the ColumnTransformer to combine several feature extraction mechanisms or transformations into a single transformer.
# Use ColumnTransformer to combine all the preprocessing stepspreprocessing = ColumnTransformer(transformers=[
('numeric', numeric_transformer, numeric_cols),
('string', str_cat_transformer, str_cat),
('integer', int_cat_transformer, int_cat)
])
Step 3 : Building a Model Pipeline
Lets use different classification models available in scikit-learn and compare the performance of different model.
First lets build some utility functions for:
1 . Combining Preprocessing pipeline and modeling
2 . Building a Model pipeline for training and cross-validation
# combine preprocessing and modeling
def build_model(model,preprocess=preprocessing):
pipe = Pipeline(steps=[('preprocessing', preprocess),
('modeling', model)])
return pipe# build a model pipeline for training and cross-validation
def model_training(model,X,y):
pipe = build_model(model)
return cross_val_score(pipe, X, y)
Now lets list all the classification models we want to compare.
Model list = [ ( name, model ),…. ]
# list of classification models
models = [('RandomForest',RandomForestClassifier()), \
('LogisticRegression',LogisticRegression()), \
('GradientBoosting',GradientBoostingClassifier()), \
('SVC',SVC()), \
('SGDClassifier',SGDClassifier()), \
('XGBClassifier',XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
]
Step 4 : Training
# TRAINING
# Feature variable
X = titanic_df.drop(columns=['Survived'])
# labels
y = titanic_df['Survived']
for name,model in models:
cv_scores = model_training(model,X,y)
print(f'Model {name :20s} score: {cv_scores.mean()}')
Can we Improve the model performance ?
Step 5 : Hyperparameter tuning
For tuning the models we can use GridSearchCV
GridSearchCV Exhaustive search over specified parameter values.
Lets update the model list by adding tuning parameters to the tuples.
New model list = [ ( name , model , hyperparameters ),.. ]
# Model list with tuning parameters
models = [('RandomForest', \
RandomForestClassifier(), \
{'modeling__max_depth':[i for i in range(4,12)]}), \
('LogisticRegression', \
LogisticRegression(), \
{'modeling__C':[i*0.1 for i in range(10,15)]}), \
('GradientBoosting', \
GradientBoostingClassifier(), \
{'modeling__n_estimators':
[i for i in range(100,300,50)]}), \
('SVC', \
SVC(), \
{'modeling__C':[i for i in range(1,10)]}), \
('SGDClassifier',SGDClassifier(), \
{'modeling__warm_start':[True,False], \
'modeling__early_stopping':[True,False], \
'modeling__average':[True,False]}), \
('XGBClassifier', \
XGBClassifier(use_label_encoder=False, eval_metric='logloss'), \
{'modeling__colsample_bytree':[0.7], \
'modeling__colsample_bylevel':[0.5], \
'modeling__colsample_bynode':[0.7], \
'modeling__subsample':[0.6,0.7]}) \
]# Tuning the model using GridSearchCV
for name, model, param_grid in models:
pipe = build_model(model)
gs = GridSearchCV(pipe, param_grid)
gs.fit(X,y)
print(f'{name :20} {gs.best_score_}')
That’s it . We explored the data, find interesting patterns, Set a baseline , Processed the data by creating a data preprocessing pipeline, then combined data-preprocessing and modeling step using a pipeline, We compared different model performance and improve the model performance by hyperparmeter tuning
Can we do the same using TensorFlow Neural Nets?
Yes ! Lets build a TensorFlow model and wrap up this tutorial
It you’re passing a heterogenous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model.
Note: If you have many features that need identical preprocessing it’s more efficient to concatenate them together before applying the preprocessing.
# Features and feature types
features = [*numeric_cols, *int_cat, *str_cat]
feature_types = {feat:titanic_df[feat].dtype for feat in features}
# Input features
inputs = {}
for k,v in feature_types.items():
inputs[k]=tf.keras.Input(shape=(1,), dtype=v, name=k)
Feature preprocessing with Keras layers
For Continuous numeric features we will use a Normalization()
layer to make sure the mean of each feature is 0 and its standard deviation is 1.
For categorical features encoded as integers, we will build a lookup table using IntegerLookup()
layer then encode these features using CategoryEncoding().
We also have a categorical feature encoded as a string, we will build a lookup table usingStringLookup()
layer then encode these features using CategoryEncoding().
Two options we have here:
- Use
CategoryEncoding()
, which requires knowing the range of input values and will error on input outside the range. - Use
IntegerLookup() or StringLookup()
which will build a lookup table for inputs and reserve an output index for unkown input values.
all_encoded=[]
# Numerical encoding
feature_input = [inputs[feat] for feat in sorted(numeric_cols)]
numerical_features = tf.keras.layers.concatenate(feature_input)
encoding = numerical_encoding(numerical_features, titanic_df)
all_encoded.append(encoding)# Category encoding # integer category
feature_input = [inputs[feat] for feat in sorted(int_cat)]
integer_features = tf.keras.layers.concatenate(feature_input)
encoding=integer_cat_encoding(integer_features,titanic_df)
all_encoded.append(encoding)
# Category encoding # string category
feature_input = [inputs[feat] for feat in sorted(str_cat)]
string_features = tf.keras.layers.concatenate(feature_input)
encoding=string_cat_encoding(string_features,titanic_df)
all_encoded.append(encoding)
Assemble the preprocessing head
At this point “ all_encoded ” is just a Python list of all the preprocessing results, each result has a shape of (batch_size, depth):
Concatenate all the preprocessed features along the depth axis, so each dictionary-example is converted into a single vector.
# keras preprocessing head
feature_prep = tf.keras.layers.concatenate(all_encoded)
titanic_preprocessor = tf.keras.Model(inputs, feature_prep)
tf.keras.utils.plot_model(titanic_preprocessor,
show_shapes=True,
rankdir='LR')
Create and train a model
# Tensorflow Sequential model ( Body of the model )fully_connected = tf.keras.Sequential([
tf.keras.layers.Dense(32,'relu'),
tf.keras.layers.Dense(16,'relu'),
tf.keras.layers.Dense(1)
])# utility function builds and compile the modeldef build_tf_model(inputs, preprocessor, sequential):
prep = preprocessor(inputs)
results = sequential(prep)
model = tf.keras.Model(inputs, results)
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer='adam',
metrics=['accuracy'])
return model
# tensorflow modeltf_model = build_tf_model(inputs,
titanic_preprocessor,
fully_connected)
Keras models don’t automatically convert Pandas DataFrames because it’s not clear if it should be converted to one tensor or to a dictionary of tensors. So convert it to a dictionary of tensors:
# TRAINING DATA
training_data={}
for k,v in feature_types.items():
na_impute = 'missing' if v=='object' else -1
vals = titanic_df[k].fillna(na_impute)
training_data[k]= np.array(vals)
training_label=titanic_df['Survived']
Train the tensorflow model
history = tf_model.fit(training_data, training_label, epochs=10)
Plot the model performance
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(history.history['accuracy'],color='green')
plt.title("MODEL ACCURACY")
plt.xlabel('Epochs')
plt.subplot(1,2,2)
plt.plot(history.history['loss'],color='salmon')
plt.title("MODEL LOSS")
plt.xlabel('Epochs')
plt.show()
You can see that accuracy is at 80% , while loss bottoms out at about .4016 after ten epochs.
Key point: You will typically see best results with deep learning with much larger and more complex datasets. When working with a small datasets like this one, its recommend using a decision tree or random forest as a strong baseline. The goal of this tutorial is not to train an accurate model, but to demonstrate the mechanics of working with structured data, so you have the code to use as a starting point when working with your own datasets in the future.