Logistic Regression Quick start using sklearn

Ajay krishnan
7 min readOct 8, 2021

--

Logistic Regression, despite its name, is a classification model rather than regression model. Logistic regression is a simple and more efficient method for binary and linear classification problems.

Logistic Regression is very easy to implement and achieves very good performance with linearly separable classes. It is an extensively employed algorithm for classification in industry.

Logistic Regression vs Linear Regression
Linear Regression
is used to predict output on a continuous spectrum
for an example predicting the salary based on years of experience
or predicting the fuel economy based on horse power of the vehicle.
Linear Regression output is not bounded , it can go from zero to infinity.

Logistic Regression is the go-to method for binary classification problems (problems with two class values). Logistic Regression outputs a probability.

Logistic Regression is borrowed from statistics, the logistic model is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick or (YES / NO) or (True / False).

Linear Regression

Dependent variable = Y -intercept + slope * Independent variable

Multiple Linear Regression

let Dependent-variable = y , Independent variables = Xi
Coefficients = Mi
y = Y-intercept + M1*X1 + M2*X2 + M3*M3 +……+ Mn*Xn

Logistic Regression

let Dependent-variable = y , Independent variables = Xi
y = Y-intercept + M1*X1 + M2*X2 + M3*M3 +……+ Mn*Xn
apply sigmoid function to get probability
output = 1 / 1+e^(-y)

Sigmoid Function — It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

Logistic Regression using scikit-learn

One of the most amazing things about the scikit-learn library is that is has a four-step modeling pattern that makes it easy to code a machine learning classifier. While this quick-start tutorial uses Logistic Regression, the coding process in this tutorial applies to other classifiers in sklearn too.
import classifier : sklearn.linear_model.LogisticRegression

Data-set : we are using a toy data-set to quick start logistic regression
Features / Columns in the data-set:
‘Names’, ‘emails’, ‘Country’, ‘Time Spent on Site’, ‘Salary’
Target Label : ‘Clicked’

Load the data into a pandas dataframe
check the mean and standard -deviation

Let’s explore the columns
* check the mean and standard -deviation
* check for null values
etc…

Visualize the dataset

Lets use Seaborn , a data visualization library built on top of matplotlib, to visualize the dataset

sns.scatterplot(x = 'Time Spent on Site',
y = 'Salary',
hue = 'Clicked',
data = df)
Scatter plot
sns.boxplot(x = 'Clicked', y = 'Time Spent on Site', data = df);
box plot
sns.boxplot(x = 'Clicked', y = 'Salary', data = df);
box plot

Extracting features and labels from the data-set
For this tutorial lets use columns [ ‘Time Spent on Site’, ‘Salary’ ] to predict [‘Clicked’] ## Binary classification problem

X = df.drop(columns=['Names','emails','Country','Clicked'])
y = df['Clicked']
print('Shape of Independent variable',X.shape)
print('Shape of Dependent variable',y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print('Shape of Training data X',X_train.shape)
print('Shape of Training data y',y_train.shape)
print('Shape of Testing data X',X_test.shape)
print('Shape of Testing data y',y_test.shape)
Output

Modeling and making prediction

# model instantiation
log_reg = LogisticRegression()
# training
log_reg.fit(X_train, y_train)
# making predictions on test set
y_preds = log_reg.predict(X_test)

Model Evaluation

print(classification_report(y_test, y_preds))
Classification report

Model performance is not satisfactory.
Can we improve the model?

plot_confusion_matrix(log_reg, X_test, y_test, cmap=plt.cm.Blues);
Confusion matrix

Model Improvement

dataframe.describe()

Its a good practice to normalize the features that have different scales and range.
This is important because the features are multiplied by model weights so the scale of the output and the scale of the gradient are affected by the scale of the inputs

Lets use Standard Scaler to scale the Features to zero mean, unit variance

  1. Data standardization is the process of re-scaling the attributes so that they have mean as 0 and variance as 1 , (Zero mean & Unit variance).
  2. The ultimate goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values.

In sklearn.preprocessing.StandardScaler(), centering and scaling happens independently on each feature.

Scaled-feature = (feature -feature-mean) / feature-standard-deviation

It’s not necessary to use Column Transformer here but here in this tutorial we are using Column Transformer to keep things neat

features = ['Time Spent on Site', 'Salary']scaler = StandardScaler()transformer = ColumnTransformer(
transformers=[('scale', scaler, features)]
)
X_train_transformed = transformer.fit_transform(X_train)X_test_transformed = transformer.transform(X_test)

fit_transform() and transform() are used while scaling or standardizing our training and test data.

When transforming the X_train use fit_transform() and when transforming X_test use transform()

Why we use fit_transform() on training data but transform() on the test data?

Data leakage is a big problem in machine learning when developing predictive models
Data leakage is when information from outside the training dataset is used to create the model.

When we use fit_transform, the fit method is calculating the mean and variance of each of the features present in our training data and the transform method is transforming all the features using the respective mean and variance.
Using the transform method we can use the same mean and variance as it is calculated from our training data to transform our test data.

We want to keep Training set and Test set separate. After a model has been trained by using the training set, you test the model by making predictions against the test set. Because the data in the testing set already contains known values for the attribute that you want to predict, it is easy to determine whether the model’s guesses are correct.
we must estimate the performance of the model on unseen data .

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

Modeling and making predictions

# model instance
model = LogisticRegression()
# training using scaled data
model.fit(X_train_transformed, y_train)
# making prediction
predictions = model.predict(X_test_transformed)
# evaluation
print(classification_report(y_test, predictions))

Model Evaluation

Classification Report

Model performance are satisfactory

plot_confusion_matrix(model, X_test_transformed, y_test);
Confusing matrix

Evaluation metrics for classification problems

Classification Accuracy = (TP+TN) / (TP+TN+FP+FN)
Precision = TP / (TP+FP)
Recall = TP/ (TP+FN)

Precision — when model predicts TRUE class how often is it correct
Recall — when class is TRUE how often model predicts it correct

conf_mat = confusion_matrix(y_test, predictions)
[[TN, FP],[FN, TP]] = conf_mat
# ACCURACY
accuracy = (TP+TN) / (TP+TN+FP+FN)
print('Model Accuracy', accuracy)
# PRECISION
precision_1 = TP / (TP + FP)
precision_0 = TN / (TN + FN)
print('Model precision predicting class 1 :',precision_1)
print('Model precision predicting class 0 :',precision_0)
# RECALL
recall_1 = TP / (TP + FN)
recall_0 = TN / (TN + FP)
print('Recall predicting class 1 :',recall_1)
print('Recall predicting class 0 :',recall_0)
Model evaluation

Prediction Visualization

# get a collection of points 
# use data from column 1
x = np.arange(start = X_test_transformed[:, 0].min()-1,
stop = X_test_transformed[:, 0].max()+1,
step = 0.01)
# use data from column 1
y = np.arange(start = X_test_transformed[:, 1].min()-1,
stop = X_test_transformed[:, 1].max()+1,
step = 0.01)
# Create a meshgrid # (X-axis and Y-axis)
xx, yy = np.meshgrid(x, y)
# Create a pandas dataframe
data_points = pd.DataFrame(data={'Time Spent on Site':xx.ravel(),
'Salary': yy.ravel()})
# Make Predictions using the datapoints
preds = model.predict(data_points)
# Z-axis
zz = preds.reshape(xx.shape)
# Use the axis(xx, yy, zz) to create a contourf-plot
plt.contourf(xx,yy,zz)
# Create scatterplot with the testdata
plt.scatter(X_test_transformed[y_test==0, 0],
X_test_transformed[y_test==0, 1],
color='steelblue', label='0')
plt.scatter(X_test_transformed[y_test==1, 0],
X_test_transformed[y_test==1, 1],
color='salmon', label='1')
plt.legend()
plt.show()
Decision surface

Click for Github link

Jupyter notebook

--

--