Logistic Regression


> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

The Titanic data set from Kaggle.

https://www.kaggle.com/c/titanic

> train = pd.read_csv('titanic_train.csv')
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
> train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Exploratory Data Analysis


Missing Data

> plt.figure(figsize=(8,6))
+ sns.heatmap(train.isnull(),
+             yticklabels=False,cbar=False,
+             cmap='viridis');
+ plt.tight_layout()
+ plt.show()

Roughly 20 percent of the Age data is missing (yellow). The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level.

Visualizations

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',
+               data=train,palette='RdBu_r');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Sex',
+               data=train,palette='RdBu_r');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Pclass',
+               data=train,palette='rainbow');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.distplot(train['Age'].dropna(),
+              kde=False,color='darkred',bins=30);
+ plt.show()

> #siblings or spouses
+ plt.figure(figsize=(8,6))
+ sns.countplot(x='SibSp',data=train, 
+               palette='pastel');
+ plt.show()

> train['Fare'].hist(color='orange',
+               bins=40,figsize=(8,6));
+ plt.show()

> import plotly.express as px
> fig = px.histogram(train,x="Fare", nbins=40,
+   color_discrete_sequence=['orange'])
+ fig.write_html("T_hist.html")
> htmltools::includeHTML('T_hist.html')

Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean or median age of all the passengers (imputation).

> #siblings or spouses
+ plt.figure(figsize=(10,6))
+ sns.boxplot(x='Pclass',y='Age',
+             data=train,palette='winter');
+ plt.show()

> train[train['Pclass']== 3]['Age'].median()
24.0
> train[train['Pclass']== 2]['Age'].median()
29.0
> train[train['Pclass']== 1]['Age'].median()
37.0

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these median age values to impute based on Pclass for Age.

> def impute_age(cols):
+     Age = cols[0]
+     Pclass = cols[1]
+     
+     if pd.isnull(Age):
+ 
+         if Pclass == 1:
+             return 37
+ 
+         elif Pclass == 2:
+             return 29
+ 
+         else:
+             return 24
+ 
+     else:
+         return Age

Now apply that function.

> train['Age'] = train[['Age','Pclass']].apply(
+                             impute_age,axis=1)

Now let’s check the heat map again.

> sns.heatmap(train.isnull(),yticklabels=False,
+             cbar=False,cmap='viridis')

We can drop the Cabin column and the row in Embarked that is NaN.

> train.drop('Cabin',axis=1,inplace=True)
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male 24 0 0 330877 8.4583 Q
> train.dropna(inplace=True)

Categorical Features

You need to convert categorical features to dummy variables using pandas. Otherwise the machine learning algorithm won’t be able to directly take in those features as inputs.

> train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Pclass         889 non-null int64
Name           889 non-null object
Sex            889 non-null object
Age            889 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Ticket         889 non-null object
Fare           889 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB
> pd.get_dummies(train['Sex']).head()
   female  male
0       0     1
1       1     0
2       1     0
3       1     0
4       0     1
> pd.get_dummies(train['Embarked']).head()
   C  Q  S
0  0  0  1
1  1  0  0
2  0  0  1
3  0  0  1
4  0  0  1

You need to drop the first dummy variable. That value can always be predicted from the remaining dummy variables.

> sex = pd.get_dummies(train['Sex'],
+                      drop_first=True)
+ embark = pd.get_dummies(train['Embarked'],
+                         drop_first=True)
> # drop unnecessary rows
+ train.drop(['Sex','Embarked','Name','Ticket','PassengerId'],
+            axis=1,inplace=True)
> # add in dummy variables
+ train = pd.concat([train,sex,embark],
+                   axis=1)
train.head()
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22 1 0 7.2500 1 0 1
1 1 1 38 1 0 71.2833 0 0 0
2 1 3 26 0 0 7.9250 0 0 1
3 1 1 35 1 0 53.1000 0 0 1
4 0 3 35 0 0 8.0500 1 0 1
5 0 3 24 0 0 8.4583 1 1 0

Building the Model


Train Test Split

> from sklearn.model_selection import train_test_split
> X_train, X_test, y_train, y_test = train_test_split(
+   train.drop('Survived',axis=1), 
+   train['Survived'], test_size=0.30, 
+   random_state=101)

Training and Predicting

> from sklearn.linear_model import LogisticRegression
> logmodel = LogisticRegression(solver='liblinear')
+ logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
> predictions = logmodel.predict(X_test)

Evaluation

You can check precision,recall,f1-score using classification report.

> from sklearn.metrics import classification_report
> print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.80      0.91      0.85       163
           1       0.82      0.65      0.73       104

    accuracy                           0.81       267
   macro avg       0.81      0.78      0.79       267
weighted avg       0.81      0.81      0.80       267
> from sklearn.metrics import confusion_matrix
> confusion_matrix(y_test,predictions)
array([[148,  15],
       [ 36,  68]], dtype=int64)

Logistic Regression Example


This is an artificial fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

  • Daily Time Spent on Site: consumer time on site in minutes
  • Age: customer age in years
  • Area Income: Avg. Income of geographical area of consumer
  • Daily Internet Usage: Avg. minutes a day consumer is on the internet
  • Ad Topic Line: Headline of the advertisement
  • City: City of consumer
  • Male: Whether or not consumer was male
  • Country: Country of consumer
  • Timestamp: Time at which consumer clicked on Ad or closed window
  • Clicked on Ad: 0 or 1 indicated clicking on Ad

The Data

> ad_data = pd.read_csv('advertising.csv')
train.head()
Daily Time Spent on Site Age Area Income Daily Internet Usage Ad Topic Line City Male Country Timestamp Clicked on Ad
68.95 35 61833.90 256.09 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia 2016-03-27 00:53:11 0
80.23 31 68441.85 193.77 Monitored national standardization West Jodi 1 Nauru 2016-04-04 01:39:02 0
69.47 26 59785.94 236.50 Organic bottom-line service-desk Davidton 0 San Marino 2016-03-13 20:35:42 0
74.15 29 54806.18 245.89 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy 2016-01-10 02:31:19 0
68.37 35 73889.99 225.58 Robust logistical utilization South Manuel 0 Iceland 2016-06-03 03:36:18 0
59.99 23 59761.56 226.74 Sharable client-driven software Jamieberg 1 Norway 2016-05-19 14:30:17 0
> ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
> adinfo = ad_data.describe()
ad_data.describe()
Daily Time Spent on Site Age Area Income Daily Internet Usage Male Clicked on Ad
count 1000.00000 1000.000000 1000.00 1000.00000 1000.0000000 1000.0000000
mean 65.00020 36.009000 55000.00 180.00010 0.4810000 0.5000000
std 15.85361 8.785562 13414.63 43.90234 0.4998889 0.5002502
min 32.60000 19.000000 13996.50 104.78000 0.0000000 0.0000000
25% 51.36000 29.000000 47031.80 138.83000 0.0000000 0.0000000
50% 68.21500 35.000000 57012.30 183.13000 0.0000000 0.5000000

Exploratory Data Analysis

A histogram of the Age.

> sns.set_style('whitegrid')
+ ad_data['Age'].hist(bins=30,figsize=(8,6));
+ plt.xlabel('Age');
+ plt.show()

Jointplot showing Area Income versus Age.

> sns.jointplot(x='Age',y='Area Income',data=ad_data);
+ plt.show()

Jointplot showing the kde distributions of Daily Time spent on site vs. Age.

> sns.jointplot(x='Age',y='Daily Time Spent on Site',
+               data=ad_data,color='red',kind='kde');
+ plt.show()

Jointplot of Daily Time Spent on Site vs. Daily Internet Usage.

> sns.jointplot(x='Daily Time Spent on Site',
+               y='Daily Internet Usage',
+               data=ad_data,color='green');
+ plt.show()

Pairplot with the hue defined by the Clicked on Ad column feature.

> sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr');
> plt.show()

Logistic Regression

Split the data into training set and testing set using train_test_split

> from sklearn.model_selection import train_test_split
> X = ad_data[['Daily Time Spent on Site', 'Age', 
+              'Area Income','Daily Internet Usage', 
+              'Male']]
+ y = ad_data['Clicked on Ad']
> X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.33, random_state=42)

Train and fit a logistic regression model on the training set.

> from sklearn.linear_model import LogisticRegression
> logmodel = LogisticRegression(solver='liblinear')
+ logmodel.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Predictions and Evaluations

Now predict values for the testing data.

> predictions = logmodel.predict(X_test)

Create a classification report for the model.

> from sklearn.metrics import classification_report
+ print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.87      0.96      0.91       162
           1       0.96      0.86      0.91       168

    accuracy                           0.91       330
   macro avg       0.91      0.91      0.91       330
weighted avg       0.91      0.91      0.91       330