Logistic Regression

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

The Titanic data set from Kaggle.

https://www.kaggle.com/c/titanic

> train = pd.read_csv('titanic_train.csv')

train.head()

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S
6	0	3	Moran, Mr. James	male	NaN	0	330877	8.4583	NaN	Q

> train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Exploratory Data Analysis

Missing Data

> plt.figure(figsize=(8,6))
+ sns.heatmap(train.isnull(),
+             yticklabels=False,cbar=False,
+             cmap='viridis');
+ plt.tight_layout()
+ plt.show()

Roughly 20 percent of the Age data is missing (yellow). The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level.

Visualizations

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',
+               data=train,palette='RdBu_r');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Sex',
+               data=train,palette='RdBu_r');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Pclass',
+               data=train,palette='rainbow');
+ plt.show()

> plt.figure(figsize=(8,6))
+ sns.distplot(train['Age'].dropna(),
+              kde=False,color='darkred',bins=30);
+ plt.show()

> #siblings or spouses
+ plt.figure(figsize=(8,6))
+ sns.countplot(x='SibSp',data=train, 
+               palette='pastel');
+ plt.show()

> train['Fare'].hist(color='orange',
+               bins=40,figsize=(8,6));
+ plt.show()

> import plotly.express as px

> fig = px.histogram(train,x="Fare", nbins=40,
+   color_discrete_sequence=['orange'])
+ fig.write_html("T_hist.html")

> htmltools::includeHTML('T_hist.html')

Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean or median age of all the passengers (imputation).

> #siblings or spouses
+ plt.figure(figsize=(10,6))
+ sns.boxplot(x='Pclass',y='Age',
+             data=train,palette='winter');
+ plt.show()

> train[train['Pclass']== 3]['Age'].median()

24.0

> train[train['Pclass']== 2]['Age'].median()

29.0

> train[train['Pclass']== 1]['Age'].median()

37.0

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these median age values to impute based on Pclass for Age.

> def impute_age(cols):
+     Age = cols[0]
+     Pclass = cols[1]
+     
+     if pd.isnull(Age):
+ 
+         if Pclass == 1:
+             return 37
+ 
+         elif Pclass == 2:
+             return 29
+ 
+         else:
+             return 24
+ 
+     else:
+         return Age

Now apply that function.

> train['Age'] = train[['Age','Pclass']].apply(
+                             impute_age,axis=1)

Now let’s check the heat map again.

> sns.heatmap(train.isnull(),yticklabels=False,
+             cbar=False,cmap='viridis')

We can drop the Cabin column and the row in Embarked that is NaN.

> train.drop('Cabin',axis=1,inplace=True)

train.head()

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.2833	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	S
6	0	3	Moran, Mr. James	male	24	0	330877	8.4583	Q

> train.dropna(inplace=True)

Categorical Features

You need to convert categorical features to dummy variables using pandas. Otherwise the machine learning algorithm won’t be able to directly take in those features as inputs.

> train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    889 non-null int64
Survived       889 non-null int64
Pclass         889 non-null int64
Name           889 non-null object
Sex            889 non-null object
Age            889 non-null float64
SibSp          889 non-null int64
Parch          889 non-null int64
Ticket         889 non-null object
Fare           889 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB

> pd.get_dummies(train['Sex']).head()

   female  male
0       0     1
1       1     0
2       1     0
3       1     0
4       0     1

> pd.get_dummies(train['Embarked']).head()

You need to drop the first dummy variable. That value can always be predicted from the remaining dummy variables.

> sex = pd.get_dummies(train['Sex'],
+                      drop_first=True)
+ embark = pd.get_dummies(train['Embarked'],
+                         drop_first=True)

> # drop unnecessary rows
+ train.drop(['Sex','Embarked','Name','Ticket','PassengerId'],
+            axis=1,inplace=True)

> # add in dummy variables
+ train = pd.concat([train,sex,embark],
+                   axis=1)

train.head()

	Survived	Pclass	Age	SibSp	Fare	male	Q	S
0	0	3	22	1	7.2500	1	0	1
1	1	1	38	1	71.2833	0	0	0
2	1	3	26	0	7.9250	0	0	1
3	1	1	35	1	53.1000	0	0	1
4	0	3	35	0	8.0500	1	0	1
5	0	3	24	0	8.4583	1	1	0

Building the Model

Train Test Split

> from sklearn.model_selection import train_test_split

> X_train, X_test, y_train, y_test = train_test_split(
+   train.drop('Survived',axis=1), 
+   train['Survived'], test_size=0.30, 
+   random_state=101)

Training and Predicting

> from sklearn.linear_model import LogisticRegression

> logmodel = LogisticRegression(solver='liblinear')
+ logmodel.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

> predictions = logmodel.predict(X_test)

Evaluation

You can check precision,recall,f1-score using classification report.

> from sklearn.metrics import classification_report

> print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.80      0.91      0.85       163
           1       0.82      0.65      0.73       104

    accuracy                           0.81       267
   macro avg       0.81      0.78      0.79       267
weighted avg       0.81      0.81      0.80       267

> from sklearn.metrics import confusion_matrix

> confusion_matrix(y_test,predictions)

array([[148,  15],
       [ 36,  68]], dtype=int64)

Logistic Regression Example

This is an artificial fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

Daily Time Spent on Site: consumer time on site in minutes
Age: customer age in years
Area Income: Avg. Income of geographical area of consumer
Daily Internet Usage: Avg. minutes a day consumer is on the internet
Ad Topic Line: Headline of the advertisement
City: City of consumer
Male: Whether or not consumer was male
Country: Country of consumer
Timestamp: Time at which consumer clicked on Ad or closed window
Clicked on Ad: 0 or 1 indicated clicking on Ad

The Data

> ad_data = pd.read_csv('advertising.csv')

train.head()

Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18
59.99	23	59761.56	226.74	Sharable client-driven software	Jamieberg	1	Norway	2016-05-19 14:30:17

> ad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site    1000 non-null float64
Age                         1000 non-null int64
Area Income                 1000 non-null float64
Daily Internet Usage        1000 non-null float64
Ad Topic Line               1000 non-null object
City                        1000 non-null object
Male                        1000 non-null int64
Country                     1000 non-null object
Timestamp                   1000 non-null object
Clicked on Ad               1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB

> adinfo = ad_data.describe()

ad_data.describe()

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Male	Clicked on Ad
count	1000.00000	1000.000000	1000.00	1000.00000	1000.0000000	1000.0000000
mean	65.00020	36.009000	55000.00	180.00010	0.4810000	0.5000000
std	15.85361	8.785562	13414.63	43.90234	0.4998889	0.5002502
min	32.60000	19.000000	13996.50	104.78000	0.0000000	0.0000000
25%	51.36000	29.000000	47031.80	138.83000	0.0000000	0.0000000
50%	68.21500	35.000000	57012.30	183.13000	0.0000000	0.5000000

Exploratory Data Analysis

A histogram of the Age.

> sns.set_style('whitegrid')
+ ad_data['Age'].hist(bins=30,figsize=(8,6));
+ plt.xlabel('Age');
+ plt.show()

Jointplot showing Area Income versus Age.

> sns.jointplot(x='Age',y='Area Income',data=ad_data);
+ plt.show()

Jointplot showing the kde distributions of Daily Time spent on site vs. Age.

> sns.jointplot(x='Age',y='Daily Time Spent on Site',
+               data=ad_data,color='red',kind='kde');
+ plt.show()

Jointplot of Daily Time Spent on Site vs. Daily Internet Usage.

> sns.jointplot(x='Daily Time Spent on Site',
+               y='Daily Internet Usage',
+               data=ad_data,color='green');
+ plt.show()

Pairplot with the hue defined by the Clicked on Ad column feature.

> sns.pairplot(ad_data,hue='Clicked on Ad',palette='bwr');

> plt.show()