Logistic Regression

Subhalaxmi Rout

12/20/2020

Logistic Regression

It is a predictive algorithm using independent variables to predict the dependent variable, just like Linear Regression, but with a difference that the dependent variable should be categorical variable.

What is difference between linear and logistic regression?

Linear and Logistic regression are the most basic form of regression which are commonly used. The essential difference between these two is that Logistic regression is used when the dependent variable is binary in nature. In contrast, Linear regression is used when the dependent variable is continuous and nature of the regression line is linear.

Types of logistic regression

There are three types of logistic regression, depending on the nature of the categorical response variable:

Binary Logistic Regression : Used when the response is binary (i.e., it has two possible outcomes)
Nominal Logistic Regression : Used when there are three or more categories with no natural ordering to the levels
Ordinal Logistic Regression : Used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal

Here we will discuss about Binary Logistic Regression.

The multiple binary logistic regression model is the following:

\[log(\frac {p}{1-p}) = \beta_0 + \beta_1X_1 + .... + \beta_kX_k\] Above equation, P = probablity that an observation is in a specified category of the binary Y variable (always between 0 and 1)

X = explanatory variables which can be discrete, continuous, or a combination

Performance of Logistic Regression Model

To evaluate the performance of a logistic regression model, we must consider few metrics, they are

AIC: measures the fit which penalizes model for the number of model coefficients. Therefore, always prefer model with minimum AIC value
Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. It hepls to find the accuracy of the model
ROC curve: summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity).

Using titanic dataset , perform binary logistic regression. Download data from kaggle using below link.

Titanic test dataset

Titanic train dataset

Feature description

Goal of this analysis is Predict if a passenger survived the sinking of the Titanic or not.

Data load

Train data having 891 rows and 12 columns, test data having 418 riws and 11 columns. In test data ‘Survived’ column is not present, using model we calculat the prediction of Survival for test data.

library(ggplot2)
library(kableExtra)
library(visdat)
library(dplyr)
library(caret)
library(e1071)

train <- read.csv("/Users/subhalaxmirout/DATA 621/train.csv")
test <- read.csv("/Users/subhalaxmirout/DATA 621/test.csv")

str(train)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

str(test)

## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : chr  "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
##  $ Sex        : chr  "male" "female" "male" "male" ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : chr  "330911" "363272" "240276" "315154" ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : chr  "" "" "" "" ...
##  $ Embarked   : chr  "Q" "S" "Q" "S" ...

summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

visdat::vis_miss(train)

visdat::vis_miss(test)

There are many missing/NA values present in the datset, lets clean the data. replace NA with median value of age.

train$Age[is.na(train$Age)] <- median(train$Age,na.rm=T)
visdat::vis_miss(train)

test$Age[is.na(test$Age)] <- median(test$Age,na.rm=T)
test$Fare[is.na(test$Fare)] <- median(test$Fare,na.rm=T)
visdat::vis_miss(test)

Lets see the data distribution of Age, and fare

hist(train$Age, col = 'steelblue', main = "Disribution of Age", xlab = "Age")

hist(train$Fare, col = 'steelblue', main = "Disribution of Fare", xlab = "Fare")

Features that are assumed to be insignificant to predict survival, such as ‘PassengerId’,‘Ticket’,‘Embarked’, ‘Cabin’ and ‘Name,’ will be excluded. The features that are left for analysis are: Sex, Age, Pclass, Sibsp, Parch, Fare, Embarked and Survived.

train <- train %>% dplyr::select(c(-PassengerId,-Name,-Ticket,-Embarked,-Cabin))
test <- test %>% dplyr::select(c(-PassengerId,-Name,-Ticket,-Embarked,-Cabin))

convert Sex column in to 2 or 1, if femal2 then 2 else 1.

train$Sex <- ifelse(train$Sex=="female", 2, 1)
test$Sex <- ifelse(test$Sex=="female", 2, 1)

Build Model

Apply binomial logistic regression.

set.seed(12)
model <- glm(Survived ~.,family=binomial,data=train)
summary(model)

## 
## Call:
## glm(formula = Survived ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7091  -0.6000  -0.4237   0.6221   2.4039  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.580705   0.505500  -1.149  0.25065    
## Pclass      -1.087436   0.139402  -7.801 6.15e-15 ***
## Sex          2.760875   0.198793  13.888  < 2e-16 ***
## Age         -0.039398   0.007801  -5.051 4.40e-07 ***
## SibSp       -0.348785   0.108990  -3.200  0.00137 ** 
## Parch       -0.106709   0.117230  -0.910  0.36269    
## Fare         0.002846   0.002359   1.207  0.22759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  789.18  on 884  degrees of freedom
## AIC: 803.18
## 
## Number of Fisher Scoring iterations: 5

Create another model and exclude Parch and Fare due to high p-value.

set.seed(14)
model2 <- glm(Survived ~ . - Parch -Fare,family=binomial,data=train)
summary(model2)

## 
## Call:
## glm(formula = Survived ~ . - Parch - Fare, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6817  -0.6029  -0.4159   0.6161   2.4327  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.301929   0.454477  -0.664 0.506471    
## Pclass      -1.175654   0.120073  -9.791  < 2e-16 ***
## Sex          2.739477   0.193984  14.122  < 2e-16 ***
## Age         -0.039553   0.007761  -5.096 3.47e-07 ***
## SibSp       -0.354433   0.103392  -3.428 0.000608 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  791.23  on 886  degrees of freedom
## AIC: 801.23
## 
## Number of Fisher Scoring iterations: 5

Train model prediction and accuracy

train$prediction <- predict(model2, data = train, type="response")
train$prediction <- ifelse(train$prediction >=0.6, 1, 0)
table(factor(train$prediction), factor(train$Survived))

##    
##       0   1
##   0 504 123
##   1  45 219

Accuracy = (TP+TN)/(TP+TN+FP+FN)

TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative

Accuracy <-round((504+219)/(504+219+45+123) * 100,2)
Accuracy

## [1] 81.14

Test model prediction

d <- predict(model2, newdata = test)
test$Survived <- ifelse(d >=0.6, 1, 0)
kable(test[1:20,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")

Pclass	Sex	Age	SibSp	Parch	Fare	Survived
3	1	34.5	0	0	7.8292	0
3	2	47.0	1	0	7.0000	0
2	1	62.0	0	0	9.6875	0
3	1	27.0	0	0	8.6625	0
3	2	22.0	1	1	12.2875	0
3	1	14.0	0	0	9.2250	0
3	2	30.0	0	0	7.6292	0
2	1	26.0	1	1	29.0000	0
3	2	18.0	0	0	7.2292	1
3	1	21.0	2	0	24.1500	0
3	1	27.0	0	0	7.8958	0
1	1	46.0	0	0	26.0000	0
1	2	23.0	1	0	82.2667	1
2	1	63.0	1	0	26.0000	0
1	2	47.0	1	0	61.1750	1
2	2	24.0	1	0	27.7208	1
2	1	35.0	0	0	12.3500	0
3	1	21.0	0	0	7.2250	0
3	2	27.0	1	0	7.9250	0
3	2	45.0	0	0	7.2250	0

From logistic regression analysis, we learn:

Types of logistic regression
Binomial Logistic regression
Logistic regression analysis using R