Logistic Regression

Logistic regression is the linear regression analysis to conduct when the dependent variable is dichotomous (binary). Like all linear regressions the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more continuous-level (interval or ratio scale) independent variables.

Standard linear regression requires the dependent variable to be of continuous-level (interval or ratio) scale.
Logistic regression assumes that the dependent variable is a stochastic event. For instance, if we analyze a pesticides kill rate, the outcome event is either killed or alive. Since even the most resistant bug can only be either of these two states, logistic regression thinks in likelihoods of the bug getting killed. If the likelihood of killing the bug is greater than 0.5 it is assumed dead, if it is less than 0.5 it is assumed alive.

If we try to create a multiple linear regression with a dummy dependent variable. This approach, however, has two major shortcomings. Firstly, it can lead to probabilities outside of the (0,1) interval, and secondly residuals will all have the same variance.

In logistic regression, input features are linearly scaled just as with linear regression; however, the result is then fed as an input to the logistic function. This function provides a nonlinear transformation on its input and ensures that the range of the output, which is interpreted as the probability of the input belonging to class 1, lies in the interval [0,1].

The logistic function is:
p(x) = 1/1+exp(-x)

When x = 0, the logistic function takes the value 0.5. As x tends to +infinity, the exponential in the denominator vanishes and the function approaches the value 1. As x tends to -infinity, the exponential, and hence the denominator, tends to move toward infinity and the function approaches the value 0. Thus, our output is guaranteed to be in the interval [0,1], which is necessary for it to be a probability.

Generalized Linear Models

Generalized Linear Models are an extension of the linear model framework, which includes dependent variables which are non-normal also. In general, they possess three characteristics:

  1. These models comprise a linear combination of input features.
  2. The mean of the response variable is related to the linear combination of input features via a link function.
  3. The response variable is considered to have an underlying probability distribution belonging to the family of exponential distributions such as binomial distribution, Poisson distribution, or Gaussian distribution. Practically, binomial distribution is used when the response variable is binary. Poisson distribution is used when the response variable represents count. And, Gaussian distribution is used when the response variable is continuous.

Logistic Regression belongs to the family of generalized linear models. It is a binary classification algorithm used when the response variable is dichotomous (1 or 0). Inherently, it returns the set of probabilities of target class. But, we can also obtain response labels using a probability threshold value. Following are the assumptions made by Logistic Regression:

  1. The response variable must follow a binomial distribution.
  2. Logistic Regression assumes a linear relationship between the independent variables and the link function (logit).
  3. The dependent variable should have mutually exclusive and exhaustive categories.

Demystify Logistic Regression through a Simple Example

Interpretation of:
> p-value,
> coefficient estimates,
> odds ratio,
> logit score, and
> final probability (from logit score)

Load data

library(MASS)
DataFrame <- birthwt
help("birthwt")
## starting httpd help server ... done
head(DataFrame,3)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557

Model fitting & Model Summary

#Build logistic regression mode using only continous variables as independent variables, i.e., age and lwt.
LogisticModel <- glm(low~age + lwt, data = DataFrame, family = binomial (link = "logit"))
summary(LogisticModel)
## 
## Call:
## glm(formula = low ~ age + lwt, family = binomial(link = "logit"), 
##     data = DataFrame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1352  -0.9088  -0.7480   1.3392   2.0595  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.748773   0.997097   1.754   0.0795 .
## age         -0.039788   0.032287  -1.232   0.2178  
## lwt         -0.012775   0.006211  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 227.12  on 186  degrees of freedom
## AIC: 233.12
## 
## Number of Fisher Scoring iterations: 4

Basic Maths of Logistic Regression

Odds Ratio:
It represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
Odds Ratio Formula:
Odds
=probability of success(p)/ probability of failure
=probability of (target variable=1)/probability of (target variable=0)
=p/(1-p)

Logit Formula
The logit score can be defined as:
logit(p) = log(p/(1-p))

Logit score
b0 + b1\(*\)x1 + … + bk\(*\)xk

Probability Calculation
To find the probability of getting “low =1” (i.e., probability of getting success)

Coefficients:
Intercept Coefficient (b0) = 1.78773
age Coefficient (b1) = -0.039788 (The increase in logit score per unit increase in age is -0.012775)
lwt Coefficient (b2) = -0.012775 (The increase in logit score per unit increase in weight is -0.012775)

p-value:
p-value for age = 0.2178
(According to z-test, p-value is 0.2178 which comparatively high, which implies that “age” is not a significant variable in predicting the target variable “low”)
p-value for lwt variable = 0.0397
(According to z-test, p-value is 0.0397 which comparatively low, which implies that “lwt” is significant variable in predicting the target variable “low”)

Logit score calculation

head(DataFrame,1)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523

Lets consider a random person with age = 19 and lwt = 182.
Logit score = b0 + b1\(*\)x1 + b2\(*\)x2

1.748773 - (0.039788*19) - (0.012775*182)
## [1] -1.332249

The logit score for this observation = -1.332249

Odds Ratio Calculation: Find the probability that birthwt <2.5kg (i.e., low = 1).

OddsValue = exp(-1.332249)
OddsValue
## [1] 0.2638831

Probability Calculation
probability (p) = odds value/odds value + 1

p = OddsValue/(OddsValue+1)
p
## [1] 0.2087876
pred <- predict.glm(LogisticModel, newdata=DataFrame, type="response")
head(pred,1)
##        85 
## 0.2087754

Interpretation:
0.2087 or 20.87% is the probability of birthweight less than 2.5 kg when mother age = 19 and mother’s weight =182.

Deviance
It is a measure of goodness of fit of a generalized linear model.Higher the deviance value,poorer is the model fit.
Null deviance:238.67 on 188 degree of freedom
(When When the model includes only intercept term,then the performance of the model is governed by null deviance.)
Residual deviance: 227.12 on 186 degrees of freedom
(When the model has included age and lwt variable,then the deviance is residual deviance which is lower(227.12) than null deviance(234.67).Lower value of residual deviance points out that the model has become better when it has included two variables - age and lwt)

Degree of Freedom
Null deviance: 234.67 on 188 degrees of freedom
(The degrees of freedom for null deviance equals N-1, where N is the number of observations in data sample.Here N=189,therefore N-1=189-1=188.)
Residual deviance: 227.12 on 186 degrees of freedom (The degrees of freedom for residual deviance equals N-k-1, where k is the number of variables and N is the number of observations in data sample.Here N=189,k=2 ,therefore N-k-1=189-2-1=186)

AIC
AIC: 233.12
(Its full form is Akaike Information Criterion (AIC). This is useful when we have more than one model to compare the goodness of fit of the models.It is a maximum likelihood estimate which penalizes to prevent overfitting. It measures flexibility of the models.Its analogous to adjusted R2 in multiple linear regression where it tries to prevent you from including irrelevant predictor variables.Lower AIC of model is better than the model having higher AIC.)

Case Study: Titanic Survival Analysis

Load the dataset

training.data.raw = read.table("D:/Analytics/RPubs/Datasets/Titanic/train.csv", sep = ",", header = T, na.strings = (""))
head(training.data.raw,3)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500  <NA>        S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250  <NA>        S
str(training.data.raw)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

This data set comprises a mix of character and numeric variables. Among these variables, Survived is the dependent variable.

Data Cleansing
Now we need to check for missing values and look how many unique values there are for each variable using the sapply() function which applies the function passed as argument to each column of the dataframe.

sapply(training.data.raw,function(x) sum(is.na(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2
sapply(training.data.raw, function(x) length(unique(x)))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4

A visual take on the missing values might be helpful: the Amelia package has a special plotting function missmap() that will plot your dataset and highlight missing values.

#install.packages("Amelia, repos = http://cran.us.r-project.org")
library(Amelia)
missmap(training.data.raw, main = "Missing Values vs Observed")

The variable cabin has too many missing values, we will not use it. We will also drop PassengerId since it is only an index and Ticket.
Using the subset() function we subset the original dataset selecting the relevant columns only.

data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10,12))

Missing Values Treatment
Now we need to account for the other missing values. R can easily deal with them when fitting a generalized linear model by setting a parameter inside the fitting function. However, there are different ways to do this, a typical approach is to replace the missing values with the average, the median or the mode of the existing one.

data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = T)

For categorical variables, when we use the read.table() or read.csv() by default it will encode the categorical variables as factors.

is.factor(data$Sex)
## [1] TRUE
is.factor(data$Embarked)
## [1] TRUE

The missing values in Embarked is only two, we will discard those two rows (we could also have replaced the missing values with the mode and keep the data points)

data <- data[!is.na(data$Embarked),]
rownames(data) <- NULL
head(data)
##   Survived Pclass    Sex      Age SibSp Parch    Fare Embarked
## 1        0      3   male 22.00000     1     0  7.2500        S
## 2        1      1 female 38.00000     1     0 71.2833        C
## 3        1      3 female 26.00000     0     0  7.9250        S
## 4        1      1 female 35.00000     1     0 53.1000        S
## 5        0      3   male 35.00000     0     0  8.0500        S
## 6        0      3   male 29.69912     0     0  8.4583        Q

Model Fitting

We split the data into two chunks: training and testing set. The training set will be used to fit our model which we will be testing over the testing set.

train <- data[1:800,]
test <- data[801:889,]
model <- glm(Survived ~.,family=binomial(link='logit'),data=train)
summary(model)
## 
## Call:
## glm(formula = Survived ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6064  -0.5954  -0.4254   0.6220   2.4165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.137627   0.594998   8.635  < 2e-16 ***
## Pclass      -1.087156   0.151168  -7.192 6.40e-13 ***
## Sexmale     -2.756819   0.212026 -13.002  < 2e-16 ***
## Age         -0.037267   0.008195  -4.547 5.43e-06 ***
## SibSp       -0.292920   0.114642  -2.555   0.0106 *  
## Parch       -0.116576   0.128127  -0.910   0.3629    
## Fare         0.001528   0.002353   0.649   0.5160    
## EmbarkedQ   -0.002656   0.400882  -0.007   0.9947    
## EmbarkedS   -0.318786   0.252960  -1.260   0.2076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1065.39  on 799  degrees of freedom
## Residual deviance:  709.39  on 791  degrees of freedom
## AIC: 727.39
## 
## Number of Fisher Scoring iterations: 5

Interpreting the results of our logistic regression model

SibSp, Fare and Embarked are not statistically significant.
Statistically significant variables: sex has the lowest p-value suggesting a strong association of the sex of the passenger with the probability of having survived.
The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived.

Remember that in the logit model the response variable is log odds: ln(odds) = ln(p/(1-p)) = ax1 + bx2 + … + z*xn.
Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.

Now we can run the anova() function on the model to analyze the table of deviance

anova(model, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       799    1065.39              
## Pclass    1   83.607       798     981.79 < 2.2e-16 ***
## Sex       1  240.014       797     741.77 < 2.2e-16 ***
## Age       1   17.495       796     724.28 2.881e-05 ***
## SibSp     1   10.842       795     713.43  0.000992 ***
## Parch     1    0.863       794     712.57  0.352873    
## Fare      1    0.994       793     711.58  0.318717    
## Embarked  2    2.187       791     709.39  0.334990    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding Pclass, Sex and Age significantly reduces the residual deviance. The other variables seem to improve the model less even though SibSp has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

Assessing the predictive ability of the model
Now we would like to see how the model is doing when predicting y on a new set of data. By setting the parameter type=‘response’, R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0.

fitted.results <- predict(model,newdata=subset(test,select=c(2,3,4,5,6,7,8)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.842696629213483"

The 0.84 accuracy on the test set is quite a good result. However, keep in mind that this result is somewhat dependent on the manual split of the data that was made earlier.

As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier.

The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

#install.packages("ROCR, repos = http://cran.us.r-project.org")
library(ROCR)
p <- predict(model, newdata=subset(test,select=c(2,3,4,5,6,7,8)), type="response")
pr <- prediction(p, test$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8647186

Assessing the assumption of no multicollinearity
To check for multicollinearity, use the VIF values.
. If the largest VIF is greater than 10 then there is a cause for concern.
. If the average VIF is substantially greater than 1 then the regression may be biased.

Tolerance = 1/vif . Tolerance below 0.1 indicates a serious problem.
. Tolerance below 0.2 indicates a potential problem

#install.packages("car")
library(car)
vif(model)
##              GVIF Df GVIF^(1/(2*Df))
## Pclass   1.847686  1        1.359296
## Sex      1.218879  1        1.104029
## Age      1.279425  1        1.131116
## SibSp    1.261077  1        1.122977
## Parch    1.305816  1        1.142723
## Fare     1.536985  1        1.239752
## Embarked 1.230697  2        1.053265
1/vif(model)
##               GVIF  Df GVIF^(1/(2*Df))
## Pclass   0.5412174 1.0       0.7356748
## Sex      0.8204260 1.0       0.9057737
## Age      0.7816014 1.0       0.8840822
## SibSp    0.7929728 1.0       0.8904902
## Parch    0.7658045 1.0       0.8751025
## Fare     0.6506243 1.0       0.8066128
## Embarked 0.8125478 0.5       0.9494284
mean(vif(model))
## [1] 1.230177

For our current model the VIF values are all well below 10 and the tolerance statistics all well above 0.2. Also, the average VIF is very close to 1. Based on these measures we can safely conclude that there is no collineanrity within our data.