Introduction.

This part has been taken from the AV website.
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm).
Although most logisitc regression should be called binomial logistic regression, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model multinomial logistic regression. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.

Model Assumptions

GLM does not assume a linear relationship between dependent and independent variables. However, it assumes a linear relationship between link function and independent variables in logit model.
The dependent variable need not to be normally distributed.
It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it uses maximum likelihood estimation (MLE).
Errors need to be independent but not normally distributed.

The logistic equation.

The general equation of the logit model
\[\mathbf{Y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\] where Y is the variable to predict. \(\beta\) are the coefficients of the predictors and \(x_i\) are the predictors (aka independent variables).
In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). We should then rewrite our function \[p = e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}\] This however does not garantee to have p between 0 and 1.
Let’s then have \[p = \dfrac{e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}} {e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)} + 1}\] or \[p = \dfrac {e^Y} {e^Y + 1}\] where p is the probability of success. With little further manipulations, we have \[\dfrac {p} {1-p} = e^Y\] and \[\log{\dfrac{p} {1-p}} = Y\] If we remember what was Y, we get \[\log{\dfrac{p} {1-p}} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\]
This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%.

Performance of Logistic Regression Model.

To evaluate the performance of a logistic regression model, we must consider few metrics.

AIC (Akaike Information Criteria) The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value.
Null Deviance and Residual Deviance Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.
Confusion Matrix It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting.
We can calcualate the accuracy of our model by \[\dfrac {True Positives + True Negatives} {True Positives + True Negatives + False Positives + False Negatives}\]
From confusion matrix, Specificity and Sensitivity can be derived as
\[Specificity = \dfrac {True Negatives} {True Negative + False Positive}\]
\[Sensitivity = \dfrac{True Positive}{True Positive + False Negative}\]
ROC Curve Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade offs between true positive rate (sensitivity) and false positive rate(1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

Libraries used in this article.

library(readr)
library(purrr)
library(tibble)
library(dplyr)
library(ggmissing)
library(visdat)
library(caTools)

Example 1.

We use a dataset about factors influencing graduate admission that can be downloaded from the UCLA Institute for Digital Research and Education
Let’s have a quick look at the data and their summary. The goal is to get familiar with the data, type of predictors (continuous, discrete, categorical, etc.)

mydata <- read_csv("~/Google Drive/Software/R projects/datasets/grad_admission.csv")
glimpse(mydata)

## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre   <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa   <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank  <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2,...

map_dbl(mydata, sd)

##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

## Two-way contingency table of categorical outcome and predictors
table(mydata$admit, mydata$rank)

##    
##      1  2  3  4
##   0 28 97 93 55
##   1 33 54 28 12

round(prop.table(table(mydata$admit, mydata$rank), 2), 2)

##    
##        1    2    3    4
##   0 0.46 0.64 0.77 0.82
##   1 0.54 0.36 0.23 0.18

Let’s run our model.

## we need to make the rank variable as factor
mydata$rank <- factor(mydata$rank)
glimpse(mydata)

## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre   <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa   <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank  <fctr> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2...

model1 <- glm(admit ~ ., data = mydata, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = admit ~ ., family = "binomial", data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

anova(model1)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: admit
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev
## NULL                   399     499.98
## gre   1  13.9204       398     486.06
## gpa   1   5.7122       397     480.34
## rank  3  21.8265       394     458.52

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.
The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.

We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.

library(aod)
wald.test(Sigma = vcov(model1), b = coef(model1), Terms = 4:6)

## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011

The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.

Example 2.

mydata <- read_csv("~/Google Drive/Software/R projects/datasets/school_math_data.csv")
glimpse(mydata)

## Observations: 394
## Variables: 5
## $ hsgpa  <dbl> 78.0, 66.0, 80.2, 81.7, 86.8, 76.7, 85.8, 73.0, 72.3, 9...
## $ hsengl <int> 80, 75, 70, 67, 80, 75, 81, 77, 60, 88, 91, 74, 85, 81,...
## $ hscalc <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
## $ course <chr> "Mainstrm", "Mainstrm", "Mainstrm", "Mainstrm", "Mainst...
## $ passed <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "...

mydata$hscalc <- factor(mydata$hscalc)
mydata$course <- factor(mydata$course)
mydata$passed <- factor(mydata$passed)
glimpse(mydata)

## Observations: 394
## Variables: 5
## $ hsgpa  <dbl> 78.0, 66.0, 80.2, 81.7, 86.8, 76.7, 85.8, 73.0, 72.3, 9...
## $ hsengl <int> 80, 75, 70, 67, 80, 75, 81, 77, 60, 88, 91, 74, 85, 81,...
## $ hscalc <fctr> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
## $ course <fctr> Mainstrm, Mainstrm, Mainstrm, Mainstrm, Mainstrm, Main...
## $ passed <fctr> No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, No...

model2 <- glm(passed ~ hsgpa + hsengl, data = mydata, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = passed ~ hsgpa + hsengl, family = "binomial", data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5577  -0.9833   0.4340   0.9126   2.2883  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -14.69568    2.00683  -7.323 2.43e-13 ***
## hsgpa         0.22982    0.02955   7.776 7.47e-15 ***
## hsengl       -0.04020    0.01709  -2.352   0.0187 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 530.66  on 393  degrees of freedom
## Residual deviance: 437.69  on 391  degrees of freedom
## AIC: 443.69
## 
## Number of Fisher Scoring iterations: 4

For a constant value of mark in HS English, for every one-point increase in HS GPA, estimated odds of passing are multiplied by …

exp(model2$coefficients[2])

##    hsgpa 
## 1.258378

And similarly, for a constant value of mark in GPA, for every one-point increased in HS English, estimated odds of passing are multiplied by …

exp(model2$coefficients[3])

##    hsengl 
## 0.9605967

We can enhance this model and add the coursepredictor.

model2b <- glm(passed ~ hsengl + hsgpa + course, data = mydata, family = "binomial")
summary(model2b)

## 
## Call:
## glm(formula = passed ~ hsengl + hsgpa + course, family = "binomial", 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5404  -0.9852   0.4110   0.8820   2.2109  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -15.47401    2.07664  -7.451 9.23e-14 ***
## hsengl          -0.03534    0.01766  -2.001  0.04539 *  
## hsgpa            0.21939    0.02988   7.342 2.10e-13 ***
## courseElite      2.04983    0.64108   3.197  0.00139 ** 
## courseMainstrm   1.29137    0.45190   2.858  0.00427 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 530.66  on 393  degrees of freedom
## Residual deviance: 424.76  on 389  degrees of freedom
## AIC: 434.76
## 
## Number of Fisher Scoring iterations: 4

anova(model2b, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: passed
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     393     530.66              
## hsengl  1    8.286       392     522.37  0.003994 ** 
## hsgpa   1   84.684       391     437.69 < 2.2e-16 ***
## course  2   12.921       389     424.76  0.001564 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model2b$coefficients

##    (Intercept)         hsengl          hsgpa    courseElite courseMainstrm 
##   -15.47401114    -0.03533871     0.21939002     2.04983360     1.29136575

exp(model2b$coefficients)

##    (Intercept)         hsengl          hsgpa    courseElite courseMainstrm 
##   1.904243e-07   9.652784e-01   1.245317e+00   7.766609e+00   3.637751e+00

Controlling for High School english mark and High School GPA, the estimated odds of passing are 7.76 times as great for students in the Elite course, compared to students in the Catch-up course.

Model Selection.

Using the functions drop1 and add1 one can see if the addition or dropping of variable help our models.

drop1(model2, test = "Chisq")

## Single term deletions
## 
## Model:
## passed ~ hsgpa + hsengl
##        Df Deviance    AIC    LRT Pr(>Chi)    
## <none>      437.69 443.69                    
## hsgpa   1   522.37 526.37 84.684   <2e-16 ***
## hsengl  1   443.43 447.43  5.749   0.0165 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(model2b, ~.^2, test = "Chisq")

## Single term additions
## 
## Model:
## passed ~ hsengl + hsgpa + course
##               Df Deviance    AIC     LRT Pr(>Chi)
## <none>             424.76 434.76                 
## hsengl:hsgpa   1   423.97 435.97 0.78921   0.3743
## hsengl:course  2   423.51 437.51 1.25820   0.5331
## hsgpa:course   2   424.13 438.13 0.63625   0.7275

Example 3. The Kaggle Titanic dataset.

We downloaded the data from the Kaggle website and saved on our dataset folder. We follow a tutorial from the datascience + website.

Checking and cleaning the data.

mydata <- read_csv("~/Google Drive/Software/R projects/datasets/Kaggle_Titanic_train.csv")
glimpse(mydata)

## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <chr> "male", "female", "female", "female", "male", "mal...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...

vis_dat(mydata)

# Let's check for missing and distincts values.  
# 2 ways to do this using either summarize_each or map
fun1 <- function(x){sum(is.na(x))}
mydata %>% summarize_each(funs(fun1))

## # A tibble: 1 × 12
##   PassengerId Survived Pclass  Name   Sex   Age SibSp Parch Ticket  Fare
##         <int>    <int>  <int> <int> <int> <int> <int> <int>  <int> <int>
## 1           0        0      0     0     0   177     0     0      0     0
## # ... with 2 more variables: Cabin <int>, Embarked <int>

# or in a more visual way.  
gg_missing_var(mydata)

vis_miss(mydata)

# Let's check for unique value.  
map_dbl(mydata, n_distinct)

## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4

The variable Cabin has too many missing values, we will not use it. We will also drop PassengerId since it is only an index and Ticket. As the Age variable has too many missing data, we’ll replace them by the mean of the available data. As there is only 2 missing data for the Embarked variable, we’ll take away these 2 rows.

mydata <- mydata %>% select(-PassengerId, -Ticket, -Cabin, -Name)
mydata$Sex <- factor(mydata$Sex)
mydata$Survived <- factor(mydata$Survived)
mydata$Embarked <- factor(mydata$Embarked)
glimpse(mydata)

## Observations: 891
## Variables: 8
## $ Survived <fctr> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1...
## $ Pclass   <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2,...
## $ Sex      <fctr> male, female, female, female, male, male, male, male...
## $ Age      <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39,...
## $ SibSp    <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0,...
## $ Parch    <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0,...
## $ Fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51....
## $ Embarked <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...

mydata$Age[is.na(mydata$Age)] <- mean(mydata$Age, na.rm = TRUE)
mydata <- mydata[!is.na(mydata$Embarked), ]

This preprocessing step often is crucial for obtaining a good fit of the model and better predictive ability.

Fitting the model.

We first split the data into a training and testing set.

sample <- sample.split(mydata$Survived, SplitRatio = 0.8)
train_set <- subset(mydata, sample == TRUE)
test_set <- subset(mydata, sample == FALSE)
model3 <- glm(Survived ~ ., data = train_set, family = "binomial")
summary(model3)

## 
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2317  -0.6341  -0.4296   0.6358   2.4579  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.927930   0.631712   7.801 6.15e-15 ***
## Pclass      -1.070134   0.163069  -6.562 5.29e-11 ***
## Sexmale     -2.706571   0.222208 -12.180  < 2e-16 ***
## Age         -0.031783   0.008719  -3.645 0.000267 ***
## SibSp       -0.331013   0.120636  -2.744 0.006071 ** 
## Parch       -0.139626   0.127228  -1.097 0.272444    
## Fare         0.001947   0.002843   0.685 0.493493    
## EmbarkedQ    0.114279   0.418823   0.273 0.784962    
## EmbarkedS   -0.344180   0.265811  -1.295 0.195379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 946.06  on 710  degrees of freedom
## Residual deviance: 638.88  on 702  degrees of freedom
## AIC: 656.88
## 
## Number of Fisher Scoring iterations: 5

Now we can analyze the fitting and interpret what the model is telling us. First of all, we can see that SibSp, Fare and Embarked are not statistically significant. As for the statistically significant variables, sex has the lowest p-value suggesting a strong association of the sex of the passenger with the probability of having survived. The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived. Remember that in the logit model the response variable is log odds. Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.

Checking the model predicting ability.

fitted_results <- predict(model3, newdata = test_set, type = "response")
fitted_results <- if_else(fitted_results > 0.5, 1, 0)
misClasificError <- mean(fitted_results != test_set$Survived)
print(paste0("Accuracy of ", round((1 - misClasificError), 3)))

## [1] "Accuracy of 0.826"

prop.table(table(test_set$Survived, fitted_results))

##    fitted_results
##              0          1
##   0 0.55056180 0.06741573
##   1 0.10674157 0.27528090

And now the ROC curve and the AUC.

library(ROCR)
fitted_results <- predict(model3, newdata = test_set, type = "response")
pr <- prediction(fitted_results, test_set$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.8825535

References.

The AV web page dedicated on logistic regression. here
The dataset and page from our first example. here
The dataset and tutorial from our second example. here
The inspiration for our third example. here
The ggmissing package. here
The visdat package. here

Classification - Logistics

FdR

1 October 2016