This part has been taken from the AV website.
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables. You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.
Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm).
Although most logisitc regression should be called binomial logistic regression, since the variable to predict is binary, however, logistic regression can also be used to predict a dependent variable which can assume more than 2 values. In this second case we call the model multinomial logistic regression. A typical example for instance, would be classifying films between “Entertaining”, “borderline” or “boring”.
The general equation of the logit model
\[\mathbf{Y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\] where Y is the variable to predict. \(\beta\) are the coefficients of the predictors and \(x_i\) are the predictors (aka independent variables).
In logistic regression, we are only concerned about the probability of outcome dependent variable ( success or failure). We should then rewrite our function \[p = e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}\] This however does not garantee to have p between 0 and 1.
Let’s then have \[p = \dfrac{e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}} {e^{(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)} + 1}\] or \[p = \dfrac {e^Y} {e^Y + 1}\] where p is the probability of success. With little further manipulations, we have \[\dfrac {p} {1-p} = e^Y\] and \[\log{\dfrac{p} {1-p}} = Y\] If we remember what was Y, we get \[\log{\dfrac{p} {1-p}} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n\]
This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%.
To evaluate the performance of a logistic regression model, we must consider few metrics.
Libraries used in this article.
library(readr)
library(purrr)
library(tibble)
library(dplyr)
library(ggmissing)
library(visdat)
library(caTools)
We use a dataset about factors influencing graduate admission that can be downloaded from the UCLA Institute for Digital Research and Education
Let’s have a quick look at the data and their summary. The goal is to get familiar with the data, type of predictors (continuous, discrete, categorical, etc.)
mydata <- read_csv("~/Google Drive/Software/R projects/datasets/grad_admission.csv")
glimpse(mydata)
## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank <int> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2,...
map_dbl(mydata, sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
## Two-way contingency table of categorical outcome and predictors
table(mydata$admit, mydata$rank)
##
## 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
round(prop.table(table(mydata$admit, mydata$rank), 2), 2)
##
## 1 2 3 4
## 0 0.46 0.64 0.77 0.82
## 1 0.54 0.36 0.23 0.18
Let’s run our model.
## we need to make the rank variable as factor
mydata$rank <- factor(mydata$rank)
glimpse(mydata)
## Observations: 400
## Variables: 4
## $ admit <int> 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ gre <int> 380, 660, 800, 640, 520, 760, 560, 400, 540, 700, 800, 4...
## $ gpa <dbl> 3.61, 3.67, 4.00, 3.19, 2.93, 3.00, 2.98, 3.08, 3.39, 3....
## $ rank <fctr> 3, 3, 1, 4, 4, 2, 1, 2, 3, 2, 4, 1, 1, 2, 1, 3, 4, 3, 2...
model1 <- glm(admit ~ ., data = mydata, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = admit ~ ., family = "binomial", data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
anova(model1)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: admit
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 399 499.98
## gre 1 13.9204 398 486.06
## gpa 1 5.7122 397 480.34
## rank 3 21.8265 394 458.52
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.
The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.
We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.
library(aod)
wald.test(Sigma = vcov(model1), b = coef(model1), Terms = 4:6)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 20.9, df = 3, P(> X2) = 0.00011
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.
mydata <- read_csv("~/Google Drive/Software/R projects/datasets/school_math_data.csv")
glimpse(mydata)
## Observations: 394
## Variables: 5
## $ hsgpa <dbl> 78.0, 66.0, 80.2, 81.7, 86.8, 76.7, 85.8, 73.0, 72.3, 9...
## $ hsengl <int> 80, 75, 70, 67, 80, 75, 81, 77, 60, 88, 91, 74, 85, 81,...
## $ hscalc <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
## $ course <chr> "Mainstrm", "Mainstrm", "Mainstrm", "Mainstrm", "Mainst...
## $ passed <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "...
mydata$hscalc <- factor(mydata$hscalc)
mydata$course <- factor(mydata$course)
mydata$passed <- factor(mydata$passed)
glimpse(mydata)
## Observations: 394
## Variables: 5
## $ hsgpa <dbl> 78.0, 66.0, 80.2, 81.7, 86.8, 76.7, 85.8, 73.0, 72.3, 9...
## $ hsengl <int> 80, 75, 70, 67, 80, 75, 81, 77, 60, 88, 91, 74, 85, 81,...
## $ hscalc <fctr> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
## $ course <fctr> Mainstrm, Mainstrm, Mainstrm, Mainstrm, Mainstrm, Main...
## $ passed <fctr> No, Yes, Yes, Yes, Yes, Yes, Yes, No, No, Yes, Yes, No...
model2 <- glm(passed ~ hsgpa + hsengl, data = mydata, family = "binomial")
summary(model2)
##
## Call:
## glm(formula = passed ~ hsgpa + hsengl, family = "binomial", data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5577 -0.9833 0.4340 0.9126 2.2883
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.69568 2.00683 -7.323 2.43e-13 ***
## hsgpa 0.22982 0.02955 7.776 7.47e-15 ***
## hsengl -0.04020 0.01709 -2.352 0.0187 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 530.66 on 393 degrees of freedom
## Residual deviance: 437.69 on 391 degrees of freedom
## AIC: 443.69
##
## Number of Fisher Scoring iterations: 4
For a constant value of mark in HS English, for every one-point increase in HS GPA, estimated odds of passing are multiplied by …
exp(model2$coefficients[2])
## hsgpa
## 1.258378
And similarly, for a constant value of mark in GPA, for every one-point increased in HS English, estimated odds of passing are multiplied by …
exp(model2$coefficients[3])
## hsengl
## 0.9605967
We can enhance this model and add the coursepredictor.
model2b <- glm(passed ~ hsengl + hsgpa + course, data = mydata, family = "binomial")
summary(model2b)
##
## Call:
## glm(formula = passed ~ hsengl + hsgpa + course, family = "binomial",
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5404 -0.9852 0.4110 0.8820 2.2109
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.47401 2.07664 -7.451 9.23e-14 ***
## hsengl -0.03534 0.01766 -2.001 0.04539 *
## hsgpa 0.21939 0.02988 7.342 2.10e-13 ***
## courseElite 2.04983 0.64108 3.197 0.00139 **
## courseMainstrm 1.29137 0.45190 2.858 0.00427 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 530.66 on 393 degrees of freedom
## Residual deviance: 424.76 on 389 degrees of freedom
## AIC: 434.76
##
## Number of Fisher Scoring iterations: 4
anova(model2b, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: passed
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 393 530.66
## hsengl 1 8.286 392 522.37 0.003994 **
## hsgpa 1 84.684 391 437.69 < 2.2e-16 ***
## course 2 12.921 389 424.76 0.001564 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model2b$coefficients
## (Intercept) hsengl hsgpa courseElite courseMainstrm
## -15.47401114 -0.03533871 0.21939002 2.04983360 1.29136575
exp(model2b$coefficients)
## (Intercept) hsengl hsgpa courseElite courseMainstrm
## 1.904243e-07 9.652784e-01 1.245317e+00 7.766609e+00 3.637751e+00
Controlling for High School english mark and High School GPA, the estimated odds of passing are 7.76 times as great for students in the Elite course, compared to students in the Catch-up course.
Using the functions drop1 and add1 one can see if the addition or dropping of variable help our models.
drop1(model2, test = "Chisq")
## Single term deletions
##
## Model:
## passed ~ hsgpa + hsengl
## Df Deviance AIC LRT Pr(>Chi)
## <none> 437.69 443.69
## hsgpa 1 522.37 526.37 84.684 <2e-16 ***
## hsengl 1 443.43 447.43 5.749 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(model2b, ~.^2, test = "Chisq")
## Single term additions
##
## Model:
## passed ~ hsengl + hsgpa + course
## Df Deviance AIC LRT Pr(>Chi)
## <none> 424.76 434.76
## hsengl:hsgpa 1 423.97 435.97 0.78921 0.3743
## hsengl:course 2 423.51 437.51 1.25820 0.5331
## hsgpa:course 2 424.13 438.13 0.63625 0.7275
We downloaded the data from the Kaggle website and saved on our dataset folder. We follow a tutorial from the datascience + website.
mydata <- read_csv("~/Google Drive/Software/R projects/datasets/Kaggle_Titanic_train.csv")
glimpse(mydata)
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
vis_dat(mydata)
# Let's check for missing and distincts values.
# 2 ways to do this using either summarize_each or map
fun1 <- function(x){sum(is.na(x))}
mydata %>% summarize_each(funs(fun1))
## # A tibble: 1 × 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 177 0 0 0 0
## # ... with 2 more variables: Cabin <int>, Embarked <int>
# or in a more visual way.
gg_missing_var(mydata)
vis_miss(mydata)
# Let's check for unique value.
map_dbl(mydata, n_distinct)
## PassengerId Survived Pclass Name Sex Age
## 891 2 3 891 2 89
## SibSp Parch Ticket Fare Cabin Embarked
## 7 7 681 248 148 4
The variable Cabin has too many missing values, we will not use it. We will also drop PassengerId since it is only an index and Ticket. As the Age variable has too many missing data, we’ll replace them by the mean of the available data. As there is only 2 missing data for the Embarked variable, we’ll take away these 2 rows.
mydata <- mydata %>% select(-PassengerId, -Ticket, -Cabin, -Name)
mydata$Sex <- factor(mydata$Sex)
mydata$Survived <- factor(mydata$Survived)
mydata$Embarked <- factor(mydata$Embarked)
glimpse(mydata)
## Observations: 891
## Variables: 8
## $ Survived <fctr> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2,...
## $ Sex <fctr> male, female, female, female, male, male, male, male...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39,...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0,...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51....
## $ Embarked <fctr> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S...
mydata$Age[is.na(mydata$Age)] <- mean(mydata$Age, na.rm = TRUE)
mydata <- mydata[!is.na(mydata$Embarked), ]
This preprocessing step often is crucial for obtaining a good fit of the model and better predictive ability.
We first split the data into a training and testing set.
sample <- sample.split(mydata$Survived, SplitRatio = 0.8)
train_set <- subset(mydata, sample == TRUE)
test_set <- subset(mydata, sample == FALSE)
model3 <- glm(Survived ~ ., data = train_set, family = "binomial")
summary(model3)
##
## Call:
## glm(formula = Survived ~ ., family = "binomial", data = train_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2317 -0.6341 -0.4296 0.6358 2.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.927930 0.631712 7.801 6.15e-15 ***
## Pclass -1.070134 0.163069 -6.562 5.29e-11 ***
## Sexmale -2.706571 0.222208 -12.180 < 2e-16 ***
## Age -0.031783 0.008719 -3.645 0.000267 ***
## SibSp -0.331013 0.120636 -2.744 0.006071 **
## Parch -0.139626 0.127228 -1.097 0.272444
## Fare 0.001947 0.002843 0.685 0.493493
## EmbarkedQ 0.114279 0.418823 0.273 0.784962
## EmbarkedS -0.344180 0.265811 -1.295 0.195379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 946.06 on 710 degrees of freedom
## Residual deviance: 638.88 on 702 degrees of freedom
## AIC: 656.88
##
## Number of Fisher Scoring iterations: 5
Now we can analyze the fitting and interpret what the model is telling us. First of all, we can see that SibSp, Fare and Embarked are not statistically significant. As for the statistically significant variables, sex has the lowest p-value suggesting a strong association of the sex of the passenger with the probability of having survived. The negative coefficient for this predictor suggests that all other variables being equal, the male passenger is less likely to have survived. Remember that in the logit model the response variable is log odds. Since male is a dummy variable, being male reduces the log odds by 2.75 while a unit increase in age reduces the log odds by 0.037.
fitted_results <- predict(model3, newdata = test_set, type = "response")
fitted_results <- if_else(fitted_results > 0.5, 1, 0)
misClasificError <- mean(fitted_results != test_set$Survived)
print(paste0("Accuracy of ", round((1 - misClasificError), 3)))
## [1] "Accuracy of 0.826"
prop.table(table(test_set$Survived, fitted_results))
## fitted_results
## 0 1
## 0 0.55056180 0.06741573
## 1 0.10674157 0.27528090
And now the ROC curve and the AUC.
library(ROCR)
fitted_results <- predict(model3, newdata = test_set, type = "response")
pr <- prediction(fitted_results, test_set$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8825535