Logistic regression, also called a logit model, is used to model binary outcome variable. In the logit model, the log odds of the outcome is modeled as linear combination of the predictor variables. Logistic regression is a useful analysis method for classification problems.
Logistic Regression models provide us the probabilities, it is huge help for us to interpret the data.Even though we classify both 51% and 99% as success, there is a huge difference between them. We can see how strong relationship some predictor variables with the response variable.
Data:
Admission data set has 400 observation of 4 variables. We have admit variable which is binary; 1 is Admitted and 0 is not admitted, GRE indicates the students’ GRE score, and GPA indicates the student’s GPA. Rank represents the school’s rank where students graduate from. Schools with a rank of 1 have th ehighest prestige, and while those with a rank of 4 have the lowest.
Objective:
Carry out the logistic regression and make prediction on test data set.
Data Preparation
#Loading necessary libraries
library(tidyverse) # For data manipulation
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(caret) # confusion matrix
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#PREPARING WORK SPAcE
# Clear the workspace:
rm(list = ls())
# Load data
df <- read.csv("Admission.csv", header = TRUE)
head(df)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
nrow(df)
## [1] 400
colnames(df)
## [1] "admit" "gre" "gpa" "rank"
#Summary Statistics
summary(df)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
sapply(df, sd)
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
#Data Structure
str(df)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
All variables are integer and gpa is numeric because it has decimal value.We need to convert admit and rank to categorical variable.
#Convert to Categorical Variable
newdf <- df %>%
mutate_at(vars(admit,rank), funs(factor))
str(newdf)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
#Pairs plot
ggpairs(newdf)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can notice that GPA and GRE scores are slightly decreases from the Schools with Rank of 1 to Rank of 4. There are alo few outlierd which can be disregarded. The number of students who rejected significantly higher than the student who are admitted to the College.
# GPA and GRE average by Rank
newdf %>%
group_by(rank) %>%
summarise_at(vars(c(gpa,gre)), list(name = mean))
## # A tibble: 4 x 3
## rank gpa_name gre_name
## <fct> <dbl> <dbl>
## 1 1 3.45 612.
## 2 2 3.36 596.
## 3 3 3.43 575.
## 4 4 3.32 570.
This table indicates that students who attend the school with Rank 1 have higher GPA and GRE scores than other students.
#Two-way table of factor variables
xtabs(~admit + rank, data=newdf)
## rank
## admit 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
This data might be little imbalanced data due to having small number of rank 1 compared to other ranks.
Data Partition
ind <- sample(2, nrow(newdf), replace=T, prob =c(0.8, 0.2))
train <- newdf[ind==1,]
test <- newdf[ind==2,]
head(train)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 6 1 760 3.00 2
## 7 1 560 2.98 1
nrow(train)
## [1] 323
head(test)
## admit gre gpa rank
## 5 0 520 2.93 4
## 8 0 400 3.08 2
## 12 0 440 3.22 1
## 14 0 700 3.08 2
## 15 1 700 4.00 1
## 23 0 600 2.82 4
nrow(test)
## [1] 77
Fit Model
glm(generalized linear model) is used for logistic regression in R.
set.seed(111)
fit_glm <- glm(admit~gre+gpa+rank, data=train, family= binomial(link= logit))
summary(fit_glm)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = logit),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4980 -0.8818 -0.6588 1.2005 2.0331
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.950995 1.252203 -2.357 0.018441 *
## gre 0.001559 0.001194 1.306 0.191722
## gpa 0.612632 0.366752 1.670 0.094835 .
## rank2 -0.554743 0.350989 -1.581 0.113989
## rank3 -1.432777 0.391457 -3.660 0.000252 ***
## rank4 -1.404152 0.465106 -3.019 0.002536 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 404.41 on 322 degrees of freedom
## Residual deviance: 376.30 on 317 degrees of freedom
## AIC: 388.3
##
## Number of Fisher Scoring iterations: 4
pchisq(fit_glm$deviance, df = fit_glm$df.residual, lower.tail = F)
## [1] 0.01227363
Summary function show us our model and then deviance residuals.Then, we can see the coefficients of the model. GPA is the statistically significant (p<0.05), and GRE is somehow significant (p<0.10). Since the Rank is a factor(categorical variable), we see there are 3 Ranks in the model, because dummy variable has been used.
Based on the coefficients, we can say that every one unit change in GRE, the log odds of admission increase by 0.02 and every one unit change in GPA, the log odds of admission increases 0.76.
Rank 1 is embedded in the model, so for Rank 2 versus to Rank 1, the log odds of admission decreases 0.63. Rank 3 versus to Rank 1, the log of odds of admission decreases 1.19 and Rank 4 vs Rank 1, it decreases 1.60. It is clear that school’s rank is highky effective on the school admission.
Prediction by Rank
newdata <- with(df, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
newdata$rankP <- predict(fit_glm, newdata = newdata, type = "response")
newdata <- newdata %>%
mutate_at(vars(gre,gpa,rankP), funs(round(.,2)))
newdata
## gre gpa rank rankP
## 1 587.7 3.39 1 0.51
## 2 587.7 3.39 2 0.37
## 3 587.7 3.39 3 0.20
## 4 587.7 3.39 4 0.20
The goal of this chart is to show the effect of Rank to be admitted to the college for students. As it is seen, the predicted probability of being accepted into a graduate program is 0.46 for students from the highest prestige schools, whereas it is 0.17 for students from the lowest ranked schools.
Prediction for Train data set
p1 <- predict(fit_glm, data=train, type= 'response')
#Confusion Matrix::
#By hand
pred1 <- ifelse(p1>0.5,1,0)
table1 <- table(Predicted=pred1, Actual=train$admit)
table1
## Actual
## Predicted 0 1
## 0 204 83
## 1 16 20
#By Caret library
pred1 <- as.factor(pred1)
admit1<- as.factor(train$admit)
result_train <-confusionMatrix(data=pred1,reference=admit1)
print(result_train)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 204 83
## 1 16 20
##
## Accuracy : 0.6935
## 95% CI : (0.6401, 0.7433)
## No Information Rate : 0.6811
## P-Value [Acc > NIR] : 0.3402
##
## Kappa : 0.1468
##
## Mcnemar's Test P-Value : 3.284e-11
##
## Sensitivity : 0.9273
## Specificity : 0.1942
## Pos Pred Value : 0.7108
## Neg Pred Value : 0.5556
## Prevalence : 0.6811
## Detection Rate : 0.6316
## Detection Prevalence : 0.8885
## Balanced Accuracy : 0.5607
##
## 'Positive' Class : 0
##
We can use Caret library for the detail outcome. We can see that our accuracy rate on train data is 72.4%
Prediction for Test data set
ptest <- predict(fit_glm, test, type='response')
#Confusion Matrix::
pred_test <- ifelse(ptest>0.5,1,0)
pred_test<- as.factor(pred_test)
admit_test<- as.factor(test$admit)
result_test <-confusionMatrix(data=pred_test,reference=admit_test)
print(result_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 52 17
## 1 1 7
##
## Accuracy : 0.7662
## 95% CI : (0.6559, 0.8552)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.085555
##
## Kappa : 0.3337
##
## Mcnemar's Test P-Value : 0.000407
##
## Sensitivity : 0.9811
## Specificity : 0.2917
## Pos Pred Value : 0.7536
## Neg Pred Value : 0.8750
## Prevalence : 0.6883
## Detection Rate : 0.6753
## Detection Prevalence : 0.8961
## Balanced Accuracy : 0.6364
##
## 'Positive' Class : 0
##
And, we get 69.8% accuracy rate on the test data.
By using interaction model, we increased our accuracy by 10% on the test data set.