Logistic Regression 1

Introduction

Logistic regression, also called a logit model, is used to model binary outcome variable. In the logit model, the log odds of the outcome is modeled as linear combination of the predictor variables. Logistic regression is a useful analysis method for classification problems.

Logistic Regression models provide us the probabilities, it is huge help for us to interpret the data.Even though we classify both 51% and 99% as success, there is a huge difference between them. We can see how strong relationship some predictor variables with the response variable.

Data:

Admission data set has 400 observation of 4 variables. We have admit variable which is binary; 1 is Admitted and 0 is not admitted, GRE indicates the students’ GRE score, and GPA indicates the student’s GPA. Rank represents the school’s rank where students graduate from. Schools with a rank of 1 have th ehighest prestige, and while those with a rank of 4 have the lowest.

Objective:

Carry out the logistic regression and make prediction on test data set.

Data Preparation

#Loading necessary libraries
library(tidyverse) # For data manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) 
library(caret)  # confusion matrix

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

#PREPARING WORK SPAcE
# Clear the workspace: 
rm(list = ls())

# Load data
df <- read.csv("Admission.csv", header = TRUE)

head(df)

##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

nrow(df)

## [1] 400

colnames(df)

## [1] "admit" "gre"   "gpa"   "rank"

#Summary Statistics
summary(df)

##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

sapply(df, sd)

##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

#Data Structure
str(df)

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

All variables are integer and gpa is numeric because it has decimal value.We need to convert admit and rank to categorical variable.

#Convert to Categorical Variable
newdf <- df %>% 
  mutate_at(vars(admit,rank), funs(factor))

str(newdf)

## 'data.frame':    400 obs. of  4 variables:
##  $ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...

#Pairs plot
ggpairs(newdf)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can notice that GPA and GRE scores are slightly decreases from the Schools with Rank of 1 to Rank of 4. There are alo few outlierd which can be disregarded. The number of students who rejected significantly higher than the student who are admitted to the College.

# GPA and GRE average by Rank
newdf %>%
  group_by(rank) %>%
  summarise_at(vars(c(gpa,gre)), list(name = mean))

## # A tibble: 4 x 3
##   rank  gpa_name gre_name
##   <fct>    <dbl>    <dbl>
## 1 1         3.45     612.
## 2 2         3.36     596.
## 3 3         3.43     575.
## 4 4         3.32     570.

This table indicates that students who attend the school with Rank 1 have higher GPA and GRE scores than other students.

#Two-way table of factor variables

xtabs(~admit + rank, data=newdf)

##      rank
## admit  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

This data might be little imbalanced data due to having small number of rank 1 compared to other ranks.

Data Partition

ind <- sample(2, nrow(newdf), replace=T, prob =c(0.8, 0.2))

train <- newdf[ind==1,]
test  <- newdf[ind==2,]

head(train)

##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 6     1 760 3.00    2
## 7     1 560 2.98    1

nrow(train)

## [1] 323

head(test)

##    admit gre  gpa rank
## 5      0 520 2.93    4
## 8      0 400 3.08    2
## 12     0 440 3.22    1
## 14     0 700 3.08    2
## 15     1 700 4.00    1
## 23     0 600 2.82    4

nrow(test)

## [1] 77

Fit Model

glm(generalized linear model) is used for logistic regression in R.

set.seed(111)

fit_glm <- glm(admit~gre+gpa+rank, data=train, family= binomial(link= logit))

summary(fit_glm)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = logit), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4980  -0.8818  -0.6588   1.2005   2.0331  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.950995   1.252203  -2.357 0.018441 *  
## gre          0.001559   0.001194   1.306 0.191722    
## gpa          0.612632   0.366752   1.670 0.094835 .  
## rank2       -0.554743   0.350989  -1.581 0.113989    
## rank3       -1.432777   0.391457  -3.660 0.000252 ***
## rank4       -1.404152   0.465106  -3.019 0.002536 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 404.41  on 322  degrees of freedom
## Residual deviance: 376.30  on 317  degrees of freedom
## AIC: 388.3
## 
## Number of Fisher Scoring iterations: 4

pchisq(fit_glm$deviance, df = fit_glm$df.residual, lower.tail = F)

## [1] 0.01227363

Summary function show us our model and then deviance residuals.Then, we can see the coefficients of the model. GPA is the statistically significant (p<0.05), and GRE is somehow significant (p<0.10). Since the Rank is a factor(categorical variable), we see there are 3 Ranks in the model, because dummy variable has been used.

Based on the coefficients, we can say that every one unit change in GRE, the log odds of admission increase by 0.02 and every one unit change in GPA, the log odds of admission increases 0.76.

Rank 1 is embedded in the model, so for Rank 2 versus to Rank 1, the log odds of admission decreases 0.63. Rank 3 versus to Rank 1, the log of odds of admission decreases 1.19 and Rank 4 vs Rank 1, it decreases 1.60. It is clear that school’s rank is highky effective on the school admission.

Prediction by Rank

newdata <- with(df, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))

newdata$rankP <- predict(fit_glm, newdata = newdata, type = "response")

newdata <- newdata %>% 
  mutate_at(vars(gre,gpa,rankP), funs(round(.,2)))

newdata

##     gre  gpa rank rankP
## 1 587.7 3.39    1  0.51
## 2 587.7 3.39    2  0.37
## 3 587.7 3.39    3  0.20
## 4 587.7 3.39    4  0.20

The goal of this chart is to show the effect of Rank to be admitted to the college for students. As it is seen, the predicted probability of being accepted into a graduate program is 0.46 for students from the highest prestige schools, whereas it is 0.17 for students from the lowest ranked schools.

Prediction for Train data set

p1 <- predict(fit_glm, data=train, type= 'response')

#Confusion Matrix::

#By hand
pred1 <- ifelse(p1>0.5,1,0)

table1 <- table(Predicted=pred1, Actual=train$admit)

table1

##          Actual
## Predicted   0   1
##         0 204  83
##         1  16  20

#By Caret library
pred1 <- as.factor(pred1)

admit1<- as.factor(train$admit)

result_train <-confusionMatrix(data=pred1,reference=admit1)

print(result_train)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 204  83
##          1  16  20
##                                           
##                Accuracy : 0.6935          
##                  95% CI : (0.6401, 0.7433)
##     No Information Rate : 0.6811          
##     P-Value [Acc > NIR] : 0.3402          
##                                           
##                   Kappa : 0.1468          
##                                           
##  Mcnemar's Test P-Value : 3.284e-11       
##                                           
##             Sensitivity : 0.9273          
##             Specificity : 0.1942          
##          Pos Pred Value : 0.7108          
##          Neg Pred Value : 0.5556          
##              Prevalence : 0.6811          
##          Detection Rate : 0.6316          
##    Detection Prevalence : 0.8885          
##       Balanced Accuracy : 0.5607          
##                                           
##        'Positive' Class : 0               
##

We can use Caret library for the detail outcome. We can see that our accuracy rate on train data is 72.4%

Prediction for Test data set

ptest <- predict(fit_glm, test, type='response')

#Confusion Matrix::
pred_test <- ifelse(ptest>0.5,1,0)

pred_test<- as.factor(pred_test)

admit_test<- as.factor(test$admit)

result_test <-confusionMatrix(data=pred_test,reference=admit_test)

print(result_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52 17
##          1  1  7
##                                           
##                Accuracy : 0.7662          
##                  95% CI : (0.6559, 0.8552)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.085555        
##                                           
##                   Kappa : 0.3337          
##                                           
##  Mcnemar's Test P-Value : 0.000407        
##                                           
##             Sensitivity : 0.9811          
##             Specificity : 0.2917          
##          Pos Pred Value : 0.7536          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.6883          
##          Detection Rate : 0.6753          
##    Detection Prevalence : 0.8961          
##       Balanced Accuracy : 0.6364          
##                                           
##        'Positive' Class : 0               
##

And, we get 69.8% accuracy rate on the test data.

By using interaction model, we increased our accuracy by 10% on the test data set.

**************

Logistic Regression 1

Mustafa Arslan

9/9/2021