Logistic Regression in R

Problem Statement

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, affect admission into graduate school. The outcome variable, admit/don’t admit, is binary.
This data set has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise. There are three predictor variables: gre, gpa, and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. [Source : UCLA]

Reading Data into R

I have already downloaded the dataset into my laptop.
You can access the dataset here https://stats.idre.ucla.edu/wp-content/uploads/2016/02/binary.sas7bdat

library(rio)  #for data reading
data <- import("binary.sas7bdat")

Data Cleaning

Looking at the structure of the dataset

str(data)

## 'data.frame':    400 obs. of  4 variables:
##  $ ADMIT: num  0 1 1 1 0 1 1 0 1 0 ...
##  $ GRE  : num  380 660 800 640 520 760 560 400 540 700 ...
##  $ GPA  : num  3.61 3.67 4 3.19 2.93 ...
##  $ RANK : num  3 3 1 4 4 2 1 2 3 2 ...
##  - attr(*, "label")= chr "LOGIT"

Variables ADMIT and RANK are of type numeric but they should be factor variables since were are not going to perform any mathematical operations on them.

data$ADMIT <- as.factor(data$ADMIT)
data$RANK <- as.factor(data$RANK)
str(data)

## 'data.frame':    400 obs. of  4 variables:
##  $ ADMIT: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
##  $ GRE  : num  380 660 800 640 520 760 560 400 540 700 ...
##  $ GPA  : num  3.61 3.67 4 3.19 2.93 ...
##  $ RANK : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
##  - attr(*, "label")= chr "LOGIT"

Looking at the summary of the dataset

summary(data)

##  ADMIT        GRE             GPA        RANK   
##  0:273   Min.   :220.0   Min.   :2.260   1: 61  
##  1:127   1st Qu.:520.0   1st Qu.:3.130   2:151  
##          Median :580.0   Median :3.395   3:121  
##          Mean   :587.7   Mean   :3.390   4: 67  
##          3rd Qu.:660.0   3rd Qu.:3.670          
##          Max.   :800.0   Max.   :4.000

From the summary statistics we observe

Most of students did not get admitted
There are no missing data values(NAs).

Checking for multicollineality

plot(data$GPA,data$GRE,col="red")

cor(data$GRE,data$GPA)

## [1] 0.3842659

From the plot we can infer that the two numeric variables are not correlated which is confirmed by low correlation value of 0.3842659.

Exploratoty Data Analysis.

We will explore the relationship between dependent and independent variables by way of visualization.

GRE

Since GRE is numeric variable and dependent variable is factor variable, we plot a box plot.

library(ggplot2) #For plotting
ggplot(data,aes(ADMIT,GRE,fill=ADMIT))+
  geom_boxplot()+
  theme_bw()+
  xlab("Admit")+
  ylab("GRE")+
  ggtitle("ADMIT BY GRE")

The two box plots are differents in terms of displacement, and hence GRE is significant variable.

GPA

ggplot(data,aes(ADMIT,GPA,fill=ADMIT))+
  geom_boxplot()+
  theme_bw()+
  xlab("Admit")+
  ylab("GPA")+
  ggtitle("ADMIT BY GPA")

There is clear difference in displacement between the two box plots, hence GPA is an important predictor.

RANK

RANK is a factor variable and since the dependent variable is a factor variable we plot a bar plot.

ggplot(data,aes(RANK,ADMIT,fill=ADMIT))+
  geom_col()+
  xlab("RANK")+
  ylab("COUNT-ADMIT")+
  ggtitle("ADMIT BY RANK")

There is a clear pattern; as RANK increases the possibility of a student being admitted decreases.

Modelling

Data Splitting

Before we fit a model, we need to split the dataset into training and test dataset to be able to assess the performance of the model with the unseen test dataset.

library(caret)  #For data spliting
set.seed(125)   #For reproducibiity
ind <- createDataPartition(data$ADMIT,p=0.80,list = FALSE)
training <- data[ind,] #training data set
testing <- data[-ind,] #Testing data set

Fitting a logistic regression model

We will exclude the insignificant variable GRE we saw during EDA.

set.seed(123)
mymodel <- glm(ADMIT~GPA + RANK,data=training,family=binomial(link = "logit"))
summary(mymodel)

## 
## Call:
## glm(formula = ADMIT ~ GPA + RANK, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4258  -0.8728  -0.6686   1.1715   2.0585  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.7638     1.1877  -2.327 0.019965 *  
## GPA           0.8328     0.3358   2.480 0.013148 *  
## RANK2        -0.5532     0.3531  -1.567 0.117152    
## RANK3        -1.3937     0.3867  -3.604 0.000313 ***
## RANK4        -1.6086     0.4757  -3.381 0.000721 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 401.36  on 320  degrees of freedom
## Residual deviance: 374.80  on 316  degrees of freedom
## AIC: 384.8
## 
## Number of Fisher Scoring iterations: 4

From the model summary, we can see that all the predictor variables are significant as we expected.

Making Predictions on Testing dataset

We evaluate the accurancy of the model by making predictions on the testing dataset.

pred <- predict(mymodel,testing,type = "response")
pred <- ifelse(pred>=0.5,1,0)

Creating a confusion matrix

pred <- as.factor(pred)
confusionMatrix(pred,testing$ADMIT)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 52 17
##          1  2  8
##                                           
##                Accuracy : 0.7595          
##                  95% CI : (0.6502, 0.8486)
##     No Information Rate : 0.6835          
##     P-Value [Acc > NIR] : 0.089451        
##                                           
##                   Kappa : 0.3373          
##  Mcnemar's Test P-Value : 0.001319        
##                                           
##             Sensitivity : 0.9630          
##             Specificity : 0.3200          
##          Pos Pred Value : 0.7536          
##          Neg Pred Value : 0.8000          
##              Prevalence : 0.6835          
##          Detection Rate : 0.6582          
##    Detection Prevalence : 0.8734          
##       Balanced Accuracy : 0.6415          
##                                           
##        'Positive' Class : 0               
##

From the confusion matrix we have overall accurracy of 76%.

Calculating Area under the curve

library(lares)
tag.1 <- as.numeric(testing$ADMIT)
score.1 <- as.numeric(pred)
mplot_roc(tag=tag.1,score=score.1)

We have an AUC of about 64%, we which considerably good bearing in mind that our data set was small and imbalanced response variable

Results and Communication

The following equation fits our model. \[log(p/1-p)=y=-2.7638 +(0.8328*GPA)+(-0.5532*RANK2)+(-1.3937*RANK3)+(-1.6086*RANK4)\] Where p is the probability of a student being admitted.

Goodness-of-fit test

We test the fitness of our model by calculating the p-value

with(mymodel,pchisq(null.deviance-deviance,df.null-df.residual,lower.tail = F))

## [1] 2.447434e-05

With a p-value of 0.00002447434 which is less than 0.05, we conclude that our model is statistically significant.

Thank You