The goal of this analysis is to build a model that can predict whether a person is admitted into graduate school or not based on his/her GRE (Gradute Record Exam scores), GPA (Grade Point Avarage), and the rank of his/her undergraduate alma-mater. There are four ranks and the first rank is the highest. Admit is dependent variable and predictors are gre, gpa, and rank.
library(readxl)
logData <- read_excel("D:/KARIR/R Portofolio/logisticregression/gre.xlsx")
head(logData)
## # A tibble: 6 x 4
## admit gre gpa rank
## <dbl> <dbl> <dbl> <dbl>
## 1 0 600 3.4 3
## 2 0 340 2.9 1
## 3 1 520 3.19 3
## 4 0 500 2.98 3
## 5 1 620 3.45 2
## 6 1 580 2.86 4
Inspect the structure of the data.
str(logData)
## tibble [300 x 4] (S3: tbl_df/tbl/data.frame)
## $ admit: num [1:300] 0 0 1 0 1 1 0 0 1 0 ...
## $ gre : num [1:300] 600 340 520 500 620 580 520 460 700 580 ...
## $ gpa : num [1:300] 3.4 2.9 3.19 2.98 3.45 2.86 2.85 3.07 3.52 3.46 ...
## $ rank : num [1:300] 3 1 3 3 2 4 3 2 4 4 ...
As we can see, admit and rank variables are not in factor type. So, we have to recast the data type become factor.
logData$admit = as.factor(logData$admit)
logData$rank = as.factor(logData$rank)
str(logData)
## tibble [300 x 4] (S3: tbl_df/tbl/data.frame)
## $ admit: Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 1 2 1 ...
## $ gre : num [1:300] 600 340 520 500 620 580 520 460 700 580 ...
## $ gpa : num [1:300] 3.4 2.9 3.19 2.98 3.45 2.86 2.85 3.07 3.52 3.46 ...
## $ rank : Factor w/ 4 levels "1","2","3","4": 3 1 3 3 2 4 3 2 4 4 ...
Admit is a binary response variable which consist of two classes: Class 0 corresponds not admitted and class 1 admitted. Let us check the class distribution of our new response variable.
table(logData$admit)
##
## 0 1
## 173 127
logModel <- glm(admit ~ gre + gpa + rank, data = logData, family = "binomial")
summary(logModel)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = logData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8847 -0.7765 -0.3580 0.7054 2.8140
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.376416 1.656450 -6.264 3.75e-10 ***
## gre 0.012637 0.001937 6.525 6.80e-11 ***
## gpa 1.159424 0.416520 2.784 0.00538 **
## rank2 -0.647369 0.429061 -1.509 0.13135
## rank3 -1.275603 0.457073 -2.791 0.00526 **
## rank4 -1.737568 0.520840 -3.336 0.00085 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 408.81 on 299 degrees of freedom
## Residual deviance: 288.33 on 294 degrees of freedom
## AIC: 300.33
##
## Number of Fisher Scoring iterations: 5
We have to careful in intrepreting the output of the logistic model because it is on link-scale (logit); thus, the numerical output of the model corresponds to the log-odds! The coefficient of gre and gpa have positive sign. These indicate that the chance to admit into graduate school increases with gre and gpa. At the other hand, rank has negative sign. By taking the exponent of the coefficient values, we get the odds ratios.
exp(coefficients(logModel)[2:6])
## gre gpa rank2 rank3 rank4
## 1.0127168 3.1880974 0.5234212 0.2792626 0.1759478
According to the odds ratio of gre, the increase of gre by 1 unit increases the chance to admit into graduate school by 1.013 times. For gpa odds ratio, the increase of gpa by 1 unit increases the chance to admit into graduate school by 3.188 times. But in term of rank, the first rank is the base for others. So, a person who his/her alma-mater was second rank has chance to admit into graduate school by 0.523 compared to first rank and so on.
In this example, we will predict the chance of a person who has GRE equal 700, 3.50 in GPA, and his/her alma-mater was first rank.
predict(logModel, newdata = list(gre=700, gpa=3.50, rank="1"), type = "response")
## 1
## 0.9260248
So, his/her chance to admit into graduate school is 92,6%.