Most of the following text is from the Wikipedia page on Logistic Regression, (https://en.wikipedia.org/wiki/Logistic_regression). The R code in this document was authored by John Akwei, Data Scientist, in order to demonstrate R programming for the Logistic Regression results, and graphs.
Example: Probability of passing an exam versus hours of study
A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam?
Hours <- c(0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25,
2.50, 2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75,
5.00, 5.50)
Pass <- c(0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1)
HrsStudying <- data.frame(Hours, Pass)
The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).
HrsStudying_Table <- t(HrsStudying); HrsStudying_Table
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## Hours 0.5 0.75 1 1.25 1.5 1.75 1.75 2 2.25 2.5 2.75 3 3.25
## Pass 0.0 0.00 0 0.00 0.0 0.00 1.00 0 1.00 0.0 1.00 0 1.00
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## Hours 3.5 4 4.25 4.5 4.75 5 5.5
## Pass 0.0 1 1.00 1.0 1.00 1 1.0
The graph shows the probability of passing the exam versus the number of hours studying, with the logistic regression curve fitted to the data.
library(ggplot2)
ggplot(HrsStudying, aes(Hours, Pass)) +
geom_point(aes()) +
geom_smooth(method='glm', family="binomial", se=FALSE) +
labs (x="Hours Studying", y="Probability of Passing Exam",
title="Probability of Passing Exam vs Hours Studying")
Graph of a logistic regression curve showing probability of passing an exam versus hours studying
The logistic regression analysis gives the following output.
model <- glm(Pass ~.,family=binomial(link='logit'),data=HrsStudying)
model$coefficients
## (Intercept) Hours
## -4.077713 1.504645
Coefficient Std.Error z-value P-value (Wald)
Intercept -4.0777 1.7610 -2.316 0.0206
Hours 1.5046 0.6287 2.393 0.0167
The output indicates that hours studying is significantly associated with the probability of passing the exam (p=0.0167, Wald test). The output also provides the coefficients for Intercept = -4.0777 and Hours = 1.5046. These coefficients are entered in the logistic regression equation to estimate the probability of passing the exam:
Probability of passing exam =1/(1+exp(-(-4.0777+1.5046* Hours)))
For example, for a student who studies 2 hours, entering the value Hours =2 in the equation gives the estimated probability of passing the exam of p=0.26:
StudentHours <- 2
ProbabilityOfPassingExam <- 1/(1+exp(-(-4.0777+1.5046*StudentHours)))
ProbabilityOfPassingExam
## [1] 0.2556884
Probability of passing exam =1/(1+exp(-(-4.0777+1.5046*2))) = 0.26.
Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is p=0.87:
Probability of passing exam =1/(1+exp(-(-4.0777+1.5046*4))) = 0.87.
StudentHours <- 4
ProbabilityOfPassingExam <- 1/(1+exp(-(-4.0777+1.5046*StudentHours)))
ProbabilityOfPassingExam
## [1] 0.874429
This table shows the probability of passing the exam for several values of hours studying.
ExamPassTable <- data.frame(column1=c(1, 2, 3, 4, 5),
column2=c(1/(1+exp(-(-4.0777+1.5046*1))),
1/(1+exp(-(-4.0777+1.5046*2))),
1/(1+exp(-(-4.0777+1.5046*3))),
1/(1+exp(-(-4.0777+1.5046*4))),
1/(1+exp(-(-4.0777+1.5046*5)))))
names(ExamPassTable) <- c("Hours of study", "Probability of passing exam")
ExamPassTable
## Hours of study Probability of passing exam
## 1 1 0.07088985
## 2 2 0.25568845
## 3 3 0.60732935
## 4 4 0.87442903
## 5 5 0.96909067