Show how derivative = B/4
at the steepest point on the curve, P is often very close to 0.5
If p = 0.5, we know log(odds) = a + bx = 1, so odds = exp(a + bx) = 1, which means odds = exp(0)
Plugging this into the derivative equation gives slope = B/4
Read in the dataset wells.csv and do some data exploration
wells <- read.table(file = "wells.csv", header = TRUE, sep = ",")
Get familiar with the data
str(wells) #gives no. of observations (rows) and variables (columns), and the data type of each variable
## 'data.frame': 3020 obs. of 5 variables:
## $ switch : int 1 1 0 1 1 1 1 1 1 1 ...
## $ dist100: num 0.168 0.473 0.21 0.215 0.409 ...
## $ arsenic: num 2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
## $ assoc : int 0 0 0 0 1 1 1 0 1 1 ...
## $ educ : int 0 0 10 12 14 9 4 10 0 0 ...
head(wells)
## switch dist100 arsenic assoc educ
## 1 1 0.16826 2.36 0 0
## 2 1 0.47322 0.71 0 0
## 3 0 0.20967 2.07 0 10
## 4 1 0.21486 1.15 0 12
## 5 1 0.40874 1.10 1 14
## 6 1 0.69518 3.90 1 9
tail(wells)
## switch dist100 arsenic assoc educ
## 3015 0 0.26320 0.96 0 0
## 3016 0 0.19347 0.52 1 5
## 3017 0 0.21386 1.08 1 3
## 3018 0 0.07708 0.51 0 4
## 3019 0 0.22842 0.64 0 3
## 3020 1 0.20844 0.66 1 5
summary(wells) #provides summary statistics for each variable
## switch dist100 arsenic assoc
## Min. :0.0000 Min. :0.00387 Min. :0.510 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.21117 1st Qu.:0.820 1st Qu.:0.0000
## Median :1.0000 Median :0.36762 Median :1.300 Median :0.0000
## Mean :0.5752 Mean :0.48332 Mean :1.657 Mean :0.4228
## 3rd Qu.:1.0000 3rd Qu.:0.64041 3rd Qu.:2.200 3rd Qu.:1.0000
## Max. :1.0000 Max. :3.39531 Max. :9.650 Max. :1.0000
## educ
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 5.000
## Mean : 4.828
## 3rd Qu.: 8.000
## Max. :17.000
sum(wells$switch) #To know how many households switched
## [1] 1737
Use function attach(wells) so we don’t need to write
wells$ every time we want to use a variable (e.g.,
wells$swicth becomes switch)
attach(wells)
Create a histogram for every covariate and get familiar with the
dataset. Note: the argument breaks allow you to define the
increment of the x-axis
hist (dist100,
xlab = "Distance (in 100 meters) to the nearest safe well",
ylab = "Frequency",
main = "Histogram of dist100",
breaks = seq(0, 0.25+max(dist100[!is.na(dist100)]), 0.25)) #x-axis starts at 0 and goes to just above the maximum value in dist100, in increments of 0.25.
#Note: max(dist100[!is.na(dist100)])calculates the maximum value in the dist100 column, but it excludes any missing values
hist (arsenic,
xlab = "Arsenic concentration in well water",
ylab = "Frequency",
main = "Histogram of arsenic",
breaks = seq(0, 0.25+max(arsenic[!is.na(arsenic)]), 0.25))
#Note: for arsenic using breaks helps visualising that we don't have values of 0:
hist (assoc,
xlab = "Part of a community association",
ylab = "Frequency",
main = "Histogram of Community Association")
hist (educ,
xlab = "Education years",
ylab = "Frequency",
main = "Histogram of Education")
fit <- glm (switch ~ dist100, data = wells, family = binomial(link="logit"))
summary(fit)
##
## Call:
## glm(formula = switch ~ dist100, family = binomial(link = "logit"),
## data = wells)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60596 0.06031 10.047 < 2e-16 ***
## dist100 -0.62188 0.09743 -6.383 1.74e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 4076.2 on 3018 degrees of freedom
## AIC: 4080.2
##
## Number of Fisher Scoring iterations: 4
log(Odds):
log(odds) = α + β * x
log(odds of switch) = 0.61 - 0.62 * dist100
Odds:
odds of switch = exp(0.61 - 0.62 * dist100)
Probability:
P(switch) = odds/1+odds = exp(0.61 - 0.62 * Dist100) / (1 + exp(0.61 - 0.62 * Dist100))
OR
invlogit(0.61 - 0.62 * dist100)
Extra interpretation:
For every one unit increase in dist100, the log(odds of switch) decreases by 0.62. The coefficient estimate for dist100 is significant as p < 0.05 (p = 1.74e-10 ).
Therefore, for every one unit change in dist100, the odds of switch will have a multiplicative change of exp(-0.62) = 0.54.
Install package to use function invlogit()
if(!require(arm)) install.packages("arm",repos = "http://cran.us.r-project.org")
## Loading required package: arm
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
## Warning in check_dep_version(): ABI version mismatch:
## lme4 was built with Matrix ABI version 1
## Current Matrix ABI version is 0
## Please re-install lme4 from source or restore original 'Matrix' package
##
## arm (Version 1.14-4, built: 2024-4-1)
## Working directory is /Users/zarafatima/BIOL2001 Week 10
1) α (i.e., the “intercept”): The intercept is to be interpreted when dist100 = 0
invlogit(0.61 - 0.62 * 0)
## [1] 0.6479408
The probability of switching wells when living right next to the safe well (i.e. when dist100 = 0) is 65%
2) β (i.e., “slope” / rate of change):
The slope is not constant in logistic regression, so we use the derivative at the midpoint which we know is β/4:
-0.62/4
## [1] -0.155
The probability of switching wells when adding a unit (i.e, 100m) to the distance to a safe well at the MIDPOINT of the curve, decreases by 15% (i.e. the further you live from a safe well, the less probability of switching to it)
Remember: β/4 is the maximum slope when P ~ 0.5, and represents the maximum difference in probability corresponding to a unit change in 𝑥
3) Slope at the average (mean) distance (and indicate the probability at that value of the covariate):
To calcualte the slope at any value of our covariate, we calculate the derivative at that value
Calculate mean distance
mean(dist100)
## [1] 0.4833186
Calculate the slope/derivate at the mean distance
(-0.62 * exp(0.61 - 0.62 * mean(dist100))) / (1 + exp(0.61 - 0.62 * mean(dist100)))^2
## [1] -0.151327
The probability of switching wells when adding a unit (i.e, 100m) to the distance to a safe well at x = mean(dist100), decreases by 15% (i.e. the further you live from a safe well, the less probability of switching to it)
Interpretation: the slope obtained at the midpoint of the logistic curve and at x=mean(dist100) have the same values to 2 decimal places, meaning the midpoint of the logistic curve (i.e., where P is expected to be 0.5), is “close” to the point where x = mean(dist100)
Calculate the probability at the mean distance
invlogit(0.61 - 0.62 * mean(dist100))
## [1] 0.5769688
When living a mean distance of 0.48 (i.e. 48m?) from the safe well, the probability of switching to that well is 58%
Remember: this is the exact probability when x = 0.48
plot(dist100,
jitter(switch, 0.05), #adds a small amount of jitter to the switch variable
xlab = "Distance (in 100 meters) to nearest safe well",
ylab = "Probability of Switch")
curve(invlogit(coef(fit)[1] + coef(fit)[2] * x), add=TRUE)
fit2 <- glm (switch ~ dist100 + arsenic, family=binomial(link="logit"))
summary(fit2)
##
## Call:
## glm(formula = switch ~ dist100 + arsenic, family = binomial(link = "logit"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.002749 0.079448 0.035 0.972
## dist100 -0.896644 0.104347 -8.593 <2e-16 ***
## arsenic 0.460775 0.041385 11.134 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3930.7 on 3017 degrees of freedom
## AIC: 3936.7
##
## Number of Fisher Scoring iterations: 4
Note: residual deviance decreases meaning arsenic is a good predictor variable to add
P(switch) = invlogit(0.002 - 0.90 * dist100 + 0.46 * arsenic)
α (i.e., the “intercept”)
The intercept is to be interpreted when dist100 = 0 and arsenic = 0. BUT Arsenic is never = 0, so we can’t interpret the intercept in this case
β (i.e., “slope” / rate of change) for each covariate
Slope/derivative at the midpoint/steepest point:
βdist100
-0.90/4
## [1] -0.225
With arsenic kept constant, the probability of switching wells when adding a unit (i.e, 100m) to the distance to a safe well decreases by 22.5% (at the midpoint of the curve)
βarsenic
0.46/4
## [1] 0.115
With dis100 kept constant, the probability of switching wells when adding a unit to the arsenic concentration increases by 11.5% (at the midpoint of the curve)
par(mfrow = c(1,2))
#Plot for dist100 (with arsenic constant)
plot(dist100,
jitter(switch, 0.05),
xlab = "Distance (in 100 meters) to nearest safe well",
ylab="Switch")
curve (invlogit(coef(fit2)[1] + coef(fit2)[2]*x + coef(fit2)[3]*mean(arsenic)), lwd=.5, add=TRUE)
#Plot for arsenic (with dist100 constant)
plot(arsenic,
jitter(switch, 0.05),
xlab="Arsenic concentration in well water",
ylab="Switch")
curve (invlogit(coef(fit2)[1] + coef(fit2)[2]*mean(dist100) + coef(fit2)[3]*x), lwd=.5, add=TRUE)
par(mfrow = c(1,1))
Interpretation for dist100: if arsenic is constant, as the distance from the safe well increases, the probability of switching decreases
Interpretation for arsenic: if distance is constant, as the level of arsenic increases, the probability of switching increases ________________________________________________________________________________
Note: From the lecture, you already know that we should centre the predictors before adding an interaction
Centre the predictors and add them to the table
wells$c.dist100 <- dist100 - mean(dist100)
wells$c.arsenic <- arsenic - mean(arsenic)
Re-attach data
attach(wells)
## The following objects are masked from wells (pos = 7):
##
## arsenic, assoc, dist100, educ, switch
fit3 <- glm (switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic, family=binomial(link="logit"))
summary(fit3)
##
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic,
## family = binomial(link = "logit"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.35109 0.03985 8.810 <2e-16 ***
## c.dist100 -0.87365 0.10480 -8.337 <2e-16 ***
## c.arsenic 0.46951 0.04207 11.159 <2e-16 ***
## c.dist100:c.arsenic -0.17891 0.10233 -1.748 0.0804 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3927.6 on 3016 degrees of freedom
## AIC: 3935.6
##
## Number of Fisher Scoring iterations: 4
P(switch) = invlogit(0.35 - 0.88 * c.dist100 + 0.47 * c.arsenic - 0.18 * c.dist100:c.arsenic)
α (i.e., the “intercept”)
The intercept is to be interpreted when c.dist100 = 0 and c.arsenic = 0
Because we centred the predictors, c.dist100 = 0 and c.arsenic = 0 now correspond to the mean values of the predictors, and therefore we can interpret the intercept.
invlogit(0.35)
## [1] 0.5866176
At average values of predictors, the probability of switching to a safe well is 59%.
β (i.e., “slope” / rate of change) for each covariate
Slope/derivative at the midpoint/steepest point for each covaraite:
βc.dist100
- 0.88/4
## [1] -0.22
At the mean levels of arsenic, each one unit increase in dist100 (i.e., adding 100 m) corresponds to 22% decrease in the probability of switching wells.
βc.arsenic
0.47/4
## [1] 0.1175
At the mean levels of dist100, each one unit increase in arsenic concentration corresponds to 12% increase in the probability of switching wells
βc.dist100:arsenic
We can interpret it in two ways:
For each additional unit of arsenic, the coefficient -0.18 is added to the coefficient of dist100
For each additional 100m in dist100, the coefficient -0.18 is added to the coefficient of arsenic
par(mfrow = c(1,2))
#Plot for dist100 (with arsenic constant)
plot(dist100, jitter(switch,0.05),
xlab = "Distance (in 100 meters) to nearest safe well",
ylab = "Switch")
#If arsenic = 0.5
curve (invlogit(coef(fit3)[1] + coef(fit3)[2]*x + coef(fit3)[3]*0.50 + coef(fit3)[4]*(x)*0.50), lwd=.5, add=TRUE)
#If arsenic = 1.0
curve (invlogit(coef(fit3)[1] + coef(fit3)[2]*x + coef(fit3)[3]*1 + coef(fit3)[4]*(x)*1), lwd=.5, add=TRUE)
#Add text
text (x = 0.50, y = 0.37, "if As = 0.5", adj=0, cex=.8)
text (x = 0.75, y = 0.55, "if As = 1.0", adj=0, cex=.8)
#Plot for arsenic (with distance constant)
plot(arsenic, jitter(switch,0.05),
xlab="Arsenic concentration in well water",
ylab="Switch")
#If distance = 0
curve (invlogit(coef(fit3)[1] + coef(fit3)[2]*0 + coef(fit3)[3]*x + coef(fit3)[4]*0*(x)), lwd=.5, add=TRUE)
#If distance = 0.5
curve (invlogit(coef(fit3)[1]+coef(fit3)[2]*0.5+coef(fit3)[3]*x+coef(fit3)[4]*0.5*(x)), lwd=.5, add=TRUE)
#Add text
text (x = 1.5, y = 0.8, "if dist = 0", adj=0, cex=.8)
text (x = 2.2, y = 0.6, "if dist = 0.5", adj=0, cex=.8)
par(mfrow = c(1,1))
Note: when you split the levels of arsenic/distance, you see the change in the slope differs for each level - this indicates you may have interaction between the two covariates
Transform educ, divinging it by 4
wells$educ4 <- educ/4
Re-attach data
attach(wells)
## The following objects are masked from wells (pos = 3):
##
## arsenic, assoc, c.arsenic, c.dist100, dist100, educ, switch
## The following objects are masked from wells (pos = 8):
##
## arsenic, assoc, dist100, educ, switch
fit4 <- glm (switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic + assoc + educ4,
family=binomial(link="logit"))
summary(fit4)
##
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic +
## assoc + educ4, family = binomial(link = "logit"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20252 0.06930 2.922 0.00347 **
## c.dist100 -0.87528 0.10507 -8.330 < 2e-16 ***
## c.arsenic 0.47531 0.04229 11.238 < 2e-16 ***
## assoc -0.12319 0.07698 -1.600 0.10953
## educ4 0.16779 0.03838 4.372 1.23e-05 ***
## c.dist100:c.arsenic -0.16123 0.10225 -1.577 0.11482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3905.4 on 3014 degrees of freedom
## AIC: 3917.4
##
## Number of Fisher Scoring iterations: 4
The predictor “assoc” resulted in a non-significant predictor, and can therefore be removed from the model.
However, note that the deviance decreased from 3927.6 to 3905.4 when we added two new predictors:
fit3$deviance
## [1] 3927.628
fit4$deviance
## [1] 3905.351
This change is > 2, which would be expected from noise added to the model.
Remember: When k predictors are added to a model, we expect deviance to decrease by more than k
So there is some effect, but small compared to the addition of other predictors, e.g, from fit to fit2
fit$deviance
## [1] 4076.238
fit2$deviance
## [1] 3930.668
We will see what changes when removing assoc from the model
fit5 <- glm (switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic + educ4,
family=binomial(link="logit"))
Get model output
summary(fit5)
##
## Call:
## glm(formula = switch ~ c.dist100 + c.arsenic + c.dist100:c.arsenic +
## educ4, family = binomial(link = "logit"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.14844 0.06044 2.456 0.0141 *
## c.dist100 -0.87462 0.10510 -8.322 < 2e-16 ***
## c.arsenic 0.47663 0.04228 11.273 < 2e-16 ***
## educ4 0.16922 0.03833 4.415 1.01e-05 ***
## c.dist100:c.arsenic -0.16291 0.10235 -1.592 0.1115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4118.1 on 3019 degrees of freedom
## Residual deviance: 3907.9 on 3015 degrees of freedom
## AIC: 3917.9
##
## Number of Fisher Scoring iterations: 4
Note: deviance only went up 2 units when we removed “assoc”, so we will stick with removing assoc from model
fit4$deviance
## [1] 3905.351
fit5$deviance
## [1] 3907.911
P(switch) = invlogit(0.15 - 0.88 * c.dist100 + 0.47 * c.arsenic + 0.17 * educ4 - 0.16 * c.dist100:c.arsenic)
α (i.e., the “intercept”)
The intercept is to be interpreted when c.dist100 = 0, c.arsenic = 0 and educ4 = 0
Because we centred the predictors, c.dist100 = 0 and c.arsenic = 0 now correspond to the mean values of the predictors
In this case, educ4 can be equal to 0. If you want to know how many times educ4 is equal to zero you can do this:
length(which(educ4 ==0))
## [1] 889
Therefore we can interpret the intercept after replacing the predictor values in the equation:
invlogit(0.15)
## [1] 0.5374298
Meaning that at average values of dist100 and arsenic, and at educ4 = 0, the probability of switching to a safe well is 54 %
β (i.e., “slope” / rate of change) for each covariate
Slope/derivative at the midpoint/steepest point for each covaraite:
βc.dist100
- 0.88/4
## [1] -0.22
Keeping all other predictors constant (i.e., at mean(arsenic), and zero educ), each one unit increase in dist100 (i.e., adding 100 m) corresponds to 22% decrease in the probability of switching wells.
βc.arsenic
0.47/4
## [1] 0.1175
Keeping all other predictors constant (i.e., at mean(distance), and zero educ), each one unit increase in arsenic corresponds to 12% increase in the probability of switching wells
βeduc4
0.17/4
## [1] 0.0425
Keeping all other predictors constant (i.e., at mean(arsenic) and mean(dist100)), each one unit increase in education (i.e. adding 4 years of education) corresponds to 4% increase in the probability of switching wells
βc.dist100:arsenic
The interaction term can be interpeted in one of two ways:
For each additional unit of arsenic, the coefficient -0.16 is added to the coefficient of dist100
For each additional 100m in dist100, the coefficient -0.16 is added to the coefficient of arsenic
Now we have 3 different predictors, so we do 3 plots
par(mfrow = c(1,3))
#Plot for dist100 (with arsenic and educ4 constant)
plot(dist100, jitter(switch,0.05),
xlab = "Distance (in 100 meters) to nearest safe well",
ylab = "Switch")
#If arsenic = 0.50 and educ4 = 1
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*x + coef(fit5)[3]*0.50 + coef(fit5)[4]*1 + coef(fit5)[5]*(x)*0.50), lwd=.5, add=TRUE)
#If arsenic = 0.50 and educ4 = 0
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*x + coef(fit5)[3]*0.50 + coef(fit5)[4]*0 + coef(fit5)[5]*(x)*0.50), lwd=.5, add=TRUE)
#If arsenic = 1 and educ4 = 0
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*x + coef(fit5)[3]*1 + coef(fit5)[4]*0 + coef(fit5)[5]*(x)*1), lwd=.5, add=TRUE)
#Add text
text(x = 0.65, y = 0.45, "if As = 0.5 & Educ4 = 1", adj=0, cex=.8)
text(x = 0.50, y = 0.37, "if As = 0.5 & Educ4 = 0", adj=0, cex=.8)
text(x = 0.75, y = 0.55, "if As = 1.0 & Educ4 = 0", adj=0, cex=.8)
#Plot for arsenic (with dist100 and educ4 constant)
plot(arsenic, jitter(switch,0.05),
xlab = "Arsenic concentration in well water",
ylab = "Switch")
#If dist100 = 0 and educ4 = 1
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*0 + coef(fit5)[3]*x + coef(fit5)[4]*1 + coef(fit5)[5]*0*(x)), lwd=.5, add=TRUE)
#If dist100 = 0 and educ4 = 0
curve (invlogit(coef(fit5)[1] + coef(fit5)[2]*0 + coef(fit5)[3]*x + coef(fit5)[4]*0 + coef(fit5)[5]*0*(x)), lwd=.5, add=TRUE)
#If dist100 = 0.5 and educ4 = 0
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*0.5 + coef(fit5)[3]*x + coef(fit5)[4]*0 + coef(fit5)[5]*0.5*(x)), lwd=.5, add=TRUE)
text (1.7, .7, "if dist = 0 & Educ4 = 1", adj=0, cex=.8)
text (1.5, .8, "if dist = 0 & Educ4 = 0", adj=0, cex=.8)
text (2.2, .6, "if dist = 0.5 & Educ4 = 0", adj=0, cex=.8)
#Plot for educ4 (with dist100 and arsenic constant)
plot(educ4, jitter(switch,0.05),
xlab = "Education years (every 4)",
ylab = "Switch")
#If dist100 = 0.5 and arsenic = 0.5
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*0.5 + coef(fit5)[3]*0.5 + coef(fit5)[4]*x + coef(fit5)[5]*0.5*0.5), lwd=.5, add=TRUE)
#If dist100 = 0 and arsenic = 0.5
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*0 + coef(fit5)[3]*0.5 + coef(fit5)[4]*x + coef(fit5)[5]*0*0.5), lwd=.5, add=TRUE)
#If dist100 = 0.5 and arsenic = 1
curve(invlogit(coef(fit5)[1] + coef(fit5)[2]*0.5 + coef(fit5)[3]*1 + coef(fit5)[4]*x + coef(fit5)[5]*0.5*1), lwd=.5, add=TRUE)
#Add text
text (1.7, .7, "if dist = 0.5 & As = 0.5", adj=0, cex=.8)
text (1.5, .8, "if dist = 0 & As = 0.5", adj=0, cex=.8)
text (2.2, .6, "if dist = 0.5 & As = 1", adj=0, cex=.8)
par(mfrow = c(1,1))