#An appendix to logistic regression
###Update: Yesterday, when using the Bolger data, you were able to compare the fit of one model fit to the original data frame and one with the outlier removed using base graphics. I am going to recreate that comparison here using ggplot.
First, load the data. I will first load the original dataframe and then make an updated dataframe with the outlier removed.
library(ggplot2)
bolger <- read.csv("bolger.csv")
summary(bolger)
## PERSHRUB DISTX AGE RODENTSP
## Min. : 5.00 Min. : 40.0 Min. : 2.00 Min. :0.00
## 1st Qu.:23.00 1st Qu.: 304.0 1st Qu.:18.00 1st Qu.:0.00
## Median :50.00 Median : 662.0 Median :32.00 Median :0.00
## Mean :49.12 Mean : 878.1 Mean :36.84 Mean :0.48
## 3rd Qu.:75.00 3rd Qu.:1219.0 3rd Qu.:56.00 3rd Qu.:1.00
## Max. :95.00 Max. :2865.0 Max. :86.00 Max. :1.00
#Now I will make another dataset that is identical except that I have
#taken out observation 19.
bolger.updated <- bolger[-19,]
summary(bolger.updated)
## PERSHRUB DISTX AGE RODENTSP
## Min. : 5.00 Min. : 40.0 Min. : 2.00 Min. :0.0000
## 1st Qu.:28.25 1st Qu.: 288.8 1st Qu.:17.50 1st Qu.:0.0000
## Median :50.00 Median : 635.5 Median :31.50 Median :0.0000
## Mean :50.75 Mean : 841.0 Mean :34.79 Mean :0.4583
## 3rd Qu.:75.00 3rd Qu.:1054.8 3rd Qu.:51.50 3rd Qu.:1.0000
## Max. :95.00 Max. :2865.0 Max. :79.00 Max. :1.0000
Next, re-create the best-fit model (using the original data):
rodent.mod <- glm(formula = RODENTSP ~ PERSHRUB, data = bolger, family = "binomial")
summary(rodent.mod)
##
## Call:
## glm(formula = RODENTSP ~ PERSHRUB, family = "binomial", data = bolger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4827 -0.6052 -0.2415 0.5421 2.5218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.90342 1.55775 -2.506 0.01222 *
## PERSHRUB 0.07662 0.02878 2.663 0.00775 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.617 on 24 degrees of freedom
## Residual deviance: 20.049 on 23 degrees of freedom
## AIC: 24.049
##
## Number of Fisher Scoring iterations: 5
Now, we should be able to follow our normal procedure of 1) plotting the data points and 2) adding the curves representing our model fit. Let’s try. Again, I will use the original data so that we get all of the points on the plot.
rodent.plot <- ggplot(bolger, aes(x = PERSHRUB, y = RODENTSP)) +
geom_point() +
theme_minimal()
rodent.plot
Adding the glm for the rodent model should be easy because it uses only one predictor variable.
rodent.plot +
stat_smooth(method = "glm", se = FALSE, method.args = list(family="binomial"))
But how do we add a line from a different dataset? We should be able
to do this by predicting the outcome variable based on the model using
the predict function. Here we will use the dataset with the
outlier removed. This means we will also need to update our rodent.mod
to include the updated dataset rather than the original dataset.
#first make the updated model
rodent.mod.updated <- glm(formula = RODENTSP ~ PERSHRUB, data = bolger.updated,
family = "binomial")
#next predict the response
bolger.updated$response <- (predict(rodent.mod.updated,
newdata = bolger.updated, type = "response"))
#make sure it was added to your bolger.updated dataframe
summary(bolger.updated)
## PERSHRUB DISTX AGE RODENTSP
## Min. : 5.00 Min. : 40.0 Min. : 2.00 Min. :0.0000
## 1st Qu.:28.25 1st Qu.: 288.8 1st Qu.:17.50 1st Qu.:0.0000
## Median :50.00 Median : 635.5 Median :31.50 Median :0.0000
## Mean :50.75 Mean : 841.0 Mean :34.79 Mean :0.4583
## 3rd Qu.:75.00 3rd Qu.:1054.8 3rd Qu.:51.50 3rd Qu.:1.0000
## Max. :95.00 Max. :2865.0 Max. :79.00 Max. :1.0000
## response
## Min. :0.000205
## 1st Qu.:0.012236
## Median :0.318446
## Mean :0.458333
## 3rd Qu.:0.971650
## Max. :0.999062
Now add the line represented the response variable to your rodent.plot object using geom_line().
rodent.plot +
stat_smooth(method = "glm", se = FALSE, method.args = list(family="binomial")) +
geom_line(data = bolger.updated, aes(y=response), color = "red")
And that’s the basic idea.
###NOTE: One thing you will notice right away about the new curve we made (the red one) is that it is much rougher than our stat_smooth() line. This is because when ggplot calculates our curve for us it pretends that the predictor variable has no gaps in observations, whereas when we made our own prediction line we only had 24 observations to work with.
So - if you want to make the curve smoother so it matches what you
have in your base graphics curves, you’ll have to make a new dataframe
that simulates the actual spread of your data, specify
length.out = as some big number, and then generate the
predictions of your response variable using the predict()
function. Then you can add that line to your plot rather than the one I
made here.