logistic-appendix

#An appendix to logistic regression

###Update: Yesterday, when using the Bolger data, you were able to compare the fit of one model fit to the original data frame and one with the outlier removed using base graphics. I am going to recreate that comparison here using ggplot.

First, load the data. I will first load the original dataframe and then make an updated dataframe with the outlier removed.

library(ggplot2)
bolger <- read.csv("bolger.csv")
summary(bolger)

##     PERSHRUB         DISTX             AGE           RODENTSP   
##  Min.   : 5.00   Min.   :  40.0   Min.   : 2.00   Min.   :0.00  
##  1st Qu.:23.00   1st Qu.: 304.0   1st Qu.:18.00   1st Qu.:0.00  
##  Median :50.00   Median : 662.0   Median :32.00   Median :0.00  
##  Mean   :49.12   Mean   : 878.1   Mean   :36.84   Mean   :0.48  
##  3rd Qu.:75.00   3rd Qu.:1219.0   3rd Qu.:56.00   3rd Qu.:1.00  
##  Max.   :95.00   Max.   :2865.0   Max.   :86.00   Max.   :1.00

#Now I will make another dataset that is identical except that I have
#taken out observation 19.
bolger.updated <- bolger[-19,]
summary(bolger.updated)

##     PERSHRUB         DISTX             AGE           RODENTSP     
##  Min.   : 5.00   Min.   :  40.0   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.:28.25   1st Qu.: 288.8   1st Qu.:17.50   1st Qu.:0.0000  
##  Median :50.00   Median : 635.5   Median :31.50   Median :0.0000  
##  Mean   :50.75   Mean   : 841.0   Mean   :34.79   Mean   :0.4583  
##  3rd Qu.:75.00   3rd Qu.:1054.8   3rd Qu.:51.50   3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :2865.0   Max.   :79.00   Max.   :1.0000

Next, re-create the best-fit model (using the original data):

rodent.mod <- glm(formula = RODENTSP ~ PERSHRUB, data = bolger, family = "binomial")
summary(rodent.mod)

## 
## Call:
## glm(formula = RODENTSP ~ PERSHRUB, family = "binomial", data = bolger)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4827  -0.6052  -0.2415   0.5421   2.5218  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.90342    1.55775  -2.506  0.01222 * 
## PERSHRUB     0.07662    0.02878   2.663  0.00775 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.617  on 24  degrees of freedom
## Residual deviance: 20.049  on 23  degrees of freedom
## AIC: 24.049
## 
## Number of Fisher Scoring iterations: 5

Now, we should be able to follow our normal procedure of 1) plotting the data points and 2) adding the curves representing our model fit. Let’s try. Again, I will use the original data so that we get all of the points on the plot.

rodent.plot <- ggplot(bolger, aes(x = PERSHRUB, y = RODENTSP)) +
                geom_point() +
                theme_minimal()

rodent.plot

Adding the glm for the rodent model should be easy because it uses only one predictor variable.

rodent.plot + 
 stat_smooth(method = "glm", se = FALSE, method.args = list(family="binomial"))

But how do we add a line from a different dataset? We should be able to do this by predicting the outcome variable based on the model using the predict function. Here we will use the dataset with the outlier removed. This means we will also need to update our rodent.mod to include the updated dataset rather than the original dataset.

#first make the updated model
rodent.mod.updated <- glm(formula = RODENTSP ~ PERSHRUB, data = bolger.updated,
                          family = "binomial") 
#next predict the response
bolger.updated$response <- (predict(rodent.mod.updated, 
                           newdata = bolger.updated, type = "response"))
#make sure it was added to your bolger.updated dataframe
summary(bolger.updated)

##     PERSHRUB         DISTX             AGE           RODENTSP     
##  Min.   : 5.00   Min.   :  40.0   Min.   : 2.00   Min.   :0.0000  
##  1st Qu.:28.25   1st Qu.: 288.8   1st Qu.:17.50   1st Qu.:0.0000  
##  Median :50.00   Median : 635.5   Median :31.50   Median :0.0000  
##  Mean   :50.75   Mean   : 841.0   Mean   :34.79   Mean   :0.4583  
##  3rd Qu.:75.00   3rd Qu.:1054.8   3rd Qu.:51.50   3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :2865.0   Max.   :79.00   Max.   :1.0000  
##     response       
##  Min.   :0.000205  
##  1st Qu.:0.012236  
##  Median :0.318446  
##  Mean   :0.458333  
##  3rd Qu.:0.971650  
##  Max.   :0.999062

Now add the line represented the response variable to your rodent.plot object using geom_line().

rodent.plot +
  stat_smooth(method = "glm", se = FALSE, method.args = list(family="binomial")) +
  geom_line(data = bolger.updated, aes(y=response), color = "red")

And that’s the basic idea.

###NOTE: One thing you will notice right away about the new curve we made (the red one) is that it is much rougher than our stat_smooth() line. This is because when ggplot calculates our curve for us it pretends that the predictor variable has no gaps in observations, whereas when we made our own prediction line we only had 24 observations to work with.

So - if you want to make the curve smoother so it matches what you have in your base graphics curves, you’ll have to make a new dataframe that simulates the actual spread of your data, specify length.out = as some big number, and then generate the predictions of your response variable using the predict() function. Then you can add that line to your plot rather than the one I made here.

logistic-appendix

Becca

5 oktober 2018