Possion Regression with Bikeshare Data

options(warn = -1)
library(ISLR2)

Abstract

This report discusses the use of the Poisson regression model on Poisson data and demonstrates why Poisson model is preferred over a linear model using the Bikeshare data from the ISLR2 package. To perform this analysis, first we fit a least squares linear regression model to the data and plot the coefficient estimates for the mnth and hr variables. We then fit a Poisson model to the Bikeshare data to reproduce the plots and use the predict function to obtain the fitted values. From this analysis, we find that the prediction from the Linear model and the Poisson model are almost the same but the Poisson model produces non-negative values. This finding helps us conclude that Poisson Regression is best used for rare events, and it is only utilized for numerical, persistent data.

Introduction

The Bikeshare data set measures the number of bike rentals (bikers) per hour in Washington, DC. The goal of this analysis is to determine how the month and hour of the bike rentals affect the number of riders. Even though the data set has 14 predictor variables, we are only concerned about mnth and hr. To perform this analysis, we first have to choose the appropriate model to be used. First we use a least squares linear regression model to predict and plot the estimates for mnth and hr The goal of this analysis is to compare the outputs from a linear model and Poisson model and show why Poisson model is preferred for data sets with response variables of count type. The first step of this analysis is attaching the Bikeshare data and viewing its variables.

attach(Bikeshare)
names(Bikeshare)

##  [1] "season"     "mnth"       "day"        "hr"         "holiday"   
##  [6] "weekday"    "workingday" "weathersit" "temp"       "atemp"     
## [11] "hum"        "windspeed"  "casual"     "registered" "bikers"

We then fit a least squares linear regression model to the data then observe the coeffecients for the months and hours.

mod.lm <- lm(
bikers ∼ mnth + hr + workingday + temp + weathersit ,
data = Bikeshare)
summary(mod.lm)

## 
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     data = Bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.00  -45.70   -6.23   41.08  425.29 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -68.632      5.307 -12.932  < 2e-16 ***
## mnthFeb                      6.845      4.287   1.597 0.110398    
## mnthMarch                   16.551      4.301   3.848 0.000120 ***
## mnthApril                   41.425      4.972   8.331  < 2e-16 ***
## mnthMay                     72.557      5.641  12.862  < 2e-16 ***
## mnthJune                    67.819      6.544  10.364  < 2e-16 ***
## mnthJuly                    45.324      7.081   6.401 1.63e-10 ***
## mnthAug                     53.243      6.640   8.019 1.21e-15 ***
## mnthSept                    66.678      5.925  11.254  < 2e-16 ***
## mnthOct                     75.834      4.950  15.319  < 2e-16 ***
## mnthNov                     60.310      4.610  13.083  < 2e-16 ***
## mnthDec                     46.458      4.271  10.878  < 2e-16 ***
...

In mod.lm, the first level of hr (0) and mnth (Jan) are treated as the baseline values, and so no coefficient estimates are provided for them: implicitly, their coefficient estimates are zero, and all other levels are measured relative to these baselines.

Methods

To demonstrate that the coding of the mnth and hr variables does not affect the predicted estimates, we created another second model with numbered values for mnth and hr.

contrasts (Bikeshare$hr) = contr.sum (24)
contrasts (Bikeshare$mnth) = contr.sum (12)
mod.lm2 <- lm(
bikers ∼ mnth + hr + workingday + temp + weathersit ,
data = Bikeshare
)
summary (mod.lm2)

## 
## Call:
## lm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     data = Bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -299.00  -45.70   -6.23   41.08  425.29 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 73.5974     5.1322  14.340  < 2e-16 ***
## mnth1                      -46.0871     4.0855 -11.281  < 2e-16 ***
## mnth2                      -39.2419     3.5391 -11.088  < 2e-16 ***
## mnth3                      -29.5357     3.1552  -9.361  < 2e-16 ***
...

In mod.lm2, the coefficient estimate for the last level of mnth is not zero: instead, it equals the negative of the sum of the coefficient estimates for all of the other levels. Similarly, in mod.lm2, the coefficient estimate for the last level of hr is the negative of the sum of the coefficient estimates for all of the other levels. This indicates that the difference between the mean level and the coefficients of hr and mnth in mod.lm2 will always total to zero. For instance, the January coefficient of 46.087 implies that there are normally 46 variables when all other factors are held constant. In January, there were fewer cyclists than in the previous year.

To show that this different style of coding doesnt matter provided that we interpret the model output correctly in light of the coding used, we compute the sum of squared differences.

sum (( predict (mod.lm) - predict (mod.lm2))^2)

## [1] 1.426274e-18

The sum of squared differences is almost zero, which means that both models produce the same predictions despite differences in coding.

Now, we consider instead fitting a Poisson regression model to the Bikeshare data. This is done by using the glm() function and setting the argument for family as poisson.

mod.pois <- glm (
bikers ∼ mnth + hr + workingday + temp + weathersit ,
data = Bikeshare , family = poisson
)
summary(mod.pois)

## 
## Call:
## glm(formula = bikers ~ mnth + hr + workingday + temp + weathersit, 
##     family = poisson, data = Bikeshare)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -20.7574   -3.3441   -0.6549    2.6999   21.9628  
## 
## Coefficients:
##                            Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                4.118245   0.006021  683.964  < 2e-16 ***
## mnth1                     -0.670170   0.005907 -113.445  < 2e-16 ***
## mnth2                     -0.444124   0.004860  -91.379  < 2e-16 ***
## mnth3                     -0.293733   0.004144  -70.886  < 2e-16 ***
## mnth4                      0.021523   0.003125    6.888 5.66e-12 ***
## mnth5                      0.240471   0.002916   82.462  < 2e-16 ***
...

Results

In order to visualize the outputs of the models, we plot the response estimates against the coefficients for both the linear model and the Poisson model. But first, it is important to obtain the coefficient estimates associated with mnth. The coefficients for January through November can be obtained directly from the mod.lm2 object. The coefficient for December must be explicitly computed as the negative sum of all the other months.

coef.months <- c( coef (mod.lm2)[2:12],
-sum ( coef (mod.lm2)[2:12]))
plot (coef.months , xlab = " Month ", ylab = " Coefficient ",
xaxt = "n", col = " blue ", pch = 19, type = "o")
axis (side = 1, at = 1:12, labels = c("J", "F", "M", "A",
"M", "J", "J", "A", "S", "O", "N", "D"))

The next step is to reproduce the right-hand side of the plot (hr).

coef.hours <- c(coef(mod.lm2)[13:35],
-sum (coef(mod.lm2)[13:35]))
plot(coef.hours , xlab = " Hour ", ylab = " Coefficient ",
col = " blue ", pch = 19, type = "o")

We then plot plot the coefficients associated with mnth and hr, in order to reproduce the above plots but using our Poisson model.

Plot for Poisson Model

coef.mnth <- c(coef (mod.pois)[2:12],
-sum ( coef (mod.pois)[2:12]))
plot(coef.mnth , xlab = " Month ", ylab = " Coefficient ",
xaxt = "n", col = " blue ", pch = 19, type = "o")
axis(side = 1, at = 1:12, labels = c("J", "F", "M", "A", "M",
"J", "J", "A", "S", "O", "N", "D"))

coef.hours <- c(coef(mod.pois)[13:35],
-sum (coef(mod.pois)[13:35]))
plot(coef.hours , xlab = " Hour ", ylab = " Coefficient ",
col = " blue ", pch = 19, type = "o")

Finally, we once again use the predict() function to obtain the fitted values (predictions) from this Poisson regression model. However, we must use the argument type = “response” which tells R to output probabilities of the form \(P(Y = 1|X)\) , as opposed to other information such as the logit

plot (predict(mod.lm2), predict(mod.pois, type = "response"))
abline (0, 1, col = 2, lwd = 3)

Discussion

From the plots above, we can clearly observe that the predictions from the Poisson regression model are correlated with those from the linear model; however, our Poisson model produces non-negative outpts. For Poisson regression the responses at each level of \(X\) become more variable with increasing means, where \(variance=mean\). In addition, the mean values of \(Y\) at each level of \(X\), \(λ_{Y|X}\), fall on a curve, not a line, although the logs of the means should follow a line.

As a result, at either very low or very high levels of ridership, the Poisson regression predictions tend to be larger than those from the linear model. Based on the goal of this analysis, we can draw the conclusion that a Poisson random variable is often used to model counts. Since a Poisson random variable is a count, its minimum value is zero and, in theory, the maximum is unbounded. And therefore, a Poisson model is best to use when our response is the count of something, which in this case is the count of bikers.

Appendix

Code used to truncate outputs.

hook_output_default <- knitr::knit_hooks$get('output')

truncate_to_lines <- function(x, n) {
   if (!is.null(n)) {
      x = unlist(stringr::str_split(x, '\n'))
      if (length(x) > n) {
         # truncate the output
         x = c(head(x, n), '...\n')
      }
      x = paste(x, collapse = '\n') # paste first n lines together
   }
   x
}

knitr::knit_hooks$set(output = function(x, options) {
   max.lines <- options$max.lines
   x <- truncate_to_lines(x, max.lines)

   hook_output_default(x, options)
})