1 Bike Sharing Analysis - Poisson Regression

I will be analyzing a bike sharing dataset consist of two CSV files to model urban mobility patterns. I’ll use a Poisson regression model to predict daily bike demand and tap into that data to forecast exactly how many bikes are needed each day.

library(readr)
library(dplyr)
library(corrplot)
library(car)
library(MASS)

First thing, lets load our data into R

Now I m going to start with day dataset that shows bike rental traffic. Here is data dictionary for our first dataset

  • instant : record index
  • dteday : date
  • season : season (1:spring, 2:summer, 3:fall, 4:winter)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • hr : hour (0 to 23)
  • holiday : weather day is holiday or not
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
  • weathersit :
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp : Normalized temperature in Celsius. The values are divided to 41 (max)
  • atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
  • hum: Normalized humidity. The values are divided to 100 (max)
  • windspeed: Normalized wind speed. The values are divided to 67 (max)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered

cnt will be our response variable.

1.1 Exploratory Data Analysis

str(bike_day)
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ instant   : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Date[1:731], format: "2011-01-01" "2011-01-02" ...
##  $ season    : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: num [1:731] 654 670 1229 1454 1518 ...
##  $ cnt       : num [1:731] 985 801 1349 1562 1600 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   instant = col_double(),
##   ..   dteday = col_date(format = ""),
##   ..   season = col_double(),
##   ..   yr = col_double(),
##   ..   mnth = col_double(),
##   ..   holiday = col_double(),
##   ..   weekday = col_double(),
##   ..   workingday = col_double(),
##   ..   weathersit = col_double(),
##   ..   temp = col_double(),
##   ..   atemp = col_double(),
##   ..   hum = col_double(),
##   ..   windspeed = col_double(),
##   ..   casual = col_double(),
##   ..   registered = col_double(),
##   ..   cnt = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Besides date variable all the variables are numeric form, there are categorical variables we need to transform into factor format to analyze outcome more precisely.

bike_day$season <- as.factor(bike_day$season)
bike_day$holiday <- as.factor(bike_day$holiday)
bike_day$workingday <- as.factor(bike_day$workingday)
bike_day$weekday <- as.factor(bike_day$weekday)
bike_day$yr <- as.factor(bike_day$yr)
library(GGally)

ggpairs(bike_day, 
        columns = c("season", "workingday", "temp", "cnt"),
        aes(color = season, alpha = 0.4))

ggpairs(bike_day, 
        columns = c("temp", "atemp", "hum", "windspeed", "cnt"),
        aes(color = season, alpha = 0.4))

ggpairs(bike_day, 
        columns = c("casual", "registered", "cnt", "temp"),
        aes(color = workingday, alpha = 0.4))

1.2 Positive and Negative Correlations

  • Temperature (temp) and Feeling Temperature (atemp) show a very strong positive relationship with a correlation of 0.99. This indicates that actual temperature and perceived temperature move almost identically.

  • Temperature (temp) and Total Rentals (cnt) show a strong positive relationship (Cor ≈ 0.63), and Atemp and cnt are similarly strongly correlated (Cor ≈ 0.63). This suggests that warmer weather is associated with higher bike rental demand.

  • Casual and Registered Users with Total Rentals (cnt) show very strong positive relationships (Cor ≈ 0.67 and 0.95, respectively). This indicates that both user types contribute heavily to total demand, with registered users having the strongest influence.

  • Casual and Registered Users show a moderate positive relationship (Cor ≈ 0.40). This suggests that when casual usage increases, registered usage also tends to increase, though not as strongly.

  • Humidity (hum) and Total Rentals (cnt) show a weak negative relationship (Cor ≈ -0.10). This implies that higher humidity slightly reduces bike rental activity.

  • Wind Speed (windspeed)and Total Rentals(cnt) show a moderate negative relationship (Cor ≈ -0.24). This suggests that higher wind speeds are associated with lower bike usage.

  • Temperature (temp) and Humidity (hum) show a weak positive relationship (Cor ≈ 0.13–0.25 depending on season), indicating a slight tendency for warmer days to also be more humid.

  • Temperature (temp) and Wind Speed (windspeed) show a weak negative relationship (Cor ≈ -0.16 to -0.24), suggesting that warmer days tend to have slightly lower wind speeds.

  • Registered Users and Total Rentals (cnt) show a very strong positive relationship with a correlation of 0.95. This indicates that as the number of registered users increases, total bike rentals increase almost proportionally, making it the strongest driver of demand.

  • Casual Users and Total Rentals (cnt) also show a strong positive relationship (Cor ≈ 0.67). This suggests that increases in casual riders are associated with higher overall bike usage, though not as strongly as registered users.

  • Temperature (temp) and Total Rentals (cnt) show a strong positive relationship (Cor ≈ 0.63), and Atemp and cnt are similarly correlated (Cor ≈ 0.66). This indicates that warmer weather significantly increases bike rental activity.

  • Humidity (hum) and Total Rentals (cnt) show a weak negative relationship (Cor ≈ -0.10). This suggests that higher humidity slightly reduces bike rentals, but the effect is relatively small.

  • Wind Speed (windspeed) and Total Rentals (cnt) show a moderate negative relationship (Cor ≈ -0.24). This implies that windier conditions tend to discourage bike usage and reduce total rentals.

1.3 Fitting the model

poisson_full <- glm(cnt ~ .-dteday,
                   family = poisson(link = "log"),
                   data = bike_day)
summary(poisson_full)
## 
## Call:
## glm(formula = cnt ~ . - dteday, family = poisson(link = "log"), 
##     data = bike_day)
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  7.234e+00  4.949e-03 1461.792  < 2e-16 ***
## instant     -1.220e-03  6.407e-05  -19.039  < 2e-16 ***
## season2      7.926e-02  2.530e-03   31.323  < 2e-16 ***
## season3      4.604e-02  3.477e-03   13.242  < 2e-16 ***
## season4      1.119e-01  3.735e-03   29.967  < 2e-16 ***
## yr1          4.393e-01  2.357e-02   18.638  < 2e-16 ***
## mnth         3.432e-02  1.941e-03   17.681  < 2e-16 ***
## holiday1    -1.958e-02  4.153e-03   -4.714 2.43e-06 ***
## weekday1    -2.263e-02  2.935e-03   -7.712 1.24e-14 ***
## weekday2    -3.416e-02  3.016e-03  -11.326  < 2e-16 ***
## weekday3    -5.652e-02  3.080e-03  -18.349  < 2e-16 ***
## weekday4    -4.510e-02  3.050e-03  -14.789  < 2e-16 ***
## weekday5    -2.838e-02  2.780e-03  -10.210  < 2e-16 ***
## weekday6    -1.662e-02  2.126e-03   -7.818 5.36e-15 ***
## workingday1         NA         NA       NA       NA    
## weathersit  -2.850e-02  1.477e-03  -19.296  < 2e-16 ***
## temp        -1.116e-02  2.266e-02   -0.492    0.622    
## atemp        3.078e-01  2.487e-02   12.375  < 2e-16 ***
## hum         -8.762e-03  5.760e-03   -1.521    0.128    
## windspeed   -5.205e-02  8.383e-03   -6.209 5.33e-10 ***
## casual       1.785e-04  1.577e-06  113.178  < 2e-16 ***
## registered   2.228e-04  9.432e-07  236.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 668801  on 730  degrees of freedom
## Residual deviance:  40903  on 710  degrees of freedom
## AIC: 48343
## 
## Number of Fisher Scoring iterations: 4

The null deviance is 668,801 and the residual deviance is 40,903, meaning the model explains the vast majority of the deviance in daily bike counts — a strong overall fit. A rough pseudo-R² (1 − 40903/668801) ≈ 0.939, suggesting the model accounts for about 93.9% of the deviance in cnt.

The AIC of 48,343 provides a baseline for comparing against our future models with adjustments.

workingday1 was dropped (shown as NA), because it is perfectly collinear with the combination of weekday and holiday. It carries no additional information once those variables are included.

1.3.1 Key Significant Predictors

casual and registered are overwhelmingly strong predictors (z values 113 and 236), which makes sense since cnt is literally their sum.

yr1 (coeff ≈ 0.439), indicating that Year 1 (2012) had higher rentals compared to Year 0 (2011) as the reference category.

season coefficients show that Season 4 has the highest boost relative to Season 1 (the reference), while Season 3 is the weakest gainer.

atemp (coeff ≈ 0.308) has a strong positive effect, while temp alone is not significant (p = 0.622), suggesting perceived temperature matters more than actual temperature.

weathersit (coeff ≈ −0.029) confirms that worse weather conditions reduce rentals, as expected.windspeed (coeff ≈ −0.052) has a significant negative effect, with higher winds discouraging riders.

weekday coefficients are all negative relative to Sunday (the reference 0), with Wednesday (weekday3) showing the largest dip, suggesting mid-week casual ridership is lowest.

holiday1 (coeff ≈ −0.020) slightly reduces ridership, possibly because commuter trips drop on holidays.

instant (coeff ≈ −0.00122) has a small but significant negative trend over the index of days, which may reflect seasonality or other time-based patterns not fully captured elsewhere.

1.3.2 Non-Significant Predictors

temp (p = 0.622) and hum (p = 0.128) are not statistically significant, suggesting that once atemp and other variables are controlled for, raw temperature and humidity doesn’t add unique explanatory power.

1.4 Reduced Model

poisson_reduced <- glm(cnt ~ instant + season +
                      yr + mnth + holiday +
                      weekday + weathersit + atemp +
                      windspeed + casual + registered,
                   family = poisson(link = "log"),
                   data = bike_day)

summary(poisson_reduced)
## 
## Call:
## glm(formula = cnt ~ instant + season + yr + mnth + holiday + 
##     weekday + weathersit + atemp + windspeed + casual + registered, 
##     family = poisson(link = "log"), data = bike_day)
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  7.230e+00  4.228e-03 1710.239  < 2e-16 ***
## instant     -1.230e-03  6.372e-05  -19.311  < 2e-16 ***
## season2      7.932e-02  2.523e-03   31.439  < 2e-16 ***
## season3      4.630e-02  3.401e-03   13.611  < 2e-16 ***
## season4      1.121e-01  3.734e-03   30.038  < 2e-16 ***
## yr1          4.432e-01  2.344e-02   18.911  < 2e-16 ***
## mnth         3.457e-02  1.934e-03   17.871  < 2e-16 ***
## holiday1    -1.985e-02  4.149e-03   -4.784 1.72e-06 ***
## weekday1    -2.249e-02  2.932e-03   -7.670 1.71e-14 ***
## weekday2    -3.401e-02  3.010e-03  -11.296  < 2e-16 ***
## weekday3    -5.640e-02  3.072e-03  -18.359  < 2e-16 ***
## weekday4    -4.483e-02  3.038e-03  -14.753  < 2e-16 ***
## weekday5    -2.818e-02  2.761e-03  -10.206  < 2e-16 ***
## weekday6    -1.653e-02  2.126e-03   -7.777 7.44e-15 ***
## weathersit  -2.977e-02  1.230e-03  -24.199  < 2e-16 ***
## atemp        2.934e-01  7.082e-03   41.423  < 2e-16 ***
## windspeed   -4.881e-02  7.960e-03   -6.132 8.69e-10 ***
## casual       1.787e-04  1.564e-06  114.272  < 2e-16 ***
## registered   2.228e-04  9.414e-07  236.707  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 668801  on 730  degrees of freedom
## Residual deviance:  40905  on 712  degrees of freedom
## AIC: 48342
## 
## Number of Fisher Scoring iterations: 4

The AIC is 1 point lower (48,342 vs 48,343) in the reduced model, which means it achieves essentially the same fit with fewer predictors, so removing non-significant variables (temp, hum) actually improved our model slightly.

The residual deviance is nearly identical (40,905 vs 40,903), meaning almost no explanatory power was lost by dropping temp and hum, confirming they were not contributing meaningfully.

The refined model gains 2 extra degrees of freedom (712 vs 710), reflecting the removal of those two predictors, which makes the model more parsimonious.

1.5 Model Remedy & Diagnostics

vif(poisson_reduced)
##                  GVIF Df GVIF^(1/(2*Df))
## instant    528.750548  1       22.994576
## season      14.820284  3        1.567266
## yr         425.020842  1       20.616034
## mnth       113.369841  1       10.647527
## holiday      1.318175  1        1.148118
## weekday      4.684057  6        1.137327
## weathersit   1.238158  1        1.112726
## atemp        3.558116  1        1.886297
## windspeed    1.120140  1        1.058367
## casual       4.270711  1        2.066570
## registered   6.455198  1        2.540708

instant (22.99) and yr (20.62) are extremely highly collinear with each other and other time-related variables. Both essentially capture the same time progression. therefore we will only utilize yr and drop instant.

mnth (10.65) is also severely collinear, likely because it overlaps heavily with both instant and season.Therefore we can drop mnth.

poisson_reduced2 <- glm(cnt ~ yr + season + holiday +
                      weekday + weathersit + atemp +
                      windspeed + casual + registered,
                   family = poisson(link = "log"),
                   data = bike_day)

summary(poisson_reduced2)
## 
## Call:
## glm(formula = cnt ~ yr + season + holiday + weekday + weathersit + 
##     atemp + windspeed + casual + registered, family = poisson(link = "log"), 
##     data = bike_day)
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  7.239e+00  3.929e-03 1842.671  < 2e-16 ***
## yr1         -7.273e-03  2.045e-03   -3.556 0.000377 ***
## season2      7.417e-02  2.402e-03   30.875  < 2e-16 ***
## season3      3.167e-02  2.817e-03   11.241  < 2e-16 ***
## season4      8.951e-02  2.343e-03   38.199  < 2e-16 ***
## holiday1    -1.802e-02  4.142e-03   -4.352 1.35e-05 ***
## weekday1    -2.217e-02  2.928e-03   -7.571 3.70e-14 ***
## weekday2    -3.342e-02  3.005e-03  -11.120  < 2e-16 ***
## weekday3    -5.613e-02  3.068e-03  -18.296  < 2e-16 ***
## weekday4    -4.437e-02  3.034e-03  -14.621  < 2e-16 ***
## weekday5    -2.832e-02  2.759e-03  -10.265  < 2e-16 ***
## weekday6    -1.713e-02  2.125e-03   -8.061 7.57e-16 ***
## weathersit  -2.877e-02  1.228e-03  -23.424  < 2e-16 ***
## atemp        2.924e-01  7.075e-03   41.333  < 2e-16 ***
## windspeed   -4.831e-02  7.945e-03   -6.080 1.20e-09 ***
## casual       1.796e-04  1.560e-06  115.113  < 2e-16 ***
## registered   2.230e-04  9.389e-07  237.451  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 668801  on 730  degrees of freedom
## Residual deviance:  41308  on 714  degrees of freedom
## AIC: 48740
## 
## Number of Fisher Scoring iterations: 4

Our AIC scored increased by ~400. This means that dropped variables provided predictive information even though they were highly multicollinear.

At this point it would be though for me to find the way to resolve collinearity and AIC dilemma, therefore I will do forward selection method and see what is computer decide.

null_model <- glm(cnt ~ 1, family = poisson(link = "log"), data = bike_day)

full_model <- glm(cnt ~ . - dteday, family = poisson(link = "log"), data = bike_day)

forward_model <- step(null_model,
                      scope = list(lower = null_model, upper = full_model),
                      direction = "forward",
                      trace = 1) 
## Start:  AIC=676201.4
## cnt ~ 1
## 
##              Df Deviance    AIC
## + registered  1   122821 130224
## + casual      1   421442 428845
## + atemp       1   422111 429513
## + temp        1   426458 433860
## + instant     1   426519 433921
## + season      3   435960 443366
## + yr          1   471448 478851
## + weathersit  1   612585 619988
## + mnth        1   621040 628443
## + windspeed   1   634555 641958
## + hum         1   662645 670047
## + weekday     6   664853 672266
## + holiday     1   665788 673190
## + workingday  1   666510 673913
## <none>            668801 676201
## 
## Step:  AIC=130223.5
## cnt ~ registered
## 
##              Df Deviance    AIC
## + casual      1    48209  55613
## + weekday     6    78946  86361
## + workingday  1    79576  86981
## + atemp       1    99493 106897
## + temp        1   100443 107847
## + season      3   101283 108692
## + weathersit  1   119615 127019
## + windspeed   1   121845 129250
## + holiday     1   122299 129703
## + yr          1   122359 129764
## + mnth        1   122569 129974
## + instant     1   122691 130096
## + hum         1   122734 130138
## <none>            122821 130224
## 
## Step:  AIC=55613.39
## cnt ~ registered + casual
## 
##              Df Deviance   AIC
## + season      3    44354 51765
## + atemp       1    45468 52875
## + temp        1    45820 53227
## + yr          1    46113 53520
## + instant     1    47416 54823
## + mnth        1    47558 54965
## + weekday     6    47712 55129
## + weathersit  1    47981 55388
## + windspeed   1    48086 55493
## + hum         1    48123 55530
## + holiday     1    48183 55589
## + workingday  1    48191 55598
## <none>             48209 55613
## 
## Step:  AIC=51764.56
## cnt ~ registered + casual + season
## 
##              Df Deviance   AIC
## + atemp       1    42522 49934
## + temp        1    42748 50160
## + weathersit  1    43802 51214
## + instant     1    43863 51276
## + yr          1    43962 51375
## + weekday     6    43966 51388
## + windspeed   1    44290 51702
## + mnth        1    44306 51719
## + holiday     1    44326 51738
## + hum         1    44336 51749
## <none>             44354 51765
## + workingday  1    44354 51766
## 
## Step:  AIC=49934.26
## cnt ~ registered + casual + season + atemp
## 
##              Df Deviance   AIC
## + weathersit  1    41882 49297
## + weekday     6    42027 49452
## + hum         1    42294 49709
## + instant     1    42428 49843
## + workingday  1    42465 49879
## + yr          1    42475 49889
## + holiday     1    42476 49891
## + mnth        1    42494 49908
## + windspeed   1    42499 49914
## + temp        1    42506 49920
## <none>             42522 49934
## 
## Step:  AIC=49296.68
## cnt ~ registered + casual + season + atemp + weathersit
## 
##              Df Deviance   AIC
## + weekday     6    41391 48818
## + holiday     1    41821 49237
## + workingday  1    41833 49249
## + windspeed   1    41845 49262
## + mnth        1    41861 49277
## + temp        1    41871 49287
## + instant     1    41871 49288
## <none>             41882 49297
## + hum         1    41881 49297
## + yr          1    41882 49299
## 
## Step:  AIC=48817.89
## cnt ~ registered + casual + season + atemp + weathersit + weekday
## 
##              Df Deviance   AIC
## + instant     1    41319 48747
## + windspeed   1    41347 48776
## + yr          1    41362 48791
## + holiday     1    41364 48793
## + workingday  1    41364 48793
## + mnth        1    41365 48794
## + temp        1    41387 48816
## + hum         1    41387 48816
## <none>             41391 48818
## 
## Step:  AIC=48747.28
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant
## 
##              Df Deviance   AIC
## + yr          1    41283 48713
## + windspeed   1    41285 48715
## + holiday     1    41306 48737
## + workingday  1    41306 48737
## + temp        1    41314 48744
## + mnth        1    41314 48745
## + hum         1    41316 48746
## <none>             41319 48747
## 
## Step:  AIC=48713.2
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant + yr
## 
##              Df Deviance   AIC
## + mnth        1    40965 48397
## + windspeed   1    41241 48673
## + holiday     1    41268 48700
## + workingday  1    41268 48700
## + temp        1    41278 48710
## <none>             41283 48713
## + hum         1    41282 48714
## 
## Step:  AIC=48397.36
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant + yr + mnth
## 
##              Df Deviance   AIC
## + windspeed   1    40928 48363
## + holiday     1    40943 48378
## + workingday  1    40943 48378
## <none>             40965 48397
## + temp        1    40963 48398
## + hum         1    40965 48399
## 
## Step:  AIC=48362.94
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant + yr + mnth + windspeed
## 
##              Df Deviance   AIC
## + holiday     1    40905 48342
## + workingday  1    40905 48342
## + hum         1    40925 48362
## <none>             40928 48363
## + temp        1    40928 48365
## 
## Step:  AIC=48341.97
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant + yr + mnth + windspeed + holiday
## 
##        Df Deviance   AIC
## + hum   1    40903 48342
## <none>       40905 48342
## + temp  1    40905 48344
## 
## Step:  AIC=48341.65
## cnt ~ registered + casual + season + atemp + weathersit + weekday + 
##     instant + yr + mnth + windspeed + holiday + hum
## 
##        Df Deviance   AIC
## <none>       40903 48342
## + temp  1    40903 48343
summary(forward_model)
## 
## Call:
## glm(formula = cnt ~ registered + casual + season + atemp + weathersit + 
##     weekday + instant + yr + mnth + windspeed + holiday + hum, 
##     family = poisson(link = "log"), data = bike_day)
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  7.234e+00  4.894e-03 1478.052  < 2e-16 ***
## registered   2.228e-04  9.428e-07  236.259  < 2e-16 ***
## casual       1.784e-04  1.571e-06  113.552  < 2e-16 ***
## season2      7.918e-02  2.525e-03   31.359  < 2e-16 ***
## season3      4.574e-02  3.421e-03   13.370  < 2e-16 ***
## season4      1.119e-01  3.735e-03   29.971  < 2e-16 ***
## atemp        2.961e-01  7.305e-03   40.531  < 2e-16 ***
## weathersit  -2.853e-02  1.476e-03  -19.325  < 2e-16 ***
## weekday1    -2.267e-02  2.934e-03   -7.725 1.12e-14 ***
## weekday2    -3.422e-02  3.013e-03  -11.355  < 2e-16 ***
## weekday3    -5.661e-02  3.075e-03  -18.409  < 2e-16 ***
## weekday4    -4.517e-02  3.047e-03  -14.826  < 2e-16 ***
## weekday5    -2.850e-02  2.769e-03  -10.293  < 2e-16 ***
## weekday6    -1.662e-02  2.126e-03   -7.817 5.43e-15 ***
## instant     -1.221e-03  6.400e-05  -19.083  < 2e-16 ***
## yr1          4.398e-01  2.354e-02   18.685  < 2e-16 ***
## mnth         3.437e-02  1.939e-03   17.724  < 2e-16 ***
## windspeed   -5.253e-02  8.327e-03   -6.308 2.83e-10 ***
## holiday1    -1.962e-02  4.152e-03   -4.726 2.29e-06 ***
## hum         -8.764e-03  5.760e-03   -1.521    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 668801  on 730  degrees of freedom
## Residual deviance:  40903  on 711  degrees of freedom
## AIC: 48342
## 
## Number of Fisher Scoring iterations: 4

Forward selection started with our null model and then selected the variable that results in the lowest AIC among all remaining candidates. Looking at the selection steps, registered was added first as it produced the largest AIC drop (from 676,201 to 130,224), followed by casual, season, atemp, weathersit, weekday, instant, yr, mnth, windspeed, holiday, and finally hum. Notably, temp was never added at any step because it consistently failed to improve AIC meaningfully. At the final step it would have actually increased AIC from 48,342 to 48,343, confirming it adds no value. Similarly, workingday was never selected due to its perfect collinearity with other variables. The final forward-selected model ended up with an AIC of 48,342 and residual deviance of 40,903, which is identical to the my previous poisson_reduced model from earlier.

Now lets look at model diagnostic to furher examine our model.

par(mfrow = c(2,2))
plot(poisson_reduced)

  • Residuals vs Fitted: Ideally, the red line should be flat and hover around zero. In our plot, there is a clear curved (upside down U-shape) pattern, which is a strong indication of non-linearity. Observations 668, 2, and 765 are flagged as potential outliers worth investigating further.

  • Q-Q Residuals: The points deviate severely and consistently from the diagonal reference line, particularly in the upper tail where observation 668 stands far out. This is not a minor tail deviation. The pattern suggests the residuals are heavily right-skewed and far from normally distributed, which is a strong confirmation of higher dispersion in the our model.

  • Scale-Location: The red line shows a distinct curved pattern rather than a flat horizontal line, and the spread of the standardized residuals is uneven across fitted values — wider in the lower and upper ends and narrower in the middle. This confirms heteroscedasticity, meaning the variance of errors is not constant across predicted values, which directly violates the Poisson assumption that mean equals variance.

  • Residuals vs Leverage: Observations 668, 726, and 644 show notably large residuals combined with moderate leverage. Importantly, observation 668 appears to fall beyond or near the Cook’s Distance threshold of 1, suggesting it may be an influential observation that is meaningfully affecting the model estimates and should be examined closely or considered for further investigation.

1.6 Next step: more research.

Our residual plots show a lot of variance in the errors. In a Poisson regression, we assume the variance is equal to the mean, so that’s a problem. Let’s take another look at our response variable and if the variance is larger than the mean, we may need to consider a different model.

summary(bike_day$cnt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      22    3152    4548    4504    5956    8714
var(bike_day$cnt)
## [1] 3752788
var(bike_day$cnt) / mean(bike_day$cnt)
## [1] 833.1478

The mean (4,504) and median (4,548) are very close, which suggests the distribution of cnt is fairly symmetric without extreme skewness. However, the variance is around 3.7 million, much larger than the mean. This gives a dispersion ratio of about 833, meaning the variance is 833 times the mean. That’s a major violation of the Poisson assumption, where the mean and variance are supposed to be equal.

Now, lets visualize our bike rental count dispersion.

hist(bike_day$cnt, 
     breaks = 30,
     main = "Distribution of Daily Bike Rentals",
     xlab = "cnt",
     col = "steelblue")


abline(v = mean(bike_day$cnt), col = "red", lwd = 2, lty = 1)
abline(v = median(bike_day$cnt), col = "green", lwd = 2, lty = 2)


legend("topright", legend = c("Mean", "Median"),
       col = c("red", "green"), lty = c(1, 2), lwd = 2)

The histogram of daily bike rentals also supports the overdispersion we observed. You can clearly see the spread in the data, which confirms that a Poisson model isn’t appropriate here. Instead, a Negative Binomial model is a better choice because it can handle this extra variability.

nb_model <- glm.nb(cnt ~ registered + casual + season + atemp +
                   weathersit + weekday + instant + yr + mnth +
                   windspeed + holiday + hum, data = bike_day)
summary(nb_model)
## 
## Call:
## glm.nb(formula = cnt ~ registered + casual + season + atemp + 
##     weathersit + weekday + instant + yr + mnth + windspeed + 
##     holiday + hum, data = bike_day, init.theta = 36.35894071, 
##     link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.107e+00  4.944e-02 143.752  < 2e-16 ***
## registered   2.681e-04  9.945e-06  26.962  < 2e-16 ***
## casual       1.871e-04  1.760e-05  10.630  < 2e-16 ***
## season2      5.494e-02  2.509e-02   2.190  0.02854 *  
## season3      1.782e-03  3.391e-02   0.053  0.95809    
## season4      8.214e-02  3.430e-02   2.395  0.01662 *  
## atemp        3.248e-01  8.003e-02   4.059 4.93e-05 ***
## weathersit  -3.977e-02  1.585e-02  -2.509  0.01212 *  
## weekday1    -4.523e-02  2.985e-02  -1.515  0.12973    
## weekday2    -6.000e-02  3.058e-02  -1.962  0.04977 *  
## weekday3    -8.117e-02  3.089e-02  -2.628  0.00859 ** 
## weekday4    -6.745e-02  3.094e-02  -2.180  0.02922 *  
## weekday5    -3.544e-02  2.881e-02  -1.230  0.21870    
## weekday6    -2.415e-02  2.333e-02  -1.035  0.30055    
## instant     -1.614e-03  7.056e-04  -2.288  0.02217 *  
## yr1          5.541e-01  2.604e-01   2.127  0.03339 *  
## mnth         4.756e-02  2.177e-02   2.184  0.02893 *  
## windspeed   -8.949e-02  8.888e-02  -1.007  0.31400    
## holiday1    -1.522e-02  4.205e-02  -0.362  0.71731    
## hum         -2.235e-02  6.141e-02  -0.364  0.71590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(36.3589) family taken to be 1)
## 
##     Null deviance: 6763.49  on 730  degrees of freedom
## Residual deviance:  765.88  on 711  degrees of freedom
## AIC: 11647
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  36.36 
##           Std. Err.:  1.98 
## 
##  2 x log-likelihood:  -11605.11

The null deviance dropped from about 6,763 to 766, which gives a pseudo R² of roughly 0.89. In simple terms, the model explains about 89% of the variation in cnt, which is a strong fit. The theta value is around 36, indicating that the Negative Binomial model is doing a good job accounting for over dispersion in the data. We also see a major improvement in AIC from about 48,000 in the Poisson model down to 11,600—which clearly shows that the Negative Binomial model fits the data much better. Finally, the dispersion ratio is close to 1, meaning the over dispersion issue has been effectively resolved.

nb_model_reduced <- glm.nb(cnt ~ registered + casual + season + 
                           atemp + weathersit + weekday + 
                           instant + yr + mnth, 
                           data = bike_day)
summary(nb_model_reduced)
## 
## Call:
## glm.nb(formula = cnt ~ registered + casual + season + atemp + 
##     weathersit + weekday + instant + yr + mnth, data = bike_day, 
##     init.theta = 36.29935948, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.077e+00  3.597e-02 196.731  < 2e-16 ***
## registered   2.709e-04  9.265e-06  29.235  < 2e-16 ***
## casual       1.873e-04  1.666e-05  11.241  < 2e-16 ***
## season2      5.250e-02  2.496e-02   2.103  0.03545 *  
## season3      1.522e-03  3.382e-02   0.045  0.96410    
## season4      8.085e-02  3.424e-02   2.361  0.01821 *  
## atemp        3.166e-01  7.751e-02   4.085 4.41e-05 ***
## weathersit  -4.160e-02  1.295e-02  -3.212  0.00132 ** 
## weekday1    -4.958e-02  2.730e-02  -1.816  0.06936 .  
## weekday2    -6.327e-02  2.948e-02  -2.147  0.03183 *  
## weekday3    -8.411e-02  2.974e-02  -2.828  0.00469 ** 
## weekday4    -7.067e-02  2.962e-02  -2.386  0.01703 *  
## weekday5    -3.794e-02  2.779e-02  -1.366  0.17209    
## weekday6    -2.511e-02  2.330e-02  -1.078  0.28107    
## instant     -1.629e-03  7.039e-04  -2.315  0.02064 *  
## yr1          5.554e-01  2.597e-01   2.139  0.03247 *  
## mnth         4.797e-02  2.173e-02   2.207  0.02729 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(36.2994) family taken to be 1)
## 
##     Null deviance: 6752.58  on 730  degrees of freedom
## Residual deviance:  765.82  on 714  degrees of freedom
## AIC: 11642
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  36.30 
##           Std. Err.:  1.98 
## 
##  2 x log-likelihood:  -11606.25

After removing windspeed, holiday, and humidity, the model actually improved slightly. AIC dropped from 11,647 to 11,642, and the residual deviance stayed essentially the same at 765.82. This means we didn’t really lose any explanatory power by dropping those variables, so they weren’t contributing much to the model. The dispersion ratio is still about 1.07 (765.82/714), which is very close to 1, so over dispersion remains well controlled.

1.7 Model Selection and Reviewing parameters

models <- list( forward_model = forward_model,
                full_model = full_model,
                nb_model = nb_model,
                nb_model_reduced = nb_model_reduced,
                poisson_reduced = poisson_reduced,
                poisson_reduced2 = poisson_reduced2)

target_order <- c("full_model", "poisson_reduced", "poisson_reduced2", 
                  "forward_model", "nb_model", "nb_model_reduced")

model_comparison <- data.frame(
  Model = names(models),
  AIC = sapply(models, AIC),
  BIC = sapply(models, BIC),
  LogLik = sapply(models, logLik),
  Deviance = sapply(models, deviance)
)

model_comparison$Model <- factor(model_comparison$Model, levels = target_order)

model_comparison <- model_comparison[order(model_comparison$Model), ]

print(model_comparison)
##                             Model      AIC      BIC     LogLik   Deviance
## full_model             full_model 48343.41 48439.89 -24150.706 40902.7507
## poisson_reduced   poisson_reduced 48341.97 48429.26 -24151.984 40905.3078
## poisson_reduced2 poisson_reduced2 48740.20 48818.30 -24353.098 41307.5360
## forward_model       forward_model 48341.65 48433.54 -24150.827 40902.9935
## nb_model                 nb_model 11647.11 11743.60  -5802.557   765.8773
## nb_model_reduced nb_model_reduced 11642.25 11724.95  -5803.123   765.8204

Our best model is nb_Model_reduced so far and now we are going to examine model diagnostics.

par(mfrow = c(2,2))
plot(nb_model_reduced)

As shown in our diagnostic plots, there are some violations of homoscedasticity that we can further investigate. It’s possible that certain holiday effects combined with good weather conditions are creating spikes in the data. Overall, though, the model explains a large portion of the variation. This will be my final model, and next I’ll move on to making predictions.

1.8 Making Predictions and Interpretation of Results

predictions <- predict(nb_model_reduced, type = "response")
comparison <- data.frame(
  Actual = bike_day$cnt,
  Predicted = round(predictions, 0))
head(comparison, 10)
mae <- mean(abs(comparison$Actual - comparison$Predicted))
cat("MAE:", round(mae, 2), "\n")
## MAE: 442.71
rmse <- sqrt(mean((comparison$Actual - comparison$Predicted)^2))
cat("RMSE:", round(rmse, 2), "\n")
## RMSE: 620.4
mape <- mean(abs((comparison$Actual - comparison$Predicted) / 
             comparison$Actual)) * 100
cat("MAPE:", round(mape, 2), "%\n")
## MAPE: 18.94 %
  • Mean Absolute Error (MAE) is 442.71. This is the average absolute difference between our predicted and actual values. on average, our model is off by 443 units.

  • Root Mean Squared Error (RMSE) is 620.4. this is similar to MAE but penalizes large errors more heavily. Typical error magnitude is around 620 units, but larger mistakes influence it more.

  • Mean Absolute Percentage Error (MAPE) is 18.94% means that on average, our predictions are off by about 19%

new_day <- data.frame(
  registered = 3000,
  casual     = 500,
  season     = factor(2, levels = levels(bike_day$season)),
  atemp      = 0.5,
  weathersit = 1,
  weekday    = factor(3, levels = levels(bike_day$weekday)),
  instant    = 400,
  yr         = factor(1, levels = levels(bike_day$yr)),
  mnth       = 6
)


new_prediction <- predict(nb_model_reduced, 
                          newdata = new_day, 
                          type = "response")

cat("Predicted cnt:", round(new_prediction, 0), "\n")
## Predicted cnt: 3865