I will be analyzing a bike sharing dataset consist of two CSV files to model urban mobility patterns. I’ll use a Poisson regression model to predict daily bike demand and tap into that data to forecast exactly how many bikes are needed each day.
First thing, lets load our data into R
Now I m going to start with day dataset that shows bike rental traffic. Here is data dictionary for our first dataset
cntwill be our response variable.
## spc_tbl_ [731 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ instant : num [1:731] 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : Date[1:731], format: "2011-01-01" "2011-01-02" ...
## $ season : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : num [1:731] 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : num [1:731] 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : num [1:731] 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: num [1:731] 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: num [1:731] 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num [1:731] 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num [1:731] 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num [1:731] 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num [1:731] 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : num [1:731] 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: num [1:731] 654 670 1229 1454 1518 ...
## $ cnt : num [1:731] 985 801 1349 1562 1600 ...
## - attr(*, "spec")=
## .. cols(
## .. instant = col_double(),
## .. dteday = col_date(format = ""),
## .. season = col_double(),
## .. yr = col_double(),
## .. mnth = col_double(),
## .. holiday = col_double(),
## .. weekday = col_double(),
## .. workingday = col_double(),
## .. weathersit = col_double(),
## .. temp = col_double(),
## .. atemp = col_double(),
## .. hum = col_double(),
## .. windspeed = col_double(),
## .. casual = col_double(),
## .. registered = col_double(),
## .. cnt = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Besides date variable all the variables are numeric
form, there are categorical variables we need to transform into factor
format to analyze outcome more precisely.
bike_day$season <- as.factor(bike_day$season)
bike_day$holiday <- as.factor(bike_day$holiday)
bike_day$workingday <- as.factor(bike_day$workingday)
bike_day$weekday <- as.factor(bike_day$weekday)
bike_day$yr <- as.factor(bike_day$yr)library(GGally)
ggpairs(bike_day,
columns = c("season", "workingday", "temp", "cnt"),
aes(color = season, alpha = 0.4))ggpairs(bike_day,
columns = c("temp", "atemp", "hum", "windspeed", "cnt"),
aes(color = season, alpha = 0.4))ggpairs(bike_day,
columns = c("casual", "registered", "cnt", "temp"),
aes(color = workingday, alpha = 0.4))Temperature (temp) and Feeling Temperature
(atemp) show a very strong positive relationship with a
correlation of 0.99. This indicates that actual
temperature and perceived temperature move almost identically.
Temperature (temp) and Total Rentals
(cnt) show a strong positive relationship (Cor ≈
0.63), and Atemp and cnt are
similarly strongly correlated (Cor ≈ 0.63). This
suggests that warmer weather is associated with higher bike rental
demand.
Casual and Registered Users with Total
Rentals (cnt) show very strong positive relationships
(Cor ≈ 0.67 and 0.95, respectively). This indicates
that both user types contribute heavily to total demand, with registered
users having the strongest influence.
Casual and Registered Users show a
moderate positive relationship (Cor ≈ 0.40). This
suggests that when casual usage increases, registered usage also tends
to increase, though not as strongly.
Humidity (hum) and Total Rentals (cnt)
show a weak negative relationship (Cor ≈ -0.10). This
implies that higher humidity slightly reduces bike rental
activity.
Wind Speed (windspeed)and Total
Rentals(cnt) show a moderate negative relationship
(Cor ≈ -0.24). This suggests that higher wind speeds
are associated with lower bike usage.
Temperature (temp) and Humidity (hum)
show a weak positive relationship (Cor ≈ 0.13–0.25 depending on
season), indicating a slight tendency for warmer days to also
be more humid.
Temperature (temp) and Wind Speed
(windspeed) show a weak negative relationship (Cor
≈ -0.16 to -0.24), suggesting that warmer days tend to have
slightly lower wind speeds.
Registered Users and Total Rentals
(cnt) show a very strong positive relationship with a
correlation of 0.95. This indicates that as the number
of registered users increases, total bike rentals increase almost
proportionally, making it the strongest driver of demand.
Casual Users and Total Rentals (cnt)
also show a strong positive relationship (Cor ≈ 0.67).
This suggests that increases in casual riders are associated with higher
overall bike usage, though not as strongly as registered users.
Temperature (temp) and Total Rentals
(cnt) show a strong positive relationship (Cor ≈
0.63), and Atemp and cnt are
similarly correlated (Cor ≈ 0.66). This indicates that
warmer weather significantly increases bike rental activity.
Humidity (hum) and Total Rentals (cnt)
show a weak negative relationship (Cor ≈ -0.10). This
suggests that higher humidity slightly reduces bike rentals, but the
effect is relatively small.
Wind Speed (windspeed) and Total Rentals
(cnt) show a moderate negative relationship (Cor ≈
-0.24). This implies that windier conditions tend to discourage
bike usage and reduce total rentals.
poisson_full <- glm(cnt ~ .-dteday,
family = poisson(link = "log"),
data = bike_day)
summary(poisson_full)##
## Call:
## glm(formula = cnt ~ . - dteday, family = poisson(link = "log"),
## data = bike_day)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.234e+00 4.949e-03 1461.792 < 2e-16 ***
## instant -1.220e-03 6.407e-05 -19.039 < 2e-16 ***
## season2 7.926e-02 2.530e-03 31.323 < 2e-16 ***
## season3 4.604e-02 3.477e-03 13.242 < 2e-16 ***
## season4 1.119e-01 3.735e-03 29.967 < 2e-16 ***
## yr1 4.393e-01 2.357e-02 18.638 < 2e-16 ***
## mnth 3.432e-02 1.941e-03 17.681 < 2e-16 ***
## holiday1 -1.958e-02 4.153e-03 -4.714 2.43e-06 ***
## weekday1 -2.263e-02 2.935e-03 -7.712 1.24e-14 ***
## weekday2 -3.416e-02 3.016e-03 -11.326 < 2e-16 ***
## weekday3 -5.652e-02 3.080e-03 -18.349 < 2e-16 ***
## weekday4 -4.510e-02 3.050e-03 -14.789 < 2e-16 ***
## weekday5 -2.838e-02 2.780e-03 -10.210 < 2e-16 ***
## weekday6 -1.662e-02 2.126e-03 -7.818 5.36e-15 ***
## workingday1 NA NA NA NA
## weathersit -2.850e-02 1.477e-03 -19.296 < 2e-16 ***
## temp -1.116e-02 2.266e-02 -0.492 0.622
## atemp 3.078e-01 2.487e-02 12.375 < 2e-16 ***
## hum -8.762e-03 5.760e-03 -1.521 0.128
## windspeed -5.205e-02 8.383e-03 -6.209 5.33e-10 ***
## casual 1.785e-04 1.577e-06 113.178 < 2e-16 ***
## registered 2.228e-04 9.432e-07 236.189 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 668801 on 730 degrees of freedom
## Residual deviance: 40903 on 710 degrees of freedom
## AIC: 48343
##
## Number of Fisher Scoring iterations: 4
The null deviance is 668,801 and the residual deviance is 40,903, meaning the model explains the vast majority of the deviance in daily bike counts — a strong overall fit. A rough pseudo-R² (1 − 40903/668801) ≈ 0.939, suggesting the model accounts for about 93.9% of the deviance in cnt.
The AIC of 48,343 provides a baseline for comparing against our future models with adjustments.
workingday1was dropped (shown as NA), because it is perfectly collinear with the combination of weekday and holiday. It carries no additional information once those variables are included.
casualandregisteredare overwhelmingly strong predictors (z values 113 and 236), which makes sense sincecntis literally their sum.
yr1(coeff ≈ 0.439), indicating that Year 1 (2012) had higher rentals compared to Year 0 (2011) as the reference category.
seasoncoefficients show that Season 4 has the highest boost relative to Season 1 (the reference), while Season 3 is the weakest gainer.
atemp(coeff ≈ 0.308) has a strong positive effect, whiletempalone is not significant (p = 0.622), suggesting perceived temperature matters more than actual temperature.
weathersit(coeff ≈ −0.029) confirms that worse weather conditions reduce rentals, as expected.windspeed(coeff ≈ −0.052) has a significant negative effect, with higher winds discouraging riders.
weekdaycoefficients are all negative relative to Sunday (the reference 0), with Wednesday (weekday3) showing the largest dip, suggesting mid-week casual ridership is lowest.
holiday1(coeff ≈ −0.020) slightly reduces ridership, possibly because commuter trips drop on holidays.
instant(coeff ≈ −0.00122) has a small but significant negative trend over the index of days, which may reflect seasonality or other time-based patterns not fully captured elsewhere.
temp(p = 0.622) andhum(p = 0.128) are not statistically significant, suggesting that once atemp and other variables are controlled for, raw temperature and humidity doesn’t add unique explanatory power.
poisson_reduced <- glm(cnt ~ instant + season +
yr + mnth + holiday +
weekday + weathersit + atemp +
windspeed + casual + registered,
family = poisson(link = "log"),
data = bike_day)
summary(poisson_reduced)##
## Call:
## glm(formula = cnt ~ instant + season + yr + mnth + holiday +
## weekday + weathersit + atemp + windspeed + casual + registered,
## family = poisson(link = "log"), data = bike_day)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.230e+00 4.228e-03 1710.239 < 2e-16 ***
## instant -1.230e-03 6.372e-05 -19.311 < 2e-16 ***
## season2 7.932e-02 2.523e-03 31.439 < 2e-16 ***
## season3 4.630e-02 3.401e-03 13.611 < 2e-16 ***
## season4 1.121e-01 3.734e-03 30.038 < 2e-16 ***
## yr1 4.432e-01 2.344e-02 18.911 < 2e-16 ***
## mnth 3.457e-02 1.934e-03 17.871 < 2e-16 ***
## holiday1 -1.985e-02 4.149e-03 -4.784 1.72e-06 ***
## weekday1 -2.249e-02 2.932e-03 -7.670 1.71e-14 ***
## weekday2 -3.401e-02 3.010e-03 -11.296 < 2e-16 ***
## weekday3 -5.640e-02 3.072e-03 -18.359 < 2e-16 ***
## weekday4 -4.483e-02 3.038e-03 -14.753 < 2e-16 ***
## weekday5 -2.818e-02 2.761e-03 -10.206 < 2e-16 ***
## weekday6 -1.653e-02 2.126e-03 -7.777 7.44e-15 ***
## weathersit -2.977e-02 1.230e-03 -24.199 < 2e-16 ***
## atemp 2.934e-01 7.082e-03 41.423 < 2e-16 ***
## windspeed -4.881e-02 7.960e-03 -6.132 8.69e-10 ***
## casual 1.787e-04 1.564e-06 114.272 < 2e-16 ***
## registered 2.228e-04 9.414e-07 236.707 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 668801 on 730 degrees of freedom
## Residual deviance: 40905 on 712 degrees of freedom
## AIC: 48342
##
## Number of Fisher Scoring iterations: 4
The AIC is 1 point lower (48,342 vs 48,343) in the reduced model, which means it achieves essentially the same fit with fewer predictors, so removing non-significant variables (temp, hum) actually improved our model slightly.
The residual deviance is nearly identical (40,905 vs 40,903), meaning almost no explanatory power was lost by dropping temp and hum, confirming they were not contributing meaningfully.
The refined model gains 2 extra degrees of freedom (712 vs 710), reflecting the removal of those two predictors, which makes the model more parsimonious.
## GVIF Df GVIF^(1/(2*Df))
## instant 528.750548 1 22.994576
## season 14.820284 3 1.567266
## yr 425.020842 1 20.616034
## mnth 113.369841 1 10.647527
## holiday 1.318175 1 1.148118
## weekday 4.684057 6 1.137327
## weathersit 1.238158 1 1.112726
## atemp 3.558116 1 1.886297
## windspeed 1.120140 1 1.058367
## casual 4.270711 1 2.066570
## registered 6.455198 1 2.540708
instant(22.99) andyr(20.62) are extremely highly collinear with each other and other time-related variables. Both essentially capture the same time progression. therefore we will only utilizeyrand dropinstant.
mnth(10.65) is also severely collinear, likely because it overlaps heavily with bothinstantandseason.Therefore we can dropmnth.
poisson_reduced2 <- glm(cnt ~ yr + season + holiday +
weekday + weathersit + atemp +
windspeed + casual + registered,
family = poisson(link = "log"),
data = bike_day)
summary(poisson_reduced2)##
## Call:
## glm(formula = cnt ~ yr + season + holiday + weekday + weathersit +
## atemp + windspeed + casual + registered, family = poisson(link = "log"),
## data = bike_day)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.239e+00 3.929e-03 1842.671 < 2e-16 ***
## yr1 -7.273e-03 2.045e-03 -3.556 0.000377 ***
## season2 7.417e-02 2.402e-03 30.875 < 2e-16 ***
## season3 3.167e-02 2.817e-03 11.241 < 2e-16 ***
## season4 8.951e-02 2.343e-03 38.199 < 2e-16 ***
## holiday1 -1.802e-02 4.142e-03 -4.352 1.35e-05 ***
## weekday1 -2.217e-02 2.928e-03 -7.571 3.70e-14 ***
## weekday2 -3.342e-02 3.005e-03 -11.120 < 2e-16 ***
## weekday3 -5.613e-02 3.068e-03 -18.296 < 2e-16 ***
## weekday4 -4.437e-02 3.034e-03 -14.621 < 2e-16 ***
## weekday5 -2.832e-02 2.759e-03 -10.265 < 2e-16 ***
## weekday6 -1.713e-02 2.125e-03 -8.061 7.57e-16 ***
## weathersit -2.877e-02 1.228e-03 -23.424 < 2e-16 ***
## atemp 2.924e-01 7.075e-03 41.333 < 2e-16 ***
## windspeed -4.831e-02 7.945e-03 -6.080 1.20e-09 ***
## casual 1.796e-04 1.560e-06 115.113 < 2e-16 ***
## registered 2.230e-04 9.389e-07 237.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 668801 on 730 degrees of freedom
## Residual deviance: 41308 on 714 degrees of freedom
## AIC: 48740
##
## Number of Fisher Scoring iterations: 4
Our AIC scored increased by ~400. This means that dropped variables provided predictive information even though they were highly multicollinear.
At this point it would be though for me to find the way to resolve collinearity and AIC dilemma, therefore I will do forward selection method and see what is computer decide.
null_model <- glm(cnt ~ 1, family = poisson(link = "log"), data = bike_day)
full_model <- glm(cnt ~ . - dteday, family = poisson(link = "log"), data = bike_day)
forward_model <- step(null_model,
scope = list(lower = null_model, upper = full_model),
direction = "forward",
trace = 1) ## Start: AIC=676201.4
## cnt ~ 1
##
## Df Deviance AIC
## + registered 1 122821 130224
## + casual 1 421442 428845
## + atemp 1 422111 429513
## + temp 1 426458 433860
## + instant 1 426519 433921
## + season 3 435960 443366
## + yr 1 471448 478851
## + weathersit 1 612585 619988
## + mnth 1 621040 628443
## + windspeed 1 634555 641958
## + hum 1 662645 670047
## + weekday 6 664853 672266
## + holiday 1 665788 673190
## + workingday 1 666510 673913
## <none> 668801 676201
##
## Step: AIC=130223.5
## cnt ~ registered
##
## Df Deviance AIC
## + casual 1 48209 55613
## + weekday 6 78946 86361
## + workingday 1 79576 86981
## + atemp 1 99493 106897
## + temp 1 100443 107847
## + season 3 101283 108692
## + weathersit 1 119615 127019
## + windspeed 1 121845 129250
## + holiday 1 122299 129703
## + yr 1 122359 129764
## + mnth 1 122569 129974
## + instant 1 122691 130096
## + hum 1 122734 130138
## <none> 122821 130224
##
## Step: AIC=55613.39
## cnt ~ registered + casual
##
## Df Deviance AIC
## + season 3 44354 51765
## + atemp 1 45468 52875
## + temp 1 45820 53227
## + yr 1 46113 53520
## + instant 1 47416 54823
## + mnth 1 47558 54965
## + weekday 6 47712 55129
## + weathersit 1 47981 55388
## + windspeed 1 48086 55493
## + hum 1 48123 55530
## + holiday 1 48183 55589
## + workingday 1 48191 55598
## <none> 48209 55613
##
## Step: AIC=51764.56
## cnt ~ registered + casual + season
##
## Df Deviance AIC
## + atemp 1 42522 49934
## + temp 1 42748 50160
## + weathersit 1 43802 51214
## + instant 1 43863 51276
## + yr 1 43962 51375
## + weekday 6 43966 51388
## + windspeed 1 44290 51702
## + mnth 1 44306 51719
## + holiday 1 44326 51738
## + hum 1 44336 51749
## <none> 44354 51765
## + workingday 1 44354 51766
##
## Step: AIC=49934.26
## cnt ~ registered + casual + season + atemp
##
## Df Deviance AIC
## + weathersit 1 41882 49297
## + weekday 6 42027 49452
## + hum 1 42294 49709
## + instant 1 42428 49843
## + workingday 1 42465 49879
## + yr 1 42475 49889
## + holiday 1 42476 49891
## + mnth 1 42494 49908
## + windspeed 1 42499 49914
## + temp 1 42506 49920
## <none> 42522 49934
##
## Step: AIC=49296.68
## cnt ~ registered + casual + season + atemp + weathersit
##
## Df Deviance AIC
## + weekday 6 41391 48818
## + holiday 1 41821 49237
## + workingday 1 41833 49249
## + windspeed 1 41845 49262
## + mnth 1 41861 49277
## + temp 1 41871 49287
## + instant 1 41871 49288
## <none> 41882 49297
## + hum 1 41881 49297
## + yr 1 41882 49299
##
## Step: AIC=48817.89
## cnt ~ registered + casual + season + atemp + weathersit + weekday
##
## Df Deviance AIC
## + instant 1 41319 48747
## + windspeed 1 41347 48776
## + yr 1 41362 48791
## + holiday 1 41364 48793
## + workingday 1 41364 48793
## + mnth 1 41365 48794
## + temp 1 41387 48816
## + hum 1 41387 48816
## <none> 41391 48818
##
## Step: AIC=48747.28
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant
##
## Df Deviance AIC
## + yr 1 41283 48713
## + windspeed 1 41285 48715
## + holiday 1 41306 48737
## + workingday 1 41306 48737
## + temp 1 41314 48744
## + mnth 1 41314 48745
## + hum 1 41316 48746
## <none> 41319 48747
##
## Step: AIC=48713.2
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant + yr
##
## Df Deviance AIC
## + mnth 1 40965 48397
## + windspeed 1 41241 48673
## + holiday 1 41268 48700
## + workingday 1 41268 48700
## + temp 1 41278 48710
## <none> 41283 48713
## + hum 1 41282 48714
##
## Step: AIC=48397.36
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant + yr + mnth
##
## Df Deviance AIC
## + windspeed 1 40928 48363
## + holiday 1 40943 48378
## + workingday 1 40943 48378
## <none> 40965 48397
## + temp 1 40963 48398
## + hum 1 40965 48399
##
## Step: AIC=48362.94
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant + yr + mnth + windspeed
##
## Df Deviance AIC
## + holiday 1 40905 48342
## + workingday 1 40905 48342
## + hum 1 40925 48362
## <none> 40928 48363
## + temp 1 40928 48365
##
## Step: AIC=48341.97
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant + yr + mnth + windspeed + holiday
##
## Df Deviance AIC
## + hum 1 40903 48342
## <none> 40905 48342
## + temp 1 40905 48344
##
## Step: AIC=48341.65
## cnt ~ registered + casual + season + atemp + weathersit + weekday +
## instant + yr + mnth + windspeed + holiday + hum
##
## Df Deviance AIC
## <none> 40903 48342
## + temp 1 40903 48343
##
## Call:
## glm(formula = cnt ~ registered + casual + season + atemp + weathersit +
## weekday + instant + yr + mnth + windspeed + holiday + hum,
## family = poisson(link = "log"), data = bike_day)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.234e+00 4.894e-03 1478.052 < 2e-16 ***
## registered 2.228e-04 9.428e-07 236.259 < 2e-16 ***
## casual 1.784e-04 1.571e-06 113.552 < 2e-16 ***
## season2 7.918e-02 2.525e-03 31.359 < 2e-16 ***
## season3 4.574e-02 3.421e-03 13.370 < 2e-16 ***
## season4 1.119e-01 3.735e-03 29.971 < 2e-16 ***
## atemp 2.961e-01 7.305e-03 40.531 < 2e-16 ***
## weathersit -2.853e-02 1.476e-03 -19.325 < 2e-16 ***
## weekday1 -2.267e-02 2.934e-03 -7.725 1.12e-14 ***
## weekday2 -3.422e-02 3.013e-03 -11.355 < 2e-16 ***
## weekday3 -5.661e-02 3.075e-03 -18.409 < 2e-16 ***
## weekday4 -4.517e-02 3.047e-03 -14.826 < 2e-16 ***
## weekday5 -2.850e-02 2.769e-03 -10.293 < 2e-16 ***
## weekday6 -1.662e-02 2.126e-03 -7.817 5.43e-15 ***
## instant -1.221e-03 6.400e-05 -19.083 < 2e-16 ***
## yr1 4.398e-01 2.354e-02 18.685 < 2e-16 ***
## mnth 3.437e-02 1.939e-03 17.724 < 2e-16 ***
## windspeed -5.253e-02 8.327e-03 -6.308 2.83e-10 ***
## holiday1 -1.962e-02 4.152e-03 -4.726 2.29e-06 ***
## hum -8.764e-03 5.760e-03 -1.521 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 668801 on 730 degrees of freedom
## Residual deviance: 40903 on 711 degrees of freedom
## AIC: 48342
##
## Number of Fisher Scoring iterations: 4
Forward selection started with our null model and then selected the variable that results in the lowest AIC among all remaining candidates. Looking at the selection steps,
registeredwas added first as it produced the largest AIC drop (from 676,201 to 130,224), followed bycasual,season,atemp,weathersit,weekday,instant,yr,mnth,windspeed,holiday, and finallyhum. Notably,tempwas never added at any step because it consistently failed to improve AIC meaningfully. At the final step it would have actually increased AIC from 48,342 to 48,343, confirming it adds no value. Similarly,workingdaywas never selected due to its perfect collinearity with other variables. The final forward-selected model ended up with an AIC of 48,342 and residual deviance of 40,903, which is identical to the my previous poisson_reduced model from earlier.
Now lets look at model diagnostic to furher examine our model.
Residuals vs Fitted: Ideally, the red line should be flat and hover around zero. In our plot, there is a clear curved (upside down U-shape) pattern, which is a strong indication of non-linearity. Observations 668, 2, and 765 are flagged as potential outliers worth investigating further.
Q-Q Residuals: The points deviate severely and consistently from the diagonal reference line, particularly in the upper tail where observation 668 stands far out. This is not a minor tail deviation. The pattern suggests the residuals are heavily right-skewed and far from normally distributed, which is a strong confirmation of higher dispersion in the our model.
Scale-Location: The red line shows a distinct curved pattern rather than a flat horizontal line, and the spread of the standardized residuals is uneven across fitted values — wider in the lower and upper ends and narrower in the middle. This confirms heteroscedasticity, meaning the variance of errors is not constant across predicted values, which directly violates the Poisson assumption that mean equals variance.
Residuals vs Leverage: Observations 668, 726, and 644 show notably large residuals combined with moderate leverage. Importantly, observation 668 appears to fall beyond or near the Cook’s Distance threshold of 1, suggesting it may be an influential observation that is meaningfully affecting the model estimates and should be examined closely or considered for further investigation.
Our residual plots show a lot of variance in the errors. In a Poisson regression, we assume the variance is equal to the mean, so that’s a problem. Let’s take another look at our response variable and if the variance is larger than the mean, we may need to consider a different model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22 3152 4548 4504 5956 8714
## [1] 3752788
## [1] 833.1478
The mean (4,504) and median (4,548) are very close, which suggests the distribution of
cntis fairly symmetric without extreme skewness. However, the variance is around 3.7 million, much larger than the mean. This gives a dispersion ratio of about 833, meaning the variance is 833 times the mean. That’s a major violation of the Poisson assumption, where the mean and variance are supposed to be equal.
Now, lets visualize our bike rental count dispersion.
hist(bike_day$cnt,
breaks = 30,
main = "Distribution of Daily Bike Rentals",
xlab = "cnt",
col = "steelblue")
abline(v = mean(bike_day$cnt), col = "red", lwd = 2, lty = 1)
abline(v = median(bike_day$cnt), col = "green", lwd = 2, lty = 2)
legend("topright", legend = c("Mean", "Median"),
col = c("red", "green"), lty = c(1, 2), lwd = 2)The histogram of daily bike rentals also supports the overdispersion we observed. You can clearly see the spread in the data, which confirms that a Poisson model isn’t appropriate here. Instead, a Negative Binomial model is a better choice because it can handle this extra variability.
nb_model <- glm.nb(cnt ~ registered + casual + season + atemp +
weathersit + weekday + instant + yr + mnth +
windspeed + holiday + hum, data = bike_day)
summary(nb_model)##
## Call:
## glm.nb(formula = cnt ~ registered + casual + season + atemp +
## weathersit + weekday + instant + yr + mnth + windspeed +
## holiday + hum, data = bike_day, init.theta = 36.35894071,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.107e+00 4.944e-02 143.752 < 2e-16 ***
## registered 2.681e-04 9.945e-06 26.962 < 2e-16 ***
## casual 1.871e-04 1.760e-05 10.630 < 2e-16 ***
## season2 5.494e-02 2.509e-02 2.190 0.02854 *
## season3 1.782e-03 3.391e-02 0.053 0.95809
## season4 8.214e-02 3.430e-02 2.395 0.01662 *
## atemp 3.248e-01 8.003e-02 4.059 4.93e-05 ***
## weathersit -3.977e-02 1.585e-02 -2.509 0.01212 *
## weekday1 -4.523e-02 2.985e-02 -1.515 0.12973
## weekday2 -6.000e-02 3.058e-02 -1.962 0.04977 *
## weekday3 -8.117e-02 3.089e-02 -2.628 0.00859 **
## weekday4 -6.745e-02 3.094e-02 -2.180 0.02922 *
## weekday5 -3.544e-02 2.881e-02 -1.230 0.21870
## weekday6 -2.415e-02 2.333e-02 -1.035 0.30055
## instant -1.614e-03 7.056e-04 -2.288 0.02217 *
## yr1 5.541e-01 2.604e-01 2.127 0.03339 *
## mnth 4.756e-02 2.177e-02 2.184 0.02893 *
## windspeed -8.949e-02 8.888e-02 -1.007 0.31400
## holiday1 -1.522e-02 4.205e-02 -0.362 0.71731
## hum -2.235e-02 6.141e-02 -0.364 0.71590
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(36.3589) family taken to be 1)
##
## Null deviance: 6763.49 on 730 degrees of freedom
## Residual deviance: 765.88 on 711 degrees of freedom
## AIC: 11647
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 36.36
## Std. Err.: 1.98
##
## 2 x log-likelihood: -11605.11
The null deviance dropped from about 6,763 to 766, which gives a pseudo R² of roughly 0.89. In simple terms, the model explains about 89% of the variation in
cnt, which is a strong fit. The theta value is around 36, indicating that the Negative Binomial model is doing a good job accounting for over dispersion in the data. We also see a major improvement in AIC from about 48,000 in the Poisson model down to 11,600—which clearly shows that the Negative Binomial model fits the data much better. Finally, the dispersion ratio is close to 1, meaning the over dispersion issue has been effectively resolved.
nb_model_reduced <- glm.nb(cnt ~ registered + casual + season +
atemp + weathersit + weekday +
instant + yr + mnth,
data = bike_day)
summary(nb_model_reduced)##
## Call:
## glm.nb(formula = cnt ~ registered + casual + season + atemp +
## weathersit + weekday + instant + yr + mnth, data = bike_day,
## init.theta = 36.29935948, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.077e+00 3.597e-02 196.731 < 2e-16 ***
## registered 2.709e-04 9.265e-06 29.235 < 2e-16 ***
## casual 1.873e-04 1.666e-05 11.241 < 2e-16 ***
## season2 5.250e-02 2.496e-02 2.103 0.03545 *
## season3 1.522e-03 3.382e-02 0.045 0.96410
## season4 8.085e-02 3.424e-02 2.361 0.01821 *
## atemp 3.166e-01 7.751e-02 4.085 4.41e-05 ***
## weathersit -4.160e-02 1.295e-02 -3.212 0.00132 **
## weekday1 -4.958e-02 2.730e-02 -1.816 0.06936 .
## weekday2 -6.327e-02 2.948e-02 -2.147 0.03183 *
## weekday3 -8.411e-02 2.974e-02 -2.828 0.00469 **
## weekday4 -7.067e-02 2.962e-02 -2.386 0.01703 *
## weekday5 -3.794e-02 2.779e-02 -1.366 0.17209
## weekday6 -2.511e-02 2.330e-02 -1.078 0.28107
## instant -1.629e-03 7.039e-04 -2.315 0.02064 *
## yr1 5.554e-01 2.597e-01 2.139 0.03247 *
## mnth 4.797e-02 2.173e-02 2.207 0.02729 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(36.2994) family taken to be 1)
##
## Null deviance: 6752.58 on 730 degrees of freedom
## Residual deviance: 765.82 on 714 degrees of freedom
## AIC: 11642
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 36.30
## Std. Err.: 1.98
##
## 2 x log-likelihood: -11606.25
After removing
windspeed,holiday, andhumidity, the model actually improved slightly. AIC dropped from 11,647 to 11,642, and the residual deviance stayed essentially the same at 765.82. This means we didn’t really lose any explanatory power by dropping those variables, so they weren’t contributing much to the model. The dispersion ratio is still about 1.07 (765.82/714), which is very close to 1, so over dispersion remains well controlled.
models <- list( forward_model = forward_model,
full_model = full_model,
nb_model = nb_model,
nb_model_reduced = nb_model_reduced,
poisson_reduced = poisson_reduced,
poisson_reduced2 = poisson_reduced2)
target_order <- c("full_model", "poisson_reduced", "poisson_reduced2",
"forward_model", "nb_model", "nb_model_reduced")
model_comparison <- data.frame(
Model = names(models),
AIC = sapply(models, AIC),
BIC = sapply(models, BIC),
LogLik = sapply(models, logLik),
Deviance = sapply(models, deviance)
)
model_comparison$Model <- factor(model_comparison$Model, levels = target_order)
model_comparison <- model_comparison[order(model_comparison$Model), ]
print(model_comparison)## Model AIC BIC LogLik Deviance
## full_model full_model 48343.41 48439.89 -24150.706 40902.7507
## poisson_reduced poisson_reduced 48341.97 48429.26 -24151.984 40905.3078
## poisson_reduced2 poisson_reduced2 48740.20 48818.30 -24353.098 41307.5360
## forward_model forward_model 48341.65 48433.54 -24150.827 40902.9935
## nb_model nb_model 11647.11 11743.60 -5802.557 765.8773
## nb_model_reduced nb_model_reduced 11642.25 11724.95 -5803.123 765.8204
Our best model is nb_Model_reduced so far and now we are going to examine model diagnostics.
As shown in our diagnostic plots, there are some violations of homoscedasticity that we can further investigate. It’s possible that certain holiday effects combined with good weather conditions are creating spikes in the data. Overall, though, the model explains a large portion of the variation. This will be my final model, and next I’ll move on to making predictions.
predictions <- predict(nb_model_reduced, type = "response")
comparison <- data.frame(
Actual = bike_day$cnt,
Predicted = round(predictions, 0))
head(comparison, 10)## MAE: 442.71
## RMSE: 620.4
mape <- mean(abs((comparison$Actual - comparison$Predicted) /
comparison$Actual)) * 100
cat("MAPE:", round(mape, 2), "%\n")## MAPE: 18.94 %
Mean Absolute Error (MAE) is 442.71. This is the average absolute difference between our predicted and actual values. on average, our model is off by 443 units.
Root Mean Squared Error (RMSE) is 620.4. this is similar to MAE but penalizes large errors more heavily. Typical error magnitude is around 620 units, but larger mistakes influence it more.
Mean Absolute Percentage Error (MAPE) is 18.94% means that on average, our predictions are off by about 19%
new_day <- data.frame(
registered = 3000,
casual = 500,
season = factor(2, levels = levels(bike_day$season)),
atemp = 0.5,
weathersit = 1,
weekday = factor(3, levels = levels(bike_day$weekday)),
instant = 400,
yr = factor(1, levels = levels(bike_day$yr)),
mnth = 6
)
new_prediction <- predict(nb_model_reduced,
newdata = new_day,
type = "response")
cat("Predicted cnt:", round(new_prediction, 0), "\n")## Predicted cnt: 3865