set.seed(14)
ind<- sample(2,nrow(bikedata),replace = T,prob = c(0.8,0.2))
train<- bikedata[ind==1,]
test<- bikedata[ind==2,]
############################################model
str(train)
## 'data.frame': 927 obs. of 4 variables:
## $ ntrips : int 292 9487 2810 10244 3029 392 9950 1767 453 2328 ...
## $ weekdays: Factor w/ 7 levels "Sun","Mon","Tues",..: 4 4 4 5 5 6 6 6 7 7 ...
## $ season : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Daytime : Factor w/ 4 levels "Night","Morning",..: 1 2 4 2 4 1 2 4 1 4 ...
Creating a model with the training data
model1<- glm(train,formula = ntrips~ weekdays+season+Daytime, family = "poisson")
summary(model1)
##
## Call:
## glm(formula = ntrips ~ weekdays + season + Daytime, family = "poisson",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -125.813 -13.923 -2.466 14.263 101.975
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.483214 0.002672 2426.37 <2e-16 ***
## weekdaysMon 0.205529 0.001453 141.46 <2e-16 ***
## weekdaysTues 0.266160 0.001452 183.25 <2e-16 ***
## weekdaysWed 0.223736 0.001462 153.07 <2e-16 ***
## weekdaysThurs 0.251817 0.001458 172.77 <2e-16 ***
## weekdaysFri 0.209494 0.001495 140.09 <2e-16 ***
## weekdaysSat 0.020438 0.001530 13.36 <2e-16 ***
## seasonSpring 0.144620 0.001103 131.15 <2e-16 ***
## seasonSummer 0.202801 0.001142 177.61 <2e-16 ***
## seasonWinter -0.199948 0.001251 -159.78 <2e-16 ***
## DaytimeMorning 2.586476 0.002424 1067.11 <2e-16 ***
## DaytimeAfternoon 2.754482 0.002414 1141.22 <2e-16 ***
## DaytimeEvening 1.733262 0.002531 684.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4261890 on 926 degrees of freedom
## Residual deviance: 505553 on 914 degrees of freedom
## AIC: 515089
##
## Number of Fisher Scoring iterations: 4
All covariates are significant according to their P-values,relatively speaking, and as we expected from Exploratory analysis, the periods during night, winter, and weekends have the least number of trips within their factors respectively as a level.
One assumption of a poisson model is that the mean and variance need to be equal, otherwise the model will be dispersed. we can run a dispersiontest for testing dispersion
dispersiontest(model1)
##
## Overdispersion test
##
## data: model1
## z = 14.779, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 543.7392
The Dispersion parameter is massively greater than 1, indicating Poisson may not be a valid choice.
A way to deal with over dispertion is the Negative binomial model unlike poisson, Negative binomail model requires a dispersion parameter the capture the inequality of means and variance. Therefore, the Poisson model is actually nested in the negative binomial model. Which can be compared by a likelihood ratio test.
model3<-glm.nb(train,formula = ntrips~weekdays+season+Daytime)
pchisq(2 * (logLik(model3) - logLik(model1)), df = 1, lower.tail = FALSE)
## 'log Lik.' 0 (df=14)
summary(model3)
##
## Call:
## glm.nb(formula = ntrips ~ weekdays + season + Daytime, data = train,
## init.theta = 7.849953965, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7297 -0.7754 -0.0064 0.5151 4.1681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.622952 0.044039 150.388 < 2e-16 ***
## weekdaysMon -0.041204 0.043054 -0.957 0.339
## weekdaysTues 0.009389 0.043978 0.214 0.831
## weekdaysWed -0.041277 0.043205 -0.955 0.339
## weekdaysThurs 0.006764 0.043541 0.155 0.877
## weekdaysFri -0.013931 0.044357 -0.314 0.753
## weekdaysSat -0.025730 0.043451 -0.592 0.554
## seasonSpring 0.197566 0.034181 5.780 7.47e-09 ***
## seasonSummer 0.282257 0.035737 7.898 2.83e-15 ***
## seasonWinter -0.198777 0.035720 -5.565 2.62e-08 ***
## DaytimeMorning 2.611803 0.033411 78.172 < 2e-16 ***
## DaytimeAfternoon 2.763227 0.033661 82.091 < 2e-16 ***
## DaytimeEvening 1.721694 0.033504 51.387 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(7.85) family taken to be 1)
##
## Null deviance: 7112.3 on 926 degrees of freedom
## Residual deviance: 947.9 on 914 degrees of freedom
## AIC: 16406
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 7.850
## Std. Err.: 0.359
##
## 2 x log-likelihood: -16377.660
We have the evidence to choose Nb than poisson!
model32<-glm.nb(train,formula = ntrips~weekdays+season+Daytime+
weekdays:season+season:Daytime+Daytime:weekdays)
model34<-glm.nb(train,formula = ntrips~weekdays+season+Daytime+
weekdays:Daytime+Daytime:season)
model33<-glm.nb(train,formula = ntrips~weekdays+season+Daytime+
weekdays:season)
model35<-glm.nb(train,formula = ntrips~weekdays+season+Daytime+
Daytime:season)
model36<- glm.nb(train,formula = ntrips~weekdays+season+Daytime+
Daytime:season+weekdays:season)
anova(model36)
## Warning in anova.negbin(model36): tests made without re-estimating 'theta'
## Analysis of Deviance Table
##
## Model: Negative Binomial(8.6078), link: log
##
## Response: ntrips
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 926 7796.8
## weekdays 6 86.7 920 7710.2 < 2.2e-16 ***
## season 3 234.5 917 7475.6 < 2.2e-16 ***
## Daytime 3 6436.7 914 1038.9 < 2.2e-16 ***
## season:Daytime 9 44.4 905 994.5 1.197e-06 ***
## weekdays:season 18 47.6 887 947.0 0.0001748 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All the covariates and intercepts are significant according to their P-values
list(c(AIC(model3),"model3"),
c(AIC(model32),"model32"),
c(AIC(model34),"model34"),
c(AIC(model33),"model33"),
c(AIC(model35),"model35"),
c(AIC(model36),"model36"))
## [[1]]
## [1] "16405.6604346304" "model3"
##
## [[2]]
## [1] "15774.3554135836" "model32"
##
## [[3]]
## [1] "15842.591408677" "model34"
##
## [[4]]
## [1] "16391.4737054561" "model33"
##
## [[5]]
## [1] "16382.2935830502" "model35"
##
## [[6]]
## [1] "16371.8922275975" "model36"
AIC balances between the complexity of the model and the goodnesss of fit, since increasing the number of parameters in the model will mostly increase fit, but often it may not worth duo to complexity. Follow the equation of AIC, we want the lower value among all the models. thus model32 is lying in our favour.
In order to test how well the model fits, we can use a chi-square goodness of fit test.
#pchisq(model3$deviance, df=model3$df.residual, lower.tail=FALSE)
#pchisq(model34$deviance, df=model34$df.residual, lower.tail=FALSE)
pchisq(model32$deviance, df=model32$df.residual, lower.tail=FALSE)
## [1] 0.05227276
#pchisq(model35$deviance, df=model35$df.residual, lower.tail=FALSE)
#pchisq(model36$deviance, df=model36$df.residual, lower.tail=FALSE)
list(c(logLik(model32),"model32"),
c(logLik(model35),"model35"),
c(logLik(model36),"model36"),
c(logLik(model3),"model3"),
c(logLik(model34),"model34"))
## [[1]]
## [1] "-7828.17770679181" "model32"
##
## [[2]]
## [1] "-8168.14679152512" "model35"
##
## [[3]]
## [1] "-8144.94611379875" "model36"
##
## [[4]]
## [1] "-8188.8302173152" "model3"
##
## [[5]]
## [1] "-7880.29570433852" "model34"
Model32 has the highest loglikehood.
Once we have build our model in train data, we may as well to use the model on test data to see how will the model preforms with an unseen dateset.
sum((test$ntrips-exp(predict(model32, test)))^2)
## [1] 665566192
mean((test$ntrips-exp(predict(model32, test)))^2)
## [1] 3011612
One may argue that the model is over complex, since it has more interaction terms.
summary(model32)
##
## Call:
## glm.nb(formula = ntrips ~ weekdays + season + Daytime + weekdays:season +
## season:Daytime + Daytime:weekdays, data = train, init.theta = 16.78870559,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0246 -0.5274 -0.0025 0.4867 3.7225
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.08977 0.06781 104.553 < 2e-16 ***
## weekdaysMon -0.83530 0.08514 -9.810 < 2e-16 ***
## weekdaysTues -0.89651 0.08687 -10.321 < 2e-16 ***
## weekdaysWed -0.85524 0.08513 -10.047 < 2e-16 ***
## weekdaysThurs -0.80831 0.08853 -9.130 < 2e-16 ***
## weekdaysFri -0.62974 0.08818 -7.141 9.26e-13 ***
## weekdaysSat -0.16940 0.08821 -1.920 0.054803 .
## seasonSpring 0.47296 0.07410 6.383 1.74e-10 ***
## seasonSummer 0.60053 0.07672 7.827 4.99e-15 ***
## seasonWinter -0.40563 0.07677 -5.284 1.27e-07 ***
## DaytimeMorning 1.62067 0.07658 21.163 < 2e-16 ***
## DaytimeAfternoon 2.17613 0.07562 28.778 < 2e-16 ***
## DaytimeEvening 0.81204 0.07626 10.649 < 2e-16 ***
## weekdaysMon:seasonSpring -0.29991 0.08648 -3.468 0.000524 ***
## weekdaysTues:seasonSpring -0.24766 0.08677 -2.854 0.004313 **
## weekdaysWed:seasonSpring -0.34141 0.08595 -3.972 7.12e-05 ***
## weekdaysThurs:seasonSpring -0.27838 0.08679 -3.207 0.001339 **
## weekdaysFri:seasonSpring -0.29608 0.08774 -3.374 0.000740 ***
## weekdaysSat:seasonSpring -0.03577 0.08746 -0.409 0.682559
## weekdaysMon:seasonSummer -0.14667 0.08863 -1.655 0.097955 .
## weekdaysTues:seasonSummer -0.19269 0.09128 -2.111 0.034784 *
## weekdaysWed:seasonSummer -0.36409 0.08874 -4.103 4.08e-05 ***
## weekdaysThurs:seasonSummer -0.25517 0.08969 -2.845 0.004439 **
## weekdaysFri:seasonSummer -0.19962 0.09212 -2.167 0.030245 *
## weekdaysSat:seasonSummer -0.18425 0.09016 -2.043 0.041003 *
## weekdaysMon:seasonWinter 0.28363 0.08879 3.195 0.001400 **
## weekdaysTues:seasonWinter 0.30074 0.09088 3.309 0.000936 ***
## weekdaysWed:seasonWinter 0.18756 0.09028 2.078 0.037749 *
## weekdaysThurs:seasonWinter 0.21563 0.09249 2.331 0.019734 *
## weekdaysFri:seasonWinter 0.18457 0.09145 2.018 0.043562 *
## weekdaysSat:seasonWinter 0.19720 0.09006 2.190 0.028541 *
## seasonSpring:DaytimeMorning -0.20621 0.06733 -3.063 0.002194 **
## seasonSummer:DaytimeMorning -0.29476 0.07052 -4.180 2.92e-05 ***
## seasonWinter:DaytimeMorning 0.03148 0.07027 0.448 0.654200
## seasonSpring:DaytimeAfternoon -0.07323 0.06643 -1.102 0.270319
## seasonSummer:DaytimeAfternoon -0.20237 0.07007 -2.888 0.003875 **
## seasonWinter:DaytimeAfternoon -0.04958 0.06994 -0.709 0.478382
## seasonSpring:DaytimeEvening 0.04776 0.06835 0.699 0.484643
## seasonSummer:DaytimeEvening 0.05352 0.07074 0.757 0.449325
## seasonWinter:DaytimeEvening -0.04442 0.07055 -0.630 0.528919
## weekdaysMon:DaytimeMorning 1.47912 0.08378 17.654 < 2e-16 ***
## weekdaysTues:DaytimeMorning 1.65351 0.08703 18.999 < 2e-16 ***
## weekdaysWed:DaytimeMorning 1.66956 0.08477 19.695 < 2e-16 ***
## weekdaysThurs:DaytimeMorning 1.60494 0.08662 18.529 < 2e-16 ***
## weekdaysFri:DaytimeMorning 1.34504 0.08614 15.614 < 2e-16 ***
## weekdaysSat:DaytimeMorning 0.29962 0.08572 3.495 0.000474 ***
## weekdaysMon:DaytimeAfternoon 1.00052 0.08434 11.862 < 2e-16 ***
## weekdaysTues:DaytimeAfternoon 1.06581 0.08701 12.249 < 2e-16 ***
## weekdaysWed:DaytimeAfternoon 1.07165 0.08470 12.652 < 2e-16 ***
## weekdaysThurs:DaytimeAfternoon 0.97483 0.08717 11.183 < 2e-16 ***
## weekdaysFri:DaytimeAfternoon 0.80392 0.08636 9.309 < 2e-16 ***
## weekdaysSat:DaytimeAfternoon 0.22391 0.08628 2.595 0.009456 **
## weekdaysMon:DaytimeEvening 1.22383 0.08443 14.495 < 2e-16 ***
## weekdaysTues:DaytimeEvening 1.32001 0.08722 15.134 < 2e-16 ***
## weekdaysWed:DaytimeEvening 1.38238 0.08339 16.578 < 2e-16 ***
## weekdaysThurs:DaytimeEvening 1.30995 0.08435 15.530 < 2e-16 ***
## weekdaysFri:DaytimeEvening 1.03882 0.08669 11.983 < 2e-16 ***
## weekdaysSat:DaytimeEvening 0.35250 0.08485 4.154 3.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(16.7887) family taken to be 1)
##
## Null deviance: 15161.85 on 926 degrees of freedom
## Residual deviance: 937.74 on 869 degrees of freedom
## AIC: 15774
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 16.789
## Std. Err.: 0.780
##
## 2 x log-likelihood: -15656.355
Overall,The summary output of our final model illustrates that while keeping other covariates constant, A change number of trips in Summer will have exp(0.60053) more than in Autumn. One thing worth to metion that is when interactions are added, the outputs of the cofficients of weekdays are quite different comparing with the model with out interactions (model3)