set.seed(14)
ind<- sample(2,nrow(bikedata),replace = T,prob = c(0.8,0.2))
train<- bikedata[ind==1,]
test<- bikedata[ind==2,]
############################################model
str(train)

## 'data.frame':    927 obs. of  4 variables:
##  $ ntrips  : int  292 9487 2810 10244 3029 392 9950 1767 453 2328 ...
##  $ weekdays: Factor w/ 7 levels "Sun","Mon","Tues",..: 4 4 4 5 5 6 6 6 7 7 ...
##  $ season  : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Daytime : Factor w/ 4 levels "Night","Morning",..: 1 2 4 2 4 1 2 4 1 4 ...

Judging by the outcome of Exploratory analysis that each factors do have an impact to another,interactions are a valid choice. However we are ensure if the consequence of the interactions are strong enough since the influence can be negligible, which can be answered by regression.

Creating a model with the training data

model1<- glm(train,formula =  ntrips~ weekdays+season+Daytime, family = "poisson")
summary(model1)

## 
## Call:
## glm(formula = ntrips ~ weekdays + season + Daytime, family = "poisson", 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -125.813   -13.923    -2.466    14.263   101.975  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.483214   0.002672 2426.37   <2e-16 ***
## weekdaysMon       0.205529   0.001453  141.46   <2e-16 ***
## weekdaysTues      0.266160   0.001452  183.25   <2e-16 ***
## weekdaysWed       0.223736   0.001462  153.07   <2e-16 ***
## weekdaysThurs     0.251817   0.001458  172.77   <2e-16 ***
## weekdaysFri       0.209494   0.001495  140.09   <2e-16 ***
## weekdaysSat       0.020438   0.001530   13.36   <2e-16 ***
## seasonSpring      0.144620   0.001103  131.15   <2e-16 ***
## seasonSummer      0.202801   0.001142  177.61   <2e-16 ***
## seasonWinter     -0.199948   0.001251 -159.78   <2e-16 ***
## DaytimeMorning    2.586476   0.002424 1067.11   <2e-16 ***
## DaytimeAfternoon  2.754482   0.002414 1141.22   <2e-16 ***
## DaytimeEvening    1.733262   0.002531  684.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4261890  on 926  degrees of freedom
## Residual deviance:  505553  on 914  degrees of freedom
## AIC: 515089
## 
## Number of Fisher Scoring iterations: 4

All covariates are significant according to their P-values,relatively speaking, and as we expected from Exploratory analysis, the periods during night, winter, and weekends have the least number of trips within their factors respectively as a level.

One assumption of a poisson model is that the mean and variance need to be equal, otherwise the model will be dispersed. we can run a dispersiontest for testing dispersion

dispersiontest(model1)

## 
##  Overdispersion test
## 
## data:  model1
## z = 14.779, p-value < 2.2e-16
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   543.7392

The Dispersion parameter is massively greater than 1, indicating Poisson may not be a valid choice.

A way to deal with over dispertion is the Negative binomial model unlike poisson, Negative binomail model requires a dispersion parameter the capture the inequality of means and variance. Therefore, the Poisson model is actually nested in the negative binomial model. Which can be compared by a likelihood ratio test.

model3<-glm.nb(train,formula =  ntrips~weekdays+season+Daytime)
pchisq(2 * (logLik(model3) - logLik(model1)), df = 1, lower.tail = FALSE)

## 'log Lik.' 0 (df=14)

summary(model3)

## 
## Call:
## glm.nb(formula = ntrips ~ weekdays + season + Daytime, data = train, 
##     init.theta = 7.849953965, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7297  -0.7754  -0.0064   0.5151   4.1681  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.622952   0.044039 150.388  < 2e-16 ***
## weekdaysMon      -0.041204   0.043054  -0.957    0.339    
## weekdaysTues      0.009389   0.043978   0.214    0.831    
## weekdaysWed      -0.041277   0.043205  -0.955    0.339    
## weekdaysThurs     0.006764   0.043541   0.155    0.877    
## weekdaysFri      -0.013931   0.044357  -0.314    0.753    
## weekdaysSat      -0.025730   0.043451  -0.592    0.554    
## seasonSpring      0.197566   0.034181   5.780 7.47e-09 ***
## seasonSummer      0.282257   0.035737   7.898 2.83e-15 ***
## seasonWinter     -0.198777   0.035720  -5.565 2.62e-08 ***
## DaytimeMorning    2.611803   0.033411  78.172  < 2e-16 ***
## DaytimeAfternoon  2.763227   0.033661  82.091  < 2e-16 ***
## DaytimeEvening    1.721694   0.033504  51.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(7.85) family taken to be 1)
## 
##     Null deviance: 7112.3  on 926  degrees of freedom
## Residual deviance:  947.9  on 914  degrees of freedom
## AIC: 16406
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  7.850 
##           Std. Err.:  0.359 
## 
##  2 x log-likelihood:  -16377.660

We have the evidence to choose Nb than poisson!

model32<-glm.nb(train,formula =  ntrips~weekdays+season+Daytime+
    weekdays:season+season:Daytime+Daytime:weekdays)

model34<-glm.nb(train,formula =  ntrips~weekdays+season+Daytime+
                  weekdays:Daytime+Daytime:season)

model33<-glm.nb(train,formula =  ntrips~weekdays+season+Daytime+
                  weekdays:season)

model35<-glm.nb(train,formula =  ntrips~weekdays+season+Daytime+
                  Daytime:season)
model36<- glm.nb(train,formula =  ntrips~weekdays+season+Daytime+
                   Daytime:season+weekdays:season)

anova(model36)

## Warning in anova.negbin(model36): tests made without re-estimating 'theta'

## Analysis of Deviance Table
## 
## Model: Negative Binomial(8.6078), link: log
## 
## Response: ntrips
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                              926     7796.8              
## weekdays         6     86.7       920     7710.2 < 2.2e-16 ***
## season           3    234.5       917     7475.6 < 2.2e-16 ***
## Daytime          3   6436.7       914     1038.9 < 2.2e-16 ***
## season:Daytime   9     44.4       905      994.5 1.197e-06 ***
## weekdays:season 18     47.6       887      947.0 0.0001748 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All the covariates and intercepts are significant according to their P-values

list(c(AIC(model3),"model3"),
     c(AIC(model32),"model32"),
     c(AIC(model34),"model34"),
     c(AIC(model33),"model33"),
     c(AIC(model35),"model35"),
     c(AIC(model36),"model36"))

## [[1]]
## [1] "16405.6604346304" "model3"          
## 
## [[2]]
## [1] "15774.3554135836" "model32"         
## 
## [[3]]
## [1] "15842.591408677" "model34"        
## 
## [[4]]
## [1] "16391.4737054561" "model33"         
## 
## [[5]]
## [1] "16382.2935830502" "model35"         
## 
## [[6]]
## [1] "16371.8922275975" "model36"

AIC balances between the complexity of the model and the goodnesss of fit, since increasing the number of parameters in the model will mostly increase fit, but often it may not worth duo to complexity. Follow the equation of AIC, we want the lower value among all the models. thus model32 is lying in our favour.

In order to test how well the model fits, we can use a chi-square goodness of fit test.

#pchisq(model3$deviance, df=model3$df.residual, lower.tail=FALSE)

#pchisq(model34$deviance, df=model34$df.residual, lower.tail=FALSE)

pchisq(model32$deviance, df=model32$df.residual, lower.tail=FALSE)

## [1] 0.05227276

#pchisq(model35$deviance, df=model35$df.residual, lower.tail=FALSE)

#pchisq(model36$deviance, df=model36$df.residual, lower.tail=FALSE)

list(c(logLik(model32),"model32"),
c(logLik(model35),"model35"),
c(logLik(model36),"model36"),
c(logLik(model3),"model3"),
c(logLik(model34),"model34"))

## [[1]]
## [1] "-7828.17770679181" "model32"          
## 
## [[2]]
## [1] "-8168.14679152512" "model35"          
## 
## [[3]]
## [1] "-8144.94611379875" "model36"          
## 
## [[4]]
## [1] "-8188.8302173152" "model3"          
## 
## [[5]]
## [1] "-7880.29570433852" "model34"

Model32 has the highest loglikehood.

Testing our final model

Once we have build our model in train data, we may as well to use the model on test data to see how will the model preforms with an unseen dateset.

sum((test$ntrips-exp(predict(model32, test)))^2)

## [1] 665566192

mean((test$ntrips-exp(predict(model32, test)))^2)

## [1] 3011612

One may argue that the model is over complex, since it has more interaction terms.

summary(model32)

## 
## Call:
## glm.nb(formula = ntrips ~ weekdays + season + Daytime + weekdays:season + 
##     season:Daytime + Daytime:weekdays, data = train, init.theta = 16.78870559, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0246  -0.5274  -0.0025   0.4867   3.7225  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     7.08977    0.06781 104.553  < 2e-16 ***
## weekdaysMon                    -0.83530    0.08514  -9.810  < 2e-16 ***
## weekdaysTues                   -0.89651    0.08687 -10.321  < 2e-16 ***
## weekdaysWed                    -0.85524    0.08513 -10.047  < 2e-16 ***
## weekdaysThurs                  -0.80831    0.08853  -9.130  < 2e-16 ***
## weekdaysFri                    -0.62974    0.08818  -7.141 9.26e-13 ***
## weekdaysSat                    -0.16940    0.08821  -1.920 0.054803 .  
## seasonSpring                    0.47296    0.07410   6.383 1.74e-10 ***
## seasonSummer                    0.60053    0.07672   7.827 4.99e-15 ***
## seasonWinter                   -0.40563    0.07677  -5.284 1.27e-07 ***
## DaytimeMorning                  1.62067    0.07658  21.163  < 2e-16 ***
## DaytimeAfternoon                2.17613    0.07562  28.778  < 2e-16 ***
## DaytimeEvening                  0.81204    0.07626  10.649  < 2e-16 ***
## weekdaysMon:seasonSpring       -0.29991    0.08648  -3.468 0.000524 ***
## weekdaysTues:seasonSpring      -0.24766    0.08677  -2.854 0.004313 ** 
## weekdaysWed:seasonSpring       -0.34141    0.08595  -3.972 7.12e-05 ***
## weekdaysThurs:seasonSpring     -0.27838    0.08679  -3.207 0.001339 ** 
## weekdaysFri:seasonSpring       -0.29608    0.08774  -3.374 0.000740 ***
## weekdaysSat:seasonSpring       -0.03577    0.08746  -0.409 0.682559    
## weekdaysMon:seasonSummer       -0.14667    0.08863  -1.655 0.097955 .  
## weekdaysTues:seasonSummer      -0.19269    0.09128  -2.111 0.034784 *  
## weekdaysWed:seasonSummer       -0.36409    0.08874  -4.103 4.08e-05 ***
## weekdaysThurs:seasonSummer     -0.25517    0.08969  -2.845 0.004439 ** 
## weekdaysFri:seasonSummer       -0.19962    0.09212  -2.167 0.030245 *  
## weekdaysSat:seasonSummer       -0.18425    0.09016  -2.043 0.041003 *  
## weekdaysMon:seasonWinter        0.28363    0.08879   3.195 0.001400 ** 
## weekdaysTues:seasonWinter       0.30074    0.09088   3.309 0.000936 ***
## weekdaysWed:seasonWinter        0.18756    0.09028   2.078 0.037749 *  
## weekdaysThurs:seasonWinter      0.21563    0.09249   2.331 0.019734 *  
## weekdaysFri:seasonWinter        0.18457    0.09145   2.018 0.043562 *  
## weekdaysSat:seasonWinter        0.19720    0.09006   2.190 0.028541 *  
## seasonSpring:DaytimeMorning    -0.20621    0.06733  -3.063 0.002194 ** 
## seasonSummer:DaytimeMorning    -0.29476    0.07052  -4.180 2.92e-05 ***
## seasonWinter:DaytimeMorning     0.03148    0.07027   0.448 0.654200    
## seasonSpring:DaytimeAfternoon  -0.07323    0.06643  -1.102 0.270319    
## seasonSummer:DaytimeAfternoon  -0.20237    0.07007  -2.888 0.003875 ** 
## seasonWinter:DaytimeAfternoon  -0.04958    0.06994  -0.709 0.478382    
## seasonSpring:DaytimeEvening     0.04776    0.06835   0.699 0.484643    
## seasonSummer:DaytimeEvening     0.05352    0.07074   0.757 0.449325    
## seasonWinter:DaytimeEvening    -0.04442    0.07055  -0.630 0.528919    
## weekdaysMon:DaytimeMorning      1.47912    0.08378  17.654  < 2e-16 ***
## weekdaysTues:DaytimeMorning     1.65351    0.08703  18.999  < 2e-16 ***
## weekdaysWed:DaytimeMorning      1.66956    0.08477  19.695  < 2e-16 ***
## weekdaysThurs:DaytimeMorning    1.60494    0.08662  18.529  < 2e-16 ***
## weekdaysFri:DaytimeMorning      1.34504    0.08614  15.614  < 2e-16 ***
## weekdaysSat:DaytimeMorning      0.29962    0.08572   3.495 0.000474 ***
## weekdaysMon:DaytimeAfternoon    1.00052    0.08434  11.862  < 2e-16 ***
## weekdaysTues:DaytimeAfternoon   1.06581    0.08701  12.249  < 2e-16 ***
## weekdaysWed:DaytimeAfternoon    1.07165    0.08470  12.652  < 2e-16 ***
## weekdaysThurs:DaytimeAfternoon  0.97483    0.08717  11.183  < 2e-16 ***
## weekdaysFri:DaytimeAfternoon    0.80392    0.08636   9.309  < 2e-16 ***
## weekdaysSat:DaytimeAfternoon    0.22391    0.08628   2.595 0.009456 ** 
## weekdaysMon:DaytimeEvening      1.22383    0.08443  14.495  < 2e-16 ***
## weekdaysTues:DaytimeEvening     1.32001    0.08722  15.134  < 2e-16 ***
## weekdaysWed:DaytimeEvening      1.38238    0.08339  16.578  < 2e-16 ***
## weekdaysThurs:DaytimeEvening    1.30995    0.08435  15.530  < 2e-16 ***
## weekdaysFri:DaytimeEvening      1.03882    0.08669  11.983  < 2e-16 ***
## weekdaysSat:DaytimeEvening      0.35250    0.08485   4.154 3.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(16.7887) family taken to be 1)
## 
##     Null deviance: 15161.85  on 926  degrees of freedom
## Residual deviance:   937.74  on 869  degrees of freedom
## AIC: 15774
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  16.789 
##           Std. Err.:  0.780 
## 
##  2 x log-likelihood:  -15656.355

Overall,The summary output of our final model illustrates that while keeping other covariates constant, A change number of trips in Summer will have exp(0.60053) more than in Autumn. One thing worth to metion that is when interactions are added, the outputs of the cofficients of weekdays are quite different comparing with the model with out interactions (model3)

Regression start week3 part1

Judging by the outcome of Exploratory analysis that each factors do have an impact to another,interactions are a valid choice. However we are ensure if the consequence of the interactions are strong enough since the influence can be negligible, which can be answered by regression.

Testing our final model