THIS IS A DRAFT

Extract the following variables from the data set:weathersit, workingday, temp, hum, windspeed, season and cnt.
Analysis

row_count <- nrow(data.selected)
shuffled_rows <- sample(row_count)
train <- data.selected[head(shuffled_rows,floor(row_count*0.75)),]
test <- data.selected[tail(shuffled_rows,floor(row_count*0.25)),]

Is there Overdispersion?
Yes, there is over dispersion since the variance is larger than the mean.

data.frame( mean = mean(cnt), variance = var(cnt), Overdispersion = ifelse(mean(cnt)<var(cnt), "Yes", "Not Overdispersed"))

Note

Modeling

Here are all the models that we would use. Now. We first need to look at the poisson model and see how well its fits the data. Next, we will try to improve this model using stepwise.

poisson.full <- glm(cnt ~ ., data = train, family='poisson')
quasi<- glm(cnt ~ ., data = train, family= quasi(variance = "mu^2") )
Quasip <- glm(cnt ~ ., data = train, family= quasipoisson(link = "log"))
# negbin <- glm(cnt ~ ., data = train, family= nb (link="log"))   

NB.GLM <- glm.nb(cnt ~ ., data = train)
Poisson Quasi Model Quasi Poisson Negative binomial StepWise selection
(Intercept) 7 . 491*** 1797 . 943*** 7 . 491*** 7 . 394*** 7 . 491***
(0 . 007) (178 . 445) (0 . 100) (0 . 103) (0 . 007)
weathersit −0 . 157*** −380 . 651*** −0 . 157*** −0 . 171*** −0 . 157***
(0 . 002) (74 . 903) (0 . 033) (0 . 037) (0 . 002)
workingday 0 . 018*** 303 . 915*** 0 . 018 0 . 042 0 . 018***
(0 . 002) (72 . 861) (0 . 031) (0 . 036) (0 . 002)
temp 1 . 348*** 5241 . 650*** 1 . 348*** 1 . 587*** 1 . 348***
(0 . 006) (257 . 138) (0 . 085) (0 . 099) (0 . 006)
hum −0 . 144*** −1371 . 111*** −0 . 144 −0 . 208 −0 . 144***
(0 . 010) (276 . 242) (0 . 131) (0 . 143) (0 . 010)
windspeed −0 . 635*** −3346 . 494*** −0 . 635** −0 . 719** −0 . 635***
(0 . 016) (448 . 275) (0 . 212) (0 . 233) (0 . 016)
season 0 . 141*** 372 . 719*** 0 . 141*** 0 . 152*** 0 . 141***
(0 . 001) (39 . 161) (0 . 016) (0 . 017) (0 . 001)
N 273 273 273 273 273
Likelihood-ratio 116359 . 013 44 . 781 116359 . 013 569 . 638 116359 . 013
p 0 . 000 0 . 000 0 . 000 0 . 000 0 . 000
AIC 55453 . 700 4469 . 349 55453 . 700


gam.train <- gam(cnt ~ weathersit +
                   s(temp,k = 15) +
                   s(hum, k = 15) +
                   s(windspeed, k = 15) +
                   s(season,k = 2),
                 family = poisson(link = "log"),
                 data = train)

par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE) 

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 11 iterations.
## Gradient range [8.707399e-06,4.082464e-05]
## (score 76.72169 & scale 1).
## Hessian positive definite, eigenvalue range [6.46379e-05,0.0003888078].
## Model rank =  46 / 46 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(temp)      14.0 14.0    0.96    0.22
## s(hum)       14.0 14.0    1.05    0.80
## s(windspeed) 14.0 13.9    1.15    1.00
## s(season)     2.0  2.0    1.05    0.76

The deviance residuals look terribe. So I try again. This time I will increase the number of knots for temp since hum and windspeed are really non significant.

gam.train.2 <- gam(cnt ~ s(season,k = 2) +
                         s(temp,k = 20) +
                         weathersit + hum + windspeed,
                         family = poisson(link = "log"),
                         data = train)
gam.check(gam.train.2)

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [1.763566e-05,0.0002787183]
## (score 109.9051 & scale 1).
## Hessian positive definite, eigenvalue range [5.364215e-05,0.001513168].
## Model rank =  25 / 25 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##             k'  edf k-index p-value
## s(season)  2.0  2.0    0.98    0.30
## s(temp)   19.0 18.9    1.00    0.49
par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE) 
plot(train$temp, train$cnt)

The residuals of the second model look a bit better. Lets try a third model where season is the only one with with a spline.

plot(fitted(gam.train),residuals(gam.train))

plot(fitted(gam.train.2),residuals(gam.train.2))

Gam Model 3, only season has splines:

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [0.0003855256,0.0003855256]
## (score 179.7907 & scale 1).
## Hessian positive definite, eigenvalue range [0.0007958144,0.0007958144].
## Model rank =  7 / 7 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k' edf k-index p-value
## s(season)  2   2    0.94    0.17

looks bad.

gam.train.2.nb <- gam(cnt ~ s(season,k = 2) +
                         s(temp,k = 20) +
                         weathersit + hum + windspeed,
                         family = nb(link = "log"),
                         data = train)
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
gam.check(gam.train.2.nb)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [2.352533e-09,3.285471e-07]
## (score 2201.777 & scale 1).
## Hessian positive definite, eigenvalue range [0.4124818,130.3207].
## Model rank =  25 / 25 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'   edf k-index p-value
## s(season)  2.00  1.91    0.97    0.27
## s(temp)   19.00  5.11    0.96    0.21

Deviance seems to be not too bad for the nb. model. In the beginning I noted how negative bin would not make sence. I didnt know why or how to do it in GLM however using the mgcv package its quite easy.

gam.train.X <- gam(cnt ~ s(season, k = 2) +
                         s(temp,k = 20) +
                         weathersit + s(hum) + s(windspeed),
                         family = nb(link = "log"),
                         data = train)
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
gam.check(gam.train.X)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-4.75598e-05,7.788441e-05]
## (score 2186.079 & scale 1).
## Hessian positive definite, eigenvalue range [0.4027102,122.7165].
## Model rank =  41 / 41 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'   edf k-index p-value
## s(season)     2.00  1.90    1.02    0.61
## s(temp)      19.00  5.48    0.95    0.22
## s(hum)        9.00  6.09    1.05    0.81
## s(windspeed)  9.00  4.87    1.09    0.95

Prediction

Here is some code to make a prediction and create a df. Knitter had probles compiling this but it does work. I think the problem is in showing the Df.
gam.2 is used to make predictions.

 b <- as.data.frame(predict(gam.train.2, newdata = test, se = TRUE, type = "response"))

 b$TestCount <- test$cnt
 b <- b[,c(3,1,2)]
 # head(b, n = 10)
 b <- b[1:10,]
 # kableExtra::kable(b) %>% kable_styling(bootstrap_options = "striped", full_width = F)
 d <- as.data.frame(predict(gam.train.2.nb, newdata = test, se = TRUE, type = "response"))

 d$TestCount <- test$cnt
 d <- d[,c(3,1,2)]
 # head(d, n = 10)
 d <- d[1:10,]
 # kableExtra::kable(d) %>% kable_styling(bootstrap_options = "striped", full_width = F)
#Train Set

plot(train$cnt,xlab="day",ylab="count",
     cex.lab=1.5,cex.axis=1.3,col="grey",cex=1.3,
     main="Poisson models fit (training data, n = 255)")
gam1 <- fitted(gam.train)
gam2 <- fitted(gam.train.2)
lines(gam1,col="blue",lwd=.8)
lines(gam2,col="red",lwd=2)
lines(fitted.NB,col="black",lwd=.75)

Interesting, The blue and red lines are GAM models the black line was our best fitting glm model. That would be the negative binominal model. The Gam models seems to fot much better around day 200.

AIC Gam’s

aic.m <- data.frame(AIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM),BIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM))

aic.m <- aic.m [, c(1,2,4)]
aic.m

This model #gam(cnt ~ s(season,k = 2) + s(temp,k = 20) + weathersit + hum + windspeed,family = nb(link = "log"), data = train) seems to be the best.

Model AIC
GAM 18329.71
GAM2 22763.04
GAM2.nb 4008.941
GAM.season 38028.52

Gam with negative Binoninal link looks like the best model based on AIC.

What we still need to do: