Extract the following variables from the data set:weathersit, workingday, temp, hum, windspeed, season and cnt.
Analysis
Select the most predictive regressors in a classical GLM manner and interpret your results.
Fit a negative binomial model and a quasi-Poisson model in a frequentist manner if there is overdispersion.
row_count <- nrow(data.selected)
shuffled_rows <- sample(row_count)
train <- data.selected[head(shuffled_rows,floor(row_count*0.75)),]
test <- data.selected[tail(shuffled_rows,floor(row_count*0.25)),]
Is there Overdispersion?
Yes, there is over dispersion since the variance is larger than the mean.
data.frame( mean = mean(cnt), variance = var(cnt), Overdispersion = ifelse(mean(cnt)<var(cnt), "Yes", "Not Overdispersed"))
Note
Modeling
Here are all the models that we would use. Now. We first need to look at the poisson model and see how well its fits the data. Next, we will try to improve this model using stepwise.
poisson.full <- glm(cnt ~ ., data = train, family='poisson')
quasi<- glm(cnt ~ ., data = train, family= quasi(variance = "mu^2") )
Quasip <- glm(cnt ~ ., data = train, family= quasipoisson(link = "log"))
# negbin <- glm(cnt ~ ., data = train, family= nb (link="log"))
NB.GLM <- glm.nb(cnt ~ ., data = train)
| Poisson | Quasi Model | Quasi Poisson | Negative binomial | StepWise selection | |||||||||||
| (Intercept) | 7 | . | 491*** | 1797 | . | 943*** | 7 | . | 491*** | 7 | . | 394*** | 7 | . | 491*** |
| (0 | . | 007) | (178 | . | 445) | (0 | . | 100) | (0 | . | 103) | (0 | . | 007) | |
| weathersit | −0 | . | 157*** | −380 | . | 651*** | −0 | . | 157*** | −0 | . | 171*** | −0 | . | 157*** |
| (0 | . | 002) | (74 | . | 903) | (0 | . | 033) | (0 | . | 037) | (0 | . | 002) | |
| workingday | 0 | . | 018*** | 303 | . | 915*** | 0 | . | 018 | 0 | . | 042 | 0 | . | 018*** |
| (0 | . | 002) | (72 | . | 861) | (0 | . | 031) | (0 | . | 036) | (0 | . | 002) | |
| temp | 1 | . | 348*** | 5241 | . | 650*** | 1 | . | 348*** | 1 | . | 587*** | 1 | . | 348*** |
| (0 | . | 006) | (257 | . | 138) | (0 | . | 085) | (0 | . | 099) | (0 | . | 006) | |
| hum | −0 | . | 144*** | −1371 | . | 111*** | −0 | . | 144 | −0 | . | 208 | −0 | . | 144*** |
| (0 | . | 010) | (276 | . | 242) | (0 | . | 131) | (0 | . | 143) | (0 | . | 010) | |
| windspeed | −0 | . | 635*** | −3346 | . | 494*** | −0 | . | 635** | −0 | . | 719** | −0 | . | 635*** |
| (0 | . | 016) | (448 | . | 275) | (0 | . | 212) | (0 | . | 233) | (0 | . | 016) | |
| season | 0 | . | 141*** | 372 | . | 719*** | 0 | . | 141*** | 0 | . | 152*** | 0 | . | 141*** |
| (0 | . | 001) | (39 | . | 161) | (0 | . | 016) | (0 | . | 017) | (0 | . | 001) | |
| N | 273 | 273 | 273 | 273 | 273 | ||||||||||
| Likelihood-ratio | 116359 | . | 013 | 44 | . | 781 | 116359 | . | 013 | 569 | . | 638 | 116359 | . | 013 |
| p | 0 | . | 000 | 0 | . | 000 | 0 | . | 000 | 0 | . | 000 | 0 | . | 000 |
| AIC | 55453 | . | 700 | 4469 | . | 349 | 55453 | . | 700 | ||||||
gam.train <- gam(cnt ~ weathersit +
s(temp,k = 15) +
s(hum, k = 15) +
s(windspeed, k = 15) +
s(season,k = 2),
family = poisson(link = "log"),
data = train)
par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE)
##
## Method: UBRE Optimizer: outer newton
## full convergence after 11 iterations.
## Gradient range [8.707399e-06,4.082464e-05]
## (score 76.72169 & scale 1).
## Hessian positive definite, eigenvalue range [6.46379e-05,0.0003888078].
## Model rank = 46 / 46
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(temp) 14.0 14.0 0.96 0.22
## s(hum) 14.0 14.0 1.05 0.80
## s(windspeed) 14.0 13.9 1.15 1.00
## s(season) 2.0 2.0 1.05 0.76
The deviance residuals look terribe. So I try again. This time I will increase the number of knots for temp since hum and windspeed are really non significant.
gam.train.2 <- gam(cnt ~ s(season,k = 2) +
s(temp,k = 20) +
weathersit + hum + windspeed,
family = poisson(link = "log"),
data = train)
gam.check(gam.train.2)
##
## Method: UBRE Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [1.763566e-05,0.0002787183]
## (score 109.9051 & scale 1).
## Hessian positive definite, eigenvalue range [5.364215e-05,0.001513168].
## Model rank = 25 / 25
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(season) 2.0 2.0 0.98 0.30
## s(temp) 19.0 18.9 1.00 0.49
par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE)
plot(train$temp, train$cnt)
The residuals of the second model look a bit better. Lets try a third model where season is the only one with with a spline.
plot(fitted(gam.train),residuals(gam.train))
plot(fitted(gam.train.2),residuals(gam.train.2))
Gam Model 3, only season has splines:
##
## Method: UBRE Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [0.0003855256,0.0003855256]
## (score 179.7907 & scale 1).
## Hessian positive definite, eigenvalue range [0.0007958144,0.0007958144].
## Model rank = 7 / 7
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(season) 2 2 0.94 0.17
looks bad.
gam.train.2.nb <- gam(cnt ~ s(season,k = 2) +
s(temp,k = 20) +
weathersit + hum + windspeed,
family = nb(link = "log"),
data = train)
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
gam.check(gam.train.2.nb)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [2.352533e-09,3.285471e-07]
## (score 2201.777 & scale 1).
## Hessian positive definite, eigenvalue range [0.4124818,130.3207].
## Model rank = 25 / 25
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(season) 2.00 1.91 0.97 0.27
## s(temp) 19.00 5.11 0.96 0.21
Deviance seems to be not too bad for the nb. model. In the beginning I noted how negative bin would not make sence. I didnt know why or how to do it in GLM however using the mgcv package its quite easy.
gam.train.X <- gam(cnt ~ s(season, k = 2) +
s(temp,k = 20) +
weathersit + s(hum) + s(windspeed),
family = nb(link = "log"),
data = train)
## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible
gam.check(gam.train.X)
##
## Method: REML Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-4.75598e-05,7.788441e-05]
## (score 2186.079 & scale 1).
## Hessian positive definite, eigenvalue range [0.4027102,122.7165].
## Model rank = 41 / 41
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(season) 2.00 1.90 1.02 0.61
## s(temp) 19.00 5.48 0.95 0.22
## s(hum) 9.00 6.09 1.05 0.81
## s(windspeed) 9.00 4.87 1.09 0.95
Prediction
Here is some code to make a prediction and create a df. Knitter had probles compiling this but it does work. I think the problem is in showing the Df.
gam.2 is used to make predictions.
b <- as.data.frame(predict(gam.train.2, newdata = test, se = TRUE, type = "response"))
b$TestCount <- test$cnt
b <- b[,c(3,1,2)]
# head(b, n = 10)
b <- b[1:10,]
# kableExtra::kable(b) %>% kable_styling(bootstrap_options = "striped", full_width = F)
d <- as.data.frame(predict(gam.train.2.nb, newdata = test, se = TRUE, type = "response"))
d$TestCount <- test$cnt
d <- d[,c(3,1,2)]
# head(d, n = 10)
d <- d[1:10,]
# kableExtra::kable(d) %>% kable_styling(bootstrap_options = "striped", full_width = F)
#Train Set
plot(train$cnt,xlab="day",ylab="count",
cex.lab=1.5,cex.axis=1.3,col="grey",cex=1.3,
main="Poisson models fit (training data, n = 255)")
gam1 <- fitted(gam.train)
gam2 <- fitted(gam.train.2)
lines(gam1,col="blue",lwd=.8)
lines(gam2,col="red",lwd=2)
lines(fitted.NB,col="black",lwd=.75)
Interesting, The blue and red lines are GAM models the black line was our best fitting glm model. That would be the negative binominal model. The Gam models seems to fot much better around day 200.
AIC Gam’s
aic.m <- data.frame(AIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM),BIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM))
aic.m <- aic.m [, c(1,2,4)]
aic.m
This model #gam(cnt ~ s(season,k = 2) + s(temp,k = 20) + weathersit + hum + windspeed,family = nb(link = "log"), data = train) seems to be the best.
| Model | AIC |
|---|---|
| GAM | 18329.71 |
| GAM2 | 22763.04 |
| GAM2.nb | 4008.941 |
| GAM.season | 38028.52 |
Gam with negative Binoninal link looks like the best model based on AIC.
What we still need to do: