GLM Asssingment, Bike Sharing

THIS IS A DRAFT

Extract the following variables from the data set:weathersit, workingday, temp, hum, windspeed, season and cnt.
Analysis

Set up a Poisson regression model that predicts the number of total rental bikes (cnt) in a frequentist manner with the R software used in the course.
Select the most predictive regressors in a classical GLM manner and interpret your results.
Do the necessary checks to verify that the chosen model(s) fit the data well, e.g. check the link function, the scale of the covariates, etc. In other words, choose the most appropriate procedure. Illustrate your findings with appropriate graphics and illustrate how good prediction.
Fit a negative binomial model and a quasi-Poisson model in a frequentist manner if there is overdispersion.

row_count <- nrow(data.selected)
shuffled_rows <- sample(row_count)
train <- data.selected[head(shuffled_rows,floor(row_count*0.75)),]
test <- data.selected[tail(shuffled_rows,floor(row_count*0.25)),]

Is there Overdispersion?
Yes, there is over dispersion since the variance is larger than the mean.

data.frame( mean = mean(cnt), variance = var(cnt), Overdispersion = ifelse(mean(cnt)<var(cnt), "Yes", "Not Overdispersed"))

Note

Step AIC will not work for the quasi methods.

Modeling

Here are all the models that we would use. Now. We first need to look at the poisson model and see how well its fits the data. Next, we will try to improve this model using stepwise.

Frequestist models

poisson.full <- glm(cnt ~ ., data = train, family='poisson')
quasi<- glm(cnt ~ ., data = train, family= quasi(variance = "mu^2") )
Quasip <- glm(cnt ~ ., data = train, family= quasipoisson(link = "log"))
# negbin <- glm(cnt ~ ., data = train, family= nb (link="log"))   

NB.GLM <- glm.nb(cnt ~ ., data = train)

Stepwise selection

	Poisson			Quasi Model			Quasi Poisson			Negative binomial			StepWise selection
(Intercept)	7	.	491***	1797	.	943***	7	.	491***	7	.	394***	7	.	491***
	(0	.	007)	(178	.	445)	(0	.	100)	(0	.	103)	(0	.	007)
weathersit	â0	.	157***	â380	.	651***	â0	.	157***	â0	.	171***	â0	.	157***
	(0	.	002)	(74	.	903)	(0	.	033)	(0	.	037)	(0	.	002)
workingday	0	.	018***	303	.	915***	0	.	018	0	.	042	0	.	018***
	(0	.	002)	(72	.	861)	(0	.	031)	(0	.	036)	(0	.	002)
temp	1	.	348***	5241	.	650***	1	.	348***	1	.	587***	1	.	348***
	(0	.	006)	(257	.	138)	(0	.	085)	(0	.	099)	(0	.	006)
hum	â0	.	144***	â1371	.	111***	â0	.	144	â0	.	208	â0	.	144***
	(0	.	010)	(276	.	242)	(0	.	131)	(0	.	143)	(0	.	010)
windspeed	â0	.	635***	â3346	.	494***	â0	.	635**	â0	.	719**	â0	.	635***
	(0	.	016)	(448	.	275)	(0	.	212)	(0	.	233)	(0	.	016)
season	0	.	141***	372	.	719***	0	.	141***	0	.	152***	0	.	141***
	(0	.	001)	(39	.	161)	(0	.	016)	(0	.	017)	(0	.	001)
N	273			273			273			273			273
Likelihood-ratio	116359	.	013	44	.	781	116359	.	013	569	.	638	116359	.	013
p	0	.	000	0	.	000	0	.	000	0	.	000	0	.	000
AIC	55453	.	700							4469	.	349	55453	.	700

GAM model
In the data frame below we look at the amount of unique variables per variable we have. We choose to start with k = 15. Based on the squared root of the amount of unique windspeeds.

gam.train <- gam(cnt ~ weathersit +
                   s(temp,k = 15) +
                   s(hum, k = 15) +
                   s(windspeed, k = 15) +
                   s(season,k = 2),
                 family = poisson(link = "log"),
                 data = train)

par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE)

Gam Model Checking

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 11 iterations.
## Gradient range [8.707399e-06,4.082464e-05]
## (score 76.72169 & scale 1).
## Hessian positive definite, eigenvalue range [6.46379e-05,0.0003888078].
## Model rank =  46 / 46 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(temp)      14.0 14.0    0.96    0.22
## s(hum)       14.0 14.0    1.05    0.80
## s(windspeed) 14.0 13.9    1.15    1.00
## s(season)     2.0  2.0    1.05    0.76

The deviance residuals look terribe. So I try again. This time I will increase the number of knots for temp since hum and windspeed are really non significant.

gam.train.2 <- gam(cnt ~ s(season,k = 2) +
                         s(temp,k = 20) +
                         weathersit + hum + windspeed,
                         family = poisson(link = "log"),
                         data = train)
gam.check(gam.train.2)

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [1.763566e-05,0.0002787183]
## (score 109.9051 & scale 1).
## Hessian positive definite, eigenvalue range [5.364215e-05,0.001513168].
## Model rank =  25 / 25 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##             k'  edf k-index p-value
## s(season)  2.0  2.0    0.98    0.30
## s(temp)   19.0 18.9    1.00    0.49

par(mfrow=c(1,5)) #to partition the Plotting Window
plot(gam.train,se = TRUE) 
plot(train$temp, train$cnt)

The residuals of the second model look a bit better. Lets try a third model where season is the only one with with a spline.

plot(fitted(gam.train),residuals(gam.train))

plot(fitted(gam.train.2),residuals(gam.train.2))

Gam Model 3, only season has splines:

## 
## Method: UBRE   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [0.0003855256,0.0003855256]
## (score 179.7907 & scale 1).
## Hessian positive definite, eigenvalue range [0.0007958144,0.0007958144].
## Model rank =  7 / 7 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k' edf k-index p-value
## s(season)  2   2    0.94    0.17

looks bad.

gam.train.2.nb <- gam(cnt ~ s(season,k = 2) +
                         s(temp,k = 20) +
                         weathersit + hum + windspeed,
                         family = nb(link = "log"),
                         data = train)

## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible

gam.check(gam.train.2.nb)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [2.352533e-09,3.285471e-07]
## (score 2201.777 & scale 1).
## Hessian positive definite, eigenvalue range [0.4124818,130.3207].
## Model rank =  25 / 25 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'   edf k-index p-value
## s(season)  2.00  1.91    0.97    0.27
## s(temp)   19.00  5.11    0.96    0.21

Deviance seems to be not too bad for the nb. model. In the beginning I noted how negative bin would not make sence. I didnt know why or how to do it in GLM however using the mgcv package its quite easy.

gam.train.X <- gam(cnt ~ s(season, k = 2) +
                         s(temp,k = 20) +
                         weathersit + s(hum) + s(windspeed),
                         family = nb(link = "log"),
                         data = train)

## Warning in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): basis dimension, k, increased to minimum possible

gam.check(gam.train.X)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 7 iterations.
## Gradient range [-4.75598e-05,7.788441e-05]
## (score 2186.079 & scale 1).
## Hessian positive definite, eigenvalue range [0.4027102,122.7165].
## Model rank =  41 / 41 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                 k'   edf k-index p-value
## s(season)     2.00  1.90    1.02    0.61
## s(temp)      19.00  5.48    0.95    0.22
## s(hum)        9.00  6.09    1.05    0.81
## s(windspeed)  9.00  4.87    1.09    0.95

Prediction

Here is some code to make a prediction and create a df. Knitter had probles compiling this but it does work. I think the problem is in showing the Df.
gam.2 is used to make predictions.

 b <- as.data.frame(predict(gam.train.2, newdata = test, se = TRUE, type = "response"))

 b$TestCount <- test$cnt
 b <- b[,c(3,1,2)]
 # head(b, n = 10)
 b <- b[1:10,]
 # kableExtra::kable(b) %>% kable_styling(bootstrap_options = "striped", full_width = F)

 d <- as.data.frame(predict(gam.train.2.nb, newdata = test, se = TRUE, type = "response"))

 d$TestCount <- test$cnt
 d <- d[,c(3,1,2)]
 # head(d, n = 10)
 d <- d[1:10,]
 # kableExtra::kable(d) %>% kable_styling(bootstrap_options = "striped", full_width = F)

#Train Set

plot(train$cnt,xlab="day",ylab="count",
     cex.lab=1.5,cex.axis=1.3,col="grey",cex=1.3,
     main="Poisson models fit (training data, n = 255)")
gam1 <- fitted(gam.train)
gam2 <- fitted(gam.train.2)
lines(gam1,col="blue",lwd=.8)
lines(gam2,col="red",lwd=2)
lines(fitted.NB,col="black",lwd=.75)

Interesting, The blue and red lines are GAM models the black line was our best fitting glm model. That would be the negative binominal model. The Gam models seems to fot much better around day 200.

AIC Gam’s

aic.m <- data.frame(AIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM),BIC(gam.train,gam.train.2,gam.train.3, gam.train.2.nb, poisson.full, step.poisson.full, NB.GLM))

aic.m <- aic.m [, c(1,2,4)]
aic.m

This model #gam(cnt ~ s(season,k = 2) + s(temp,k = 20) + weathersit + hum + windspeed,family = nb(link = "log"), data = train) seems to be the best.

Model	AIC
GAM	18329.71
GAM2	22763.04
GAM2.nb	`4008.941`
GAM.season	38028.52

Gam with negative Binoninal link looks like the best model based on AIC.

What we still need to do: