The regression problem this investigation aims to explore is multilinear regression. Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season and other factors can affect the rental behaviours. We will attempt to fit a multiple linear regression model, then evaluate the adequacy of the fitted model and rebuild for a better model. We expect to see multilinear relationship between the total number of bike rental users and the variables of interest. The results of the investigation could be used to further predict bike rental daily users.
The research dataset bike was obtained from the UCI website http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This data folder contains information on bike sharing on both daily and hourly basis. For the purpose of this investigation, we chose to focus on the dataset containing daily bike sharing information. The daily bike sharing data contains 731 observations of the following variables:
instant: record index
dteday: date
season: season (1: spring, 2: summer, 3: fall, 4: winter)
yr: year (0: 2011, 1: 2012)
mnth: month (1 to 12)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday - 1, otherwise - 0.
weathersit:
1: Clear, Few clouds, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: Normalised temperature in Celsius. The values are divided to 41 (max)
atemp: Normalised feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalised humidity. The values are divided to 100 (max)
windspeed: Normalised wind speed. The values are divided to 67 (max)
casual: Count of casual users
registered: Count of registered users
cnt: Count of total rental bikes including both casual and registered
The decision was made to keep season, workingday, weathersit, temp, hum, windspeed as the predictor variables, cnt - is the response variable of total number of bike rental users. season, workingday and weathersit are treated as indicator variables. The following code creates a bike data frame and displays its first observations.
bike <- read_csv("day.csv")
## Parsed with column specification:
## cols(
## instant = col_double(),
## dteday = col_date(format = ""),
## season = col_double(),
## yr = col_double(),
## mnth = col_double(),
## holiday = col_double(),
## weekday = col_double(),
## workingday = col_double(),
## weathersit = col_double(),
## temp = col_double(),
## atemp = col_double(),
## hum = col_double(),
## windspeed = col_double(),
## casual = col_double(),
## registered = col_double(),
## cnt = col_double()
## )
bike <- bike[,c("season", "workingday", "weathersit", "temp", "hum", "windspeed", "cnt")]
head(bike)
## # A tibble: 6 x 7
## season workingday weathersit temp hum windspeed cnt
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 2 0.344 0.806 0.160 985
## 2 1 0 2 0.363 0.696 0.249 801
## 3 1 1 1 0.196 0.437 0.248 1349
## 4 1 1 1 0.2 0.590 0.160 1562
## 5 1 1 1 0.227 0.437 0.187 1600
## 6 1 1 1 0.204 0.518 0.0896 1606
The following code fits the full model with all the predictors.
model <- lm(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed, data = bike)
summary(model)
##
## Call:
## lm(formula = cnt ~ factor(season) + factor(workingday) + factor(weathersit) +
## temp + hum + windspeed, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3714.8 -918.5 -243.6 1059.9 4214.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3024.4 349.3 8.659 < 2e-16 ***
## factor(season)2 932.3 178.6 5.220 2.34e-07 ***
## factor(season)3 483.2 235.6 2.051 0.0406 *
## factor(season)4 1499.6 152.4 9.841 < 2e-16 ***
## factor(workingday)1 155.0 103.7 1.496 0.1352
## factor(weathersit)2 -232.3 127.7 -1.818 0.0694 .
## factor(weathersit)3 -1929.7 326.8 -5.905 5.43e-09 ***
## temp 6159.1 481.1 12.802 < 2e-16 ***
## hum -2608.2 461.1 -5.656 2.23e-08 ***
## windspeed -3306.0 674.5 -4.901 1.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1296 on 721 degrees of freedom
## Multiple R-squared: 0.5577, Adjusted R-squared: 0.5522
## F-statistic: 101 on 9 and 721 DF, p-value: < 2.2e-16
anova(model)
## Analysis of Variance Table
##
## Response: cnt
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(season) 3 950595868 316865289 188.5553 < 2.2e-16 ***
## factor(workingday) 1 5469478 5469478 3.2547 0.07164 .
## factor(weathersit) 2 248464999 124232499 73.9263 < 2.2e-16 ***
## temp 1 249905725 249905725 148.7100 < 2.2e-16 ***
## hum 1 33093919 33093919 19.6930 1.052e-05 ***
## windspeed 1 40371975 40371975 24.0239 1.176e-06 ***
## Residuals 721 1211633428 1680490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis tested for significance of regression:
\(H_0 : β_1 = β_2 =...= β_k = 0\)
\(H_A: β_j ≠0\) for at least one \(j\)
The output of the ANOVA table splits the \(MS_{reg}\) between the regressors, the F-statistic can either be calculated from the ANOVA table or taken from the model summary. The reported F-statistic is 101 with a p-value < 0.05, therefore, we reject \(H_0\), the overall regression using all predictors is statistically significant at 5% level. Agjusted \(R^2\) = 0.55, so the model explains about 55% of variability in the number of bike rental users.
The hypothesis for significance of individual coefficients:
\(H_0 : \beta_j = 0\)
\(H_A: \beta_j ≠0\)
The individual t-test on the coefficients is obtained from model summary. We can conclude that the coefficients of workingday = 1 and for weathersit = 2 are not statistically significant at 5% level. All coefficients of other predictors were reported to be statistically significant at 5% level.
From the output of the ANOVA table, we can conclude that the addition of workingday to the regression is not statistically significant at 5% level, so it is suggested to be left out of the model.
The following code checks the multicollinearity assumption
vif(model)
## GVIF Df GVIF^(1/(2*Df))
## factor(season) 3.504848 3 1.232475
## factor(workingday) 1.010191 1 1.005082
## factor(weathersit) 1.780069 2 1.155072
## temp 3.369325 1 1.835572
## hum 1.873643 1 1.368811
## windspeed 1.186928 1 1.089462
Based on GVIF^(1/(2*Df)), we can conclude that there is no multicollinearity present between the predictor variables because none of the values is greater than 2.
The function residual.analysis performs plots and tests to check the assumptions behind the estimated model.
residual.analysis <- function(model){
res.model = rstandard(model)
par(mfrow=c(2,2))
plot(model, which = c(1,1))
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2, lty = 2)
acf(res.model, main="ACF of standardised residuals")
print(shapiro.test(res.model))
print(durbinWatsonTest(model))
print(ncvTest(model))
}
residual.analysis(model)
Figure 1 - Residual Analysis plots
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.98312, p-value = 1.845e-07
##
## lag Autocorrelation D-W Statistic p-value
## 1 0.7807316 0.4369705 0
## Alternative hypothesis: rho != 0
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 32.65956, Df = 1, p = 1.098e-08
From the diagnostic checks in Figure 1 and tests reports, we can assume:
Errors seem to be spread nonrandomly. The plot of residuals against fitted values shows that errors are mostly associated with larger values of fitted values and tend to gather on the right side of the plot. We conclude that the assumption of random errors is violated.
The histogram shows the distribution of standardised residuals, it does not appear normal. Normal QQ Plot: Residuals do not fall close to the line at the both ends. There could be possible deviations from normality.
ACF plot: All the lags are highly significant which indicates that our model does not capture the autocorrelation structure in the research data.
The tests performed on the residuals support the existence of autocorrelation in the residuals, we reject \(H_0\) that the errors are uncorrelated (p-value < 0.05). The normality assumption is violated with p-value < 0.05 reported by Shapiro-Wilk test. The assumption of constant variance does not hold, we reject H0 that errors have constant variance with reported p-value < 0.05. Based on the residual analysis, we can conclude that the model is not an adequate fit for our data.
The following code performs backward elimination to come up with smaller parsimonious model
drop1(model, test="F")
## Single term deletions
##
## Model:
## cnt ~ factor(season) + factor(workingday) + factor(weathersit) +
## temp + hum + windspeed
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1211633428 10488
## factor(season) 3 208219020 1419852448 10598 41.3012 < 2.2e-16 ***
## factor(workingday) 1 3759531 1215392958 10489 2.2372 0.1352
## factor(weathersit) 2 58714379 1270347807 10519 17.4694 3.901e-08 ***
## temp 1 275398673 1487032100 10636 163.8800 < 2.2e-16 ***
## hum 1 53768096 1265401523 10518 31.9955 2.229e-08 ***
## windspeed 1 40371975 1252005403 10510 24.0239 1.176e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(update(model, ~ . -factor(workingday)), test = "F")
## Single term deletions
##
## Model:
## cnt ~ factor(season) + factor(weathersit) + temp + hum + windspeed
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 1215392958 10489
## factor(season) 3 208414000 1423806958 10598 41.269 < 2.2e-16 ***
## factor(weathersit) 2 57331974 1272724932 10518 17.029 5.937e-08 ***
## temp 1 280714291 1496107250 10639 166.757 < 2.2e-16 ***
## hum 1 54968383 1270361341 10519 32.654 1.611e-08 ***
## windspeed 1 41004780 1256397738 10511 24.359 9.937e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Partial F test indicates the variable workingday can be left out of the model with reported p-value > 0.05.
r <- regsubsets(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed,
data = bike)
plot(r, scale="Cp")
plot(r, scale="adjr2")
Both Mallows’s Cp and Adjusted \(R^2\) criteria of best subsets regression agree to leave out workingday
The following code fits a new model with season, weathersit, temp, hum and windspeed variables
submodel.1 <- lm(cnt ~ factor(season) + factor(weathersit) + temp + hum + windspeed, data = bike)
summary(submodel.1)
##
## Call:
## lm(formula = cnt ~ factor(season) + factor(weathersit) + temp +
## hum + windspeed, data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3830.0 -926.4 -253.8 1097.0 4095.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3128.9 342.5 9.135 < 2e-16 ***
## factor(season)2 927.0 178.7 5.187 2.78e-07 ***
## factor(season)3 471.3 235.7 2.000 0.0459 *
## factor(season)4 1496.8 152.5 9.814 < 2e-16 ***
## factor(weathersit)2 -218.8 127.5 -1.715 0.0867 .
## factor(weathersit)3 -1902.9 326.6 -5.827 8.51e-09 ***
## temp 6205.4 480.5 12.913 < 2e-16 ***
## hum -2635.2 461.1 -5.714 1.61e-08 ***
## windspeed -3330.8 674.9 -4.935 9.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1297 on 722 degrees of freedom
## Multiple R-squared: 0.5564, Adjusted R-squared: 0.5514
## F-statistic: 113.2 on 8 and 722 DF, p-value: < 2.2e-16
anova(submodel.1)
## Analysis of Variance Table
##
## Response: cnt
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(season) 3 950595868 316865289 188.233 < 2.2e-16 ***
## factor(weathersit) 2 243512285 121756142 72.329 < 2.2e-16 ***
## temp 1 255086767 255086767 151.533 < 2.2e-16 ***
## hum 1 33942733 33942733 20.164 8.274e-06 ***
## windspeed 1 41004780 41004780 24.359 9.937e-07 ***
## Residuals 722 1215392958 1683370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The hypothesis tested for significance of regression:
\(H_0 : β_1 = β_2 =...= β_k = 0\)
\(H_A: β_j ≠0\) for at least one \(j\)
The reported F-statistic from model summary output is 109.7 with a p-value < 0.05, therefore, we reject \(H_0\), the overall regression using all predictors is statistically significant at 5% level. Agjusted \(R^2\) = 0.54, so the model explains about 54% of variability in the number of bike rental users.
The hypothesis for significance of individual coefficients:
\(H_0 : \beta_j = 0\)
\(H_A: \beta_j ≠0\)
The individual t-test on the coefficients is obtained from model summary. All predictors were reported to be statistically significant at 5% level.
residual.analysis <- function(model){
res.model = rstandard(model)
par(mfrow=c(2,2))
plot(model, which = c(1,1))
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2, lty = 2)
acf(res.model, main="ACF of standardised residuals")
print(shapiro.test(res.model))
print(durbinWatsonTest(model))
print(ncvTest(model))
}
residual.analysis(submodel.1)
Figure 2 - Residual Analysis plots
##
## Shapiro-Wilk normality test
##
## data: res.model
## W = 0.98202, p-value = 8.069e-08
##
## lag Autocorrelation D-W Statistic p-value
## 1 0.7807629 0.4366348 0
## Alternative hypothesis: rho != 0
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 33.03884, Df = 1, p = 9.0336e-09
From the diagnostic checks in Figure 2 and tests reports, we can conclude that the assumptions of randomness, normality and homogeneity of errors are violated. The ACF and Durbin-Watson test suggest that the errors are highly correlated, so the updated model is not adequate for capturing the autocorrelation structure of the research data.
Assuming that the response variable of number of bike rental users follows a Poisson distribution, we attempt to fit a regression model using the log link.
model.full<-glm(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed, family=poisson(link="log"), data = bike)
summary.glm(model.full)
##
## Call:
## glm(formula = cnt ~ factor(season) + factor(workingday) + factor(weathersit) +
## temp + hum + windspeed, family = poisson(link = "log"), data = bike)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -59.115 -15.771 -3.352 15.691 68.686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.933928 0.004272 1856.98 <2e-16 ***
## factor(season)2 0.313370 0.002270 138.02 <2e-16 ***
## factor(season)3 0.197897 0.002831 69.89 <2e-16 ***
## factor(season)4 0.458829 0.001978 231.93 <2e-16 ***
## factor(workingday)1 0.039978 0.001203 33.23 <2e-16 ***
## factor(weathersit)2 -0.055284 0.001488 -37.14 <2e-16 ***
## factor(weathersit)3 -0.745276 0.005472 -136.20 <2e-16 ***
## temp 1.381391 0.005599 246.74 <2e-16 ***
## hum -0.572890 0.005445 -105.21 <2e-16 ***
## windspeed -0.723194 0.008116 -89.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 668801 on 730 degrees of freedom
## Residual deviance: 298057 on 721 degrees of freedom
## AIC: 305476
##
## Number of Fisher Scoring iterations: 4
deviance(model.full)
## [1] 298057.5
pchisq(model.full$deviance, df=model.full$df.residual, lower.tail=FALSE)
## [1] 0
The output of the poisson model summary suggest that all the predictors have a statistically significant effect on the response with all p-values < 0.05.
The hypothesis behind testing model adequacy:
\(H_0:\) the model fits the data well
\(H_A:\) the model does not fit the data well
The chi-square test statistic of 298057 with 721 degrees of freedom gives a p-value of 0, indicating that we should reject the \(H_0\) hypothesis, and we can conclude that the poisson model is not adequate for our data.
anova(model.full,test="Chi")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: cnt
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 730 668801
## factor(season) 3 232841 727 435960 < 2.2e-16 ***
## factor(workingday) 1 1227 726 434732 < 2.2e-16 ***
## factor(weathersit) 2 65261 724 369472 < 2.2e-16 ***
## temp 1 56481 723 312990 < 2.2e-16 ***
## hum 1 6916 722 306074 < 2.2e-16 ***
## windspeed 1 8017 721 298057 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(model.full, test="Chi")
## Single term deletions
##
## Model:
## cnt ~ factor(season) + factor(workingday) + factor(weathersit) +
## temp + hum + windspeed
## Df Deviance AIC LRT Pr(>Chi)
## <none> 298057 305476
## factor(season) 3 367366 374779 69309 < 2.2e-16 ***
## factor(workingday) 1 299167 306584 1110 < 2.2e-16 ***
## factor(weathersit) 2 320701 328116 22644 < 2.2e-16 ***
## temp 1 359006 366422 60948 < 2.2e-16 ***
## hum 1 309055 316472 10997 < 2.2e-16 ***
## windspeed 1 306074 313491 8017 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table and backward elimination using partial Chisq test with p-values < 0.05 indicate that all the predictors are important and do not suggest a smaller parsimonious model.
par(mfrow=c(1,2))
plot(residuals(model.full, type="deviance"))
qqnorm(residuals(model.full, type="deviance"))
qqline(residuals(model.full, type="deviance"), lty = 2, col = 2)
Figure 3 - Residual Analysis plots
shapiro.test(residuals(model.full, type="deviance"))
##
## Shapiro-Wilk normality test
##
## data: residuals(model.full, type = "deviance")
## W = 0.98468, p-value = 6.256e-07
Figure 3 shows the plots of deviance residuals from the poisson model. The errors are not spread randomly, the Q-Q plot and Shapiro-Wilk test indicate that the normality does not hold. So, we can conclude that the model does not appear satisfactory from the residual analysis viewpoint.
The goal of the research project was to fit a multiple linear regression model that can be suitable to predict the number of bike rental users. We expected to see a multilinear relationship between the response variable and predictors chosen for this task. However, the results of our study showed that none of the models fitted were adequate in terms of residual checks.
Overall, we conclude that multiple linear regression was not suitable for the research data. Highly significant autocorrelation structure in the residuals from regression models suggests that future investigation could use Time Series analysis methods to predict the number of bike rental users.