MATH1312 Final Project

Introduction

The regression problem this investigation aims to explore is multilinear regression. Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season and other factors can affect the rental behaviours. We will attempt to fit a multiple linear regression model, then evaluate the adequacy of the fitted model and rebuild for a better model. We expect to see multilinear relationship between the total number of bike rental users and the variables of interest. The results of the investigation could be used to further predict bike rental daily users.

The research dataset bike was obtained from the UCI website http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. This data folder contains information on bike sharing on both daily and hourly basis. For the purpose of this investigation, we chose to focus on the dataset containing daily bike sharing information. The daily bike sharing data contains 731 observations of the following variables:

instant: record index
dteday: date
season: season (1: spring, 2: summer, 3: fall, 4: winter)
yr: year (0: 2011, 1: 2012)
mnth: month (1 to 12)
holiday: weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
weekday: day of the week
workingday: if day is neither weekend nor holiday - 1, otherwise - 0.
weathersit:
- 1: Clear, Few clouds, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: Normalised temperature in Celsius. The values are divided to 41 (max)
atemp: Normalised feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalised humidity. The values are divided to 100 (max)
windspeed: Normalised wind speed. The values are divided to 67 (max)
casual: Count of casual users
registered: Count of registered users
cnt: Count of total rental bikes including both casual and registered

The decision was made to keep season, workingday, weathersit, temp, hum, windspeed as the predictor variables, cnt - is the response variable of total number of bike rental users. season, workingday and weathersit are treated as indicator variables. The following code creates a bike data frame and displays its first observations.

bike <- read_csv("day.csv")

## Parsed with column specification:
## cols(
##   instant = col_double(),
##   dteday = col_date(format = ""),
##   season = col_double(),
##   yr = col_double(),
##   mnth = col_double(),
##   holiday = col_double(),
##   weekday = col_double(),
##   workingday = col_double(),
##   weathersit = col_double(),
##   temp = col_double(),
##   atemp = col_double(),
##   hum = col_double(),
##   windspeed = col_double(),
##   casual = col_double(),
##   registered = col_double(),
##   cnt = col_double()
## )

bike <- bike[,c("season", "workingday", "weathersit", "temp", "hum", "windspeed", "cnt")]
head(bike)

## # A tibble: 6 x 7
##   season workingday weathersit  temp   hum windspeed   cnt
##    <dbl>      <dbl>      <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1      1          0          2 0.344 0.806    0.160    985
## 2      1          0          2 0.363 0.696    0.249    801
## 3      1          1          1 0.196 0.437    0.248   1349
## 4      1          1          1 0.2   0.590    0.160   1562
## 5      1          1          1 0.227 0.437    0.187   1600
## 6      1          1          1 0.204 0.518    0.0896  1606

Regression Analysis

The following code fits the full model with all the predictors.

model <- lm(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed, data = bike)
summary(model)

## 
## Call:
## lm(formula = cnt ~ factor(season) + factor(workingday) + factor(weathersit) + 
##     temp + hum + windspeed, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3714.8  -918.5  -243.6  1059.9  4214.4 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3024.4      349.3   8.659  < 2e-16 ***
## factor(season)2        932.3      178.6   5.220 2.34e-07 ***
## factor(season)3        483.2      235.6   2.051   0.0406 *  
## factor(season)4       1499.6      152.4   9.841  < 2e-16 ***
## factor(workingday)1    155.0      103.7   1.496   0.1352    
## factor(weathersit)2   -232.3      127.7  -1.818   0.0694 .  
## factor(weathersit)3  -1929.7      326.8  -5.905 5.43e-09 ***
## temp                  6159.1      481.1  12.802  < 2e-16 ***
## hum                  -2608.2      461.1  -5.656 2.23e-08 ***
## windspeed            -3306.0      674.5  -4.901 1.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1296 on 721 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5522 
## F-statistic:   101 on 9 and 721 DF,  p-value: < 2.2e-16

anova(model)

## Analysis of Variance Table
## 
## Response: cnt
##                     Df     Sum Sq   Mean Sq  F value    Pr(>F)    
## factor(season)       3  950595868 316865289 188.5553 < 2.2e-16 ***
## factor(workingday)   1    5469478   5469478   3.2547   0.07164 .  
## factor(weathersit)   2  248464999 124232499  73.9263 < 2.2e-16 ***
## temp                 1  249905725 249905725 148.7100 < 2.2e-16 ***
## hum                  1   33093919  33093919  19.6930 1.052e-05 ***
## windspeed            1   40371975  40371975  24.0239 1.176e-06 ***
## Residuals          721 1211633428   1680490                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypothesis tested for significance of regression:

\(H_0 : β_1 = β_2 =...= β_k = 0\)
\(H_A: β_j ≠ 0\) for at least one \(j\)

The output of the ANOVA table splits the \(MS_{reg}\) between the regressors, the F-statistic can either be calculated from the ANOVA table or taken from the model summary. The reported F-statistic is 101 with a p-value < 0.05, therefore, we reject \(H_0\), the overall regression using all predictors is statistically significant at 5% level. Agjusted \(R^2\) = 0.55, so the model explains about 55% of variability in the number of bike rental users.

The hypothesis for significance of individual coefficients:

\(H_0 : \beta_j = 0\)
\(H_A: \beta_j ≠ 0\)

The individual t-test on the coefficients is obtained from model summary. We can conclude that the coefficients of workingday = 1 and for weathersit = 2 are not statistically significant at 5% level. All coefficients of other predictors were reported to be statistically significant at 5% level.

From the output of the ANOVA table, we can conclude that the addition of workingday to the regression is not statistically significant at 5% level, so it is suggested to be left out of the model.

The following code checks the multicollinearity assumption

vif(model)

##                        GVIF Df GVIF^(1/(2*Df))
## factor(season)     3.504848  3        1.232475
## factor(workingday) 1.010191  1        1.005082
## factor(weathersit) 1.780069  2        1.155072
## temp               3.369325  1        1.835572
## hum                1.873643  1        1.368811
## windspeed          1.186928  1        1.089462

Based on GVIF^(1/(2*Df)), we can conclude that there is no multicollinearity present between the predictor variables because none of the values is greater than 2.

Residual Analysis

The function residual.analysis performs plots and tests to check the assumptions behind the estimated model.

residual.analysis <- function(model){
  res.model = rstandard(model)
  par(mfrow=c(2,2))
  plot(model, which = c(1,1))
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2, lty = 2)
  acf(res.model, main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  print(durbinWatsonTest(model))
  print(ncvTest(model))
}
residual.analysis(model)

Figure 1 - Residual Analysis plots

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.98312, p-value = 1.845e-07
## 
##  lag Autocorrelation D-W Statistic p-value
##    1       0.7807316     0.4369705       0
##  Alternative hypothesis: rho != 0
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 32.65956, Df = 1, p = 1.098e-08

From the diagnostic checks in Figure 1 and tests reports, we can assume:

Errors seem to be spread nonrandomly. The plot of residuals against fitted values shows that errors are mostly associated with larger values of fitted values and tend to gather on the right side of the plot. We conclude that the assumption of random errors is violated.

The histogram shows the distribution of standardised residuals, it does not appear normal. Normal QQ Plot: Residuals do not fall close to the line at the both ends. There could be possible deviations from normality.

ACF plot: All the lags are highly significant which indicates that our model does not capture the autocorrelation structure in the research data.

The tests performed on the residuals support the existence of autocorrelation in the residuals, we reject \(H_0\) that the errors are uncorrelated (p-value < 0.05). The normality assumption is violated with p-value < 0.05 reported by Shapiro-Wilk test. The assumption of constant variance does not hold, we reject H0 that errors have constant variance with reported p-value < 0.05. Based on the residual analysis, we can conclude that the model is not an adequate fit for our data.

Model Building

The following code performs backward elimination to come up with smaller parsimonious model

drop1(model, test="F")

## Single term deletions
## 
## Model:
## cnt ~ factor(season) + factor(workingday) + factor(weathersit) + 
##     temp + hum + windspeed
##                    Df Sum of Sq        RSS   AIC  F value    Pr(>F)    
## <none>                          1211633428 10488                       
## factor(season)      3 208219020 1419852448 10598  41.3012 < 2.2e-16 ***
## factor(workingday)  1   3759531 1215392958 10489   2.2372    0.1352    
## factor(weathersit)  2  58714379 1270347807 10519  17.4694 3.901e-08 ***
## temp                1 275398673 1487032100 10636 163.8800 < 2.2e-16 ***
## hum                 1  53768096 1265401523 10518  31.9955 2.229e-08 ***
## windspeed           1  40371975 1252005403 10510  24.0239 1.176e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(update(model, ~ . -factor(workingday)), test = "F")

## Single term deletions
## 
## Model:
## cnt ~ factor(season) + factor(weathersit) + temp + hum + windspeed
##                    Df Sum of Sq        RSS   AIC F value    Pr(>F)    
## <none>                          1215392958 10489                      
## factor(season)      3 208414000 1423806958 10598  41.269 < 2.2e-16 ***
## factor(weathersit)  2  57331974 1272724932 10518  17.029 5.937e-08 ***
## temp                1 280714291 1496107250 10639 166.757 < 2.2e-16 ***
## hum                 1  54968383 1270361341 10519  32.654 1.611e-08 ***
## windspeed           1  41004780 1256397738 10511  24.359 9.937e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Partial F test indicates the variable workingday can be left out of the model with reported p-value > 0.05.

r <- regsubsets(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed,
                data = bike)
plot(r, scale="Cp")

plot(r, scale="adjr2")

Both Mallows’s Cp and Adjusted \(R^2\) criteria of best subsets regression agree to leave out workingday

The following code fits a new model with season, weathersit, temp, hum and windspeed variables

submodel.1 <- lm(cnt ~ factor(season) + factor(weathersit) + temp + hum + windspeed, data = bike)
summary(submodel.1)

## 
## Call:
## lm(formula = cnt ~ factor(season) + factor(weathersit) + temp + 
##     hum + windspeed, data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3830.0  -926.4  -253.8  1097.0  4095.7 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3128.9      342.5   9.135  < 2e-16 ***
## factor(season)2        927.0      178.7   5.187 2.78e-07 ***
## factor(season)3        471.3      235.7   2.000   0.0459 *  
## factor(season)4       1496.8      152.5   9.814  < 2e-16 ***
## factor(weathersit)2   -218.8      127.5  -1.715   0.0867 .  
## factor(weathersit)3  -1902.9      326.6  -5.827 8.51e-09 ***
## temp                  6205.4      480.5  12.913  < 2e-16 ***
## hum                  -2635.2      461.1  -5.714 1.61e-08 ***
## windspeed            -3330.8      674.9  -4.935 9.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1297 on 722 degrees of freedom
## Multiple R-squared:  0.5564, Adjusted R-squared:  0.5514 
## F-statistic: 113.2 on 8 and 722 DF,  p-value: < 2.2e-16

anova(submodel.1)

## Analysis of Variance Table
## 
## Response: cnt
##                     Df     Sum Sq   Mean Sq F value    Pr(>F)    
## factor(season)       3  950595868 316865289 188.233 < 2.2e-16 ***
## factor(weathersit)   2  243512285 121756142  72.329 < 2.2e-16 ***
## temp                 1  255086767 255086767 151.533 < 2.2e-16 ***
## hum                  1   33942733  33942733  20.164 8.274e-06 ***
## windspeed            1   41004780  41004780  24.359 9.937e-07 ***
## Residuals          722 1215392958   1683370                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypothesis tested for significance of regression:

\(H_0 : β_1 = β_2 =...= β_k = 0\)
\(H_A: β_j ≠ 0\) for at least one \(j\)

The reported F-statistic from model summary output is 109.7 with a p-value < 0.05, therefore, we reject \(H_0\), the overall regression using all predictors is statistically significant at 5% level. Agjusted \(R^2\) = 0.54, so the model explains about 54% of variability in the number of bike rental users.

The hypothesis for significance of individual coefficients:

\(H_0 : \beta_j = 0\)
\(H_A: \beta_j ≠ 0\)

The individual t-test on the coefficients is obtained from model summary. All predictors were reported to be statistically significant at 5% level.

Residual Analysis

residual.analysis <- function(model){
  res.model = rstandard(model)
  par(mfrow=c(2,2))
  plot(model, which = c(1,1))
  hist(res.model,main="Histogram of standardised residuals")
  qqnorm(res.model,main="QQ plot of standardised residuals")
  qqline(res.model, col = 2, lty = 2)
  acf(res.model, main="ACF of standardised residuals")
  print(shapiro.test(res.model))
  print(durbinWatsonTest(model))
  print(ncvTest(model))
}
residual.analysis(submodel.1)

Figure 2 - Residual Analysis plots

## 
##  Shapiro-Wilk normality test
## 
## data:  res.model
## W = 0.98202, p-value = 8.069e-08
## 
##  lag Autocorrelation D-W Statistic p-value
##    1       0.7807629     0.4366348       0
##  Alternative hypothesis: rho != 0
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 33.03884, Df = 1, p = 9.0336e-09

From the diagnostic checks in Figure 2 and tests reports, we can conclude that the assumptions of randomness, normality and homogeneity of errors are violated. The ACF and Durbin-Watson test suggest that the errors are highly correlated, so the updated model is not adequate for capturing the autocorrelation structure of the research data.

Generalised Linear Model

Assuming that the response variable of number of bike rental users follows a Poisson distribution, we attempt to fit a regression model using the log link.

model.full<-glm(cnt ~ factor(season) + factor(workingday) + factor(weathersit) + temp + hum + windspeed, family=poisson(link="log"), data = bike)
summary.glm(model.full)

## 
## Call:
## glm(formula = cnt ~ factor(season) + factor(workingday) + factor(weathersit) + 
##     temp + hum + windspeed, family = poisson(link = "log"), data = bike)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -59.115  -15.771   -3.352   15.691   68.686  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.933928   0.004272 1856.98   <2e-16 ***
## factor(season)2      0.313370   0.002270  138.02   <2e-16 ***
## factor(season)3      0.197897   0.002831   69.89   <2e-16 ***
## factor(season)4      0.458829   0.001978  231.93   <2e-16 ***
## factor(workingday)1  0.039978   0.001203   33.23   <2e-16 ***
## factor(weathersit)2 -0.055284   0.001488  -37.14   <2e-16 ***
## factor(weathersit)3 -0.745276   0.005472 -136.20   <2e-16 ***
## temp                 1.381391   0.005599  246.74   <2e-16 ***
## hum                 -0.572890   0.005445 -105.21   <2e-16 ***
## windspeed           -0.723194   0.008116  -89.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 668801  on 730  degrees of freedom
## Residual deviance: 298057  on 721  degrees of freedom
## AIC: 305476
## 
## Number of Fisher Scoring iterations: 4

deviance(model.full)

## [1] 298057.5

pchisq(model.full$deviance, df=model.full$df.residual, lower.tail=FALSE)

## [1] 0

The output of the poisson model summary suggest that all the predictors have a statistically significant effect on the response with all p-values < 0.05.

The hypothesis behind testing model adequacy:

\(H_0:\) the model fits the data well
\(H_A:\) the model does not fit the data well

The chi-square test statistic of 298057 with 721 degrees of freedom gives a p-value of 0, indicating that we should reject the \(H_0\) hypothesis, and we can conclude that the poisson model is not adequate for our data.

anova(model.full,test="Chi")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: cnt
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                 730     668801              
## factor(season)      3   232841       727     435960 < 2.2e-16 ***
## factor(workingday)  1     1227       726     434732 < 2.2e-16 ***
## factor(weathersit)  2    65261       724     369472 < 2.2e-16 ***
## temp                1    56481       723     312990 < 2.2e-16 ***
## hum                 1     6916       722     306074 < 2.2e-16 ***
## windspeed           1     8017       721     298057 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(model.full, test="Chi")

## Single term deletions
## 
## Model:
## cnt ~ factor(season) + factor(workingday) + factor(weathersit) + 
##     temp + hum + windspeed
##                    Df Deviance    AIC   LRT  Pr(>Chi)    
## <none>                  298057 305476                    
## factor(season)      3   367366 374779 69309 < 2.2e-16 ***
## factor(workingday)  1   299167 306584  1110 < 2.2e-16 ***
## factor(weathersit)  2   320701 328116 22644 < 2.2e-16 ***
## temp                1   359006 366422 60948 < 2.2e-16 ***
## hum                 1   309055 316472 10997 < 2.2e-16 ***
## windspeed           1   306074 313491  8017 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table and backward elimination using partial Chisq test with p-values < 0.05 indicate that all the predictors are important and do not suggest a smaller parsimonious model.

Residual Analysis

par(mfrow=c(1,2))
plot(residuals(model.full, type="deviance"))
qqnorm(residuals(model.full, type="deviance"))
qqline(residuals(model.full, type="deviance"), lty = 2, col = 2)

Figure 3 - Residual Analysis plots

shapiro.test(residuals(model.full, type="deviance"))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model.full, type = "deviance")
## W = 0.98468, p-value = 6.256e-07

Figure 3 shows the plots of deviance residuals from the poisson model. The errors are not spread randomly, the Q-Q plot and Shapiro-Wilk test indicate that the normality does not hold. So, we can conclude that the model does not appear satisfactory from the residual analysis viewpoint.

Conclusion

The goal of the research project was to fit a multiple linear regression model that can be suitable to predict the number of bike rental users. We expected to see a multilinear relationship between the response variable and predictors chosen for this task. However, the results of our study showed that none of the models fitted were adequate in terms of residual checks.

Overall, we conclude that multiple linear regression was not suitable for the research data. Highly significant autocorrelation structure in the residuals from regression models suggests that future investigation could use Time Series analysis methods to predict the number of bike rental users.

MATH1312 Final Project

Multiple linear regression analysis of bike sharing data

Anna Krinochkina s3712761 Xiyue Shu s3705474 Shan Jiang s3592369

Introduction

Regression Analysis

Residual Analysis

Model Building

Residual Analysis

Generalised Linear Model

Residual Analysis

Conclusion