Introduction

This project applies a series of linear model selection and regularization methods to build a predictive model for the bike share data of Capital bikeshare system, Washington D.C.

Background

Bikeshare system has been widely used in more than 500 cities in the world as a way of advocating healthy, affordable and environmental-friendly transportation mode. Users can check out a public bike for use at a low fare at any of the stations in the city’s bike sharing network, and return it at the same or another station. Capital bikeshare is one of such systems in Washington D.C., and has been running for seven years providing bike sharing services in its five jurisdictions. Today it has 440 stations offering 3,700 bikes for rent , and has more than 29,000 registered members.

Dataset

The dataset was originally provided by Fanaee-T, Hadi, and Gama, Joao , and is now stored in UCI Machine Learning Repository Access link. It integrates Capital Bikeshare trip open data in the year 2011 and 2012, weather information and holiday schedule for easy research and study use.

This project will analyze the daily count dataset. This original dataset 731 observations and 16 variables. The following variables will be included in model analysis, including 7 continuous variables:

and 4 categorical variables:

Procedure

Data Cleaning

Read in the .csv file, and extracted the variables needed for analysis.

bikeshare<-read.csv("bikesharingday.csv")
bikedata<-subset(bikeshare,select = c(season, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered, cnt))
bikedata$workingday=as.factor(bikedata$workingday)
bikedata$season=as.factor(bikedata$season)
bikedata$weathersit=as.factor(bikedata$weathersit)
str(bikedata)
## 'data.frame':    731 obs. of  10 variables:
##  $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
##  $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

Variables Overview

a) Plotting

Use pair() plot to see variables that may contribute to linear relationship.

pairs(bikedata)

b) Is there significant linear relationship

Bike rental is expected to be related with weather conditions including temperature, humidity, and wind speed. Here bike counts cnt,registered and casual show distringishable linear relationship with temp and atemp, two highly-correlated varibles discribing temperture. First fit a simple linear regression model with cnt as response and atemp,hum,windspeed as predictors.

bkmod<-lm(cnt~atemp+hum+windspeed+season+workingday+weathersit,data=bikedata)
summary(bkmod)
## 
## Call:
## lm(formula = cnt ~ atemp + hum + windspeed + season + workingday + 
##     weathersit, data = bikedata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3881.0  -925.0  -255.7  1045.5  4216.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2844.0      355.8   7.993 5.21e-15 ***
## atemp         6595.5      528.1  12.489  < 2e-16 ***
## hum          -2643.0      464.1  -5.695 1.79e-08 ***
## windspeed    -2973.5      678.2  -4.384 1.34e-05 ***
## season2        983.3      178.3   5.515 4.86e-08 ***
## season3        647.9      229.2   2.827  0.00484 ** 
## season4       1504.1      153.3   9.809  < 2e-16 ***
## workingday1    159.7      104.1   1.534  0.12537    
## weathersit2   -229.1      128.4  -1.783  0.07494 .  
## weathersit3  -1899.8      328.7  -5.779 1.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1302 on 721 degrees of freedom
## Multiple R-squared:  0.5537, Adjusted R-squared:  0.5482 
## F-statistic: 99.41 on 9 and 721 DF,  p-value: < 2.2e-16
  • \(H_0\): \(\beta_(atemp)\)=\(\beta_(hum)\)=\(\beta_(widsp)\)=0
  • \(H_1\): At least one of \(\beta\) \(\neq\) 0
  • p-value less than 2.2e-16
  • Decision: Reject \(H_0\)
  • Conclusion: There is significant linear relationship between bike count and feeled temperture, humidity and windspeed.

b) Real temperture or feeled temperature

There are two temperature variables, the real temperture and feeled temperature. Try to see which one works better in a simple linear regression model.

realtemp=bikedata$temp
feeltemp=bikedata$atemp
t.test(x = realtemp, y = feeltemp, data=bikedata, alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  realtemp and feeltemp
## t = 2.3201, df = 1440.7, p-value = 0.02047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.003249579 0.038812021
## sample estimates:
## mean of x mean of y 
## 0.4953848 0.4743540
library(faraway)
## Warning: package 'faraway' was built under R version 3.5.3
bkmod_temp<-lm(cnt~temp,data=bikedata)
bkmod_atemp<-lm(cnt~atemp,data=bikedata)
summary(bkmod_temp)$r.square
## [1] 0.3937487
summary(bkmod_atemp)$r.square
## [1] 0.3982439

The models with real temperature and feeled temperature have the same lavel of \(R^2\), though the mean values of these two set of temperture are different. For easier modeling will use atempin the following model fit.

c) Interactions with categorical variables

Categoriacal variables like season, workingday and weathersit may influence the model. For example, bike rental numbers may be greatly different on working days and non-work days. The following picture shows casual bike users have significantly different trip pattern on holiday/weekends then on working days.

bkmod_int_c = lm(casual ~ atemp * workingday, data = bikedata)
int_coef = summary(bkmod_int_c)$coef[,1]

int_nwd     = int_coef[1]
int_wd   = int_coef[1] + int_coef[3]

slope_nwd   = int_coef[2]
slope_wd = int_coef[2] + int_coef[4]

plot(casual ~ atemp, data = bikedata, col = workingday)
legend("topleft",legend= c("nonworkday","workday"),fill = c(1:2), horiz=TRUE)
abline(int_nwd, slope_nwd, lwd = 3, col = "black")
abline(int_wd, slope_wd, lwd = 3, col = "red")

Compare an addition model and an interaction model using the same categorical variable. Results show models using interaction with categorical variables all have more variation explained.

workingday

bkmod_wdadd<-lm(cnt~atemp+hum+windspeed+workingday,data=bikedata)
bkmod_wdint<-lm(cnt~(atemp+hum+windspeed)*workingday,data=bikedata)
summary(bkmod_wdadd)$adj.r.squared
## [1] 0.4611843
summary(bkmod_wdint)$adj.r.squared
## [1] 0.4618009

season

bkmod_snadd<-lm(cnt~atemp+hum+windspeed+season,data=bikedata)
bkmod_snint<-lm(cnt~(atemp+hum+windspeed)*season,data=bikedata)
summary(bkmod_snadd)$adj.r.squared
## [1] 0.5282048
summary(bkmod_snint)$adj.r.squared
## [1] 0.5894463

weathersit

bkmod_wsadd<-lm(cnt~atemp+hum+windspeed+weathersit,data=bikedata)
bkmod_wsint<-lm(cnt~(atemp+hum+windspeed)*weathersit,data=bikedata)
summary(bkmod_wsadd)$adj.r.squared
## [1] 0.4776649
summary(bkmod_wsint)$adj.r.squared
## [1] 0.4841737

Model Selection

Assumptions Check

Check the simple linear regression model with three variables atemp,hum and windspeed.

par(mfrow=c(1,2))
plot(bkmod, which = c(1))
plot(bkmod, which = c(2))

The assumption of constant variation seems violated. We do see more outliers for higher fitted values. Check with numbers:

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(bkmod)
## 
##  studentized Breusch-Pagan test
## 
## data:  bkmod
## BP = 50.393, df = 9, p-value = 9.088e-08

\(H_0\):Homoscedasticity. The errors have constant variance about the true model. \(H_1\):Heteroscedasticity. The errors have non-constant variance about the true model. p-value is very small, we reject the null hypothesis The constant variance assumption is violated.

Check the normality assumption:

shapiro.test(resid(bkmod))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(bkmod)
## W = 0.98357, p-value = 2.607e-07

\(H_0\):Data were sampled from a normal distribution \(H_1\):Data were not sampled from a normal distribution The normality assumption is also violated.

So we will not use linear model with only first order terms, but try to find a model with higher order terms.

Higher order terms

We will first fit a big quadratic model. Referencing the pairs() plot, we will use atemp to reflect its potential quadratic relationship with the response. We’ll also consider all possible two-way interactions.

plot(cnt ~atemp, data = bikedata, col = "grey", pch = 20, cex = 1.5)

bkmod_big<-lm(log(cnt)~.^2+I(atemp)^2, data=bikedata)
summary(bkmod_big)$adj.r.squared
## [1] 0.9622415
length(coef(bkmod_big))
## [1] 76

Use both backwards AIC and BIC to reduce the number of parameters, while still fitting well.

bkmod_back_aic<-step(bkmod_big, direction = "backward", trace = 0)

n = length(resid(bkmod_big))
bkmod_back_bic = step(bkmod_big, direction = "backward", k = log(n), trace = 0)

Statistics of the big model

summary(bkmod_big)$adj.r.squared
## [1] 0.9622415
length(coef(bkmod_big))
## [1] 76

Statistics of the fitted AIC model

summary(bkmod_back_aic)$adj.r.squared
## [1] 0.962238
length(coef(bkmod_back_aic))
## [1] 53

Statistics of the fitted BIC model

summary(bkmod_back_bic)$adj.r.squared
## [1] 0.9604971
length(coef(bkmod_back_bic))
## [1] 37

The three models have almost the same Adjusted \(R^2\). However, the model chosen by BIC significantly reduces coefficient numbers, while still maintains a Adjusted \(R^2\) over 96%.

Calculate LOOCV RMSE for each model:

calc_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
calc_loocv_rmse(bkmod_big)
## [1] 0.5846633
calc_loocv_rmse(bkmod_back_aic)
## [1] 0.5518765
calc_loocv_rmse(bkmod_back_bic)
## [1] 0.2533907

Again, the model chosen by BIC has a much smaller LOOCV RMSE. We would prefer this BIC model.

Result

Final Model Selected

summary(bkmod_back_bic)
## 
## Call:
## lm(formula = log(cnt) ~ season + workingday + weathersit + temp + 
##     hum + windspeed + casual + registered + season:weathersit + 
##     season:registered + workingday:weathersit + workingday:casual + 
##     workingday:registered + weathersit:temp + weathersit:windspeed + 
##     weathersit:casual + weathersit:registered + temp:registered + 
##     hum:registered + windspeed:registered + casual:registered, 
##     data = bikedata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91010 -0.02598  0.00609  0.03410  0.69703 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              6.692e+00  8.125e-02  82.364  < 2e-16 ***
## season2                  1.545e-01  5.135e-02   3.009  0.00272 ** 
## season3                  6.457e-02  7.122e-02   0.907  0.36492    
## season4                  2.521e-01  4.592e-02   5.490 5.64e-08 ***
## workingday1              1.873e-01  3.381e-02   5.540 4.30e-08 ***
## weathersit2              2.690e-02  4.841e-02   0.556  0.57858    
## weathersit3              1.074e+00  2.074e-01   5.176 2.96e-07 ***
## temp                     7.972e-01  1.494e-01   5.338 1.28e-07 ***
## hum                     -4.884e-01  1.015e-01  -4.810 1.86e-06 ***
## windspeed               -4.086e-01  1.675e-01  -2.439  0.01498 *  
## casual                   5.401e-04  3.972e-05  13.596  < 2e-16 ***
## registered               4.553e-04  2.480e-05  18.357  < 2e-16 ***
## season2:weathersit2     -5.665e-03  3.595e-02  -0.158  0.87484    
## season3:weathersit2     -1.024e-02  4.639e-02  -0.221  0.82529    
## season4:weathersit2     -2.417e-02  3.197e-02  -0.756  0.44988    
## season2:weathersit3      3.455e-01  1.153e-01   2.997  0.00283 ** 
## season3:weathersit3      4.517e-01  1.588e-01   2.844  0.00458 ** 
## season4:weathersit3     -2.499e-01  1.087e-01  -2.299  0.02178 *  
## season2:registered      -5.261e-05  1.291e-05  -4.074 5.16e-05 ***
## season3:registered      -3.017e-05  1.631e-05  -1.849  0.06483 .  
## season4:registered      -7.855e-05  1.189e-05  -6.605 7.92e-11 ***
## workingday1:weathersit2  1.968e-02  3.300e-02   0.596  0.55114    
## workingday1:weathersit3 -5.035e-01  9.157e-02  -5.499 5.37e-08 ***
## workingday1:casual       1.297e-04  2.888e-05   4.492 8.27e-06 ***
## workingday1:registered  -9.675e-05  1.329e-05  -7.279 9.11e-13 ***
## weathersit2:temp        -4.249e-02  1.059e-01  -0.401  0.68831    
## weathersit3:temp        -2.334e+00  4.205e-01  -5.551 4.05e-08 ***
## weathersit2:windspeed   -5.878e-02  1.293e-01  -0.455  0.64960    
## weathersit3:windspeed   -3.396e+00  4.446e-01  -7.639 7.29e-14 ***
## weathersit2:casual      -1.930e-06  2.711e-05  -0.071  0.94327    
## weathersit3:casual      -9.702e-04  1.610e-04  -6.025 2.75e-09 ***
## weathersit2:registered   4.323e-06  1.101e-05   0.393  0.69471    
## weathersit3:registered   6.047e-04  4.187e-05  14.441  < 2e-16 ***
## temp:registered         -2.030e-04  3.468e-05  -5.855 7.37e-09 ***
## hum:registered           1.176e-04  2.696e-05   4.363 1.48e-05 ***
## windspeed:registered     1.189e-04  4.109e-05   2.893  0.00394 ** 
## casual:registered       -1.017e-07  9.577e-09 -10.616  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1159 on 694 degrees of freedom
## Multiple R-squared:  0.9624, Adjusted R-squared:  0.9605 
## F-statistic:   494 on 36 and 694 DF,  p-value: < 2.2e-16

Conclusion

Bike rental can be explained by its linear relationship with temperture and weather situation. However, for the given set of predictor values, errors producted in a linear model do not follow a normal distribution. A model good for prediction with smaller errors needs additional interaction terms or polynomial terms. In this problem, we used BIC to find the best fit model with a Multiple R-squared above 0.96 using 37 parameters.