Regression Assumptions

Introduction

The marketing department for the city of Toronto, Canada is very pleased with itself. It recently concluded a promotion intended to encourage use of the city’s new downtown bike sharing program called BikeShare. Since the conclusion of the promotion, the marketing department has emphatically declared the promotion to be a huge success—a claim based entirely on anecdotal responses from a few individuals on social media.

As the primary data analyst to the city’s chief administrative officer, you have been given the task of assessing the real impact of the promotion on the use of the bike sharing program. The bike sharing program operates on the following model:

Bikes may be rented only on a daily basis and only for the entire day (no partial days.)
There are two ways a person can rent a bike.
- 1. casual users can walk up to any available bike rental terminal, swipe a credit card and rent a bike by paying a daily fee (assuming bikes are available.)
- 1. registered BikeShare members pay a monthly fee and are guaranteed bike rental availability whenever they want. They pay half the daily fee of the casual users.
The marketing department’s promotion was run over the course of two years and would randomly assign half of the year’s days to be declared “promotional days” and the other half “non-promotional days.”
On promotional days, the daily rental fee paid by the individual is discounted by 30% of the standard daily rate. As a result, on promotional days casual users pay 70% of the normal daily fee, and registered members pay 20% of the normal daily fee (because they are already receiving a 50% discount on rentals by being members).

Data source: http://asayanalytics.com/bikeshare_csv

Data Dictionary

Below is a summary table explaining each variable as well as the meaning of numbers represented in the BikeShare data set.

Variable Name	Explanation
Season	The season of the year (1: Spring, 2: Summer, 3: Fall, 4: Winter)
Promotion	Dummy variable indicating whether the promotion was active on the day
Mnth	Month (1-12 representing Jan – Dec respectively)
Holiday	Dummy variable indicating whether the day was a holiday (0 = False, 1=True)
Weekday	Day of the week (0-6 representing Sunday through Saturday respectively)
Workingday	Dummy variable indicating whether the day was a working day. (Working days are weekdays that are also not holidays; 0 = False, 1=True)
Weathersit	Variable indicating the weather on the day (1: Clear, few clouds; 2: Mist, cloudy; 3: Light snow, light rain; 4: Heavy snow or rain)
Temp	Temperature (in degrees Celsius)
Humidity	Humidity (in percent)
Windspeed	Wind speed (in knots)
Casual	Number of Casual Riders for the day
Registered	Number of Registered riders for the day

1) Model Improvement

Generalized Regression Equation

Regression Equation for Model 2 (most improved)

total_riders = b_0 + b_1(temp^3) + b_2(Promotion) + b_3(season) + b_4(mnth) + b_5(weathersit) + b_6(windspeed) + b_7(humidity^2)

Regression Equation for Model 1 (initial)

total_riders = b_0 + b_1(temp^3) + b_2(Promotion) + b_3(season) + b_4(weathersit) + b_5(windspeed) + b_6(humidity^2)

Below is a table showing correlations of each dependent variable against total_riders, our independent variable.

	total_riders
season	0.4061
Promotion	0.5667
mnth	0.28
holiday	-0.06835
weekday	0.06744
workingday	0.06116
weathersit	-0.2974
temp	0.6275
humidity	-0.1007
windspeed	-0.2345

Tables & Plots Summary

This sections provides though process, and tables and graphs of regression output (with coefficients, standard errors, etc.). Below are a summary table indicating the level of significance in which each dependent variable has on total_riders, a VIF summary table showing how highly correlated each dependent variable is to the independent variable (i.e. total_riders), an ANOVA table showing whether the improved model is actually better than the initial model, and a break down of each variable’s graph.

Explanation

Before deciding which variables to include into the model, here’s my thought process:

cor(): I first looked at cor() function to see the correlations between variables (as shown in the other tab). I chose to exclude holiday, weekday, workingday, and humidity variables since their corr values are less than 0.2.
vif(): Next, I used vif() function to check for multicollinearity in the model. It turns out that season and month variables are highly correlated since season already indicates months. I chose to leave out month as its corr value to total_riders is weaker than the one of season. Also, using season instead of mnth also improved r-squared value.
Then, I played around with the humidity variable as its corr value is more highly correlated to total_riders than the other three variables I left out. It turns out that including the variable improved the model. However, when looking at crPlot(), there is a linearity problem with humidity so I increased its degree to be a polynomial, and thus, improve its linearity.

To this point, I got the first model, the one without variable mnth

Summary & Tables for Model 1

Below are summary tables and graphs for model 1 (w/o mnth)

## 
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) + 
##     as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity, 
##     2, raw = TRUE), data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3465.6  -343.9    52.8   416.3  2662.8 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.926e+03  5.511e+02   3.495 0.000502 ***
## poly(temp, 3, raw = TRUE)1     -3.380e+02  6.679e+01  -5.061 5.30e-07 ***
## poly(temp, 3, raw = TRUE)2      3.569e+01  3.567e+00  10.007  < 2e-16 ***
## poly(temp, 3, raw = TRUE)3     -7.413e-01  5.959e-02 -12.440  < 2e-16 ***
## as.factor(Promotion)1           1.935e+03  5.168e+01  37.440  < 2e-16 ***
## as.factor(season)2              7.465e+02  9.674e+01   7.717 4.01e-14 ***
## as.factor(season)3              9.961e+02  1.255e+02   7.940 7.79e-15 ***
## as.factor(season)4              1.259e+03  8.448e+01  14.907  < 2e-16 ***
## as.factor(weathersit)2         -3.039e+02  6.865e+01  -4.426 1.11e-05 ***
## as.factor(weathersit)3         -1.332e+03  1.934e+02  -6.888 1.24e-11 ***
## windspeed                      -4.993e+01  5.344e+00  -9.342  < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1  4.574e+01  1.278e+01   3.580 0.000367 ***
## poly(humidity, 2, raw = TRUE)2 -5.712e-01  1.048e-01  -5.450 6.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 681.4 on 718 degrees of freedom
## Multiple R-squared:  0.8783, Adjusted R-squared:  0.8763 
## F-statistic: 431.9 on 12 and 718 DF,  p-value: < 2.2e-16

##                                   GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)     5.466470  3        1.327245
## as.factor(Promotion)          1.051463  1        1.025408
## as.factor(season)             5.143654  3        1.313848
## as.factor(weathersit)         2.266273  2        1.226953
## windspeed                     1.210655  1        1.100298
## poly(humidity, 2, raw = TRUE) 2.709878  2        1.283032

## Analysis of Variance Table
## 
## Model 1: total_riders ~ poly(temp, 3, raw = TRUE)
## Model 2: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) + 
##     as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity, 
##     2, raw = TRUE)
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    727 1472082143                                   
## 2    718  333358327  9 1138723816 272.51 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary & Tables for Model 2

Below are summary tables and graphs for model 2 (mnth included)

## 
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) + 
##     as.factor(season) + as.factor(mnth) + as.factor(weathersit) + 
##     windspeed + poly(humidity, 2, raw = TRUE), data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3415.1  -336.5    63.8   393.9  2518.8 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.528e+03  5.679e+02   2.690  0.00731 ** 
## poly(temp, 3, raw = TRUE)1     -3.218e+02  7.618e+01  -4.225 2.71e-05 ***
## poly(temp, 3, raw = TRUE)2      3.419e+01  4.132e+00   8.276 6.35e-16 ***
## poly(temp, 3, raw = TRUE)3     -7.160e-01  6.909e-02 -10.363  < 2e-16 ***
## as.factor(Promotion)1           1.940e+03  5.138e+01  37.762  < 2e-16 ***
## as.factor(season)2              8.374e+02  1.571e+02   5.331 1.32e-07 ***
## as.factor(season)3              1.207e+03  1.875e+02   6.436 2.27e-10 ***
## as.factor(season)4              1.697e+03  1.584e+02  10.707  < 2e-16 ***
## as.factor(mnth)2                5.605e+01  1.307e+02   0.429  0.66814    
## as.factor(mnth)3                3.022e+02  1.524e+02   1.982  0.04782 *  
## as.factor(mnth)4               -7.826e+01  2.221e+02  -0.352  0.72467    
## as.factor(mnth)5                1.102e+02  2.364e+02   0.466  0.64135    
## as.factor(mnth)6                1.131e+02  2.469e+02   0.458  0.64694    
## as.factor(mnth)7                9.354e+01  2.744e+02   0.341  0.73330    
## as.factor(mnth)8               -1.272e+02  2.653e+02  -0.479  0.63181    
## as.factor(mnth)9                5.949e+01  2.374e+02   0.251  0.80216    
## as.factor(mnth)10              -2.807e+02  2.186e+02  -1.284  0.19958    
## as.factor(mnth)11              -5.491e+02  2.098e+02  -2.617  0.00905 ** 
## as.factor(mnth)12              -2.606e+02  1.662e+02  -1.568  0.11734    
## as.factor(weathersit)2         -3.165e+02  6.855e+01  -4.617 4.62e-06 ***
## as.factor(weathersit)3         -1.296e+03  1.934e+02  -6.700 4.26e-11 ***
## windspeed                      -4.834e+01  5.314e+00  -9.095  < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1  5.617e+01  1.289e+01   4.359 1.50e-05 ***
## poly(humidity, 2, raw = TRUE)2 -6.524e-01  1.058e-01  -6.167 1.17e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 672.8 on 707 degrees of freedom
## Multiple R-squared:  0.8832, Adjusted R-squared:  0.8794 
## F-statistic: 232.4 on 23 and 707 DF,  p-value: < 2.2e-16

##                                     GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)      22.071017  3        1.674829
## as.factor(Promotion)            1.065614  1        1.032286
## as.factor(season)             175.743240  3        2.366718
## as.factor(mnth)               786.942276 11        1.354047
## as.factor(weathersit)           2.362922  2        1.239831
## windspeed                       1.227831  1        1.108076
## poly(humidity, 2, raw = TRUE)   3.056050  2        1.322179

## Analysis of Variance Table
## 
## Model 1: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) + 
##     as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity, 
##     2, raw = TRUE)
## Model 2: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) + 
##     as.factor(season) + as.factor(mnth) + as.factor(weathersit) + 
##     windspeed + poly(humidity, 2, raw = TRUE)
##   Res.Df       RSS Df Sum of Sq      F   Pr(>F)   
## 1    718 333358327                                
## 2    707 320065379 11  13292948 2.6694 0.002309 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2) Problems with the Model

In this section, I explain the problems I encountered while improving the model and how I solve them. I broke them down into three sections: 1) Multicollinearity Assumptions Problems, 2) Linearity Assumptions Problems, and 3) Homoscedasticity Assumptions Problems.

Multicollinearity

The first problem I ran into was the multicollinearity of variables month and season and deciding which one to keep or whether to keep both. As mentioned in 1), the two variables are highly correlated as season already indicates which months they are in. I chose to leave out month also because it has to much variability and it does not appear to have much significance on the IV, total_riders. This also reduced the value of VIF by about 0.34, as shown below.

Model #1: no mnth

##                                   GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)     5.466470  3        1.327245
## as.factor(Promotion)          1.051463  1        1.025408
## as.factor(season)             5.143654  3        1.313848
## as.factor(weathersit)         2.266273  2        1.226953
## windspeed                     1.210655  1        1.100298
## poly(humidity, 2, raw = TRUE) 2.709878  2        1.283032

## [1] 8.21799

Model #2: include mnth

##                                     GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)      22.071017  3        1.674829
## as.factor(Promotion)            1.065614  1        1.032286
## as.factor(season)             175.743240  3        2.366718
## as.factor(mnth)               786.942276 11        1.354047
## as.factor(weathersit)           2.362922  2        1.239831
## windspeed                       1.227831  1        1.108076
## poly(humidity, 2, raw = TRUE)   3.056050  2        1.322179

## [1] 8.559299

Linearity

The second problem I encountered was with the linearity assumption of the humidity variable. As shown in the crPlots below, humidity has about the same behavior as with temp except inversely correlated to total_riders. I solved this problem by make humidity as polynomial just like we did with the temp variable by making it a 3rd-degree polynomial. The differences are as shown.

Homoscedasticity

This section is not from Q1 but it is an extra observation and a possibility for improvement.

Despite the problems with the assumptions of multicollinearity, when comparing the model with month variable included with the model without it, the ANOVA test shows that the one with month included is better in terms of homoscedasticity assumption improvement, as shown below. More than that, when plotting out the first model (with mnth), there is clearly a problem with homoscedasticity assumption. I solved this problem by logging the DV, total_riders to flatten out the data to improve linearity, and thus, improve homoscedasticity. As shown in the plots below, specifically in the first two graphs, the line improved to be more fit in the second model when using log on DV (“Residuals vs. Fitted”), plus, the normality is improved (more flattened) in the “Normal Q-Q” graph. In addition, by logging the DV, the VIF for multicollinearity also improved significantly, specifically reduced by about 3.3, from 8.22 to 4.93, which is less than 5.

Note: the equation with log transformation on the model ends up not working when used to predict values.

## Analysis of Variance Table
## 
## Response: total_riders
##                                Df     Sum Sq   Mean Sq  F value    Pr(>F)    
## poly(temp, 3, raw = TRUE)       3 1267453249 422484416  909.963 < 2.2e-16 ***
## as.factor(Promotion)            1  759651907 759651907 1636.168 < 2.2e-16 ***
## as.factor(season)               3  115781949  38593983   83.125 < 2.2e-16 ***
## as.factor(weathersit)           2  187438472  93719236  201.856 < 2.2e-16 ***
## windspeed                       1   24095498  24095498   51.898 1.481e-12 ***
## poly(humidity, 2, raw = TRUE)   2   51755989  25877995   55.737 < 2.2e-16 ***
## Residuals                     718  333358327    464287                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] 8.21799

## [1] 4.926672

Plots of Model #1: no mnth

Plots of Model #2: include mnth & log transformation

3) Month vs. Total Riders

Using the regression output from Question 1), the highest number of riders, holding everything else constant is in month 10th or October.

Regarding the second question, if October became unseasonably cold and rainy, it would not change the coefficient on this month. This is mainly because the correlation between mnth & weathersit is far weaker (0.04) than the correlation between temp vs. weathersit which is, -0.12. This shows that weather condition should only predict the temperature or weather-related variables, and month variable should be impacted by date-related variable like holiday.

	weathersit
bikeshare.mnth	0.04353
bikeshare.temp	-0.1206

4) Promotion vs. Total Riders

The coefficient of Promotion is a positive coefficient of 1940.16. This simply means when the Promotion is active (Promotion=1), a number of total riders increases by approximately 1940 people. Based on this analysis, it shows that the marketing department has done a great job having the promotion to attract riders.

Below is a box plot showing the difference between the total riders when there is a promotion and when there is not a promotion.

5) Casual vs. Registered Riders

Below are box plots and tables showing the difference between average numbers of casual riders when there is promotion and when there is not a promotion, and the difference between average numbers of registered riders when there is a promotion and when there is no promotion. Based on the analysis, the difference between average numbers of casual riders when there is promotion and when there is not a promotion is about 341 people difference while the one of registered riders is at 1853 people. This shows that registered riders are more responsive than casual riders even when there is a promotion. We can conclude that the Promotion program have a more substantial impact on the registered riders than the casual riders.

Casual Riders

Average Difference of Casual Riders
341.1

Registered Riders

Average Difference of Registered Riders
1853

6) Promotion Influence on Ridership

Before we can report whether to continue with the Promotion program, we need to know at least the profit margin and operating costs for renting bikes from both casual and registered riders. We need to know whether BikeShare is gaining or losing money from the promotional period. We may be able to compare those information with the gain/loss of number of riders after the promotional period ends.