Regression Assumptions
Introduction
The marketing department for the city of Toronto, Canada is very pleased with itself. It recently concluded a promotion intended to encourage use of the city’s new downtown bike sharing program called BikeShare. Since the conclusion of the promotion, the marketing department has emphatically declared the promotion to be a huge success—a claim based entirely on anecdotal responses from a few individuals on social media.
As the primary data analyst to the city’s chief administrative officer, you have been given the task of assessing the real impact of the promotion on the use of the bike sharing program. The bike sharing program operates on the following model:
Bikes may be rented only on a daily basis and only for the entire day (no partial days.)
There are two ways a person can rent a bike.
- casual users can walk up to any available bike rental terminal, swipe a credit card and rent a bike by paying a daily fee (assuming bikes are available.)
- registered BikeShare members pay a monthly fee and are guaranteed bike rental availability whenever they want. They pay half the daily fee of the casual users.
The marketing department’s promotion was run over the course of two years and would randomly assign half of the year’s days to be declared “promotional days” and the other half “non-promotional days.”
On promotional days, the daily rental fee paid by the individual is discounted by 30% of the standard daily rate. As a result, on promotional days casual users pay 70% of the normal daily fee, and registered members pay 20% of the normal daily fee (because they are already receiving a 50% discount on rentals by being members).
Data source: http://asayanalytics.com/bikeshare_csv
Data Dictionary
Below is a summary table explaining each variable as well as the meaning of numbers represented in the BikeShare data set.
| Variable Name | Explanation |
|---|---|
| Season | The season of the year (1: Spring, 2: Summer, 3: Fall, 4: Winter) |
| Promotion | Dummy variable indicating whether the promotion was active on the day |
| Mnth | Month (1-12 representing Jan – Dec respectively) |
| Holiday | Dummy variable indicating whether the day was a holiday (0 = False, 1=True) |
| Weekday | Day of the week (0-6 representing Sunday through Saturday respectively) |
| Workingday | Dummy variable indicating whether the day was a working day. (Working days are weekdays that are also not holidays; 0 = False, 1=True) |
| Weathersit | Variable indicating the weather on the day (1: Clear, few clouds; 2: Mist, cloudy; 3: Light snow, light rain; 4: Heavy snow or rain) |
| Temp | Temperature (in degrees Celsius) |
| Humidity | Humidity (in percent) |
| Windspeed | Wind speed (in knots) |
| Casual | Number of Casual Riders for the day |
| Registered | Number of Registered riders for the day |
1) Model Improvement
Generalized Regression Equation
Regression Equation for Model 2 (most improved)
total_riders = b_0 + b_1(temp^3) + b_2(Promotion) + b_3(season) + b_4(mnth) + b_5(weathersit) + b_6(windspeed) + b_7(humidity^2)
Regression Equation for Model 1 (initial)
total_riders = b_0 + b_1(temp^3) + b_2(Promotion) + b_3(season) + b_4(weathersit) + b_5(windspeed) + b_6(humidity^2)
Below is a table showing correlations of each dependent variable against total_riders, our independent variable.
| total_riders | |
|---|---|
| season | 0.4061 |
| Promotion | 0.5667 |
| mnth | 0.28 |
| holiday | -0.06835 |
| weekday | 0.06744 |
| workingday | 0.06116 |
| weathersit | -0.2974 |
| temp | 0.6275 |
| humidity | -0.1007 |
| windspeed | -0.2345 |
Tables & Plots Summary
This sections provides though process, and tables and graphs of regression output (with coefficients, standard errors, etc.). Below are a summary table indicating the level of significance in which each dependent variable has on total_riders, a VIF summary table showing how highly correlated each dependent variable is to the independent variable (i.e. total_riders), an ANOVA table showing whether the improved model is actually better than the initial model, and a break down of each variable’s graph.
Explanation
Before deciding which variables to include into the model, here’s my thought process:
- cor(): I first looked at cor() function to see the correlations between variables (as shown in the other tab). I chose to exclude holiday, weekday, workingday, and humidity variables since their corr values are less than 0.2.
- vif(): Next, I used vif() function to check for multicollinearity in the model. It turns out that season and month variables are highly correlated since season already indicates months. I chose to leave out month as its corr value to total_riders is weaker than the one of season. Also, using season instead of mnth also improved r-squared value.
- Then, I played around with the humidity variable as its corr value is more highly correlated to total_riders than the other three variables I left out. It turns out that including the variable improved the model. However, when looking at crPlot(), there is a linearity problem with humidity so I increased its degree to be a polynomial, and thus, improve its linearity.
To this point, I got the first model, the one without variable mnth
Summary & Tables for Model 1
Below are summary tables and graphs for model 1 (w/o mnth)
##
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) +
## as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity,
## 2, raw = TRUE), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3465.6 -343.9 52.8 416.3 2662.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.926e+03 5.511e+02 3.495 0.000502 ***
## poly(temp, 3, raw = TRUE)1 -3.380e+02 6.679e+01 -5.061 5.30e-07 ***
## poly(temp, 3, raw = TRUE)2 3.569e+01 3.567e+00 10.007 < 2e-16 ***
## poly(temp, 3, raw = TRUE)3 -7.413e-01 5.959e-02 -12.440 < 2e-16 ***
## as.factor(Promotion)1 1.935e+03 5.168e+01 37.440 < 2e-16 ***
## as.factor(season)2 7.465e+02 9.674e+01 7.717 4.01e-14 ***
## as.factor(season)3 9.961e+02 1.255e+02 7.940 7.79e-15 ***
## as.factor(season)4 1.259e+03 8.448e+01 14.907 < 2e-16 ***
## as.factor(weathersit)2 -3.039e+02 6.865e+01 -4.426 1.11e-05 ***
## as.factor(weathersit)3 -1.332e+03 1.934e+02 -6.888 1.24e-11 ***
## windspeed -4.993e+01 5.344e+00 -9.342 < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1 4.574e+01 1.278e+01 3.580 0.000367 ***
## poly(humidity, 2, raw = TRUE)2 -5.712e-01 1.048e-01 -5.450 6.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 681.4 on 718 degrees of freedom
## Multiple R-squared: 0.8783, Adjusted R-squared: 0.8763
## F-statistic: 431.9 on 12 and 718 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 5.466470 3 1.327245
## as.factor(Promotion) 1.051463 1 1.025408
## as.factor(season) 5.143654 3 1.313848
## as.factor(weathersit) 2.266273 2 1.226953
## windspeed 1.210655 1 1.100298
## poly(humidity, 2, raw = TRUE) 2.709878 2 1.283032
## Analysis of Variance Table
##
## Model 1: total_riders ~ poly(temp, 3, raw = TRUE)
## Model 2: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) +
## as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity,
## 2, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 727 1472082143
## 2 718 333358327 9 1138723816 272.51 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary & Tables for Model 2
Below are summary tables and graphs for model 2 (mnth included)
##
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) +
## as.factor(season) + as.factor(mnth) + as.factor(weathersit) +
## windspeed + poly(humidity, 2, raw = TRUE), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3415.1 -336.5 63.8 393.9 2518.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.528e+03 5.679e+02 2.690 0.00731 **
## poly(temp, 3, raw = TRUE)1 -3.218e+02 7.618e+01 -4.225 2.71e-05 ***
## poly(temp, 3, raw = TRUE)2 3.419e+01 4.132e+00 8.276 6.35e-16 ***
## poly(temp, 3, raw = TRUE)3 -7.160e-01 6.909e-02 -10.363 < 2e-16 ***
## as.factor(Promotion)1 1.940e+03 5.138e+01 37.762 < 2e-16 ***
## as.factor(season)2 8.374e+02 1.571e+02 5.331 1.32e-07 ***
## as.factor(season)3 1.207e+03 1.875e+02 6.436 2.27e-10 ***
## as.factor(season)4 1.697e+03 1.584e+02 10.707 < 2e-16 ***
## as.factor(mnth)2 5.605e+01 1.307e+02 0.429 0.66814
## as.factor(mnth)3 3.022e+02 1.524e+02 1.982 0.04782 *
## as.factor(mnth)4 -7.826e+01 2.221e+02 -0.352 0.72467
## as.factor(mnth)5 1.102e+02 2.364e+02 0.466 0.64135
## as.factor(mnth)6 1.131e+02 2.469e+02 0.458 0.64694
## as.factor(mnth)7 9.354e+01 2.744e+02 0.341 0.73330
## as.factor(mnth)8 -1.272e+02 2.653e+02 -0.479 0.63181
## as.factor(mnth)9 5.949e+01 2.374e+02 0.251 0.80216
## as.factor(mnth)10 -2.807e+02 2.186e+02 -1.284 0.19958
## as.factor(mnth)11 -5.491e+02 2.098e+02 -2.617 0.00905 **
## as.factor(mnth)12 -2.606e+02 1.662e+02 -1.568 0.11734
## as.factor(weathersit)2 -3.165e+02 6.855e+01 -4.617 4.62e-06 ***
## as.factor(weathersit)3 -1.296e+03 1.934e+02 -6.700 4.26e-11 ***
## windspeed -4.834e+01 5.314e+00 -9.095 < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1 5.617e+01 1.289e+01 4.359 1.50e-05 ***
## poly(humidity, 2, raw = TRUE)2 -6.524e-01 1.058e-01 -6.167 1.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 672.8 on 707 degrees of freedom
## Multiple R-squared: 0.8832, Adjusted R-squared: 0.8794
## F-statistic: 232.4 on 23 and 707 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 22.071017 3 1.674829
## as.factor(Promotion) 1.065614 1 1.032286
## as.factor(season) 175.743240 3 2.366718
## as.factor(mnth) 786.942276 11 1.354047
## as.factor(weathersit) 2.362922 2 1.239831
## windspeed 1.227831 1 1.108076
## poly(humidity, 2, raw = TRUE) 3.056050 2 1.322179
## Analysis of Variance Table
##
## Model 1: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) +
## as.factor(season) + as.factor(weathersit) + windspeed + poly(humidity,
## 2, raw = TRUE)
## Model 2: total_riders ~ poly(temp, 3, raw = TRUE) + as.factor(Promotion) +
## as.factor(season) + as.factor(mnth) + as.factor(weathersit) +
## windspeed + poly(humidity, 2, raw = TRUE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 718 333358327
## 2 707 320065379 11 13292948 2.6694 0.002309 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2) Problems with the Model
In this section, I explain the problems I encountered while improving the model and how I solve them. I broke them down into three sections: 1) Multicollinearity Assumptions Problems, 2) Linearity Assumptions Problems, and 3) Homoscedasticity Assumptions Problems.
Multicollinearity
The first problem I ran into was the multicollinearity of variables month and season and deciding which one to keep or whether to keep both. As mentioned in 1), the two variables are highly correlated as season already indicates which months they are in. I chose to leave out month also because it has to much variability and it does not appear to have much significance on the IV, total_riders. This also reduced the value of VIF by about 0.34, as shown below.
Model #1: no mnth
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 5.466470 3 1.327245
## as.factor(Promotion) 1.051463 1 1.025408
## as.factor(season) 5.143654 3 1.313848
## as.factor(weathersit) 2.266273 2 1.226953
## windspeed 1.210655 1 1.100298
## poly(humidity, 2, raw = TRUE) 2.709878 2 1.283032
## [1] 8.21799
Model #2: include mnth
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 22.071017 3 1.674829
## as.factor(Promotion) 1.065614 1 1.032286
## as.factor(season) 175.743240 3 2.366718
## as.factor(mnth) 786.942276 11 1.354047
## as.factor(weathersit) 2.362922 2 1.239831
## windspeed 1.227831 1 1.108076
## poly(humidity, 2, raw = TRUE) 3.056050 2 1.322179
## [1] 8.559299
Linearity
The second problem I encountered was with the linearity assumption of the humidity variable. As shown in the crPlots below, humidity has about the same behavior as with temp except inversely correlated to total_riders. I solved this problem by make humidity as polynomial just like we did with the temp variable by making it a 3rd-degree polynomial. The differences are as shown.
Homoscedasticity
This section is not from Q1 but it is an extra observation and a possibility for improvement.
Despite the problems with the assumptions of multicollinearity, when comparing the model with month variable included with the model without it, the ANOVA test shows that the one with month included is better in terms of homoscedasticity assumption improvement, as shown below. More than that, when plotting out the first model (with mnth), there is clearly a problem with homoscedasticity assumption. I solved this problem by logging the DV, total_riders to flatten out the data to improve linearity, and thus, improve homoscedasticity. As shown in the plots below, specifically in the first two graphs, the line improved to be more fit in the second model when using log on DV (“Residuals vs. Fitted”), plus, the normality is improved (more flattened) in the “Normal Q-Q” graph. In addition, by logging the DV, the VIF for multicollinearity also improved significantly, specifically reduced by about 3.3, from 8.22 to 4.93, which is less than 5.
Note: the equation with log transformation on the model ends up not working when used to predict values.
## Analysis of Variance Table
##
## Response: total_riders
## Df Sum Sq Mean Sq F value Pr(>F)
## poly(temp, 3, raw = TRUE) 3 1267453249 422484416 909.963 < 2.2e-16 ***
## as.factor(Promotion) 1 759651907 759651907 1636.168 < 2.2e-16 ***
## as.factor(season) 3 115781949 38593983 83.125 < 2.2e-16 ***
## as.factor(weathersit) 2 187438472 93719236 201.856 < 2.2e-16 ***
## windspeed 1 24095498 24095498 51.898 1.481e-12 ***
## poly(humidity, 2, raw = TRUE) 2 51755989 25877995 55.737 < 2.2e-16 ***
## Residuals 718 333358327 464287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 8.21799
## [1] 4.926672
Plots of Model #1: no mnth
Plots of Model #2: include mnth & log transformation
3) Month vs. Total Riders
Using the regression output from Question 1), the highest number of riders, holding everything else constant is in month 10th or October.
Regarding the second question, if October became unseasonably cold and rainy, it would not change the coefficient on this month. This is mainly because the correlation between mnth & weathersit is far weaker (0.04) than the correlation between temp vs. weathersit which is, -0.12. This shows that weather condition should only predict the temperature or weather-related variables, and month variable should be impacted by date-related variable like holiday.
| weathersit | |
|---|---|
| bikeshare.mnth | 0.04353 |
| bikeshare.temp | -0.1206 |
4) Promotion vs. Total Riders
The coefficient of Promotion is a positive coefficient of 1940.16. This simply means when the Promotion is active (Promotion=1), a number of total riders increases by approximately 1940 people. Based on this analysis, it shows that the marketing department has done a great job having the promotion to attract riders.
Below is a box plot showing the difference between the total riders when there is a promotion and when there is not a promotion.
5) Casual vs. Registered Riders
Below are box plots and tables showing the difference between average numbers of casual riders when there is promotion and when there is not a promotion, and the difference between average numbers of registered riders when there is a promotion and when there is no promotion. Based on the analysis, the difference between average numbers of casual riders when there is promotion and when there is not a promotion is about 341 people difference while the one of registered riders is at 1853 people. This shows that registered riders are more responsive than casual riders even when there is a promotion. We can conclude that the Promotion program have a more substantial impact on the registered riders than the casual riders.
Casual Riders
| Average Difference of Casual Riders |
|---|
| 341.1 |
Registered Riders
| Average Difference of Registered Riders |
|---|
| 1853 |
6) Promotion Influence on Ridership
Before we can report whether to continue with the Promotion program, we need to know at least the profit margin and operating costs for renting bikes from both casual and registered riders. We need to know whether BikeShare is gaining or losing money from the promotional period. We may be able to compare those information with the gain/loss of number of riders after the promotional period ends.