GLM Assignment 2
Introduction
The marketing department for the city of Toronto, Canada is very pleased with itself. It recently concluded a promotion intended to encourage use of the city’s new downtown bike sharing program called BikeShareTM. Since the conclusion of the promotion, the marketing department has emphatically declared the promotion to be a huge success—a claim based entirely on anecdotal responses from a few individuals on social media.
As the primary data analyst to the city’s chief administrative officer, you have been given the task of assessing the real impact of the promotion on the use of the bike sharing program. The bike sharing program operates on the following model:
• Bikes may be rented only on a daily basis and only for the entire day (no partial days.)
• There are two ways a person can rent a bike.
Casual users can walk up to any available bike rental terminal, swipe a credit card and rent a bike by paying a daily fee (assuming bikes are available.)
Registered BikeShareTM members pay a monthly fee and are guaranteed bike rental availability whenever they want. They pay half the daily fee of the casual users.
• The marketing department’s promotion was run over the course of two years and would randomly assign half of the year’s days to be declared “promotional days” and the other half “non-promotional days.”
• On promotional days, the daily rental fee paid by the individual is discounted by 30% of the standard daily rate. As a result, on promotional days casual users pay 70% of the normal daily fee, and registered members pay 20% of the normal daily fee (because they are already receiving a 50% discount on rentals by being members.)
Setup
Data Fields
| Field | Description |
|---|---|
| Season | The season of the year (1: Spring, 2: Summer, 3: Fall, 4: Winter) |
| Promotion | Dummy variable indicating whether the promotion was active on the day (0 = False, 1=True) |
| Mnth | Month (1-12 representing Jan – Dec respectively) |
| Holiday | Dummy variable indicating whether the day was a holiday |
| Weekday | Day of the week (0-6 representing Sunday through Saturday respectively) |
| Workingday | Dummy variable indicating whether the day was a working day. |
| Weathersit | Variable indicating the weather on the day (1: Clear, few clouds; 2: Mist, cloudy; 3: Light snow, light rain; 4: Heavy snow or rain) |
| Temp | Temperature (in degrees Celsius) |
| Humidity | Humidity (in percent) |
| Windspeed | Windspeed (in knots) |
| Casual | Number of Casual Riders for the day |
| Registered | Number of Registered riders for the day |
Data Cleaning
| Field | Action Taken |
|---|---|
| total_riders | Casual + Registered Riders |
| Season | Changed to Factor Variable |
| Promotion | Changed to Factor Variable |
| Mnth | Changed to Factor Variable |
| Holiday | Changed to Factor Variable |
| Weekday | Changed to Factor Variable |
| Workingday | Changed to Factor Variable |
| Weathersit | Changed to Factor Variable |
Packages Used
| Package | Summary |
|---|---|
| tidyverse | The tidyverse collection of packages |
| skimr | Quick data check tool |
| car | Used for testing linearity of regression models |
| fmsb | Used for testing colinearity |
| rmdformats | Formatting for RMD output |
Question 1
You want to build the best regression model possible for the dependent variable, total riders.
To build this regression model, I started with the model where we fit total riders as a function of temperature using a third-degree polynomial. From there, I added and removed independent variables. Since we were only looking for the best fitting model to explain total ridership, I started with all variables included. After viewing the results, I realized that Season and Month are collinear and I tested a model using only one or the other. I found that season provided a slightly higher R-squared. Next I realized the Workingday, Holiday, and Weekday variables were all redundant. In this dataset, a Weekday is either a Workday, Holiday, or Weekend so weekday was not relevant. Finally, humidity is similar to temperature in that as it increases, ridership increases but there is a point where it gets too high and riders find a different, cooler form of transportation. To solve this, I used a polynomial regression.
The final result is summarized below. The only coefficient that is not statistically significant is Monday is not statistically different from Sunday, based on the p-values. The table also tells us that the RMSE is 660 on 711 degrees of freedom, meaning on average, a prediction will fall 660 riders away from the actual total riders when this model to make a prediction. Finally, the R-squared is 0.887, meaning that 88.7% of the variance in total daily riders is explained by the model.
The final generalized regression equation is:
Total Riders = 1,735 - 356(temp) + 36.5(temp)^2 - 75.3(temp)^3 + 742.7(Summer) + 985.8(Fall) + 1,263 (Winter) + 1,943(Promotion) - 530.2(Holiday) + 266.7(Tuesday) + 330.1(Wednesday) + 357.2(Thursday) + 417.2(Friday) + 446(Saturday) - 345.4(Mist, Clouds) - 1,439(Light Snow, Light Rain) + 46(Humidity) - 56.11(Humidity)^2 - 49(Windspeed)
##
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + season +
## Promotion + holiday + weekday + weathersit + poly(humidity,
## 2, raw = TRUE) + windspeed, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3164.0 -290.8 46.4 400.1 2487.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.735e+03 5.371e+02 3.230 0.001296 **
## poly(temp, 3, raw = TRUE)1 -3.560e+02 6.488e+01 -5.487 5.68e-08 ***
## poly(temp, 3, raw = TRUE)2 3.650e+01 3.466e+00 10.532 < 2e-16 ***
## poly(temp, 3, raw = TRUE)3 -7.529e-01 5.791e-02 -13.001 < 2e-16 ***
## season2 7.427e+02 9.375e+01 7.922 9.03e-15 ***
## season3 9.858e+02 1.217e+02 8.102 2.36e-15 ***
## season4 1.263e+03 8.184e+01 15.435 < 2e-16 ***
## Promotion1 1.943e+03 5.006e+01 38.806 < 2e-16 ***
## holiday1 -5.302e+02 1.529e+02 -3.467 0.000559 ***
## weekday1 1.229e+02 9.393e+01 1.308 0.191205
## weekday2 2.667e+02 9.169e+01 2.909 0.003741 **
## weekday3 3.301e+02 9.198e+01 3.589 0.000355 ***
## weekday4 3.572e+02 9.185e+01 3.888 0.000110 ***
## weekday5 4.172e+02 9.208e+01 4.530 6.91e-06 ***
## weekday6 4.460e+02 9.133e+01 4.883 1.29e-06 ***
## weathersit2 -3.454e+02 6.693e+01 -5.161 3.20e-07 ***
## weathersit3 -1.439e+03 1.885e+02 -7.633 7.41e-14 ***
## poly(humidity, 2, raw = TRUE)1 4.600e+01 1.241e+01 3.707 0.000226 ***
## poly(humidity, 2, raw = TRUE)2 -5.611e-01 1.018e-01 -5.510 5.01e-08 ***
## windspeed -4.909e+01 5.179e+00 -9.477 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 659.7 on 711 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.884
## F-statistic: 293.9 on 19 and 711 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 5.550799 3 1.330636
## season 5.166089 3 1.314802
## Promotion 1.052329 1 1.025831
## holiday 1.096070 1 1.046933
## weekday 1.147851 6 1.011557
## weathersit 2.327299 2 1.235131
## poly(humidity, 2, raw = TRUE) 2.763279 2 1.289307
## windspeed 1.213123 1 1.101419
## [1] 8.852869
Question 2
Explain any problems your encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.
As stated above, Season and Month are collinear as well Weekday, Workingday, and Holiday. I also mentioned how Humidity did not meet the linearity assumption. Below are the plots showing the linearity of both Temp and Humidity after using polynomial regression for both variables.
I did not make any changes due to homoscedasticity.
Question 3
Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?
The model I created in Q1 used season instead of month. In order to address which month had the largest impact on total ridership, I removed season and included month. This is possible because they are collinear and it just slightly decreased the R-Squared to 87.26%, which is still extremely accurate. Using this updated model, October has the largest impact on total ridership compared to other months in the year with a coefficient of 1,327.3. This means, holding all else constant, there are 1,327 more riders in October compared to January, which is our baseline month in the model.
If October became unseasonably cold or rainy, ridership would decrease. This is because the model solves for holding everything else constant and “Light Snow, Light Rain” has a coefficient of -1,478.8, meaning in those weather conditions, there will be 1,479 less riders than “Clear, Few Clouds”. The same goes for cold as when it is unseasonably cold, the ridership will be lower than if it were warm, but not too hot.
##
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + mnth +
## Promotion + holiday + weekday + weathersit + poly(humidity,
## 2, raw = TRUE) + windspeed, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3178.0 -321.2 45.7 417.7 2240.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1588.9269 598.3961 2.655 0.008103 **
## poly(temp, 3, raw = TRUE)1 -312.1385 79.9719 -3.903 0.000104 ***
## poly(temp, 3, raw = TRUE)2 34.0704 4.3402 7.850 1.55e-14 ***
## poly(temp, 3, raw = TRUE)3 -0.7159 0.0725 -9.875 < 2e-16 ***
## mnth2 22.1268 136.8906 0.162 0.871637
## mnth3 505.1967 149.0747 3.389 0.000741 ***
## mnth4 687.2603 165.5147 4.152 3.70e-05 ***
## mnth5 825.5255 186.8663 4.418 1.15e-05 ***
## mnth6 927.9040 211.6647 4.384 1.34e-05 ***
## mnth7 1173.1587 233.6638 5.021 6.53e-07 ***
## mnth8 926.9299 216.4969 4.281 2.11e-05 ***
## mnth9 1256.6820 193.1499 6.506 1.46e-10 ***
## mnth10 1327.3031 166.9448 7.951 7.41e-15 ***
## mnth11 1098.6560 150.7521 7.288 8.49e-13 ***
## mnth12 772.4838 141.2255 5.470 6.27e-08 ***
## Promotion1 1945.0511 53.8355 36.130 < 2e-16 ***
## holiday1 -594.1798 164.7902 -3.606 0.000333 ***
## weekday1 146.5532 100.4921 1.458 0.145189
## weekday2 287.1648 98.1813 2.925 0.003557 **
## weekday3 346.0437 98.5348 3.512 0.000473 ***
## weekday4 375.8489 98.5123 3.815 0.000148 ***
## weekday5 417.1138 98.4703 4.236 2.58e-05 ***
## weekday6 444.6662 97.5945 4.556 6.14e-06 ***
## weathersit2 -364.4568 72.3289 -5.039 5.96e-07 ***
## weathersit3 -1478.8201 203.8588 -7.254 1.07e-12 ***
## poly(humidity, 2, raw = TRUE)1 43.7864 13.4787 3.249 0.001215 **
## poly(humidity, 2, raw = TRUE)2 -0.5367 0.1108 -4.846 1.55e-06 ***
## windspeed -53.3188 5.5408 -9.623 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 704.7 on 703 degrees of freedom
## Multiple R-squared: 0.8726, Adjusted R-squared: 0.8677
## F-statistic: 178.3 on 27 and 703 DF, p-value: < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 21.489316 3 1.667390
## mnth 23.959487 11 1.155323
## Promotion 1.066555 1 1.032742
## holiday 1.115353 1 1.056103
## weekday 1.169467 6 1.013131
## weathersit 2.421475 2 1.247441
## poly(humidity, 2, raw = TRUE) 3.090540 2 1.325893
## windspeed 1.216703 1 1.103043
## [1] 7.847131
Question 4
Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.
Based on both models, the promotion variable increases ridership by approximately 1,945 riders compared to non-promotional days. The marketing department would undoubtably consider this a success as ridership increases significantly on promotional days compared to non-promo days.
Question 5
Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders.
By using the log transformation on the dependent variable, casual riders or registered riders, we can see the percentage change in riders for those variables based on if the specific day is a promotional day or not. The casual rider model is summarized below. The confidence interval is 35.5% - 62.5% (48% +/- 2 SD) increase in riders when there is a promotional day. However, the R-squared on the model is only 0.05585, meaning only 5.6% of the variance is explained by promotional status of the day.
##
## Call:
## lm(formula = log(casual) ~ Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8948 -0.6318 0.2526 0.6182 1.9208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.10698 0.05183 117.831 < 2e-16 ***
## Promotion1 0.48099 0.07325 6.567 9.77e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9902 on 729 degrees of freedom
## Multiple R-squared: 0.05585, Adjusted R-squared: 0.05455
## F-statistic: 43.12 on 1 and 729 DF, p-value: 9.775e-11
The registered rider model is now summarized below. The confidence interval is 47.1% - 61.5% (54.3% +/- 2 SD) increase in riders when there is a promotional day. This confidence interval is much tighter and not surprisingly, the R-squared on the model is much better at 0.2306, meaning 23% of the variance is explained by promotional status of the day.
##
## Call:
## lm(formula = log(registered) ~ Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3560 -0.2211 0.1331 0.3412 0.6277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.80913 0.02597 300.65 <2e-16 ***
## Promotion1 0.54262 0.03671 14.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4962 on 729 degrees of freedom
## Multiple R-squared: 0.2306, Adjusted R-squared: 0.2296
## F-statistic: 218.5 on 1 and 729 DF, p-value: < 2.2e-16
These results make sense as registered riders are already paying a monthly fee and only need to pay 20% of the normal daily fee so they have much more incentive to use the bikes. The casual rider on the other hand does not pay anything if they are not riding a bike that day and if they do, they still have to pay 70% of the normal daily rate.
Question 6
You lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?
The first thing I would need to know is the monthly rate that registered riders pay. This would be helpful to know what the breakeven point is for the registered riders compared to the casual riders. I would also want to know what kind of operating costs the program has, specifically maintenance costs. Finally, I would love to know a monetary amount of damage riders do to the bikes and how that is allocated between casual or registered riders. Although if damage could be attributed to a specific day, the percentage of casual and registered riders for that day could be used to determine how much damage is impacted by the different types of riders. All of this is to say that there are still many factors that should be explored before deciding if the program is a success or not.