Introduction

The marketing department for the city of Toronto, Canada is very pleased with itself. It recently concluded a promotion intended to encourage use of the city’s new downtown bike sharing program called BikeShareTM. Since the conclusion of the promotion, the marketing department has emphatically declared the promotion to be a huge success—a claim based entirely on anecdotal responses from a few individuals on social media.

As the primary data analyst to the city’s chief administrative officer, you have been given the task of assessing the real impact of the promotion on the use of the bike sharing program. The bike sharing program operates on the following model:

• Bikes may be rented only on a daily basis and only for the entire day (no partial days.)

• There are two ways a person can rent a bike.

  1. Casual users can walk up to any available bike rental terminal, swipe a credit card and rent a bike by paying a daily fee (assuming bikes are available.)

  2. Registered BikeShareTM members pay a monthly fee and are guaranteed bike rental availability whenever they want. They pay half the daily fee of the casual users.

• The marketing department’s promotion was run over the course of two years and would randomly assign half of the year’s days to be declared “promotional days” and the other half “non-promotional days.”

• On promotional days, the daily rental fee paid by the individual is discounted by 30% of the standard daily rate. As a result, on promotional days casual users pay 70% of the normal daily fee, and registered members pay 20% of the normal daily fee (because they are already receiving a 50% discount on rentals by being members.)

Setup

Data Fields

Field Description
Season The season of the year (1: Spring, 2: Summer, 3: Fall, 4: Winter)
Promotion Dummy variable indicating whether the promotion was active on the day (0 = False, 1=True)
Mnth Month (1-12 representing Jan – Dec respectively)
Holiday Dummy variable indicating whether the day was a holiday
Weekday Day of the week (0-6 representing Sunday through Saturday respectively)
Workingday Dummy variable indicating whether the day was a working day.
Weathersit Variable indicating the weather on the day (1: Clear, few clouds; 2: Mist, cloudy; 3: Light snow, light rain; 4: Heavy snow or rain)
Temp Temperature (in degrees Celsius)
Humidity Humidity (in percent)
Windspeed Windspeed (in knots)
Casual Number of Casual Riders for the day
Registered Number of Registered riders for the day

Data Cleaning

Field Action Taken
total_riders Casual + Registered Riders
Season Changed to Factor Variable
Promotion Changed to Factor Variable
Mnth Changed to Factor Variable
Holiday Changed to Factor Variable
Weekday Changed to Factor Variable
Workingday Changed to Factor Variable
Weathersit Changed to Factor Variable

Packages Used

Package Summary
tidyverse The tidyverse collection of packages
skimr Quick data check tool
car Used for testing linearity of regression models
fmsb Used for testing colinearity
rmdformats Formatting for RMD output

Question 1

You want to build the best regression model possible for the dependent variable, total riders.

To build this regression model, I started with the model where we fit total riders as a function of temperature using a third-degree polynomial. From there, I added and removed independent variables. Since we were only looking for the best fitting model to explain total ridership, I started with all variables included. After viewing the results, I realized that Season and Month are collinear and I tested a model using only one or the other. I found that season provided a slightly higher R-squared. Next I realized the Workingday, Holiday, and Weekday variables were all redundant. In this dataset, a Weekday is either a Workday, Holiday, or Weekend so weekday was not relevant. Finally, humidity is similar to temperature in that as it increases, ridership increases but there is a point where it gets too high and riders find a different, cooler form of transportation. To solve this, I used a polynomial regression.

The final result is summarized below. The only coefficient that is not statistically significant is Monday is not statistically different from Sunday, based on the p-values. The table also tells us that the RMSE is 660 on 711 degrees of freedom, meaning on average, a prediction will fall 660 riders away from the actual total riders when this model to make a prediction. Finally, the R-squared is 0.887, meaning that 88.7% of the variance in total daily riders is explained by the model.

The final generalized regression equation is:

Total Riders = 1,735 - 356(temp) + 36.5(temp)^2 - 75.3(temp)^3 + 742.7(Summer) + 985.8(Fall) + 1,263 (Winter) + 1,943(Promotion) - 530.2(Holiday) + 266.7(Tuesday) + 330.1(Wednesday) + 357.2(Thursday) + 417.2(Friday) + 446(Saturday) - 345.4(Mist, Clouds) - 1,439(Light Snow, Light Rain) + 46(Humidity) - 56.11(Humidity)^2 - 49(Windspeed)

## 
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + season + 
##     Promotion + holiday + weekday + weathersit + poly(humidity, 
##     2, raw = TRUE) + windspeed, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3164.0  -290.8    46.4   400.1  2487.1 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.735e+03  5.371e+02   3.230 0.001296 ** 
## poly(temp, 3, raw = TRUE)1     -3.560e+02  6.488e+01  -5.487 5.68e-08 ***
## poly(temp, 3, raw = TRUE)2      3.650e+01  3.466e+00  10.532  < 2e-16 ***
## poly(temp, 3, raw = TRUE)3     -7.529e-01  5.791e-02 -13.001  < 2e-16 ***
## season2                         7.427e+02  9.375e+01   7.922 9.03e-15 ***
## season3                         9.858e+02  1.217e+02   8.102 2.36e-15 ***
## season4                         1.263e+03  8.184e+01  15.435  < 2e-16 ***
## Promotion1                      1.943e+03  5.006e+01  38.806  < 2e-16 ***
## holiday1                       -5.302e+02  1.529e+02  -3.467 0.000559 ***
## weekday1                        1.229e+02  9.393e+01   1.308 0.191205    
## weekday2                        2.667e+02  9.169e+01   2.909 0.003741 ** 
## weekday3                        3.301e+02  9.198e+01   3.589 0.000355 ***
## weekday4                        3.572e+02  9.185e+01   3.888 0.000110 ***
## weekday5                        4.172e+02  9.208e+01   4.530 6.91e-06 ***
## weekday6                        4.460e+02  9.133e+01   4.883 1.29e-06 ***
## weathersit2                    -3.454e+02  6.693e+01  -5.161 3.20e-07 ***
## weathersit3                    -1.439e+03  1.885e+02  -7.633 7.41e-14 ***
## poly(humidity, 2, raw = TRUE)1  4.600e+01  1.241e+01   3.707 0.000226 ***
## poly(humidity, 2, raw = TRUE)2 -5.611e-01  1.018e-01  -5.510 5.01e-08 ***
## windspeed                      -4.909e+01  5.179e+00  -9.477  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 659.7 on 711 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.884 
## F-statistic: 293.9 on 19 and 711 DF,  p-value: < 2.2e-16
##                                   GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)     5.550799  3        1.330636
## season                        5.166089  3        1.314802
## Promotion                     1.052329  1        1.025831
## holiday                       1.096070  1        1.046933
## weekday                       1.147851  6        1.011557
## weathersit                    2.327299  2        1.235131
## poly(humidity, 2, raw = TRUE) 2.763279  2        1.289307
## windspeed                     1.213123  1        1.101419
## [1] 8.852869

Question 2

Explain any problems your encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.

As stated above, Season and Month are collinear as well Weekday, Workingday, and Holiday. I also mentioned how Humidity did not meet the linearity assumption. Below are the plots showing the linearity of both Temp and Humidity after using polynomial regression for both variables.

I did not make any changes due to homoscedasticity.

Question 3

Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?

The model I created in Q1 used season instead of month. In order to address which month had the largest impact on total ridership, I removed season and included month. This is possible because they are collinear and it just slightly decreased the R-Squared to 87.26%, which is still extremely accurate. Using this updated model, October has the largest impact on total ridership compared to other months in the year with a coefficient of 1,327.3. This means, holding all else constant, there are 1,327 more riders in October compared to January, which is our baseline month in the model.

If October became unseasonably cold or rainy, ridership would decrease. This is because the model solves for holding everything else constant and “Light Snow, Light Rain” has a coefficient of -1,478.8, meaning in those weather conditions, there will be 1,479 less riders than “Clear, Few Clouds”. The same goes for cold as when it is unseasonably cold, the ridership will be lower than if it were warm, but not too hot.

## 
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + mnth + 
##     Promotion + holiday + weekday + weathersit + poly(humidity, 
##     2, raw = TRUE) + windspeed, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3178.0  -321.2    45.7   417.7  2240.2 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1588.9269   598.3961   2.655 0.008103 ** 
## poly(temp, 3, raw = TRUE)1      -312.1385    79.9719  -3.903 0.000104 ***
## poly(temp, 3, raw = TRUE)2        34.0704     4.3402   7.850 1.55e-14 ***
## poly(temp, 3, raw = TRUE)3        -0.7159     0.0725  -9.875  < 2e-16 ***
## mnth2                             22.1268   136.8906   0.162 0.871637    
## mnth3                            505.1967   149.0747   3.389 0.000741 ***
## mnth4                            687.2603   165.5147   4.152 3.70e-05 ***
## mnth5                            825.5255   186.8663   4.418 1.15e-05 ***
## mnth6                            927.9040   211.6647   4.384 1.34e-05 ***
## mnth7                           1173.1587   233.6638   5.021 6.53e-07 ***
## mnth8                            926.9299   216.4969   4.281 2.11e-05 ***
## mnth9                           1256.6820   193.1499   6.506 1.46e-10 ***
## mnth10                          1327.3031   166.9448   7.951 7.41e-15 ***
## mnth11                          1098.6560   150.7521   7.288 8.49e-13 ***
## mnth12                           772.4838   141.2255   5.470 6.27e-08 ***
## Promotion1                      1945.0511    53.8355  36.130  < 2e-16 ***
## holiday1                        -594.1798   164.7902  -3.606 0.000333 ***
## weekday1                         146.5532   100.4921   1.458 0.145189    
## weekday2                         287.1648    98.1813   2.925 0.003557 ** 
## weekday3                         346.0437    98.5348   3.512 0.000473 ***
## weekday4                         375.8489    98.5123   3.815 0.000148 ***
## weekday5                         417.1138    98.4703   4.236 2.58e-05 ***
## weekday6                         444.6662    97.5945   4.556 6.14e-06 ***
## weathersit2                     -364.4568    72.3289  -5.039 5.96e-07 ***
## weathersit3                    -1478.8201   203.8588  -7.254 1.07e-12 ***
## poly(humidity, 2, raw = TRUE)1    43.7864    13.4787   3.249 0.001215 ** 
## poly(humidity, 2, raw = TRUE)2    -0.5367     0.1108  -4.846 1.55e-06 ***
## windspeed                        -53.3188     5.5408  -9.623  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 704.7 on 703 degrees of freedom
## Multiple R-squared:  0.8726, Adjusted R-squared:  0.8677 
## F-statistic: 178.3 on 27 and 703 DF,  p-value: < 2.2e-16
##                                    GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE)     21.489316  3        1.667390
## mnth                          23.959487 11        1.155323
## Promotion                      1.066555  1        1.032742
## holiday                        1.115353  1        1.056103
## weekday                        1.169467  6        1.013131
## weathersit                     2.421475  2        1.247441
## poly(humidity, 2, raw = TRUE)  3.090540  2        1.325893
## windspeed                      1.216703  1        1.103043
## [1] 7.847131

Question 4

Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.

Based on both models, the promotion variable increases ridership by approximately 1,945 riders compared to non-promotional days. The marketing department would undoubtably consider this a success as ridership increases significantly on promotional days compared to non-promo days.

Question 5

Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders.

By using the log transformation on the dependent variable, casual riders or registered riders, we can see the percentage change in riders for those variables based on if the specific day is a promotional day or not. The casual rider model is summarized below. The confidence interval is 35.5% - 62.5% (48% +/- 2 SD) increase in riders when there is a promotional day. However, the R-squared on the model is only 0.05585, meaning only 5.6% of the variance is explained by promotional status of the day.

## 
## Call:
## lm(formula = log(casual) ~ Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8948 -0.6318  0.2526  0.6182  1.9208 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.10698    0.05183 117.831  < 2e-16 ***
## Promotion1   0.48099    0.07325   6.567 9.77e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9902 on 729 degrees of freedom
## Multiple R-squared:  0.05585,    Adjusted R-squared:  0.05455 
## F-statistic: 43.12 on 1 and 729 DF,  p-value: 9.775e-11

The registered rider model is now summarized below. The confidence interval is 47.1% - 61.5% (54.3% +/- 2 SD) increase in riders when there is a promotional day. This confidence interval is much tighter and not surprisingly, the R-squared on the model is much better at 0.2306, meaning 23% of the variance is explained by promotional status of the day.

## 
## Call:
## lm(formula = log(registered) ~ Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3560 -0.2211  0.1331  0.3412  0.6277 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.80913    0.02597  300.65   <2e-16 ***
## Promotion1   0.54262    0.03671   14.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4962 on 729 degrees of freedom
## Multiple R-squared:  0.2306, Adjusted R-squared:  0.2296 
## F-statistic: 218.5 on 1 and 729 DF,  p-value: < 2.2e-16

These results make sense as registered riders are already paying a monthly fee and only need to pay 20% of the normal daily fee so they have much more incentive to use the bikes. The casual rider on the other hand does not pay anything if they are not riding a bike that day and if they do, they still have to pay 70% of the normal daily rate.

Question 6

You lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?

The first thing I would need to know is the monthly rate that registered riders pay. This would be helpful to know what the breakeven point is for the registered riders compared to the casual riders. I would also want to know what kind of operating costs the program has, specifically maintenance costs. Finally, I would love to know a monetary amount of damage riders do to the bikes and how that is allocated between casual or registered riders. Although if damage could be attributed to a specific day, the percentage of casual and registered riders for that day could be used to determine how much damage is impacted by the different types of riders. All of this is to say that there are still many factors that should be explored before deciding if the program is a success or not.