The regression model I found to be the best fit for the total riders included the polynomial temp variable, the month, windspeed, humidity, weathersit, and Promotion.
First I built the simple model of riders as a function of temperature with a regression line:
Then I built the regression model with just the polynomial temp variable which gave these results:
##
## Call:
## lm(formula = total ~ temp + I(temp * temp) + I(temp * temp *
## temp), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4724.0 -1034.4 -99.6 1130.1 3160.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.9929 775.3459 0.669 0.503472
## temp 63.1408 134.5298 0.469 0.638964
## I(temp * temp) 16.6342 7.2173 2.305 0.021461 *
## I(temp * temp * temp) -0.4324 0.1208 -3.580 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1423 on 727 degrees of freedom
## Multiple R-squared: 0.4627, Adjusted R-squared: 0.4604
## F-statistic: 208.6 on 3 and 727 DF, p-value: < 0.00000000000000022
I then plotted this as a baseline:
I then built the regression model with the variables mentioned above, which gave these results:
##
## Call:
## lm(formula = total ~ temp + I(temp * temp) + I(temp * temp *
## temp) + mnth + windspeed + humidity + weathersit + Promotion,
## data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3308.2 -344.5 64.3 452.6 2315.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3655.55759 476.93890 7.665 0.00000000000005895 ***
## temp -290.60944 83.34955 -3.487 0.000519 ***
## I(temp * temp) 32.60304 4.51908 7.215 0.00000000000139040 ***
## I(temp * temp * temp) -0.68868 0.07543 -9.130 < 0.0000000000000002 ***
## mnth2 43.40561 143.12868 0.303 0.761778
## mnth3 510.11506 155.60011 3.278 0.001095 **
## mnth4 707.99181 172.71289 4.099 0.00004623311183325 ***
## mnth5 885.32358 194.56963 4.550 0.00000630285601288 ***
## mnth6 1041.86721 219.17299 4.754 0.00000241897831399 ***
## mnth7 1254.92258 241.99134 5.186 0.00000028070973731 ***
## mnth8 1053.89186 224.01881 4.704 0.00000305872677145 ***
## mnth9 1301.82875 201.00980 6.476 0.00000000017521292 ***
## mnth10 1409.04014 173.34785 8.128 0.00000000000000193 ***
## mnth11 1148.87931 157.08176 7.314 0.00000000000070144 ***
## mnth12 812.94672 147.48171 5.512 0.00000004960769791 ***
## windspeed -55.11091 5.78616 -9.525 < 0.0000000000000002 ***
## humidity -21.65540 2.80124 -7.731 0.00000000000003660 ***
## weathersit2 -398.68533 73.44189 -5.429 0.00000007800758143 ***
## weathersit3 -1835.55211 186.82033 -9.825 < 0.0000000000000002 ***
## Promotion 1961.96028 56.05143 35.003 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 737.1 on 711 degrees of freedom
## Multiple R-squared: 0.859, Adjusted R-squared: 0.8552
## F-statistic: 227.9 on 19 and 711 DF, p-value: < 0.00000000000000022
Nearly all of the variables (the months of January and February being the exception) showed statistical significance indicated by low p-values. The overall R-squared at 0.8589755 showed the model as a good fit.
The model was then plotted to show the residuals, then a component plot to show residuals for individual variables.
While I thought that at least humidity and temperature may be colinear, this did not turn out to be a concern.
A variance inflation factor (VIF) test shows that none of the variables are highly correlated, as the test result values are low:
## GVIF Df GVIF^(1/(2*Df))
## temp 525.701581 1 22.928183
## I(temp * temp) 2575.348359 1 50.747890
## I(temp * temp * temp) 824.325500 1 28.711069
## mnth 22.066947 11 1.151010
## windspeed 1.212639 1 1.101199
## humidity 2.138553 1 1.462379
## weathersit 1.834181 2 1.163752
## Promotion 1.056642 1 1.027931
From the regression output, it is apparent that the month with the most riders is the month with the highest coefficient, mnth10, or October. A change in weather would not change the coefficient on the month.
Isolating the residual plot for months, this visualization confirms the estimate of greater riders in October:
The coefficient for the Promotion variable, 1961.9602777, does appear to have an affect on the number of riders based on this model. It indicates that for a day that the Promotion was active, total riders increases by 1961.9602777.
This is helpful in understanding the overall impact of the Promotion, however, it does not necessarily imply that it is good business. I.e. There may be a greater influence on registered riders than casual riders, effectively giving already-captured customers an unnecessary discount.
Using the same variables, but separating the rider populations into casual and registered riders, there is an apparent difference in the influence on the promotion on those types of riders.
Comparing the standarized coefficients of the variables shows this difference:
For casual riders, the standardized coefficient is 0.1670051 and for registered riders is 0.4083562. This shows that registered riders increases significantly more than casual riders when the Promotion is in effect.
While the above model and statistics help to show the effect of the promotion, key information is needed to understand the business impact of these results. Namely, what is the net revenue of the promotion per rider, and more importantly, what is the net income of the promotion for casual riders vs. registered riders. Likely, casual riders provide more net revenue and are more desirable riders. If the target of the promotion is to maximize net revenue by incentivizing new or more casual riders over registered riders, it is not as effective as it is for registered riders.