Bikeshare Data

Question 1

You want to build the best regression model possible for the dependent variable, total riders. Begin with the example from class where we fit total riders as a function of temperature using a third-degree polynomial. Add as many additional variables to your model as feasible to improve fit. Remember, your goal is to build the best fitting regression model explaining total ridership, using the tools we have covered regarding the linearity and multicollinearity assumptions.

Specify your generalized regression equation, an output of regression output (with coefficients, standard errors, etc.) and a summary of your work.

The regression equation with the best fit for total riders.

Total Riders = 3644.93 - 338.94(temp) + 34.65(temp^2) -0.72(temp^3) + 540.67(season) + Weekday (see factor for significant days 2-6 below) - 611.50(weathersit) - 19.47(humidity) - 52.66(windspeed) + month (see factor for significant months 3,5,6,11 below) + 1994.74(promotion)

The R^2 = 87.34% with a significant P-value.

## 
## Call:
## lm(formula = total ~ temp + I(temp * temp) + I(temp * temp * 
##     temp) + day + season + weathersit + humidity + windspeed + 
##     month + Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3818.3  -335.9    50.3   417.1  2441.5 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3644.93628  445.28728   8.186 1.27e-15 ***
## temp                  -338.94660   78.26837  -4.331 1.70e-05 ***
## I(temp * temp)          34.65984    4.24451   8.166 1.47e-15 ***
## I(temp * temp * temp)   -0.72245    0.07085 -10.197  < 2e-16 ***
## day1                    81.12511   95.50016   0.849 0.395905    
## day2                   279.31134   95.97137   2.910 0.003724 ** 
## day3                   331.71820   96.01766   3.455 0.000584 ***
## day4                   369.81681   96.26663   3.842 0.000133 ***
## day5                   477.34206   95.96753   4.974 8.25e-07 ***
## day6                   443.56789   95.43177   4.648 4.00e-06 ***
## season                 540.67582   53.86374  10.038  < 2e-16 ***
## weathersit            -611.50152   62.73238  -9.748  < 2e-16 ***
## humidity               -19.47324    2.63654  -7.386 4.29e-13 ***
## windspeed              -52.66633    5.40726  -9.740  < 2e-16 ***
## month2                  61.67992  133.86831   0.461 0.645120    
## month3                 366.66496  146.14056   2.509 0.012331 *  
## month4                 247.05468  168.23641   1.468 0.142415    
## month5                 486.84951  187.16874   2.601 0.009487 ** 
## month6                 476.84542  213.85786   2.230 0.026080 *  
## month7                 361.22341  245.12285   1.474 0.141024    
## month8                 148.11948  230.01652   0.644 0.519816    
## month9                 183.07284  218.06218   0.840 0.401449    
## month10               -154.73461  221.94557  -0.697 0.485924    
## month11               -466.36323  213.10511  -2.188 0.028967 *  
## month12               -211.56817  169.58855  -1.248 0.212614    
## Promotion             1994.74885   52.34482  38.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 689.2 on 705 degrees of freedom
## Multiple R-squared:  0.8777, Adjusted R-squared:  0.8734 
## F-statistic: 202.5 on 25 and 705 DF,  p-value: < 2.2e-16

Question 2

Regarding your model from Q1, explain any problems you encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.

Multicollinearity is where two or more explanatory variables in a multiple regression model are highly linearly related. Month and temperature were highly correlated. I looked at using mnth+I(mnth*mnth) which improved the model slightly, but realize that temperature takes care of the parabola model. I also didn’t use working day because it was similar to weekday. I left out holiday because it was highly correlated with month. For linearity, I plotted each variable against total riders to ensure a linear assumption made sense. While plotting the independent variables I did not see homoscedasticity so I did not take the log of the dependent variable.

Question 3

Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?

May has the highest number of riders holding everything else constant. It has the highest coefficient for the statisically significant month. If it became cold and rainy it wouldn’t change the coefficient, however, it would decrease the number of total riders.

The graph below shows Month vs. Total Riders.

Question 4

Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.

The coefficient for the promotion in simple terms means when there is a promotion it increases total riders by around 1994. I would agree that the promotion is a success.

Question 5

You suspect the promotion might have influenced casual riders differently than the registered riders. Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders. What is your conclusion, and why? Include any data or screenshots to back up your claim.

My conclusion is it had more of an impact on registered riders. The bloxplots below show the total riders for casual and registered with and without a promotion. The registers riders see a larger increase in the mean than the casual riders.

Question 6

With your analysis from questions 4 and 5 now in-hand, you are prepared to report on the promotion’s influence on ridership. However, you lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?

I would need total sales on the promotional days and non-promotional days for the register and casual riders. We know that registers riders get more of a discount than casual riders. Given the increase in register riders on promotional days this could still turnout to be financially beneficial even with the discounted pricing.