GLM Assignment 2
GLM Assignment 2
Katie Nowicki
2020-11-18
Bikeshare Data
Question 1
You want to build the best regression model possible for the dependent variable, total riders. Begin with the example from class where we fit total riders as a function of temperature using a third-degree polynomial. Add as many additional variables to your model as feasible to improve fit. Remember, your goal is to build the best fitting regression model explaining total ridership, using the tools we have covered regarding the linearity and multicollinearity assumptions.
Specify your generalized regression equation, an output of regression output (with coefficients, standard errors, etc.) and a summary of your work.
The regression equation with the best fit for total riders.
Total Riders = 3644.93 - 338.94(temp) + 34.65(temp^2) -0.72(temp^3) + 540.67(season) + Weekday (see factor for significant days 2-6 below) - 611.50(weathersit) - 19.47(humidity) - 52.66(windspeed) + month (see factor for significant months 3,5,6,11 below) + 1994.74(promotion)
The R^2 = 87.34% with a significant P-value.
##
## Call:
## lm(formula = total ~ temp + I(temp * temp) + I(temp * temp *
## temp) + day + season + weathersit + humidity + windspeed +
## month + Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3818.3 -335.9 50.3 417.1 2441.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3644.93628 445.28728 8.186 1.27e-15 ***
## temp -338.94660 78.26837 -4.331 1.70e-05 ***
## I(temp * temp) 34.65984 4.24451 8.166 1.47e-15 ***
## I(temp * temp * temp) -0.72245 0.07085 -10.197 < 2e-16 ***
## day1 81.12511 95.50016 0.849 0.395905
## day2 279.31134 95.97137 2.910 0.003724 **
## day3 331.71820 96.01766 3.455 0.000584 ***
## day4 369.81681 96.26663 3.842 0.000133 ***
## day5 477.34206 95.96753 4.974 8.25e-07 ***
## day6 443.56789 95.43177 4.648 4.00e-06 ***
## season 540.67582 53.86374 10.038 < 2e-16 ***
## weathersit -611.50152 62.73238 -9.748 < 2e-16 ***
## humidity -19.47324 2.63654 -7.386 4.29e-13 ***
## windspeed -52.66633 5.40726 -9.740 < 2e-16 ***
## month2 61.67992 133.86831 0.461 0.645120
## month3 366.66496 146.14056 2.509 0.012331 *
## month4 247.05468 168.23641 1.468 0.142415
## month5 486.84951 187.16874 2.601 0.009487 **
## month6 476.84542 213.85786 2.230 0.026080 *
## month7 361.22341 245.12285 1.474 0.141024
## month8 148.11948 230.01652 0.644 0.519816
## month9 183.07284 218.06218 0.840 0.401449
## month10 -154.73461 221.94557 -0.697 0.485924
## month11 -466.36323 213.10511 -2.188 0.028967 *
## month12 -211.56817 169.58855 -1.248 0.212614
## Promotion 1994.74885 52.34482 38.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 689.2 on 705 degrees of freedom
## Multiple R-squared: 0.8777, Adjusted R-squared: 0.8734
## F-statistic: 202.5 on 25 and 705 DF, p-value: < 2.2e-16
Question 2
Regarding your model from Q1, explain any problems you encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.
Multicollinearity is where two or more explanatory variables in a multiple regression model are highly linearly related. Month and temperature were highly correlated. I looked at using mnth+I(mnth*mnth) which improved the model slightly, but realize that temperature takes care of the parabola model. I also didn’t use working day because it was similar to weekday. I left out holiday because it was highly correlated with month. For linearity, I plotted each variable against total riders to ensure a linear assumption made sense. While plotting the independent variables I did not see homoscedasticity so I did not take the log of the dependent variable.
Question 3
Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?
May has the highest number of riders holding everything else constant. It has the highest coefficient for the statisically significant month. If it became cold and rainy it wouldn’t change the coefficient, however, it would decrease the number of total riders.
The graph below shows Month vs. Total Riders.
Question 4
Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.
The coefficient for the promotion in simple terms means when there is a promotion it increases total riders by around 1994. I would agree that the promotion is a success.
Question 5
You suspect the promotion might have influenced casual riders differently than the registered riders. Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders. What is your conclusion, and why? Include any data or screenshots to back up your claim.
My conclusion is it had more of an impact on registered riders. The bloxplots below show the total riders for casual and registered with and without a promotion. The registers riders see a larger increase in the mean than the casual riders.
Question 6
With your analysis from questions 4 and 5 now in-hand, you are prepared to report on the promotion’s influence on ridership. However, you lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?
I would need total sales on the promotional days and non-promotional days for the register and casual riders. We know that registers riders get more of a discount than casual riders. Given the increase in register riders on promotional days this could still turnout to be financially beneficial even with the discounted pricing.