GLM 2
GLM Assignment 2
Question 1
You want to build the best regression model possible for the dependent variable, total riders. Begin with the example from class where we fit total riders as a function of temperature using a third-degree polynomial. Add as many additional variables to your model as feasible to improve fit. Remember, your goal is to build the best fitting regression model explaining total ridership, using the tools we have covered regarding the linearity and multicollinearity assumptions.
Specify your generalized regression equation, an output of regression output (with coefficients, standard errors, etc.) and a summary of your work.
##
## Call:
## lm(formula = total_riders ~ temp + I(temp * temp) + I(temp *
## temp * temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday),
## data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5740.3 -351.9 200.8 534.8 2020.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1637.5482 569.8485 2.874 0.004179 **
## temp -344.9528 105.8137 -3.260 0.001167 **
## I(temp * temp) 32.6561 5.7340 5.695 1.81e-08 ***
## I(temp * temp * temp) -0.6524 0.0956 -6.825 1.89e-11 ***
## as.factor(mnth)2 153.4621 180.6001 0.850 0.395760
## as.factor(mnth)3 591.7590 195.4559 3.028 0.002555 **
## as.factor(mnth)4 836.2914 215.9554 3.873 0.000118 ***
## as.factor(mnth)5 1034.2289 245.9167 4.206 2.94e-05 ***
## as.factor(mnth)6 1396.3066 272.8509 5.117 3.99e-07 ***
## as.factor(mnth)7 1525.5024 303.6465 5.024 6.41e-07 ***
## as.factor(mnth)8 1274.6463 281.3667 4.530 6.91e-06 ***
## as.factor(mnth)9 1348.9092 253.8449 5.314 1.44e-07 ***
## as.factor(mnth)10 1435.8801 218.3471 6.576 9.37e-11 ***
## as.factor(mnth)11 1319.7979 197.7406 6.674 5.01e-11 ***
## as.factor(mnth)12 795.0041 185.9374 4.276 2.17e-05 ***
## as.factor(Promotion)1 2104.3227 69.8239 30.138 < 2e-16 ***
## as.factor(weekday)1 10.3400 129.1038 0.080 0.936188
## as.factor(weekday)2 171.9128 129.4487 1.328 0.184593
## as.factor(weekday)3 209.5634 129.4818 1.618 0.106004
## as.factor(weekday)4 349.8536 129.5265 2.701 0.007078 **
## as.factor(weekday)5 439.8778 129.2835 3.402 0.000705 ***
## as.factor(weekday)6 362.2104 128.8048 2.812 0.005058 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 932.9 on 709 degrees of freedom
## Multiple R-squared: 0.7747, Adjusted R-squared: 0.7681
## F-statistic: 116.1 on 21 and 709 DF, p-value: < 2.2e-16
Regression Equation:
total_riders = 1637.55 - 344.95(temp) + 32.66(temp)^2 - 0.65(temp)^3 + 592(mnth3) + 836(mnth4) + 1034(mnth5) + 1396(mnth6) + 1525(mnth7) + 1275(mnth8) + 1349(mnth9) + 1436(mnth10) + 1320(mnth11) + 795(mnth12) + 2104(Promotion1) + 349(weekday4) + 439(weekday5) + 362(weekday6)
R Squared: 77%
Question 2
Regarding your model from Q1, explain any problems your encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.
## season Promotion mnth holiday weekday
## season 1.000000000 -0.001844343 0.831440114 -0.010536659 -0.0030798813
## Promotion -0.001844343 1.000000000 -0.001792434 0.007954311 -0.0054607652
## mnth 0.831440114 -0.001792434 1.000000000 0.019190895 0.0095093129
## holiday -0.010536659 0.007954311 0.019190895 1.000000000 -0.1019602689
## weekday -0.003079881 -0.005460765 0.009509313 -0.101960269 1.0000000000
## workingday 0.012484963 -0.002012621 -0.005900951 -0.253022700 0.0357896736
## weathersit 0.019211028 -0.048726541 0.043528098 -0.034626841 0.0310874694
## temp 0.334314856 0.047603572 0.220205335 -0.028555535 -0.0001699624
## humidity 0.205444765 -0.110651045 0.222203691 -0.015937479 -0.0522321004
## windspeed -0.229046337 -0.011817060 -0.207501752 0.006291507 0.0142821241
## casual 0.210399165 0.248545664 0.123005889 0.054274203 0.0599226375
## registered 0.411623051 0.594248168 0.293487830 -0.108744863 0.0573674440
## total_riders 0.406100371 0.566709708 0.279977112 -0.068347716 0.0674434124
## workingday weathersit temp humidity windspeed
## season 0.012484963 0.01921103 0.3343148564 0.20544476 -0.229046337
## Promotion -0.002012621 -0.04872654 0.0476035719 -0.11065104 -0.011817060
## mnth -0.005900951 0.04352810 0.2202053352 0.22220369 -0.207501752
## holiday -0.253022700 -0.03462684 -0.0285555350 -0.01593748 0.006291507
## weekday 0.035789674 0.03108747 -0.0001699624 -0.05223210 0.014282124
## workingday 1.000000000 0.06120043 0.0526598102 0.02432705 -0.018796487
## weathersit 0.061200430 1.00000000 -0.1206022365 0.59104460 0.039511059
## temp 0.052659810 -0.12060224 1.0000000000 0.12696294 -0.157944120
## humidity 0.024327046 0.59104460 0.1269629390 1.00000000 -0.248489099
## windspeed -0.018796487 0.03951106 -0.1579441204 -0.24848910 1.000000000
## casual -0.518044191 -0.24735300 0.5432846617 -0.07700788 -0.167613349
## registered 0.303907117 -0.26038771 0.5400119662 -0.09108860 -0.217448981
## total_riders 0.061156063 -0.29739124 0.6274940090 -0.10065856 -0.234544997
## casual registered total_riders
## season 0.21039916 0.41162305 0.40610037
## Promotion 0.24854566 0.59424817 0.56670971
## mnth 0.12300589 0.29348783 0.27997711
## holiday 0.05427420 -0.10874486 -0.06834772
## weekday 0.05992264 0.05736744 0.06744341
## workingday -0.51804419 0.30390712 0.06115606
## weathersit -0.24735300 -0.26038771 -0.29739124
## temp 0.54328466 0.54001197 0.62749401
## humidity -0.07700788 -0.09108860 -0.10065856
## windspeed -0.16761335 -0.21744898 -0.23454500
## casual 1.00000000 0.39528245 0.67280443
## registered 0.39528245 1.00000000 0.94551692
## total_riders 0.67280443 0.94551692 1.00000000
##
## Call:
## lm(formula = total_riders ~ temp + I(temp * temp) + I(temp *
## temp * temp), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4724.0 -1034.4 -99.6 1130.1 3160.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.9929 775.3459 0.669 0.503472
## temp 63.1408 134.5298 0.469 0.638964
## I(temp * temp) 16.6342 7.2173 2.305 0.021461 *
## I(temp * temp * temp) -0.4324 0.1208 -3.580 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1423 on 727 degrees of freedom
## Multiple R-squared: 0.4627, Adjusted R-squared: 0.4604
## F-statistic: 208.6 on 3 and 727 DF, p-value: < 2.2e-16
Running a correlation test on the data reveals that month and season have a correlation greater than 70% as well as registered riders and total riders. Season is excluded from the regression model due to the fact that it has a high correlation with month. Since casual riders and registered riders are represented within total riders, they are also excluded from the regression model.
Question 3
Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?
According to the regression output, July has the highest coefficent and therefore the highest number of riders holding everything else constant. If it became unreasonably cold and rainy then the number of riders would decrease.
Question 4
Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.
“The marketing department has emphatically declared the promotion to be a huge success.” On promotional days, there was an increase in riders by 2104. This means that the claims of the marketing department are valid.
Question 5
You suspect the promotion might have influenced casual riders differently than the registered riders. Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders. What is your conclusion, and why? Include any data or screenshots to back up your claim.
##
## Call:
## lm(formula = casual ~ temp + I(temp * temp) + I(temp * temp *
## temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday),
## data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1478.32 -202.08 -21.51 174.38 2086.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 979.74873 231.77675 4.227 2.68e-05 ***
## temp -147.87346 43.03804 -3.436 0.000625 ***
## I(temp * temp) 12.01964 2.33221 5.154 3.31e-07 ***
## I(temp * temp * temp) -0.22265 0.03888 -5.726 1.52e-08 ***
## as.factor(mnth)2 14.31782 73.45619 0.195 0.845514
## as.factor(mnth)3 282.65001 79.49856 3.555 0.000402 ***
## as.factor(mnth)4 366.56291 87.83639 4.173 3.38e-05 ***
## as.factor(mnth)5 329.10174 100.02268 3.290 0.001050 **
## as.factor(mnth)6 260.67970 110.97774 2.349 0.019101 *
## as.factor(mnth)7 321.68547 123.50334 2.605 0.009389 **
## as.factor(mnth)8 191.45415 114.44140 1.673 0.094779 .
## as.factor(mnth)9 218.44682 103.24735 2.116 0.034713 *
## as.factor(mnth)10 285.73139 88.80919 3.217 0.001353 **
## as.factor(mnth)11 247.49206 80.42782 3.077 0.002170 **
## as.factor(mnth)12 48.25551 75.62705 0.638 0.523633
## as.factor(Promotion)1 313.92391 28.39977 11.054 < 2e-16 ***
## as.factor(weekday)1 -697.98225 52.51090 -13.292 < 2e-16 ***
## as.factor(weekday)2 -823.27293 52.65118 -15.636 < 2e-16 ***
## as.factor(weekday)3 -826.65100 52.66463 -15.697 < 2e-16 ***
## as.factor(weekday)4 -781.76166 52.68284 -14.839 < 2e-16 ***
## as.factor(weekday)5 -594.17262 52.58401 -11.299 < 2e-16 ***
## as.factor(weekday)6 138.04292 52.38930 2.635 0.008599 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 379.5 on 709 degrees of freedom
## Multiple R-squared: 0.7034, Adjusted R-squared: 0.6946
## F-statistic: 80.06 on 21 and 709 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = registered ~ temp + I(temp * temp) + I(temp * temp *
## temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday),
## data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4924.0 -288.6 157.5 462.3 1410.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 657.79949 466.83592 1.409 0.159256
## temp -197.07937 86.68558 -2.273 0.023295 *
## I(temp * temp) 20.63646 4.69745 4.393 1.29e-05 ***
## I(temp * temp * temp) -0.42980 0.07832 -5.488 5.67e-08 ***
## as.factor(mnth)2 139.14423 147.95266 0.940 0.347300
## as.factor(mnth)3 309.10897 160.12298 1.930 0.053950 .
## as.factor(mnth)4 469.72850 176.91672 2.655 0.008107 **
## as.factor(mnth)5 705.12714 201.46189 3.500 0.000494 ***
## as.factor(mnth)6 1135.62691 223.52715 5.080 4.82e-07 ***
## as.factor(mnth)7 1203.81696 248.75573 4.839 1.60e-06 ***
## as.factor(mnth)8 1083.19211 230.50352 4.699 3.14e-06 ***
## as.factor(mnth)9 1130.46241 207.95688 5.436 7.50e-08 ***
## as.factor(mnth)10 1150.14875 178.87609 6.430 2.35e-10 ***
## as.factor(mnth)11 1072.30582 161.99465 6.619 7.12e-11 ***
## as.factor(mnth)12 746.74854 152.32513 4.902 1.17e-06 ***
## as.factor(Promotion)1 1790.39885 57.20174 31.300 < 2e-16 ***
## as.factor(weekday)1 708.32228 105.76547 6.697 4.33e-11 ***
## as.factor(weekday)2 995.18575 106.04801 9.384 < 2e-16 ***
## as.factor(weekday)3 1036.21440 106.07510 9.769 < 2e-16 ***
## as.factor(weekday)4 1131.61529 106.11178 10.664 < 2e-16 ***
## as.factor(weekday)5 1034.05045 105.91272 9.763 < 2e-16 ***
## as.factor(weekday)6 224.16748 105.52054 2.124 0.033982 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 764.3 on 709 degrees of freedom
## Multiple R-squared: 0.767, Adjusted R-squared: 0.76
## F-statistic: 111.1 on 21 and 709 DF, p-value: < 2.2e-16
The promotion had a substantial impact on registered riders compared to casual riders. On promotional days, casual riders increased by about 314 whereas registered riders increased by about 1790. The two groups are therefore influenced differently by promotions.
Question 6
With your analysis from questions 4 and 5 now in-hand, you are prepared to report on the promotion’s influence on ridership. However, you lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?
I would need to make a comparison of revenues made on promotional and non-promotional days for both groups. It’s possible that riders may have increased on promotional days but the revenue made might be lower compared to non-promotional days.