GLM Assignment 2

Question 1

You want to build the best regression model possible for the dependent variable, total riders. Begin with the example from class where we fit total riders as a function of temperature using a third-degree polynomial. Add as many additional variables to your model as feasible to improve fit. Remember, your goal is to build the best fitting regression model explaining total ridership, using the tools we have covered regarding the linearity and multicollinearity assumptions.

Specify your generalized regression equation, an output of regression output (with coefficients, standard errors, etc.) and a summary of your work.

## 
## Call:
## lm(formula = total_riders ~ temp + I(temp * temp) + I(temp * 
##     temp * temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday), 
##     data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5740.3  -351.9   200.8   534.8  2020.0 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1637.5482   569.8485   2.874 0.004179 ** 
## temp                  -344.9528   105.8137  -3.260 0.001167 ** 
## I(temp * temp)          32.6561     5.7340   5.695 1.81e-08 ***
## I(temp * temp * temp)   -0.6524     0.0956  -6.825 1.89e-11 ***
## as.factor(mnth)2       153.4621   180.6001   0.850 0.395760    
## as.factor(mnth)3       591.7590   195.4559   3.028 0.002555 ** 
## as.factor(mnth)4       836.2914   215.9554   3.873 0.000118 ***
## as.factor(mnth)5      1034.2289   245.9167   4.206 2.94e-05 ***
## as.factor(mnth)6      1396.3066   272.8509   5.117 3.99e-07 ***
## as.factor(mnth)7      1525.5024   303.6465   5.024 6.41e-07 ***
## as.factor(mnth)8      1274.6463   281.3667   4.530 6.91e-06 ***
## as.factor(mnth)9      1348.9092   253.8449   5.314 1.44e-07 ***
## as.factor(mnth)10     1435.8801   218.3471   6.576 9.37e-11 ***
## as.factor(mnth)11     1319.7979   197.7406   6.674 5.01e-11 ***
## as.factor(mnth)12      795.0041   185.9374   4.276 2.17e-05 ***
## as.factor(Promotion)1 2104.3227    69.8239  30.138  < 2e-16 ***
## as.factor(weekday)1     10.3400   129.1038   0.080 0.936188    
## as.factor(weekday)2    171.9128   129.4487   1.328 0.184593    
## as.factor(weekday)3    209.5634   129.4818   1.618 0.106004    
## as.factor(weekday)4    349.8536   129.5265   2.701 0.007078 ** 
## as.factor(weekday)5    439.8778   129.2835   3.402 0.000705 ***
## as.factor(weekday)6    362.2104   128.8048   2.812 0.005058 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 932.9 on 709 degrees of freedom
## Multiple R-squared:  0.7747, Adjusted R-squared:  0.7681 
## F-statistic: 116.1 on 21 and 709 DF,  p-value: < 2.2e-16

Regression Equation:

total_riders = 1637.55 - 344.95(temp) + 32.66(temp)^2 - 0.65(temp)^3 + 592(mnth3) + 836(mnth4) + 1034(mnth5) + 1396(mnth6) + 1525(mnth7) + 1275(mnth8) + 1349(mnth9) + 1436(mnth10) + 1320(mnth11) + 795(mnth12) + 2104(Promotion1) + 349(weekday4) + 439(weekday5) + 362(weekday6)

R Squared: 77%

Question 2

Regarding your model from Q1, explain any problems your encountered with the assumptions of multicollinearity, linearity or homoscedasticity in this regression and how you solved them.

##                    season    Promotion         mnth      holiday       weekday
## season        1.000000000 -0.001844343  0.831440114 -0.010536659 -0.0030798813
## Promotion    -0.001844343  1.000000000 -0.001792434  0.007954311 -0.0054607652
## mnth          0.831440114 -0.001792434  1.000000000  0.019190895  0.0095093129
## holiday      -0.010536659  0.007954311  0.019190895  1.000000000 -0.1019602689
## weekday      -0.003079881 -0.005460765  0.009509313 -0.101960269  1.0000000000
## workingday    0.012484963 -0.002012621 -0.005900951 -0.253022700  0.0357896736
## weathersit    0.019211028 -0.048726541  0.043528098 -0.034626841  0.0310874694
## temp          0.334314856  0.047603572  0.220205335 -0.028555535 -0.0001699624
## humidity      0.205444765 -0.110651045  0.222203691 -0.015937479 -0.0522321004
## windspeed    -0.229046337 -0.011817060 -0.207501752  0.006291507  0.0142821241
## casual        0.210399165  0.248545664  0.123005889  0.054274203  0.0599226375
## registered    0.411623051  0.594248168  0.293487830 -0.108744863  0.0573674440
## total_riders  0.406100371  0.566709708  0.279977112 -0.068347716  0.0674434124
##                workingday  weathersit          temp    humidity    windspeed
## season        0.012484963  0.01921103  0.3343148564  0.20544476 -0.229046337
## Promotion    -0.002012621 -0.04872654  0.0476035719 -0.11065104 -0.011817060
## mnth         -0.005900951  0.04352810  0.2202053352  0.22220369 -0.207501752
## holiday      -0.253022700 -0.03462684 -0.0285555350 -0.01593748  0.006291507
## weekday       0.035789674  0.03108747 -0.0001699624 -0.05223210  0.014282124
## workingday    1.000000000  0.06120043  0.0526598102  0.02432705 -0.018796487
## weathersit    0.061200430  1.00000000 -0.1206022365  0.59104460  0.039511059
## temp          0.052659810 -0.12060224  1.0000000000  0.12696294 -0.157944120
## humidity      0.024327046  0.59104460  0.1269629390  1.00000000 -0.248489099
## windspeed    -0.018796487  0.03951106 -0.1579441204 -0.24848910  1.000000000
## casual       -0.518044191 -0.24735300  0.5432846617 -0.07700788 -0.167613349
## registered    0.303907117 -0.26038771  0.5400119662 -0.09108860 -0.217448981
## total_riders  0.061156063 -0.29739124  0.6274940090 -0.10065856 -0.234544997
##                   casual  registered total_riders
## season        0.21039916  0.41162305   0.40610037
## Promotion     0.24854566  0.59424817   0.56670971
## mnth          0.12300589  0.29348783   0.27997711
## holiday       0.05427420 -0.10874486  -0.06834772
## weekday       0.05992264  0.05736744   0.06744341
## workingday   -0.51804419  0.30390712   0.06115606
## weathersit   -0.24735300 -0.26038771  -0.29739124
## temp          0.54328466  0.54001197   0.62749401
## humidity     -0.07700788 -0.09108860  -0.10065856
## windspeed    -0.16761335 -0.21744898  -0.23454500
## casual        1.00000000  0.39528245   0.67280443
## registered    0.39528245  1.00000000   0.94551692
## total_riders  0.67280443  0.94551692   1.00000000

## 
## Call:
## lm(formula = total_riders ~ temp + I(temp * temp) + I(temp * 
##     temp * temp), data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4724.0 -1034.4   -99.6  1130.1  3160.1 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           518.9929   775.3459   0.669 0.503472    
## temp                   63.1408   134.5298   0.469 0.638964    
## I(temp * temp)         16.6342     7.2173   2.305 0.021461 *  
## I(temp * temp * temp)  -0.4324     0.1208  -3.580 0.000366 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1423 on 727 degrees of freedom
## Multiple R-squared:  0.4627, Adjusted R-squared:  0.4604 
## F-statistic: 208.6 on 3 and 727 DF,  p-value: < 2.2e-16

Running a correlation test on the data reveals that month and season have a correlation greater than 70% as well as registered riders and total riders. Season is excluded from the regression model due to the fact that it has a high correlation with month. Since casual riders and registered riders are represented within total riders, they are also excluded from the regression model.

Question 3

Your model from Q1 should include some means of assessing the impact the month of the year has on total ridership. Using your regression output, which month has the highest number of riders, holding everything else constant? If this month became unseasonably cold and rainy, would it change the coefficient on this month in any way?

According to the regression output, July has the highest coefficent and therefore the highest number of riders holding everything else constant. If it became unreasonably cold and rainy then the number of riders would decrease.

Question 4

Interpret (in simple terms) the coefficient on your “promotion” variable and make an initial judgement on the claims of the marketing department based on your analysis.

“The marketing department has emphatically declared the promotion to be a huge success.” On promotional days, there was an increase in riders by 2104. This means that the claims of the marketing department are valid.

Question 5

You suspect the promotion might have influenced casual riders differently than the registered riders. Perform some type of analysis that allows you to assess if the program had a more substantial impact on the casual riders or the registered riders. What is your conclusion, and why? Include any data or screenshots to back up your claim.

## 
## Call:
## lm(formula = casual ~ temp + I(temp * temp) + I(temp * temp * 
##     temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday), 
##     data = bikeshare)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1478.32  -202.08   -21.51   174.38  2086.17 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            979.74873  231.77675   4.227 2.68e-05 ***
## temp                  -147.87346   43.03804  -3.436 0.000625 ***
## I(temp * temp)          12.01964    2.33221   5.154 3.31e-07 ***
## I(temp * temp * temp)   -0.22265    0.03888  -5.726 1.52e-08 ***
## as.factor(mnth)2        14.31782   73.45619   0.195 0.845514    
## as.factor(mnth)3       282.65001   79.49856   3.555 0.000402 ***
## as.factor(mnth)4       366.56291   87.83639   4.173 3.38e-05 ***
## as.factor(mnth)5       329.10174  100.02268   3.290 0.001050 ** 
## as.factor(mnth)6       260.67970  110.97774   2.349 0.019101 *  
## as.factor(mnth)7       321.68547  123.50334   2.605 0.009389 ** 
## as.factor(mnth)8       191.45415  114.44140   1.673 0.094779 .  
## as.factor(mnth)9       218.44682  103.24735   2.116 0.034713 *  
## as.factor(mnth)10      285.73139   88.80919   3.217 0.001353 ** 
## as.factor(mnth)11      247.49206   80.42782   3.077 0.002170 ** 
## as.factor(mnth)12       48.25551   75.62705   0.638 0.523633    
## as.factor(Promotion)1  313.92391   28.39977  11.054  < 2e-16 ***
## as.factor(weekday)1   -697.98225   52.51090 -13.292  < 2e-16 ***
## as.factor(weekday)2   -823.27293   52.65118 -15.636  < 2e-16 ***
## as.factor(weekday)3   -826.65100   52.66463 -15.697  < 2e-16 ***
## as.factor(weekday)4   -781.76166   52.68284 -14.839  < 2e-16 ***
## as.factor(weekday)5   -594.17262   52.58401 -11.299  < 2e-16 ***
## as.factor(weekday)6    138.04292   52.38930   2.635 0.008599 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 379.5 on 709 degrees of freedom
## Multiple R-squared:  0.7034, Adjusted R-squared:  0.6946 
## F-statistic: 80.06 on 21 and 709 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = registered ~ temp + I(temp * temp) + I(temp * temp * 
##     temp) + as.factor(mnth) + as.factor(Promotion) + as.factor(weekday), 
##     data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4924.0  -288.6   157.5   462.3  1410.5 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            657.79949  466.83592   1.409 0.159256    
## temp                  -197.07937   86.68558  -2.273 0.023295 *  
## I(temp * temp)          20.63646    4.69745   4.393 1.29e-05 ***
## I(temp * temp * temp)   -0.42980    0.07832  -5.488 5.67e-08 ***
## as.factor(mnth)2       139.14423  147.95266   0.940 0.347300    
## as.factor(mnth)3       309.10897  160.12298   1.930 0.053950 .  
## as.factor(mnth)4       469.72850  176.91672   2.655 0.008107 ** 
## as.factor(mnth)5       705.12714  201.46189   3.500 0.000494 ***
## as.factor(mnth)6      1135.62691  223.52715   5.080 4.82e-07 ***
## as.factor(mnth)7      1203.81696  248.75573   4.839 1.60e-06 ***
## as.factor(mnth)8      1083.19211  230.50352   4.699 3.14e-06 ***
## as.factor(mnth)9      1130.46241  207.95688   5.436 7.50e-08 ***
## as.factor(mnth)10     1150.14875  178.87609   6.430 2.35e-10 ***
## as.factor(mnth)11     1072.30582  161.99465   6.619 7.12e-11 ***
## as.factor(mnth)12      746.74854  152.32513   4.902 1.17e-06 ***
## as.factor(Promotion)1 1790.39885   57.20174  31.300  < 2e-16 ***
## as.factor(weekday)1    708.32228  105.76547   6.697 4.33e-11 ***
## as.factor(weekday)2    995.18575  106.04801   9.384  < 2e-16 ***
## as.factor(weekday)3   1036.21440  106.07510   9.769  < 2e-16 ***
## as.factor(weekday)4   1131.61529  106.11178  10.664  < 2e-16 ***
## as.factor(weekday)5   1034.05045  105.91272   9.763  < 2e-16 ***
## as.factor(weekday)6    224.16748  105.52054   2.124 0.033982 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 764.3 on 709 degrees of freedom
## Multiple R-squared:  0.767,  Adjusted R-squared:   0.76 
## F-statistic: 111.1 on 21 and 709 DF,  p-value: < 2.2e-16

The promotion had a substantial impact on registered riders compared to casual riders. On promotional days, casual riders increased by about 314 whereas registered riders increased by about 1790. The two groups are therefore influenced differently by promotions.

Question 6

With your analysis from questions 4 and 5 now in-hand, you are prepared to report on the promotion’s influence on ridership. However, you lack some information required to make a meaningful report on whether the promotion was a financial success or a failure. What additional information (from a business perspective) do you need to accurately make such a conclusion?

I would need to make a comparison of revenues made on promotional and non-promotional days for both groups. It’s possible that riders may have increased on promotional days but the revenue made might be lower compared to non-promotional days.