Below you will see my representation for the best regression model possible for the dependent variable total riders. Total riders are composed of both casual and registered riders. The following independent variables temp, weekday, promotion, weather situation, humidity, holiday, wind speed, and month were utilized to tell the story of total ridership. I chose the following variables because while I am trying to understand what other riders do, I relate to what would affect me as a rider. After I decided which variables I believed were most applicable, I checked for inflated variances and collinear variables using the cor function and VIF function to ensure there were no inflated variables or collinear variables. I found some issues regarding collinearity and those will be discussed later, however the overall regression was a good fit. The output below displays the significant factors in the regression but overall, we can see most variables are within the significance threshold of .05 and outliers are mostly factor polynomials or other polynomial order variables. Finally, we can see a Multiple R-Squared of .8745 (87.45%) and an Adjusted R-Squared of .8693 (86.93%). After reviewing this summary, I feel comfortable moving forward with the data to represent total ridership.
##
## Call:
## lm(formula = (total_riders) ~ poly(temp, 3, raw = TRUE) + as.factor(weekday) +
## as.factor(Promotion) + as.factor(weathersit) + poly(humidity,
## 2, raw = TRUE) + as.factor(holiday) + poly(windspeed, 3,
## raw = TRUE) + as.factor(mnth), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3102.40 -329.13 44.27 413.98 2239.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.002e+03 6.503e+02 3.078 0.002162 **
## poly(temp, 3, raw = TRUE)1 -3.105e+02 7.955e+01 -3.903 0.000104 ***
## poly(temp, 3, raw = TRUE)2 3.412e+01 4.319e+00 7.900 1.08e-14 ***
## poly(temp, 3, raw = TRUE)3 -7.173e-01 7.217e-02 -9.938 < 2e-16 ***
## as.factor(weekday)1 1.515e+02 9.999e+01 1.516 0.130080
## as.factor(weekday)2 2.896e+02 9.758e+01 2.968 0.003104 **
## as.factor(weekday)3 3.471e+02 9.793e+01 3.545 0.000419 ***
## as.factor(weekday)4 3.785e+02 9.791e+01 3.865 0.000121 ***
## as.factor(weekday)5 4.231e+02 9.788e+01 4.322 1.77e-05 ***
## as.factor(weekday)6 4.613e+02 9.716e+01 4.748 2.49e-06 ***
## as.factor(Promotion)1 1.954e+03 5.359e+01 36.464 < 2e-16 ***
## as.factor(weathersit)2 -3.715e+02 7.195e+01 -5.164 3.15e-07 ***
## as.factor(weathersit)3 -1.538e+03 2.034e+02 -7.561 1.25e-13 ***
## poly(humidity, 2, raw = TRUE)1 3.656e+01 1.358e+01 2.693 0.007249 **
## poly(humidity, 2, raw = TRUE)2 -4.797e-01 1.114e-01 -4.305 1.91e-05 ***
## as.factor(holiday)1 -6.186e+02 1.640e+02 -3.773 0.000175 ***
## poly(windspeed, 3, raw = TRUE)1 -1.368e+02 6.011e+01 -2.275 0.023183 *
## poly(windspeed, 3, raw = TRUE)2 7.902e+00 4.153e+00 1.903 0.057505 .
## poly(windspeed, 3, raw = TRUE)3 -2.045e-01 8.765e-02 -2.333 0.019927 *
## as.factor(mnth)2 4.211e+01 1.362e+02 0.309 0.757265
## as.factor(mnth)3 4.812e+02 1.483e+02 3.244 0.001234 **
## as.factor(mnth)4 6.413e+02 1.651e+02 3.883 0.000113 ***
## as.factor(mnth)5 7.920e+02 1.861e+02 4.255 2.37e-05 ***
## as.factor(mnth)6 8.881e+02 2.108e+02 4.213 2.85e-05 ***
## as.factor(mnth)7 1.143e+03 2.325e+02 4.918 1.09e-06 ***
## as.factor(mnth)8 8.923e+02 2.154e+02 4.142 3.86e-05 ***
## as.factor(mnth)9 1.230e+03 1.921e+02 6.404 2.77e-10 ***
## as.factor(mnth)10 1.319e+03 1.661e+02 7.940 8.01e-15 ***
## as.factor(mnth)11 1.082e+03 1.503e+02 7.198 1.58e-12 ***
## as.factor(mnth)12 7.713e+02 1.408e+02 5.476 6.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 700.3 on 701 degrees of freedom
## Multiple R-squared: 0.8745, Adjusted R-squared: 0.8693
## F-statistic: 168.4 on 29 and 701 DF, p-value: < 2.2e-16
## total_riders temp weekday Promotion
## total_riders 1.00000000 0.6274940090 0.0674434124 0.566709708
## temp 0.62749401 1.0000000000 -0.0001699624 0.047603572
## weekday 0.06744341 -0.0001699624 1.0000000000 -0.005460765
## Promotion 0.56670971 0.0476035719 -0.0054607652 1.000000000
## weathersit -0.29739124 -0.1206022365 0.0310874694 -0.048726541
## humidity -0.10065856 0.1269629390 -0.0522321004 -0.110651045
## holiday -0.06834772 -0.0285555350 -0.1019602689 0.007954311
## windspeed -0.23454500 -0.1579441204 0.0142821241 -0.011817060
## mnth 0.27997711 0.2202053352 0.0095093129 -0.001792434
## weathersit humidity holiday windspeed
## total_riders -0.29739124 -0.10065856 -0.068347716 -0.234544997
## temp -0.12060224 0.12696294 -0.028555535 -0.157944120
## weekday 0.03108747 -0.05223210 -0.101960269 0.014282124
## Promotion -0.04872654 -0.11065104 0.007954311 -0.011817060
## weathersit 1.00000000 0.59104460 -0.034626841 0.039511059
## humidity 0.59104460 1.00000000 -0.015937479 -0.248489099
## holiday -0.03462684 -0.01593748 1.000000000 0.006291507
## windspeed 0.03951106 -0.24848910 0.006291507 1.000000000
## mnth 0.04352810 0.22220369 0.019190895 -0.207501752
## mnth
## total_riders 0.279977112
## temp 0.220205335
## weekday 0.009509313
## Promotion -0.001792434
## weathersit 0.043528098
## humidity 0.222203691
## holiday 0.019190895
## windspeed -0.207501752
## mnth 1.000000000
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 21.694747 3 1.670036
## as.factor(weekday) 1.180943 6 1.013956
## as.factor(Promotion) 1.070005 1 1.034410
## as.factor(weathersit) 2.444317 2 1.250372
## poly(humidity, 2, raw = TRUE) 3.178286 2 1.335206
## as.factor(holiday) 1.117926 1 1.057320
## poly(windspeed, 3, raw = TRUE) 1.402216 3 1.057960
## as.factor(mnth) 25.285956 11 1.158156
The main issues found regarding multicollinearity or linearity in my regression occurred when I was trying to maximize the Multiple and Adjusted R-Square values thus finding a high correlation between the month, temp, and season independent variables. I understood I would find inflated variables when combining these three, however a seasonality effect may need to be tested so it was important to include a variable relating to a timeframe even though all three are correlated. The reason I chose month rather than season relates the to the ability to check the slope coefficient for 12 observations rather than 4 seasons, thus giving allowing more opportunity to understand the data. Also, it makes sense to me considering the relationship between temp and month where we can infer what the season may be by using the two independent variables. Keeping this is mind we can better understand total riders without having too many variables that are collinear with highly inflated variance factors which can be problematic. It can be naive to use one variable rather than both however for this model, this will suffice.
In my regression months 6 and 8 have the largest impact on total ridership. This intuitive because we can infer, as well as test that temp and other corresponding independent variables may play a role in the overall effect of total riders holding everything else constant. The month with the most total riders is month June. If the months became unseasonably cold and rainy it would not change the slope coefficient for the month due to the model remaining constant. The model itself is not changing therefor we must assume the coefficient itself will not change.
Looking at my given regression model we can expect 1954 more riders on any given day where a promotion is utilized with every other independent variable are held constant. My initial judgement or advice to the marketing department would be to utilize promotions in rather unseasonable months where we see an average lower ridership such as January, February, December, March, November, etc. I also would not hope to put my faith in the data, I would consider weather situation, temp, etc. Even though promotion makes a huge impact on ridership we cannot infer that riders will take a promotion and ride in the snow. The decision needs to be made with a gut decision partnered with the data so that a maximized return can be reached. Naive models’ ore inherent such as this on the surface, the ultimate decision should consist of both analysis and common sense.
Cannot explain anything other than both variables are affected by the fact a promotion influenced both groups to ride on a given day. The percent difference is not a good explanation because there are too many unknowns which in fact make this model naive and a good fit to where has been brought to the surface, however there are too many unknowns to absolutely describe the relationship of promotion on the two different riderships. The data below shows the both R-Squared values are greater for the casual riders which depicts a better fit and prediction rather than registered, however there are more registered riders which can mean more variability to the data thus being harder to explain. The point is there is not enough information giver to be able to tell the entire story but we can get a decent overview of how the tow different riders are affected by promotions.
##
## Call:
## lm(formula = (casual) ~ poly(temp, 3, raw = TRUE) + as.factor(weekday) +
## as.factor(Promotion) + as.factor(weathersit) + poly(humidity,
## 2, raw = TRUE) + as.factor(holiday) + poly(windspeed, 3,
## raw = TRUE), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1084.7 -203.0 -29.2 177.2 1594.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1218.60837 303.97667 4.009 6.74e-05 ***
## poly(temp, 3, raw = TRUE)1 -100.26532 32.38442 -3.096 0.00204 **
## poly(temp, 3, raw = TRUE)2 11.83308 1.73955 6.802 2.18e-11 ***
## poly(temp, 3, raw = TRUE)3 -0.24645 0.02920 -8.440 < 2e-16 ***
## as.factor(weekday)1 -768.71679 48.49251 -15.852 < 2e-16 ***
## as.factor(weekday)2 -810.57388 47.25673 -17.153 < 2e-16 ***
## as.factor(weekday)3 -816.65755 47.40345 -17.228 < 2e-16 ***
## as.factor(weekday)4 -801.42404 47.34977 -16.926 < 2e-16 ***
## as.factor(weekday)5 -617.26920 47.48188 -13.000 < 2e-16 ***
## as.factor(weekday)6 162.16213 47.16914 3.438 0.00062 ***
## as.factor(Promotion)1 262.68796 25.79248 10.185 < 2e-16 ***
## as.factor(weathersit)2 -70.13307 34.43610 -2.037 0.04206 *
## as.factor(weathersit)3 -197.89175 97.13492 -2.037 0.04199 *
## poly(humidity, 2, raw = TRUE)1 4.05732 6.44084 0.630 0.52894
## poly(humidity, 2, raw = TRUE)2 -0.09890 0.05294 -1.868 0.06213 .
## as.factor(holiday)1 534.44998 78.89814 6.774 2.63e-11 ***
## poly(windspeed, 3, raw = TRUE)1 -80.33994 28.76781 -2.793 0.00537 **
## poly(windspeed, 3, raw = TRUE)2 5.42641 1.99096 2.726 0.00658 **
## poly(windspeed, 3, raw = TRUE)3 -0.13217 0.04205 -3.143 0.00174 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 340.2 on 712 degrees of freedom
## Multiple R-squared: 0.7606, Adjusted R-squared: 0.7546
## F-statistic: 125.7 on 18 and 712 DF, p-value: < 2.2e-16
## casual temp weekday Promotion
## casual 1.00000000 0.5432846617 0.0599226375 0.248545664
## temp 0.54328466 1.0000000000 -0.0001699624 0.047603572
## weekday 0.05992264 -0.0001699624 1.0000000000 -0.005460765
## Promotion 0.24854566 0.0476035719 -0.0054607652 1.000000000
## weathersit -0.24735300 -0.1206022365 0.0310874694 -0.048726541
## humidity -0.07700788 0.1269629390 -0.0522321004 -0.110651045
## holiday 0.05427420 -0.0285555350 -0.1019602689 0.007954311
## windspeed -0.16761335 -0.1579441204 0.0142821241 -0.011817060
## weathersit humidity holiday windspeed
## casual -0.24735300 -0.07700788 0.054274203 -0.167613349
## temp -0.12060224 0.12696294 -0.028555535 -0.157944120
## weekday 0.03108747 -0.05223210 -0.101960269 0.014282124
## Promotion -0.04872654 -0.11065104 0.007954311 -0.011817060
## weathersit 1.00000000 0.59104460 -0.034626841 0.039511059
## humidity 0.59104460 1.00000000 -0.015937479 -0.248489099
## holiday -0.03462684 -0.01593748 1.000000000 0.006291507
## windspeed 0.03951106 -0.24848910 0.006291507 1.000000000
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 1.285583 3 1.042758
## as.factor(weekday) 1.155038 6 1.012084
## as.factor(Promotion) 1.050715 1 1.025044
## as.factor(weathersit) 2.302449 2 1.231821
## poly(humidity, 2, raw = TRUE) 2.765578 2 1.289575
## as.factor(holiday) 1.097326 1 1.047533
## poly(windspeed, 3, raw = TRUE) 1.282491 3 1.042339
##
## Call:
## lm(formula = I(log(registered)) ~ poly(temp, 3, raw = TRUE) +
## as.factor(weekday) + as.factor(Promotion) + as.factor(weathersit) +
## poly(humidity, 2, raw = TRUE) + as.factor(holiday) + poly(windspeed,
## 3, raw = TRUE), data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3084 -0.0996 0.0132 0.1367 0.7830
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.312e+00 2.666e-01 23.677 < 2e-16 ***
## poly(temp, 3, raw = TRUE)1 7.132e-02 2.840e-02 2.511 0.012259 *
## poly(temp, 3, raw = TRUE)2 1.693e-03 1.526e-03 1.109 0.267596
## poly(temp, 3, raw = TRUE)3 -7.918e-05 2.561e-05 -3.091 0.002070 **
## as.factor(weekday)1 2.571e-01 4.253e-02 6.045 2.41e-09 ***
## as.factor(weekday)2 3.218e-01 4.145e-02 7.763 2.89e-14 ***
## as.factor(weekday)3 3.268e-01 4.158e-02 7.860 1.42e-14 ***
## as.factor(weekday)4 3.328e-01 4.153e-02 8.013 4.59e-15 ***
## as.factor(weekday)5 3.119e-01 4.164e-02 7.490 2.05e-13 ***
## as.factor(weekday)6 9.423e-02 4.137e-02 2.278 0.023040 *
## as.factor(Promotion)1 4.563e-01 2.262e-02 20.173 < 2e-16 ***
## as.factor(weathersit)2 -7.973e-02 3.020e-02 -2.640 0.008475 **
## as.factor(weathersit)3 -7.155e-01 8.519e-02 -8.398 2.44e-16 ***
## poly(humidity, 2, raw = TRUE)1 1.698e-02 5.649e-03 3.005 0.002747 **
## poly(humidity, 2, raw = TRUE)2 -1.800e-04 4.643e-05 -3.876 0.000116 ***
## as.factor(holiday)1 -4.089e-01 6.920e-02 -5.909 5.35e-09 ***
## poly(windspeed, 3, raw = TRUE)1 -3.703e-02 2.523e-02 -1.468 0.142591
## poly(windspeed, 3, raw = TRUE)2 2.012e-03 1.746e-03 1.152 0.249585
## poly(windspeed, 3, raw = TRUE)3 -5.421e-05 3.688e-05 -1.470 0.142020
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2983 on 712 degrees of freedom
## Multiple R-squared: 0.7284, Adjusted R-squared: 0.7215
## F-statistic: 106.1 on 18 and 712 DF, p-value: < 2.2e-16
## registered temp weekday Promotion
## registered 1.00000000 0.5400119662 0.0573674440 0.594248168
## temp 0.54001197 1.0000000000 -0.0001699624 0.047603572
## weekday 0.05736744 -0.0001699624 1.0000000000 -0.005460765
## Promotion 0.59424817 0.0476035719 -0.0054607652 1.000000000
## weathersit -0.26038771 -0.1206022365 0.0310874694 -0.048726541
## humidity -0.09108860 0.1269629390 -0.0522321004 -0.110651045
## holiday -0.10874486 -0.0285555350 -0.1019602689 0.007954311
## windspeed -0.21744898 -0.1579441204 0.0142821241 -0.011817060
## weathersit humidity holiday windspeed
## registered -0.26038771 -0.09108860 -0.108744863 -0.217448981
## temp -0.12060224 0.12696294 -0.028555535 -0.157944120
## weekday 0.03108747 -0.05223210 -0.101960269 0.014282124
## Promotion -0.04872654 -0.11065104 0.007954311 -0.011817060
## weathersit 1.00000000 0.59104460 -0.034626841 0.039511059
## humidity 0.59104460 1.00000000 -0.015937479 -0.248489099
## holiday -0.03462684 -0.01593748 1.000000000 0.006291507
## windspeed 0.03951106 -0.24848910 0.006291507 1.000000000
## GVIF Df GVIF^(1/(2*Df))
## poly(temp, 3, raw = TRUE) 1.285583 3 1.042758
## as.factor(weekday) 1.155038 6 1.012084
## as.factor(Promotion) 1.050715 1 1.025044
## as.factor(weathersit) 2.302449 2 1.231821
## poly(humidity, 2, raw = TRUE) 2.765578 2 1.289575
## as.factor(holiday) 1.097326 1 1.047533
## poly(windspeed, 3, raw = TRUE) 1.282491 3 1.042339
In my final report, overall, I feel comfortable describing the relationship for total riders but I cannot say the same when breaking up the two categories as more information is needed to tell the entire story. Also, as an overview for the entire model I would like to have information relating to whether this data is a sample or the entire population of riders. Another important point would be possible market basket attachments or in other words I would like to know what else encourages riders other than promotion. Lastly, I would like to know any liability issues in case a bike is damaged and how that effects riders and the overall relationship for profit margin relating to sustainability for the long term.