GLM Assignment 2
Introduction
The purpose of this analysis is to assess the real impact of the promotion on the BikeShare program and its affect on total riders as measured by casual and registered users in the Toronto area.
Sections
The document is structured with the following sections:
- Regression Model Construction
- Explanation of Problems Encountered and Solutions
- Assessment of impact the month of the year has on ridership
- Interpretation of of the coefficient on promotion variable and initial judgement
- Presentation of Findings and request for additional information
Required Packages
The packages required for this markdown are:
| Package | Summary |
|---|---|
| tidyverse | The tidyverse collection of packages |
| DT | Javascript enabled data tables |
| stargazer | Fancy regression tables |
| corrplot | Simple correlation plots |
| PerformanceAnalytics | Detailed plots and tables for analytics |
| Pander | |
| Car | Correlation |
| Metrics | Computing rmse |
Regression Model Construction
Data Prep
Download the .csv to your working directory THEN import it using the following url: http://asayanalytics.com/bikeshare_csv"
Convert columns 1 to 7 to factors from numeric
Create a total_riders variable to aggregate casual and registered riders
#download file from web source
#download.file("http://asayanalytics.com/bikeshare_csv","bikeshare.csv")
#import file from working directory (Please comment out your download command!)
bikeshare <- read_csv("bikeshare.csv")
# Create total_riders variable summing casual and registered riders
bikeshare$total_riders <- bikeshare$casual + bikeshare$registered
# Make duplicate for the correlation matrix that takes numeric variables
bikeshare2 <- bikeshare
# Change multiple column types to factor
i <- c(1:7)
bikeshare[i] <- lapply(bikeshare[i], factor) # as.factor()random sample view of data
Create a 10% random sample of the bikeshare data for columns of interest. View the result as a javascript data table (DT)
Model
##
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + poly(humidity,
## 2, raw = TRUE) + weathersit + mnth + Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3841.0 -351.5 64.5 450.1 2502.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.544e+02 6.354e+02 0.873 0.383219
## poly(temp, 3, raw = TRUE)1 -2.974e+02 8.707e+01 -3.415 0.000674 ***
## poly(temp, 3, raw = TRUE)2 3.270e+01 4.722e+00 6.924 9.83e-12 ***
## poly(temp, 3, raw = TRUE)3 -6.872e-01 7.883e-02 -8.718 < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1 5.581e+01 1.463e+01 3.816 0.000147 ***
## poly(humidity, 2, raw = TRUE)2 -5.854e-01 1.205e-01 -4.858 1.46e-06 ***
## weathersit2 -4.156e+02 7.778e+01 -5.343 1.23e-07 ***
## weathersit3 -1.731e+03 2.172e+02 -7.971 6.26e-15 ***
## mnth2 2.787e+01 1.496e+02 0.186 0.852213
## mnth3 5.348e+02 1.626e+02 3.288 0.001057 **
## mnth4 6.588e+02 1.805e+02 3.650 0.000281 ***
## mnth5 9.264e+02 2.033e+02 4.557 6.11e-06 ***
## mnth6 1.100e+03 2.291e+02 4.801 1.93e-06 ***
## mnth7 1.343e+03 2.527e+02 5.313 1.45e-07 ***
## mnth8 1.102e+03 2.344e+02 4.702 3.10e-06 ***
## mnth9 1.426e+03 2.094e+02 6.809 2.09e-11 ***
## mnth10 1.458e+03 1.814e+02 8.036 3.86e-15 ***
## mnth11 1.185e+03 1.643e+02 7.212 1.41e-12 ***
## mnth12 8.770e+02 1.539e+02 5.700 1.76e-08 ***
## Promotion1 1.962e+03 5.874e+01 33.403 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 770.1 on 711 degrees of freedom
## Multiple R-squared: 0.8461, Adjusted R-squared: 0.842
## F-statistic: 205.7 on 19 and 711 DF, p-value: < 2.2e-16
The regression model above is statistically significant at p-value < 0.05 and has and adjusted R-squared of 0.842 which suggests that the individual variables explain or predict over 84% of the variance in total riders, the dependent variable.
the generalized regression equation can be computed using the coefficients listed in the above summary output of coefficents, standard errors, p-values, and adjusted r-sqaured. The residual standard error is 770.0
Explanation of Problems Encountered
- To improve linearity of the model, as exhibited in the crplot, temp variable polynomial was used and taken to the third degree.
- The humidity polynomial was used to the second power.
- Since month is a subset of the season variable, season was removed as an independent variable.
- The variable Workingday was removed due to collinearity issues as it is an aggregation of the Weekday and Holiday variables.
Multicollinearity
Some collinearity may exist between the independent variables. As seen in the below VIF table, the temp and mnth variables seem problematic based on the higher VIF scores. this could indicate redundancy between predictor variables. In the presence of multicollinearity, the solution of the regression model becomes unstable. Further analysis, as illustrated in the correlation matrix suggests that multicollinearity is not a significant concern.
## GVIF Df GVIF^(1/(2*Df))
## temp 6.704780 1 2.589359
## humidity 1.925023 1 1.387452
## weathersit 1.726303 2 1.146250
## mnth 7.537079 11 1.096157
## Promotion 1.042738 1 1.021146
## [1] 5.119165
Impact of Month of the Year
The month with the highest number of riders was October based on its highest monthly coefficient = 1.458e+03. Because of the negative coefficent for cold and rainy variable level of weathersit, if this month became unseasonably cold and rainy, we could expect an adverse change to its coefficent on total riders.
Interpretation of the promotion coefficient
The promotion variable and its coefficient of 1.962e+03 or approximately 1,962, and a p-value < 0.05, would seem to have a signicant positive effect on total ridership participation in the bike sharing program, at a 95% confidence level. However, the promotion had differing impacts on the casual and registered riders.
##
## Call:
## lm(formula = log(casual) ~ poly(temp, 3, raw = TRUE) + poly(humidity,
## 2, raw = TRUE) + weathersit + Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3332 -0.3764 -0.1002 0.4032 1.6375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.583e+00 4.607e-01 5.607 2.94e-08 ***
## poly(temp, 3, raw = TRUE)1 2.539e-01 5.686e-02 4.466 9.26e-06 ***
## poly(temp, 3, raw = TRUE)2 -1.765e-03 3.053e-03 -0.578 0.5633
## poly(temp, 3, raw = TRUE)3 -7.245e-05 5.123e-05 -1.414 0.1577
## poly(humidity, 2, raw = TRUE)1 1.786e-02 1.109e-02 1.610 0.1078
## poly(humidity, 2, raw = TRUE)2 -2.088e-04 9.169e-05 -2.277 0.0231 *
## weathersit2 -2.341e-01 5.960e-02 -3.927 9.42e-05 ***
## weathersit3 -1.396e+00 1.667e-01 -8.373 2.92e-16 ***
## Promotion1 3.035e-01 4.529e-02 6.701 4.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5994 on 722 degrees of freedom
## Multiple R-squared: 0.6573, Adjusted R-squared: 0.6535
## F-statistic: 173.1 on 8 and 722 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = registered ~ poly(temp, 3, raw = TRUE) + poly(humidity,
## 2, raw = TRUE) + weathersit + Promotion, data = bikeshare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3076.4 -525.1 101.2 528.3 1952.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.258e+03 6.316e+02 -1.992 0.046721 *
## poly(temp, 3, raw = TRUE)1 -4.201e+01 7.795e+01 -0.539 0.590137
## poly(temp, 3, raw = TRUE)2 1.721e+01 4.186e+00 4.112 4.37e-05 ***
## poly(temp, 3, raw = TRUE)3 -4.153e-01 7.023e-02 -5.913 5.17e-09 ***
## poly(humidity, 2, raw = TRUE)1 7.526e+01 1.521e+01 4.949 9.30e-07 ***
## poly(humidity, 2, raw = TRUE)2 -6.839e-01 1.257e-01 -5.440 7.29e-08 ***
## weathersit2 -2.874e+02 8.171e+01 -3.517 0.000463 ***
## weathersit3 -1.031e+03 2.286e+02 -4.511 7.52e-06 ***
## Promotion1 1.653e+03 6.209e+01 26.618 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 821.8 on 722 degrees of freedom
## Multiple R-squared: 0.7256, Adjusted R-squared: 0.7226
## F-statistic: 238.7 on 8 and 722 DF, p-value: < 2.2e-16
Holding all else constant, when evaluating the promotion variable for casual riders only, we get a statistically significant coefficient of approximately 270, which has a postive effect on the number of casual riders during promotions. But for registered riders, the coefficient is about 1657 which suggests that the impact of the promotion was largely due to the increase in registered riders. From the boxplots pictured below, we can see clear differences in number of riders with and without promotions for casual and registered riders.
To correct for linearity, log transformation of both casual and registered riders technique could be employed employed.
Presentation of Findings
Based on the linear regression model, we can conclude that promotions had a significant effect on BikeShare program as measured by increased casual and registered riders in the Toronto area. However, in order to determine if this increased ridership translated to greater revenues or profits, additional financial information related to both revenues and cots would be required to determine overall profitablity of the promotional activities