Introduction

The purpose of this analysis is to assess the real impact of the promotion on the BikeShare program and its affect on total riders as measured by casual and registered users in the Toronto area.

Sections

The document is structured with the following sections:

  • Regression Model Construction
  • Explanation of Problems Encountered and Solutions
  • Assessment of impact the month of the year has on ridership
  • Interpretation of of the coefficient on promotion variable and initial judgement
  • Presentation of Findings and request for additional information

Required Packages

The packages required for this markdown are:

Package Summary
tidyverse The tidyverse collection of packages
DT Javascript enabled data tables
stargazer Fancy regression tables
corrplot Simple correlation plots
PerformanceAnalytics Detailed plots and tables for analytics
Pander
Car Correlation
Metrics Computing rmse

Regression Model Construction

Data Prep

Download the .csv to your working directory THEN import it using the following url: http://asayanalytics.com/bikeshare_csv"

Convert columns 1 to 7 to factors from numeric

Create a total_riders variable to aggregate casual and registered riders

#download file from web source
#download.file("http://asayanalytics.com/bikeshare_csv","bikeshare.csv")

#import file from working directory (Please comment out your download command!)
bikeshare <- read_csv("bikeshare.csv")

# Create total_riders variable summing casual and registered riders
bikeshare$total_riders <- bikeshare$casual + bikeshare$registered

# Make duplicate for the correlation matrix that takes numeric variables
bikeshare2 <- bikeshare

# Change multiple column types to factor
i <- c(1:7)
bikeshare[i] <- lapply(bikeshare[i], factor) # as.factor()

random sample view of data

Create a 10% random sample of the bikeshare data for columns of interest. View the result as a javascript data table (DT)

Model

## 
## Call:
## lm(formula = total_riders ~ poly(temp, 3, raw = TRUE) + poly(humidity, 
##     2, raw = TRUE) + weathersit + mnth + Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3841.0  -351.5    64.5   450.1  2502.8 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.544e+02  6.354e+02   0.873 0.383219    
## poly(temp, 3, raw = TRUE)1     -2.974e+02  8.707e+01  -3.415 0.000674 ***
## poly(temp, 3, raw = TRUE)2      3.270e+01  4.722e+00   6.924 9.83e-12 ***
## poly(temp, 3, raw = TRUE)3     -6.872e-01  7.883e-02  -8.718  < 2e-16 ***
## poly(humidity, 2, raw = TRUE)1  5.581e+01  1.463e+01   3.816 0.000147 ***
## poly(humidity, 2, raw = TRUE)2 -5.854e-01  1.205e-01  -4.858 1.46e-06 ***
## weathersit2                    -4.156e+02  7.778e+01  -5.343 1.23e-07 ***
## weathersit3                    -1.731e+03  2.172e+02  -7.971 6.26e-15 ***
## mnth2                           2.787e+01  1.496e+02   0.186 0.852213    
## mnth3                           5.348e+02  1.626e+02   3.288 0.001057 ** 
## mnth4                           6.588e+02  1.805e+02   3.650 0.000281 ***
## mnth5                           9.264e+02  2.033e+02   4.557 6.11e-06 ***
## mnth6                           1.100e+03  2.291e+02   4.801 1.93e-06 ***
## mnth7                           1.343e+03  2.527e+02   5.313 1.45e-07 ***
## mnth8                           1.102e+03  2.344e+02   4.702 3.10e-06 ***
## mnth9                           1.426e+03  2.094e+02   6.809 2.09e-11 ***
## mnth10                          1.458e+03  1.814e+02   8.036 3.86e-15 ***
## mnth11                          1.185e+03  1.643e+02   7.212 1.41e-12 ***
## mnth12                          8.770e+02  1.539e+02   5.700 1.76e-08 ***
## Promotion1                      1.962e+03  5.874e+01  33.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 770.1 on 711 degrees of freedom
## Multiple R-squared:  0.8461, Adjusted R-squared:  0.842 
## F-statistic: 205.7 on 19 and 711 DF,  p-value: < 2.2e-16

The regression model above is statistically significant at p-value < 0.05 and has and adjusted R-squared of 0.842 which suggests that the individual variables explain or predict over 84% of the variance in total riders, the dependent variable.

the generalized regression equation can be computed using the coefficients listed in the above summary output of coefficents, standard errors, p-values, and adjusted r-sqaured. The residual standard error is 770.0

Explanation of Problems Encountered

  • To improve linearity of the model, as exhibited in the crplot, temp variable polynomial was used and taken to the third degree.
  • The humidity polynomial was used to the second power.
  • Since month is a subset of the season variable, season was removed as an independent variable.
  • The variable Workingday was removed due to collinearity issues as it is an aggregation of the Weekday and Holiday variables.

Multicollinearity

Some collinearity may exist between the independent variables. As seen in the below VIF table, the temp and mnth variables seem problematic based on the higher VIF scores. this could indicate redundancy between predictor variables. In the presence of multicollinearity, the solution of the regression model becomes unstable. Further analysis, as illustrated in the correlation matrix suggests that multicollinearity is not a significant concern.

##                GVIF Df GVIF^(1/(2*Df))
## temp       6.704780  1        2.589359
## humidity   1.925023  1        1.387452
## weathersit 1.726303  2        1.146250
## mnth       7.537079 11        1.096157
## Promotion  1.042738  1        1.021146
## [1] 5.119165

Impact of Month of the Year

The month with the highest number of riders was October based on its highest monthly coefficient = 1.458e+03. Because of the negative coefficent for cold and rainy variable level of weathersit, if this month became unseasonably cold and rainy, we could expect an adverse change to its coefficent on total riders.

Interpretation of the promotion coefficient

The promotion variable and its coefficient of 1.962e+03 or approximately 1,962, and a p-value < 0.05, would seem to have a signicant positive effect on total ridership participation in the bike sharing program, at a 95% confidence level. However, the promotion had differing impacts on the casual and registered riders.

## 
## Call:
## lm(formula = log(casual) ~ poly(temp, 3, raw = TRUE) + poly(humidity, 
##     2, raw = TRUE) + weathersit + Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3332 -0.3764 -0.1002  0.4032  1.6375 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.583e+00  4.607e-01   5.607 2.94e-08 ***
## poly(temp, 3, raw = TRUE)1      2.539e-01  5.686e-02   4.466 9.26e-06 ***
## poly(temp, 3, raw = TRUE)2     -1.765e-03  3.053e-03  -0.578   0.5633    
## poly(temp, 3, raw = TRUE)3     -7.245e-05  5.123e-05  -1.414   0.1577    
## poly(humidity, 2, raw = TRUE)1  1.786e-02  1.109e-02   1.610   0.1078    
## poly(humidity, 2, raw = TRUE)2 -2.088e-04  9.169e-05  -2.277   0.0231 *  
## weathersit2                    -2.341e-01  5.960e-02  -3.927 9.42e-05 ***
## weathersit3                    -1.396e+00  1.667e-01  -8.373 2.92e-16 ***
## Promotion1                      3.035e-01  4.529e-02   6.701 4.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5994 on 722 degrees of freedom
## Multiple R-squared:  0.6573, Adjusted R-squared:  0.6535 
## F-statistic: 173.1 on 8 and 722 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = registered ~ poly(temp, 3, raw = TRUE) + poly(humidity, 
##     2, raw = TRUE) + weathersit + Promotion, data = bikeshare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3076.4  -525.1   101.2   528.3  1952.2 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.258e+03  6.316e+02  -1.992 0.046721 *  
## poly(temp, 3, raw = TRUE)1     -4.201e+01  7.795e+01  -0.539 0.590137    
## poly(temp, 3, raw = TRUE)2      1.721e+01  4.186e+00   4.112 4.37e-05 ***
## poly(temp, 3, raw = TRUE)3     -4.153e-01  7.023e-02  -5.913 5.17e-09 ***
## poly(humidity, 2, raw = TRUE)1  7.526e+01  1.521e+01   4.949 9.30e-07 ***
## poly(humidity, 2, raw = TRUE)2 -6.839e-01  1.257e-01  -5.440 7.29e-08 ***
## weathersit2                    -2.874e+02  8.171e+01  -3.517 0.000463 ***
## weathersit3                    -1.031e+03  2.286e+02  -4.511 7.52e-06 ***
## Promotion1                      1.653e+03  6.209e+01  26.618  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 821.8 on 722 degrees of freedom
## Multiple R-squared:  0.7256, Adjusted R-squared:  0.7226 
## F-statistic: 238.7 on 8 and 722 DF,  p-value: < 2.2e-16

Holding all else constant, when evaluating the promotion variable for casual riders only, we get a statistically significant coefficient of approximately 270, which has a postive effect on the number of casual riders during promotions. But for registered riders, the coefficient is about 1657 which suggests that the impact of the promotion was largely due to the increase in registered riders. From the boxplots pictured below, we can see clear differences in number of riders with and without promotions for casual and registered riders.

To correct for linearity, log transformation of both casual and registered riders technique could be employed employed.

Presentation of Findings

Based on the linear regression model, we can conclude that promotions had a significant effect on BikeShare program as measured by increased casual and registered riders in the Toronto area. However, in order to determine if this increased ridership translated to greater revenues or profits, additional financial information related to both revenues and cots would be required to determine overall profitablity of the promotional activities