1/12/2022

Context

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.

In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19. They have planned this to prepare themselves to cater to the people’s needs once the situation gets better all around and stand out from other service providers and make huge profits.

They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

Which variables are significant in predicting the demand for shared bikes. How well those variables describe the bike demand Based on various meteorological surveys and people’s styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

Content

  • instant: ID
  • dteday: Date
  • season: 1-4
  • yr: year
  • mnth: month:1-14
  • holiday: 0-No,1-Yes
  • workingday:0-No,1-Yes
  • weathersit: weather situation 1-sunny 2-cloudy 3-rainny
  • temp: temperature on that day
  • atemp: average temperature on that day
  • hum: humidity
  • windspeed: speed of the wind on that day
  • causal: casual usage
  • registered: registered usage
  • cnt: total of casual and registered

Part One_Data Analysis

1. Load the libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.3     ✓ dplyr   1.0.7
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   2.0.0     ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

2. Read the dataset

file <-'Shared_Bike_Demand.csv'
shared_bike <- read_csv(file,show_col_types = FALSE)
shared_bike %>% glimpse
Rows: 730
Columns: 16
$ instant    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ dteday     <chr> "01-01-2018", "02-01-2018", "03-01-2018", "04-01-2018", "05…
$ season     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ yr         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ mnth       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ holiday    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
$ weekday    <dbl> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4,…
$ workingday <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
$ weathersit <dbl> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
$ temp       <dbl> 14.110847, 14.902598, 8.050924, 8.200000, 9.305237, 8.37826…
$ atemp      <dbl> 18.181250, 17.686950, 9.470250, 10.606100, 11.463500, 11.66…
$ hum        <dbl> 80.5833, 69.6087, 43.7273, 59.0435, 43.6957, 51.8261, 49.86…
$ windspeed  <dbl> 10.749882, 16.652113, 16.636703, 10.739832, 12.522300, 6.00…
$ casual     <dbl> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
$ registered <dbl> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
$ cnt        <dbl> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…

3. Format the date for easy computation

shared_bike$date<- as.Date(shared_bike$dteday,format='%d-%m-%Y')
shared_bike <- shared_bike %>% select(-dteday)
shared_bike %>% str
tibble [730 × 16] (S3: tbl_df/tbl/data.frame)
 $ instant   : num [1:730] 1 2 3 4 5 6 7 8 9 10 ...
 $ season    : num [1:730] 1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : num [1:730] 0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : num [1:730] 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : num [1:730] 0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : num [1:730] 6 0 1 2 3 4 5 6 0 1 ...
 $ workingday: num [1:730] 0 0 1 1 1 1 1 0 0 1 ...
 $ weathersit: num [1:730] 2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num [1:730] 14.11 14.9 8.05 8.2 9.31 ...
 $ atemp     : num [1:730] 18.18 17.69 9.47 10.61 11.46 ...
 $ hum       : num [1:730] 80.6 69.6 43.7 59 43.7 ...
 $ windspeed : num [1:730] 10.7 16.7 16.6 10.7 12.5 ...
 $ casual    : num [1:730] 331 131 120 108 82 88 148 68 54 41 ...
 $ registered: num [1:730] 654 670 1229 1454 1518 ...
 $ cnt       : num [1:730] 985 801 1349 1562 1600 ...
 $ date      : Date[1:730], format: "2018-01-01" "2018-01-02" ...

4. Check the correlation between variables,here we focus on demand column

cor(shared_bike[-16])[,14:15]
            registered         cnt
instant     0.66054384  0.62989572
season      0.41031023  0.40458378
yr          0.59691062  0.56972847
mnth        0.29195163  0.27819093
holiday    -0.10914179 -0.06876375
weekday     0.05742736  0.06753406
workingday  0.30543728  0.06254175
weathersit -0.25902527 -0.29592862
temp        0.53943622  0.62704403
atemp       0.54367823  0.63068535
hum        -0.08921192 -0.09854288
windspeed  -0.21791436 -0.23513250
casual      0.39413716  0.67212341
registered  1.00000000  0.94541061
cnt         0.94541061  1.00000000

5.1 Plot01_Demand and Weekday

Rplot01

5.2 Plot02_Holiday and Demand by Season

Rplot02

5.3 Plot03_Temperature and Demand by Season

Rplot03

5.4 Plot04_Temperature and Demand by Season and Weather Situation

Rplot04

Part Two_Linear Regression Modelling

1. Split the dataset with train and test by 80/20

set.seed(20)
trainIndex <- sample(nrow(shared_bike),dim(shared_bike)[1]*0.8)
trainset <- shared_bike[trainIndex,]
testset <- shared_bike[-trainIndex,]
print(cat('trainset dimension:',dim(trainset)))
trainset dimension: 584 16NULL
print(cat('testset dimension:',dim(testset)))
testset dimension: 146 16NULL
print(names(shared_bike))
 [1] "instant"    "season"     "yr"         "mnth"       "holiday"   
 [6] "weekday"    "workingday" "weathersit" "temp"       "atemp"     
[11] "hum"        "windspeed"  "casual"     "registered" "cnt"       
[16] "date"      

2.1 Scenario One: All factors involved in predicting the demand for shared bike

Warning in summary.lm(mod1): essentially perfect fit: summary may be unreliable
Call:
lm(formula = cnt ~ ., data = shared_bike[trainIndex, ])

Residuals:
       Min         1Q     Median         3Q        Max 
-6.438e-12 -1.696e-13  9.200e-15  1.767e-13  3.258e-12 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -1.405e-12  1.671e-13 -8.408e+00 3.36e-16 ***
instant     -1.598e-15  2.344e-15 -6.820e-01 0.495531    
season       9.907e-14  3.644e-14  2.719e+00 0.006750 ** 
yr           4.589e-13  8.622e-13  5.320e-01 0.594803    
mnth        -6.567e-15  7.239e-14 -9.100e-02 0.927759    
holiday     -3.056e-13  1.250e-13 -2.445e+00 0.014786 *  
weekday      5.133e-15  1.036e-14  4.950e-01 0.620455    
workingday  -1.713e-13  7.831e-14 -2.188e+00 0.029096 *  
weathersit   3.541e-14  5.474e-14  6.470e-01 0.517977    
temp        -6.564e-14  3.474e-14 -1.890e+00 0.059328 .  
atemp        6.593e-14  3.279e-14  2.010e+00 0.044854 *  
hum          7.218e-15  2.159e-15  3.343e+00 0.000883 ***
windspeed    2.623e-14  4.485e-15  5.848e+00 8.38e-09 ***
casual       1.000e+00  5.696e-17  1.756e+16  < 2e-16 ***
registered   1.000e+00  3.229e-17  3.097e+16  < 2e-16 ***
date                NA         NA         NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.961e-13 on 569 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.408e+32 on 14 and 569 DF,  p-value: < 2.2e-16

2.2 Scenario Two: Select five factors to predict demand include: temperature,month,weather situation,humility,windspeed.

mod2 <- lm(cnt~atemp+mnth+weathersit+hum+windspeed,data=shared_bike[trainIndex,])
pred2 <- predict(mod1,data=testset)
error2 <- pred2 - testset$cnt
rmse2 <- sqrt(mean(error2^2))
summary(mod2)
Call:
lm(formula = cnt ~ atemp + mnth + weathersit + hum + windspeed, 
    data = shared_bike[trainIndex, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-4359.7  -961.9  -177.9  1082.5  3672.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3317.873    380.964   8.709  < 2e-16 ***
atemp        145.198      7.416  19.579  < 2e-16 ***
mnth          84.205     17.349   4.854 1.56e-06 ***
weathersit  -408.428    142.609  -2.864  0.00434 ** 
hum          -24.785      5.737  -4.320 1.84e-05 ***
windspeed    -51.638     11.802  -4.375 1.44e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1365 on 578 degrees of freedom
Multiple R-squared:  0.5122,    Adjusted R-squared:  0.5079 
F-statistic: 121.4 on 5 and 578 DF,  p-value: < 2.2e-16

2.3 Scenario Three: Select three factors to predict demand include: temperature,month,weather situation

mod3 <- lm(cnt~atemp+mnth+weathersit,data=shared_bike[trainIndex,])
pred3 <- predict(mod3,data=testset)
error3 <- pred3 - testset$cnt
rmse3 <- sqrt(mean(error3^2))
summary(mod3)
Call:
lm(formula = cnt ~ atemp + mnth + weathersit, data = shared_bike[trainIndex, 
    ])

Residuals:
    Min      1Q  Median      3Q     Max 
-4187.7 -1016.2  -178.2  1089.2  3873.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1738.987    254.564   6.831 2.13e-11 ***
atemp        142.799      7.363  19.395  < 2e-16 ***
mnth          81.664     17.339   4.710 3.11e-06 ***
weathersit  -813.724    109.552  -7.428 3.97e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1398 on 580 degrees of freedom
Multiple R-squared:  0.4868,    Adjusted R-squared:  0.4842 
F-statistic: 183.4 on 3 and 580 DF,  p-value: < 2.2e-16

2.4 Scenario Four: Based on the scenario three,adding extra 2 factors:weekday and holiday to predict demand.

mod4 <- lm(cnt~atemp+mnth+weathersit+weekday+holiday,data=shared_bike[trainIndex,])
pred4 <- predict(mod4,data=testset)
error4 <- pred4 - testset$cnt
rmse4 <- sqrt(mean(error4^2))
summary(mod4)
Call:
lm(formula = cnt ~ atemp + mnth + weathersit + weekday + holiday, 
    data = shared_bike[trainIndex, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-4324.1 -1038.4  -172.8  1094.5  3734.5 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1668.831    267.007   6.250 7.97e-10 ***
atemp        141.342      7.354  19.219  < 2e-16 ***
mnth          83.049     17.276   4.807 1.95e-06 ***
weathersit  -822.939    109.132  -7.541 1.82e-13 ***
weekday       43.365     28.697   1.511   0.1313    
holiday     -684.745    335.705  -2.040   0.0418 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1392 on 578 degrees of freedom
Multiple R-squared:  0.493, Adjusted R-squared:  0.4886 
F-statistic: 112.4 on 5 and 578 DF,  p-value: < 2.2e-16

2.5 Conclusion: We prefer the model3 to predict the demand for shared bike as its accuracy is highest.

print(cat(rmse1,rmse2,rmse3,rmse4))
2730.24 2730.24 2359.951 2367.975NULL
print(cat('The minimal error:',min(rmse1,rmse2,rmse3,rmse4)))
The minimal error: 2359.951NULL