Bike-sharing systems are a new generation of traditional bike rentals
where the whole process from membership, rental and return back has
become automatic. Through these systems, the user is able to easily rent
a bike from a particular position and return back to another position.
Today, there exists great interest in these systems due to their
important role in traffic, environmental, and health issues.
Apart from interesting real-world applications of bike-sharing
systems, the characteristics of data being generated by these systems
make them attractive for the research. Having features such as duration
of travel, departure, and arrival position, total bike number rented
turns the bike-sharing system into a virtual sensor network that can be
used for sensing mobility in the city. Hence, it is expected that the
most important events in the city could be detected via monitoring these
data.
Capital Bikeshare has more than 4300 bikes available at 500 stations
across 7 jurisdictions. With that number, Capital Bikeshare provides
residents and visitors with a convenient, fun, and affordable
transportation option for getting from point A to point B. People use
Capital Bikeshare to commute to work or school, run errands, get to
appointments or social engagements and more.
We aggregated the data on daily basis and set limitation on only one station Capital Bikeshare system and focusing on only the number of bikes rented.
instant: recorded index
dteday: date of transaction
season: number representing season (1:Spring, 2:Summer,
3:Fall, 4:Winter)
yr: number representing year
(0:2011, 1:2012)
mnth: number representing Month
(1:January to 12:December)
hr: number representing
Hour (0 to 23)
holiday: number representing (0:Not
Holiday ; 1:Holiday)
weekday: number representing
Day of the week (0:Sunday, 1:Monday, 2:Tuesday, 3:Wednesday, 4:Thursday,
5:Friday, 6:Saturday)
workingday: Whether Working
day or Weekend (0:Weekend/Holiday, 1:Working Day)
weathersit: Weather Condition (1:Clear, Few clouds,
Partly cloudy, Partly cloudy, 2:Mist + Cloudy, Mist + Broken clouds,
Mist + Few clouds, Mist, 3:Light Snow, Light Rain + Thunderstorm +
Scattered clouds, Light Rain + Scattered clouds, 4:Heavy Rain + Ice
Pallets + Thunderstorm + Mist, Snow + Fog)
temp:
Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius.
The values are divided to 50 (max)
hum: Normalized
humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are
divided to 67 (max)
casual: count of casual users
(non-member user)
registered: count of registered
users (member user)
cnt: count of total rental
bikes including both casual and registered
## [1] 0
## [1] 0
as we can see from the code above, in this dataframe there are no
missing values and duplicate values.
Our target column is cnt which we want to predict using
all other columns as a predictor. But there are several columns we might
consider deleting. Such as atemp, because this column has
similar meaning to column temp and the values are almost
alike. Then, we might consider deleting column registered
and casual as we only focus on number of bike rented as a
whole. For this model, we do not need instant column and
dteday as predictor.
In this sequence, we would like to explore the data by looking into
correlation between columns.
we found that column
temp and yr have high
positive value of corellation.
For Base Model, we would like to create model that includes all the predictor
##
## Call:
## lm(formula = cnt ~ ., data = bike)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4072.3 -440.0 34.9 551.7 2992.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1626.11 230.47 7.055 0.00000000000404772 ***
## season 515.24 54.86 9.392 < 0.0000000000000002 ***
## yr 2040.88 65.37 31.221 < 0.0000000000000002 ***
## mnth -39.81 17.12 -2.325 0.02037 *
## holiday -536.84 201.45 -2.665 0.00787 **
## weekday 67.21 16.32 4.117 0.00004276871007799 ***
## workingday 119.21 72.21 1.651 0.09919 .
## weathersit -622.29 78.42 -7.935 0.00000000000000805 ***
## temp 5154.22 195.03 26.428 < 0.0000000000000002 ***
## hum -960.15 313.79 -3.060 0.00230 **
## windspeed -2730.42 451.02 -6.054 0.00000000227467350 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 874.9 on 720 degrees of freedom
## Multiple R-squared: 0.7988, Adjusted R-squared: 0.796
## F-statistic: 285.9 on 10 and 720 DF, p-value: < 0.00000000000000022
💡 Insight:
season , yr
,weekday , weathersit , temp and
windspeed are significantcnt can be explained by the predictor
columnsres <- data.frame(aktual = bike$cnt,
prediksi = model_base$fitted.values) %>%
mutate(error = prediksi - aktual)
range(bike$cnt)## [1] 22 8714
## [1] 648.7803
## [1] 44.87518
💡 Insight:
Mean Absolut Error value of model_base are 648.7803
and our data range is 22 to 8714
We can see that MAE value is within data range but it is more
than our minimum value, so we can assume that our
model_base has quite high level of error and we might
consider to build a new model
Mean Absolut Percentage Error value of model_base
are 44.87518%
This value is beyond the common acceptable percentage
which is 20% , this result support our previous measurement parameter
MAE that our model_base has quite high level of error and
we might consider to build a new model
Model Backward
## Start: AIC=9914.61
## cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit +
## temp + hum + windspeed
##
## Df Sum of Sq RSS AIC
## <none> 551085312 9914.6
## - workingday 1 2086187 553171498 9915.4
## - mnth 1 4136241 555221552 9918.1
## - holiday 1 5435628 556520940 9919.8
## - hum 1 7166299 558251611 9922.1
## - weekday 1 12975371 564060682 9929.6
## - windspeed 1 28051752 579137063 9948.9
## - weathersit 1 48194967 599280278 9973.9
## - season 1 67520352 618605663 9997.1
## - temp 1 534571941 1085657252 10408.3
## - yr 1 746080084 1297165396 10538.4
Model Forward
model_zero <- lm(formula = cnt~1,
data = bike)
model_forward <- step(object = model_zero,
direction = "forward",
scope = list(lower = model_zero, upper = model_base))## Start: AIC=11066.88
## cnt ~ 1
##
## Df Sum of Sq RSS AIC
## + temp 1 1078688585 1660846807 10703
## + yr 1 879828893 1859706499 10786
## + season 1 451797359 2287738033 10937
## + weathersit 1 242288753 2497246639 11001
## + mnth 1 214744463 2524790929 11009
## + windspeed 1 150705556 2588829836 11028
## + hum 1 27757373 2711778019 11061
## + holiday 1 12797494 2726737898 11066
## + weekday 1 12461089 2727074303 11066
## + workingday 1 10246038 2729289354 11066
## <none> 2739535392 11067
##
## Step: AIC=10703.05
## cnt ~ temp
##
## Df Sum of Sq RSS AIC
## + yr 1 791315919 869530888 10232
## + weathersit 1 136655317 1524191489 10642
## + season 1 118871554 1541975253 10651
## + hum 1 90543304 1570303503 10664
## + mnth 1 57891346 1602955461 10679
## + windspeed 1 51536710 1609310097 10682
## + weekday 1 12500530 1648346276 10700
## + holiday 1 6972634 1653874173 10702
## <none> 1660846807 10703
## + workingday 1 2171087 1658675720 10704
##
## Step: AIC=10232
## cnt ~ temp + yr
##
## Df Sum of Sq RSS AIC
## + season 1 130771196 738759692 10115
## + weathersit 1 109837745 759693143 10135
## + mnth 1 63419864 806111024 10179
## + windspeed 1 49792557 819738331 10191
## + hum 1 39008653 830522235 10200
## + weekday 1 13610370 855920517 10222
## + holiday 1 8427999 861102889 10227
## + workingday 1 2562980 866967908 10232
## <none> 869530888 10232
##
## Step: AIC=10114.86
## cnt ~ temp + yr + season
##
## Df Sum of Sq RSS AIC
## + weathersit 1 125925220 612834472 9980.2
## + hum 1 69803991 668955701 10044.3
## + windspeed 1 24796419 713963273 10091.9
## + weekday 1 13891347 724868345 10103.0
## + holiday 1 8369993 730389699 10108.5
## + mnth 1 6673916 732085776 10110.2
## + workingday 1 2769417 735990275 10114.1
## <none> 738759692 10114.9
##
## Step: AIC=9980.25
## cnt ~ temp + yr + season + weathersit
##
## Df Sum of Sq RSS AIC
## + windspeed 1 21239104 591595368 9956.5
## + weekday 1 16654684 596179789 9962.1
## + holiday 1 11036594 601797878 9969.0
## + workingday 1 5946323 606888149 9975.1
## + mnth 1 4779267 608055205 9976.5
## + hum 1 3390584 609443888 9978.2
## <none> 612834472 9980.2
##
## Step: AIC=9956.47
## cnt ~ temp + yr + season + weathersit + windspeed
##
## Df Sum of Sq RSS AIC
## + weekday 1 17150548 574444820 9937.0
## + hum 1 11095661 580499707 9944.6
## + holiday 1 10945629 580649739 9944.8
## + mnth 1 5694272 585901096 9951.4
## + workingday 1 5631889 585963479 9951.5
## <none> 591595368 9956.5
##
## Step: AIC=9936.96
## cnt ~ temp + yr + season + weathersit + windspeed + weekday
##
## Df Sum of Sq RSS AIC
## + hum 1 8750655 565694165 9927.7
## + holiday 1 8440294 566004526 9928.1
## + mnth 1 6129510 568315310 9931.1
## + workingday 1 4990152 569454668 9932.6
## <none> 574444820 9937.0
##
## Step: AIC=9927.74
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum
##
## Df Sum of Sq RSS AIC
## + holiday 1 8313014 557381152 9918.9
## + mnth 1 4825298 560868867 9923.5
## + workingday 1 4573748 561120417 9923.8
## <none> 565694165 9927.7
##
## Step: AIC=9918.92
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday
##
## Df Sum of Sq RSS AIC
## + mnth 1 4209653 553171498 9915.4
## + workingday 1 2159599 555221552 9918.1
## <none> 557381152 9918.9
##
## Step: AIC=9915.38
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday + mnth
##
## Df Sum of Sq RSS AIC
## + workingday 1 2086187 551085312 9914.6
## <none> 553171498 9915.4
##
## Step: AIC=9914.61
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday + mnth + workingday
Model Both
## Start: AIC=11066.88
## cnt ~ 1
##
## Df Sum of Sq RSS AIC
## + temp 1 1078688585 1660846807 10703
## + yr 1 879828893 1859706499 10786
## + season 1 451797359 2287738033 10937
## + weathersit 1 242288753 2497246639 11001
## + mnth 1 214744463 2524790929 11009
## + windspeed 1 150705556 2588829836 11028
## + hum 1 27757373 2711778019 11061
## + holiday 1 12797494 2726737898 11066
## + weekday 1 12461089 2727074303 11066
## + workingday 1 10246038 2729289354 11066
## <none> 2739535392 11067
##
## Step: AIC=10703.05
## cnt ~ temp
##
## Df Sum of Sq RSS AIC
## + yr 1 791315919 869530888 10232
## + weathersit 1 136655317 1524191489 10642
## + season 1 118871554 1541975253 10651
## + hum 1 90543304 1570303503 10664
## + mnth 1 57891346 1602955461 10679
## + windspeed 1 51536710 1609310097 10682
## + weekday 1 12500530 1648346276 10700
## + holiday 1 6972634 1653874173 10702
## <none> 1660846807 10703
## + workingday 1 2171087 1658675720 10704
## - temp 1 1078688585 2739535392 11067
##
## Step: AIC=10232
## cnt ~ temp + yr
##
## Df Sum of Sq RSS AIC
## + season 1 130771196 738759692 10115
## + weathersit 1 109837745 759693143 10135
## + mnth 1 63419864 806111024 10179
## + windspeed 1 49792557 819738331 10191
## + hum 1 39008653 830522235 10200
## + weekday 1 13610370 855920517 10222
## + holiday 1 8427999 861102889 10227
## + workingday 1 2562980 866967908 10232
## <none> 869530888 10232
## - yr 1 791315919 1660846807 10703
## - temp 1 990175611 1859706499 10786
##
## Step: AIC=10114.86
## cnt ~ temp + yr + season
##
## Df Sum of Sq RSS AIC
## + weathersit 1 125925220 612834472 9980.2
## + hum 1 69803991 668955701 10044.3
## + windspeed 1 24796419 713963273 10091.9
## + weekday 1 13891347 724868345 10103.0
## + holiday 1 8369993 730389699 10108.5
## + mnth 1 6673916 732085776 10110.2
## + workingday 1 2769417 735990275 10114.1
## <none> 738759692 10114.9
## - season 1 130771196 869530888 10232.0
## - temp 1 666819270 1405578962 10583.1
## - yr 1 803215561 1541975253 10650.8
##
## Step: AIC=9980.25
## cnt ~ temp + yr + season + weathersit
##
## Df Sum of Sq RSS AIC
## + windspeed 1 21239104 591595368 9956.5
## + weekday 1 16654684 596179789 9962.1
## + holiday 1 11036594 601797878 9969.0
## + workingday 1 5946323 606888149 9975.1
## + mnth 1 4779267 608055205 9976.5
## + hum 1 3390584 609443888 9978.2
## <none> 612834472 9980.2
## - weathersit 1 125925220 738759692 10114.9
## - season 1 146858671 759693143 10135.3
## - temp 1 581108156 1193942628 10465.8
## - yr 1 775161296 1387995768 10575.9
##
## Step: AIC=9956.47
## cnt ~ temp + yr + season + weathersit + windspeed
##
## Df Sum of Sq RSS AIC
## + weekday 1 17150548 574444820 9937.0
## + hum 1 11095661 580499707 9944.6
## + holiday 1 10945629 580649739 9944.8
## + mnth 1 5694272 585901096 9951.4
## + workingday 1 5631889 585963479 9951.5
## <none> 591595368 9956.5
## - windspeed 1 21239104 612834472 9980.2
## - season 1 121300601 712895969 10090.8
## - weathersit 1 122367905 713963273 10091.9
## - temp 1 558847870 1150443238 10440.6
## - yr 1 773416543 1365011911 10565.7
##
## Step: AIC=9936.96
## cnt ~ temp + yr + season + weathersit + windspeed + weekday
##
## Df Sum of Sq RSS AIC
## + hum 1 8750655 565694165 9927.7
## + holiday 1 8440294 566004526 9928.1
## + mnth 1 6129510 568315310 9931.1
## + workingday 1 4990152 569454668 9932.6
## <none> 574444820 9937.0
## - weekday 1 17150548 591595368 9956.5
## - windspeed 1 21734968 596179789 9962.1
## - season 1 121545623 695990443 10075.3
## - weathersit 1 125098222 699543042 10079.0
## - temp 1 557570981 1132015802 10430.8
## - yr 1 774349655 1348794476 10558.9
##
## Step: AIC=9927.74
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum
##
## Df Sum of Sq RSS AIC
## + holiday 1 8313014 557381152 9918.9
## + mnth 1 4825298 560868867 9923.5
## + workingday 1 4573748 561120417 9923.8
## <none> 565694165 9927.7
## - hum 1 8750655 574444820 9937.0
## - weekday 1 14805542 580499707 9944.6
## - windspeed 1 28328617 594022782 9961.5
## - weathersit 1 45002552 610696718 9981.7
## - season 1 127670461 693364627 10074.5
## - temp 1 564980031 1130674196 10432.0
## - yr 1 742780054 1308474220 10538.7
##
## Step: AIC=9918.92
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday
##
## Df Sum of Sq RSS AIC
## + mnth 1 4209653 553171498 9915.4
## + workingday 1 2159599 555221552 9918.1
## <none> 557381152 9918.9
## - holiday 1 8313014 565694165 9927.7
## - hum 1 8623374 566004526 9928.1
## - weekday 1 12536921 569918073 9933.2
## - windspeed 1 28134323 585515475 9952.9
## - weathersit 1 46194745 603575897 9975.1
## - season 1 127728592 685109743 10067.7
## - temp 1 560060226 1117441377 10425.4
## - yr 1 744019894 1301401046 10536.8
##
## Step: AIC=9915.38
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday + mnth
##
## Df Sum of Sq RSS AIC
## + workingday 1 2086187 551085312 9914.6
## <none> 553171498 9915.4
## - mnth 1 4209653 557381152 9918.9
## - hum 1 7409883 560581381 9923.1
## - holiday 1 7697369 560868867 9923.5
## - weekday 1 13043833 566215332 9930.4
## - windspeed 1 28417536 581589035 9950.0
## - weathersit 1 47074882 600246381 9973.1
## - season 1 67645571 620817070 9997.7
## - temp 1 539950234 1093121733 10411.3
## - yr 1 745809916 1298981414 10537.4
##
## Step: AIC=9914.61
## cnt ~ temp + yr + season + weathersit + windspeed + weekday +
## hum + holiday + mnth + workingday
##
## Df Sum of Sq RSS AIC
## <none> 551085312 9914.6
## - workingday 1 2086187 553171498 9915.4
## - mnth 1 4136241 555221552 9918.1
## - holiday 1 5435628 556520940 9919.8
## - hum 1 7166299 558251611 9922.1
## - weekday 1 12975371 564060682 9929.6
## - windspeed 1 28051752 579137063 9948.9
## - weathersit 1 48194967 599280278 9973.9
## - season 1 67520352 618605663 9997.1
## - temp 1 534571941 1085657252 10408.3
## - yr 1 746080084 1297165396 10538.4
comparison <- compare_performance(model_base,
model_zero,
model_backward,
model_forward,
model_both)
comparisonKriteria model terbaik:
💡 Insight:
💡 Insight :
model_base fulfil linearity
assumptionFor checking whether our Residual’s Distribution is Normal or not, we
can use histogram:
besides using histogram, we can use another method
called Shapiro-Wilk Normality Test. The Shapiro-Wilk Normality Test is a
statistical test used to test whether data has a normal distribution or
not. In this case, we will prove whether the residuals have a normal
distribution or not.
We use shapiro.test().
The hypothesis to be tested:
With the Shapiro-Wilk Normality Test, we will determine the p-value.
##
## Shapiro-Wilk normality test
##
## data: model_base$residuals
## W = 0.97144, p-value = 0.00000000009473
💡 Insight :
model_base are not normally distributedcntusing log() or sqrt()For checking if our residuals created by model are scattered randomly or with constant variance. First method is by using plot:
# scatter plot
plot(model_base$fitted.values, y = model_base$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0the second method is by using The Breusch-Pagan statistical test to see whether the residuals are homoscedasticity (evenly distributed and constant) or not.
Hypothesis to be tested:
With the Breusch-Pagan statistical test, we will determine the p-value:
##
## studentized Breusch-Pagan test
##
## data: model_base
## BP = 67.403, df = 10, p-value = 0.0000000001403
💡 Insight :
model_base are not evenly distributed
(heteroscedasticity)For testing the muticolinearity on our predictor, we can use VIF test.
## season yr mnth holiday weekday workingday weathersit
## 3.541429 1.020251 3.332130 1.081437 1.021463 1.076338 1.741541
## temp hum windspeed
## 1.215589 1.905040 1.165206
from our VIF test, we can conclude that all of our predictor value are < 10 or no Multicolinearity found
# we transform our target variable using sqrt()
bike_sqrt <- bike %>%
mutate(cnt_sqrt = sqrt(cnt)) %>%
select(-cnt)
model_sqrt <- lm(formula = cnt_sqrt ~ .,
data = bike_sqrt)
summary(model_sqrt)##
## Call:
## lm(formula = cnt_sqrt ~ ., data = bike_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.402 -3.485 0.852 4.565 22.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.2754 1.9233 21.981 < 0.0000000000000002 ***
## season 4.4570 0.4578 9.736 < 0.0000000000000002 ***
## yr 15.4213 0.5455 28.271 < 0.0000000000000002 ***
## mnth -0.3468 0.1429 -2.427 0.015465 *
## holiday -4.7078 1.6810 -2.801 0.005239 **
## weekday 0.4712 0.1362 3.459 0.000575 ***
## workingday 1.2896 0.6026 2.140 0.032677 *
## weathersit -5.5435 0.6544 -8.471 < 0.0000000000000002 ***
## temp 42.3617 1.6275 26.029 < 0.0000000000000002 ***
## hum -7.4772 2.6185 -2.856 0.004420 **
## windspeed -22.7620 3.7636 -6.048 0.00000000236 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.301 on 720 degrees of freedom
## Multiple R-squared: 0.7883, Adjusted R-squared: 0.7853
## F-statistic: 268.1 on 10 and 720 DF, p-value: < 0.00000000000000022
res2 <- data.frame(aktual2 = bike_sqrt$cnt_sqrt,
prediksi2 = model_sqrt$fitted.values) %>%
mutate(error2 = prediksi2 - aktual2)
range(bike_sqrt$cnt_sqrt)## [1] 4.690416 93.348808
## [1] 5.290285
## [1] 10.78321
💡 Insight:
model_sqrt are 5.290285 and
our data range is 4.690416 to 93.348808 model_sqrt has improve from model_base in
terms of error by reaching 6 % of error💡 Insight :
model_base fulfil linearity
assumptionFor checking whether our Residual’s Distribution is Normal or not, we
can use histogram:
besides using histogram, we can use another method
called Shapiro-Wilk Normality Test. The Shapiro-Wilk Normality Test is a
statistical test used to test whether data has a normal distribution or
not. In this case, we will prove whether the residuals have a normal
distribution or not.
We use shapiro.test().
The hypothesis to be tested:
With the Shapiro-Wilk Normality Test, we will determine the p-value.
##
## Shapiro-Wilk normality test
##
## data: model_sqrt$residuals
## W = 0.93279, p-value < 0.00000000000000022
💡 Insight :
model_sqrt are not normally distributedFor checking if our residuals created by model are scattered randomly or with constant variance. First method is by using plot:
# scatter plot
plot(model_sqrt$fitted.values, y = model_sqrt$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0the second method is by using The Breusch-Pagan statistical test to see whether the residuals are homoscedasticity (evenly distributed and constant) or not.
Hypothesis to be tested:
With the Breusch-Pagan statistical test, we will determine the p-value:
##
## studentized Breusch-Pagan test
##
## data: model_sqrt
## BP = 36.453, df = 10, p-value = 0.00007034
💡 Insight :
model_sqrt are not evenly distributed
(heteroscedasticity)For testing the muticolinearity on our predictor, we can use VIF test.
## season yr mnth holiday weekday workingday weathersit
## 3.541429 1.020251 3.332130 1.081437 1.021463 1.076338 1.741541
## temp hum windspeed
## 1.215589 1.905040 1.165206
from our VIF test, we can conclude that all of our predictor value are < 10 or no Multicolinearity found
A work by Taufan Anggoro Adhi
tf.anggoro@gmail.com