INTRODUCTION

Bike-sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Today, there exists great interest in these systems due to their important role in traffic, environmental, and health issues.

Apart from interesting real-world applications of bike-sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Having features such as duration of travel, departure, and arrival position, total bike number rented turns the bike-sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that the most important events in the city could be detected via monitoring these data.

Capital Bikeshare has more than 4300 bikes available at 500 stations across 7 jurisdictions. With that number, Capital Bikeshare provides residents and visitors with a convenient, fun, and affordable transportation option for getting from point A to point B. People use Capital Bikeshare to commute to work or school, run errands, get to appointments or social engagements and more.

DATA UNDERSTANDING

We aggregated the data on daily basis and set limitation on only one station Capital Bikeshare system and focusing on only the number of bikes rented.

dataset <- read.csv("data_input/day.csv")
options(scipen = 9999)
head(dataset)

COLUMNS EXPLANATION

instant: recorded index
dteday: date of transaction
season: number representing season (1:Spring, 2:Summer, 3:Fall, 4:Winter)
yr: number representing year (0:2011, 1:2012)
mnth: number representing Month (1:January to 12:December)
hr: number representing Hour (0 to 23)
holiday: number representing (0:Not Holiday ; 1:Holiday)
weekday: number representing Day of the week (0:Sunday, 1:Monday, 2:Tuesday, 3:Wednesday, 4:Thursday, 5:Friday, 6:Saturday)
workingday: Whether Working day or Weekend (0:Weekend/Holiday, 1:Working Day)
weathersit: Weather Condition (1:Clear, Few clouds, Partly cloudy, Partly cloudy, 2:Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist, 3:Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds, 4:Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog)
temp: Normalized temperature in Celsius. The values are divided to 41 (max)
atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
hum: Normalized humidity. The values are divided to 100 (max)
windspeed: Normalized wind speed. The values are divided to 67 (max)
casual: count of casual users (non-member user)
registered: count of registered users (member user)
cnt: count of total rental bikes including both casual and registered

CHECKING NA AND DUPLICATES VALUES

sum(is.na(dataset))
## [1] 0
sum(duplicated(dataset))
## [1] 0

as we can see from the code above, in this dataframe there are no missing values and duplicate values.

DEFINING PREDICTORS AND TARGET

Our target column is cnt which we want to predict using all other columns as a predictor. But there are several columns we might consider deleting. Such as atemp, because this column has similar meaning to column temp and the values are almost alike. Then, we might consider deleting column registered and casual as we only focus on number of bike rented as a whole. For this model, we do not need instant column and dteday as predictor.

bike <- dataset %>% 
  select(-instant, -dteday, -casual, -registered, -atemp)

EXPLORATORY DATA ANALYSIS

In this sequence, we would like to explore the data by looking into correlation between columns.

ggcorr(bike, label = T, label_size = 3, hjust = 1)

we found that column temp and yr have high positive value of corellation.

MODELING

BASE MODEL

For Base Model, we would like to create model that includes all the predictor

model_base <- lm(formula = cnt~.,
                 data = bike)
summary(model_base)
## 
## Call:
## lm(formula = cnt ~ ., data = bike)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4072.3  -440.0    34.9   551.7  2992.4 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1626.11     230.47   7.055  0.00000000000404772 ***
## season        515.24      54.86   9.392 < 0.0000000000000002 ***
## yr           2040.88      65.37  31.221 < 0.0000000000000002 ***
## mnth          -39.81      17.12  -2.325              0.02037 *  
## holiday      -536.84     201.45  -2.665              0.00787 ** 
## weekday        67.21      16.32   4.117  0.00004276871007799 ***
## workingday    119.21      72.21   1.651              0.09919 .  
## weathersit   -622.29      78.42  -7.935  0.00000000000000805 ***
## temp         5154.22     195.03  26.428 < 0.0000000000000002 ***
## hum          -960.15     313.79  -3.060              0.00230 ** 
## windspeed   -2730.42     451.02  -6.054  0.00000000227467350 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 874.9 on 720 degrees of freedom
## Multiple R-squared:  0.7988, Adjusted R-squared:  0.796 
## F-statistic: 285.9 on 10 and 720 DF,  p-value: < 0.00000000000000022

💡 Insight:

  • Intercept = 1626.11
  • Predictor season , yr ,weekday , weathersit , temp and windspeed are significant
  • Adjusted R-squared of this model is 0.796, this means that 79.6% of variance of cnt can be explained by the predictor columns

EVALUATION

res <- data.frame(aktual = bike$cnt,
                  prediksi = model_base$fitted.values) %>% 
  mutate(error = prediksi - aktual)

range(bike$cnt)
## [1]   22 8714
# MAE value
MAE(y_pred = res$prediksi,
    y_true = res$aktual)
## [1] 648.7803
# MAPE value
MAPE(y_pred = res$prediksi,
    y_true = res$aktual)*100
## [1] 44.87518

💡 Insight:

  • Mean Absolut Error value of model_base are 648.7803 and our data range is 22 to 8714

  • We can see that MAE value is within data range but it is more than our minimum value, so we can assume that our model_base has quite high level of error and we might consider to build a new model

  • Mean Absolut Percentage Error value of model_base are 44.87518%
    This value is beyond the common acceptable percentage which is 20% , this result support our previous measurement parameter MAE that our model_base has quite high level of error and we might consider to build a new model

STEP-WISE REGRESSION

Model Backward

model_backward <- step(object = model_base,
                       direction = "backward")
## Start:  AIC=9914.61
## cnt ~ season + yr + mnth + holiday + weekday + workingday + weathersit + 
##     temp + hum + windspeed
## 
##              Df Sum of Sq        RSS     AIC
## <none>                     551085312  9914.6
## - workingday  1   2086187  553171498  9915.4
## - mnth        1   4136241  555221552  9918.1
## - holiday     1   5435628  556520940  9919.8
## - hum         1   7166299  558251611  9922.1
## - weekday     1  12975371  564060682  9929.6
## - windspeed   1  28051752  579137063  9948.9
## - weathersit  1  48194967  599280278  9973.9
## - season      1  67520352  618605663  9997.1
## - temp        1 534571941 1085657252 10408.3
## - yr          1 746080084 1297165396 10538.4

Model Forward

model_zero <- lm(formula = cnt~1,
                 data = bike)
model_forward <- step(object = model_zero,
                      direction = "forward",
                      scope = list(lower = model_zero, upper = model_base))
## Start:  AIC=11066.88
## cnt ~ 1
## 
##              Df  Sum of Sq        RSS   AIC
## + temp        1 1078688585 1660846807 10703
## + yr          1  879828893 1859706499 10786
## + season      1  451797359 2287738033 10937
## + weathersit  1  242288753 2497246639 11001
## + mnth        1  214744463 2524790929 11009
## + windspeed   1  150705556 2588829836 11028
## + hum         1   27757373 2711778019 11061
## + holiday     1   12797494 2726737898 11066
## + weekday     1   12461089 2727074303 11066
## + workingday  1   10246038 2729289354 11066
## <none>                     2739535392 11067
## 
## Step:  AIC=10703.05
## cnt ~ temp
## 
##              Df Sum of Sq        RSS   AIC
## + yr          1 791315919  869530888 10232
## + weathersit  1 136655317 1524191489 10642
## + season      1 118871554 1541975253 10651
## + hum         1  90543304 1570303503 10664
## + mnth        1  57891346 1602955461 10679
## + windspeed   1  51536710 1609310097 10682
## + weekday     1  12500530 1648346276 10700
## + holiday     1   6972634 1653874173 10702
## <none>                    1660846807 10703
## + workingday  1   2171087 1658675720 10704
## 
## Step:  AIC=10232
## cnt ~ temp + yr
## 
##              Df Sum of Sq       RSS   AIC
## + season      1 130771196 738759692 10115
## + weathersit  1 109837745 759693143 10135
## + mnth        1  63419864 806111024 10179
## + windspeed   1  49792557 819738331 10191
## + hum         1  39008653 830522235 10200
## + weekday     1  13610370 855920517 10222
## + holiday     1   8427999 861102889 10227
## + workingday  1   2562980 866967908 10232
## <none>                    869530888 10232
## 
## Step:  AIC=10114.86
## cnt ~ temp + yr + season
## 
##              Df Sum of Sq       RSS     AIC
## + weathersit  1 125925220 612834472  9980.2
## + hum         1  69803991 668955701 10044.3
## + windspeed   1  24796419 713963273 10091.9
## + weekday     1  13891347 724868345 10103.0
## + holiday     1   8369993 730389699 10108.5
## + mnth        1   6673916 732085776 10110.2
## + workingday  1   2769417 735990275 10114.1
## <none>                    738759692 10114.9
## 
## Step:  AIC=9980.25
## cnt ~ temp + yr + season + weathersit
## 
##              Df Sum of Sq       RSS    AIC
## + windspeed   1  21239104 591595368 9956.5
## + weekday     1  16654684 596179789 9962.1
## + holiday     1  11036594 601797878 9969.0
## + workingday  1   5946323 606888149 9975.1
## + mnth        1   4779267 608055205 9976.5
## + hum         1   3390584 609443888 9978.2
## <none>                    612834472 9980.2
## 
## Step:  AIC=9956.47
## cnt ~ temp + yr + season + weathersit + windspeed
## 
##              Df Sum of Sq       RSS    AIC
## + weekday     1  17150548 574444820 9937.0
## + hum         1  11095661 580499707 9944.6
## + holiday     1  10945629 580649739 9944.8
## + mnth        1   5694272 585901096 9951.4
## + workingday  1   5631889 585963479 9951.5
## <none>                    591595368 9956.5
## 
## Step:  AIC=9936.96
## cnt ~ temp + yr + season + weathersit + windspeed + weekday
## 
##              Df Sum of Sq       RSS    AIC
## + hum         1   8750655 565694165 9927.7
## + holiday     1   8440294 566004526 9928.1
## + mnth        1   6129510 568315310 9931.1
## + workingday  1   4990152 569454668 9932.6
## <none>                    574444820 9937.0
## 
## Step:  AIC=9927.74
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum
## 
##              Df Sum of Sq       RSS    AIC
## + holiday     1   8313014 557381152 9918.9
## + mnth        1   4825298 560868867 9923.5
## + workingday  1   4573748 561120417 9923.8
## <none>                    565694165 9927.7
## 
## Step:  AIC=9918.92
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday
## 
##              Df Sum of Sq       RSS    AIC
## + mnth        1   4209653 553171498 9915.4
## + workingday  1   2159599 555221552 9918.1
## <none>                    557381152 9918.9
## 
## Step:  AIC=9915.38
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday + mnth
## 
##              Df Sum of Sq       RSS    AIC
## + workingday  1   2086187 551085312 9914.6
## <none>                    553171498 9915.4
## 
## Step:  AIC=9914.61
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday + mnth + workingday

Model Both

model_both <- step(object = model_zero,
                   direction = "both",
                   scope = list(upper = model_base))
## Start:  AIC=11066.88
## cnt ~ 1
## 
##              Df  Sum of Sq        RSS   AIC
## + temp        1 1078688585 1660846807 10703
## + yr          1  879828893 1859706499 10786
## + season      1  451797359 2287738033 10937
## + weathersit  1  242288753 2497246639 11001
## + mnth        1  214744463 2524790929 11009
## + windspeed   1  150705556 2588829836 11028
## + hum         1   27757373 2711778019 11061
## + holiday     1   12797494 2726737898 11066
## + weekday     1   12461089 2727074303 11066
## + workingday  1   10246038 2729289354 11066
## <none>                     2739535392 11067
## 
## Step:  AIC=10703.05
## cnt ~ temp
## 
##              Df  Sum of Sq        RSS   AIC
## + yr          1  791315919  869530888 10232
## + weathersit  1  136655317 1524191489 10642
## + season      1  118871554 1541975253 10651
## + hum         1   90543304 1570303503 10664
## + mnth        1   57891346 1602955461 10679
## + windspeed   1   51536710 1609310097 10682
## + weekday     1   12500530 1648346276 10700
## + holiday     1    6972634 1653874173 10702
## <none>                     1660846807 10703
## + workingday  1    2171087 1658675720 10704
## - temp        1 1078688585 2739535392 11067
## 
## Step:  AIC=10232
## cnt ~ temp + yr
## 
##              Df Sum of Sq        RSS   AIC
## + season      1 130771196  738759692 10115
## + weathersit  1 109837745  759693143 10135
## + mnth        1  63419864  806111024 10179
## + windspeed   1  49792557  819738331 10191
## + hum         1  39008653  830522235 10200
## + weekday     1  13610370  855920517 10222
## + holiday     1   8427999  861102889 10227
## + workingday  1   2562980  866967908 10232
## <none>                     869530888 10232
## - yr          1 791315919 1660846807 10703
## - temp        1 990175611 1859706499 10786
## 
## Step:  AIC=10114.86
## cnt ~ temp + yr + season
## 
##              Df Sum of Sq        RSS     AIC
## + weathersit  1 125925220  612834472  9980.2
## + hum         1  69803991  668955701 10044.3
## + windspeed   1  24796419  713963273 10091.9
## + weekday     1  13891347  724868345 10103.0
## + holiday     1   8369993  730389699 10108.5
## + mnth        1   6673916  732085776 10110.2
## + workingday  1   2769417  735990275 10114.1
## <none>                     738759692 10114.9
## - season      1 130771196  869530888 10232.0
## - temp        1 666819270 1405578962 10583.1
## - yr          1 803215561 1541975253 10650.8
## 
## Step:  AIC=9980.25
## cnt ~ temp + yr + season + weathersit
## 
##              Df Sum of Sq        RSS     AIC
## + windspeed   1  21239104  591595368  9956.5
## + weekday     1  16654684  596179789  9962.1
## + holiday     1  11036594  601797878  9969.0
## + workingday  1   5946323  606888149  9975.1
## + mnth        1   4779267  608055205  9976.5
## + hum         1   3390584  609443888  9978.2
## <none>                     612834472  9980.2
## - weathersit  1 125925220  738759692 10114.9
## - season      1 146858671  759693143 10135.3
## - temp        1 581108156 1193942628 10465.8
## - yr          1 775161296 1387995768 10575.9
## 
## Step:  AIC=9956.47
## cnt ~ temp + yr + season + weathersit + windspeed
## 
##              Df Sum of Sq        RSS     AIC
## + weekday     1  17150548  574444820  9937.0
## + hum         1  11095661  580499707  9944.6
## + holiday     1  10945629  580649739  9944.8
## + mnth        1   5694272  585901096  9951.4
## + workingday  1   5631889  585963479  9951.5
## <none>                     591595368  9956.5
## - windspeed   1  21239104  612834472  9980.2
## - season      1 121300601  712895969 10090.8
## - weathersit  1 122367905  713963273 10091.9
## - temp        1 558847870 1150443238 10440.6
## - yr          1 773416543 1365011911 10565.7
## 
## Step:  AIC=9936.96
## cnt ~ temp + yr + season + weathersit + windspeed + weekday
## 
##              Df Sum of Sq        RSS     AIC
## + hum         1   8750655  565694165  9927.7
## + holiday     1   8440294  566004526  9928.1
## + mnth        1   6129510  568315310  9931.1
## + workingday  1   4990152  569454668  9932.6
## <none>                     574444820  9937.0
## - weekday     1  17150548  591595368  9956.5
## - windspeed   1  21734968  596179789  9962.1
## - season      1 121545623  695990443 10075.3
## - weathersit  1 125098222  699543042 10079.0
## - temp        1 557570981 1132015802 10430.8
## - yr          1 774349655 1348794476 10558.9
## 
## Step:  AIC=9927.74
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum
## 
##              Df Sum of Sq        RSS     AIC
## + holiday     1   8313014  557381152  9918.9
## + mnth        1   4825298  560868867  9923.5
## + workingday  1   4573748  561120417  9923.8
## <none>                     565694165  9927.7
## - hum         1   8750655  574444820  9937.0
## - weekday     1  14805542  580499707  9944.6
## - windspeed   1  28328617  594022782  9961.5
## - weathersit  1  45002552  610696718  9981.7
## - season      1 127670461  693364627 10074.5
## - temp        1 564980031 1130674196 10432.0
## - yr          1 742780054 1308474220 10538.7
## 
## Step:  AIC=9918.92
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday
## 
##              Df Sum of Sq        RSS     AIC
## + mnth        1   4209653  553171498  9915.4
## + workingday  1   2159599  555221552  9918.1
## <none>                     557381152  9918.9
## - holiday     1   8313014  565694165  9927.7
## - hum         1   8623374  566004526  9928.1
## - weekday     1  12536921  569918073  9933.2
## - windspeed   1  28134323  585515475  9952.9
## - weathersit  1  46194745  603575897  9975.1
## - season      1 127728592  685109743 10067.7
## - temp        1 560060226 1117441377 10425.4
## - yr          1 744019894 1301401046 10536.8
## 
## Step:  AIC=9915.38
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday + mnth
## 
##              Df Sum of Sq        RSS     AIC
## + workingday  1   2086187  551085312  9914.6
## <none>                     553171498  9915.4
## - mnth        1   4209653  557381152  9918.9
## - hum         1   7409883  560581381  9923.1
## - holiday     1   7697369  560868867  9923.5
## - weekday     1  13043833  566215332  9930.4
## - windspeed   1  28417536  581589035  9950.0
## - weathersit  1  47074882  600246381  9973.1
## - season      1  67645571  620817070  9997.7
## - temp        1 539950234 1093121733 10411.3
## - yr          1 745809916 1298981414 10537.4
## 
## Step:  AIC=9914.61
## cnt ~ temp + yr + season + weathersit + windspeed + weekday + 
##     hum + holiday + mnth + workingday
## 
##              Df Sum of Sq        RSS     AIC
## <none>                     551085312  9914.6
## - workingday  1   2086187  553171498  9915.4
## - mnth        1   4136241  555221552  9918.1
## - holiday     1   5435628  556520940  9919.8
## - hum         1   7166299  558251611  9922.1
## - weekday     1  12975371  564060682  9929.6
## - windspeed   1  28051752  579137063  9948.9
## - weathersit  1  48194967  599280278  9973.9
## - season      1  67520352  618605663  9997.1
## - temp        1 534571941 1085657252 10408.3
## - yr          1 746080084 1297165396 10538.4

EVALUATION

comparison <- compare_performance(model_base,
                                  model_zero,
                                  model_backward,
                                  model_forward,
                                  model_both)

comparison

Kriteria model terbaik:

  • Berdasarkan adjusted R-squared: yang paling tinggi.
  • Berdasarkan AIC: yang paling rendah.
  • Berdasarkan RMSE: yang paling rendah.

💡 Insight:

  • all model have same value for adjusted R-squared, AIC and RMSE because all model have same predictor columns included.

LINEAR MODEL ASSUMPTION

1. LINEARITY

plot(model_base, which = 1)

💡 Insight :

  • Plot fitted values vs residuals create a random pattern
  • Red line created almost form a straight line
  • Hence, we can assume that model_base fulfil linearity assumption

2. NORMALITY OF RESIDUALS

For checking whether our Residual’s Distribution is Normal or not, we can use histogram:

hist(model_base$residuals)

besides using histogram, we can use another method called Shapiro-Wilk Normality Test. The Shapiro-Wilk Normality Test is a statistical test used to test whether data has a normal distribution or not. In this case, we will prove whether the residuals have a normal distribution or not.
We use shapiro.test().

The hypothesis to be tested:

  • H0: Residuals are normally distributed.
  • H1: Residuals are not normally distributed.

With the Shapiro-Wilk Normality Test, we will determine the p-value.

  • If p-value > \(\alpha\) (usually \(\alpha = 0.05\)), then we will accept H0 (the condition we want).
# we set alpha on 5% or 0.05

shapiro.test(model_base$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_base$residuals
## W = 0.97144, p-value = 0.00000000009473

💡 Insight :

  • p-value = 0.00000000009473 < 0.05 so we can reject H0. Residuals of model_base are not normally distributed
  • from conclusion, we can re-model by doing transformation of our target variable / cntusing log() or sqrt()

3. HOMOSCEDASTICITY OF RESIDUALS

For checking if our residuals created by model are scattered randomly or with constant variance. First method is by using plot:

# scatter plot
plot(model_base$fitted.values, y = model_base$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

the second method is by using The Breusch-Pagan statistical test to see whether the residuals are homoscedasticity (evenly distributed and constant) or not.

Hypothesis to be tested:

  • H0: Residuals are evenly distributed (homoscedasticity).
  • H1: Residuals are not distributed evenly (heteroscedasticity).

With the Breusch-Pagan statistical test, we will determine the p-value:

  • If p-value > \(\alpha\) (usually \(\alpha = 0.05\)), then we will accept H0 (the condition we want).
bptest(model_base)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_base
## BP = 67.403, df = 10, p-value = 0.0000000001403

💡 Insight :

  • p-value = 0.0000000001403 < 0.05 so we reject the H0. Residuals model_base are not evenly distributed (heteroscedasticity)

4. MULTICOLINEARITY

For testing the muticolinearity on our predictor, we can use VIF test.

vif(model_base)
##     season         yr       mnth    holiday    weekday workingday weathersit 
##   3.541429   1.020251   3.332130   1.081437   1.021463   1.076338   1.741541 
##       temp        hum  windspeed 
##   1.215589   1.905040   1.165206

from our VIF test, we can conclude that all of our predictor value are < 10 or no Multicolinearity found

RE-MODEL

# we transform our target variable using sqrt()
bike_sqrt <- bike %>% 
  mutate(cnt_sqrt = sqrt(cnt)) %>% 
  select(-cnt)

model_sqrt <- lm(formula = cnt_sqrt ~ .,
                 data = bike_sqrt)
summary(model_sqrt)
## 
## Call:
## lm(formula = cnt_sqrt ~ ., data = bike_sqrt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.402  -3.485   0.852   4.565  22.058 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  42.2754     1.9233  21.981 < 0.0000000000000002 ***
## season        4.4570     0.4578   9.736 < 0.0000000000000002 ***
## yr           15.4213     0.5455  28.271 < 0.0000000000000002 ***
## mnth         -0.3468     0.1429  -2.427             0.015465 *  
## holiday      -4.7078     1.6810  -2.801             0.005239 ** 
## weekday       0.4712     0.1362   3.459             0.000575 ***
## workingday    1.2896     0.6026   2.140             0.032677 *  
## weathersit   -5.5435     0.6544  -8.471 < 0.0000000000000002 ***
## temp         42.3617     1.6275  26.029 < 0.0000000000000002 ***
## hum          -7.4772     2.6185  -2.856             0.004420 ** 
## windspeed   -22.7620     3.7636  -6.048        0.00000000236 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.301 on 720 degrees of freedom
## Multiple R-squared:  0.7883, Adjusted R-squared:  0.7853 
## F-statistic: 268.1 on 10 and 720 DF,  p-value: < 0.00000000000000022

EVALUATION

res2 <- data.frame(aktual2 = bike_sqrt$cnt_sqrt,
                  prediksi2 = model_sqrt$fitted.values) %>% 
  mutate(error2 = prediksi2 - aktual2)

range(bike_sqrt$cnt_sqrt)
## [1]  4.690416 93.348808
# MAE value
MAE(y_pred = res2$prediksi2,
    y_true = res2$aktual2)
## [1] 5.290285
# MAPE value
MAPE(y_pred = res2$prediksi2,
    y_true = res2$aktual2)*100
## [1] 10.78321

💡 Insight:

  • Mean Absolut Error value of model_sqrt are 5.290285 and our data range is 4.690416 to 93.348808
  • We can see that MAE value is within data range and it is more than our minimum value but not to far, so we can assume that our model_sqrt has improve from model_base in terms of error by reaching 6 % of error

LINEAR MODEL ASSUMPTION RE-MODEL

1. LINEARITY

plot(model_sqrt, which = 1)

💡 Insight :

  • Plot fitted values vs residuals create a random pattern
  • Red line created almost form a straight line
  • Hence, we can assume that model_base fulfil linearity assumption

2. NORMALITY OF RESIDUALS

For checking whether our Residual’s Distribution is Normal or not, we can use histogram:

hist(model_sqrt$residuals)

besides using histogram, we can use another method called Shapiro-Wilk Normality Test. The Shapiro-Wilk Normality Test is a statistical test used to test whether data has a normal distribution or not. In this case, we will prove whether the residuals have a normal distribution or not.
We use shapiro.test().

The hypothesis to be tested:

  • H0: Residuals are normally distributed.
  • H1: Residuals are not normally distributed.

With the Shapiro-Wilk Normality Test, we will determine the p-value.

  • If p-value > \(\alpha\) (usually \(\alpha = 0.05\)), then we will accept H0 (the condition we want).
# we set alpha on 5% or 0.05

shapiro.test(model_sqrt$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_sqrt$residuals
## W = 0.93279, p-value < 0.00000000000000022

💡 Insight :

  • p-value = 0.00000000000000022 < 0.05 so we can reject H0. Residuals of model_sqrt are not normally distributed
  • consider using more complex model

3. HOMOSCEDASTICITY OF RESIDUALS

For checking if our residuals created by model are scattered randomly or with constant variance. First method is by using plot:

# scatter plot
plot(model_sqrt$fitted.values, y = model_sqrt$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

the second method is by using The Breusch-Pagan statistical test to see whether the residuals are homoscedasticity (evenly distributed and constant) or not.

Hypothesis to be tested:

  • H0: Residuals are evenly distributed (homoscedasticity).
  • H1: Residuals are not distributed evenly (heteroscedasticity).

With the Breusch-Pagan statistical test, we will determine the p-value:

  • If p-value > \(\alpha\) (usually \(\alpha = 0.05\)), then we will accept H0 (the condition we want).
bptest(model_sqrt)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_sqrt
## BP = 36.453, df = 10, p-value = 0.00007034

💡 Insight :

  • p-value = 0.00007034 < 0.05 so we reject the H0. Residuals model_sqrt are not evenly distributed (heteroscedasticity)

4. MULTICOLINEARITY

For testing the muticolinearity on our predictor, we can use VIF test.

vif(model_sqrt)
##     season         yr       mnth    holiday    weekday workingday weathersit 
##   3.541429   1.020251   3.332130   1.081437   1.021463   1.076338   1.741541 
##       temp        hum  windspeed 
##   1.215589   1.905040   1.165206

from our VIF test, we can conclude that all of our predictor value are < 10 or no Multicolinearity found

Title  

A work by Taufan Anggoro Adhi

tf.anggoro@gmail.com