library(tidyverse)
library(readxl) # load data with xlsx type
library(lubridate) # for data pre-processing (date type data, etc)
library(hms) # to convert lubridate type into hour, minute, second type
library(GGally) # your usually statistic package
library(MLmetrics)
library(lmtest)
library(car)knitr::include_graphics("inhale-exhale.jpg")Fresh air! Who doesn’t love it? Imagine the times when it was raining and you inhale fresh air and the smell of rain. But, are we inhale the good quality air? Can we predict the air that we inhale has good quality or not? This repositories aims to do prediction of air quality in Italian City.
First, we need to load the data set. Because the data set is in .xlsx, I’ll use read_excel() function from readxl packages. After load the data, let’s take a peek with head() function.
air <- read_excel("AirQualityUCI.xlsx")
head(air)After that, check the data type, are they correct?
glimpse(air)## Rows: 9,357
## Columns: 15
## $ Date <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 20...
## $ Time <dttm> 1899-12-31 18:00:00, 1899-12-31 19:00:00, 1899-12-...
## $ `CO(GT)` <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -...
## $ `PT08.S1(CO)` <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.0...
## $ `NMHC(GT)` <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16...
## $ `C6H6(GT)` <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, ...
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 68...
## $ `NOx(GT)` <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, ...
## $ `PT08.S3(NOx)` <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.5...
## $ `NO2(GT)` <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 2...
## $ `PT08.S4(NO2)` <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.0...
## $ `PT08.S5(O3)` <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25,...
## $ T <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11....
## $ RH <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56....
## $ AH <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.78879...
The column Time should shows time only. Thus, we need to do data cleaning before we move to the next step.
unique(air$Time)## [1] "1899-12-31 18:00:00 UTC" "1899-12-31 19:00:00 UTC"
## [3] "1899-12-31 20:00:00 UTC" "1899-12-31 21:00:00 UTC"
## [5] "1899-12-31 22:00:00 UTC" "1899-12-31 23:00:00 UTC"
## [7] "1899-12-31 00:00:00 UTC" "1899-12-31 01:00:00 UTC"
## [9] "1899-12-31 02:00:00 UTC" "1899-12-31 03:00:00 UTC"
## [11] "1899-12-31 04:00:00 UTC" "1899-12-31 05:00:00 UTC"
## [13] "1899-12-31 06:00:00 UTC" "1899-12-31 07:00:00 UTC"
## [15] "1899-12-31 08:00:00 UTC" "1899-12-31 09:00:00 UTC"
## [17] "1899-12-31 10:00:00 UTC" "1899-12-31 11:00:00 UTC"
## [19] "1899-12-31 12:00:00 UTC" "1899-12-31 13:00:00 UTC"
## [21] "1899-12-31 14:00:00 UTC" "1899-12-31 15:00:00 UTC"
## [23] "1899-12-31 16:00:00 UTC" "1899-12-31 17:00:00 UTC"
This mean the date on column Time is a dummy date. The content of Time will be ok if only represent the hour. Thus, we need to change it by using hour() function by lubridate packages.
air <- air %>%
mutate(Time = hour(Time))
glimpse(air)## Rows: 9,357
## Columns: 15
## $ Date <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 20...
## $ Time <int> 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ `CO(GT)` <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -...
## $ `PT08.S1(CO)` <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.0...
## $ `NMHC(GT)` <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16...
## $ `C6H6(GT)` <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, ...
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 68...
## $ `NOx(GT)` <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, ...
## $ `PT08.S3(NOx)` <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.5...
## $ `NO2(GT)` <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 2...
## $ `PT08.S4(NO2)` <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.0...
## $ `PT08.S5(O3)` <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25,...
## $ T <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11....
## $ RH <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56....
## $ AH <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.78879...
Ok, the data type is correct. But what does columns represent for?
colSums(is.na(air))## Date Time CO(GT) PT08.S1(CO) NMHC(GT)
## 0 0 0 0 0
## C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT)
## 0 0 0 0 0
## PT08.S4(NO2) PT08.S5(O3) T RH AH
## 0 0 0 0 0
ggcorr(air, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)## Warning in ggcorr(air, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2):
## data in column(s) 'Date' are not numeric and were ignored
Correlation graph above shows that there are positive correlation between CO(GT) and NOx(GT) and NO2(GT). Other residual almost has no correlation to CO(GT).
To do prediction, we will create three model of regression model; Simple Linear Regression, Linear Regression (one predictor), and Multiple Linear Regression (Stepwise Regression).
Simple Linear Regression here means we will create a model regression without any predictor. It will be represent as lm1.
lm1 <- lm(`CO(GT)` ~ 1, air)
summary(lm1)##
## Call:
## lm(formula = `CO(GT)` ~ 1, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165.79 34.81 35.71 36.81 46.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.2075 0.8028 -42.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.66 on 9356 degrees of freedom
Next, I will create Linear Regression Model with PT08.S3(NOx) as predictor and assign it into lm2
lm2 <- lm(`CO(GT)` ~ `NO2(GT)`, air)
summary(lm2)##
## Call:
## lm(formula = `CO(GT)` ~ `NO2(GT)`, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -226.34 -13.50 11.22 25.82 146.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.078469 0.654643 -88.72 <2e-16 ***
## `NO2(GT)` 0.410606 0.004689 87.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.57 on 9355 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4504
## F-statistic: 7667 on 1 and 9355 DF, p-value: < 2.2e-16
In this section, we will create 2 variant of multiple linear regression models. First, create multiple regression model with 2 predictor and assign it as lm3.
# create multiple linear regression with 2 predictor
lm3 <- lm(`CO(GT)` ~ `NMHC(GT)` + `NOx(GT)`, data = air)
summary(lm3)##
## Call:
## lm(formula = `CO(GT)` ~ `NMHC(GT)` + `NOx(GT)`, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303.27 -28.13 24.33 45.78 102.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.464420 1.111996 -44.48 <2e-16 ***
## `NMHC(GT)` 0.072595 0.004826 15.04 <2e-16 ***
## `NOx(GT)` 0.158988 0.002620 60.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65.25 on 9354 degrees of freedom
## Multiple R-squared: 0.2942, Adjusted R-squared: 0.2941
## F-statistic: 1950 on 2 and 9354 DF, p-value: < 2.2e-16
Result above shows that: The predictors (NMHC(GT) and NOx(GT)) has correlation to the target (CO(GT)).
In this section, I will try to use Stepwise Regression - Backward method. We will create a multiple linear regression that use all predictor and assign it into lm4.
# create multiple linear regression with all predictor
lm4 <- lm(`CO(GT)` ~ ., air)
# stepwise - backward
backward <- step(object = lm4, direction = "backward")## Start: AIC=75453.18
## `CO(GT)` ~ Date + Time + `PT08.S1(CO)` + `NMHC(GT)` + `C6H6(GT)` +
## `PT08.S2(NMHC)` + `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` +
## `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH + AH
##
## Df Sum of Sq RSS AIC
## - AH 1 36 29635898 75451
## - `PT08.S2(NMHC)` 1 73 29635934 75451
## - `PT08.S1(CO)` 1 122 29635983 75451
## - Date 1 1202 29637063 75452
## - `C6H6(GT)` 1 3956 29639817 75452
## - `PT08.S4(NO2)` 1 4186 29640047 75453
## <none> 29635861 75453
## - `PT08.S3(NOx)` 1 8696 29644557 75454
## - T 1 21511 29657372 75458
## - RH 1 25832 29661693 75459
## - `PT08.S5(O3)` 1 33828 29669689 75462
## - Time 1 59483 29695345 75470
## - `NOx(GT)` 1 112402 29748263 75487
## - `NMHC(GT)` 1 230075 29865936 75524
## - `NO2(GT)` 1 4841655 34477516 76867
##
## Step: AIC=75451.19
## `CO(GT)` ~ Date + Time + `PT08.S1(CO)` + `NMHC(GT)` + `C6H6(GT)` +
## `PT08.S2(NMHC)` + `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` +
## `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH
##
## Df Sum of Sq RSS AIC
## - `PT08.S1(CO)` 1 122 29636020 75449
## - `PT08.S2(NMHC)` 1 277 29636175 75449
## - Date 1 1321 29637218 75450
## - `PT08.S4(NO2)` 1 6125 29642023 75451
## <none> 29635898 75451
## - `PT08.S3(NOx)` 1 10442 29646340 75452
## - `C6H6(GT)` 1 25420 29661318 75457
## - `PT08.S5(O3)` 1 34340 29670238 75460
## - T 1 42764 29678662 75463
## - RH 1 42982 29678879 75463
## - Time 1 60707 29696604 75468
## - `NOx(GT)` 1 120771 29756669 75487
## - `NMHC(GT)` 1 230235 29866132 75522
## - `NO2(GT)` 1 5281829 34917727 76984
##
## Step: AIC=75449.23
## `CO(GT)` ~ Date + Time + `NMHC(GT)` + `C6H6(GT)` + `PT08.S2(NMHC)` +
## `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` +
## `PT08.S5(O3)` + T + RH
##
## Df Sum of Sq RSS AIC
## - `PT08.S2(NMHC)` 1 289 29636309 75447
## - Date 1 1238 29637258 75448
## - `PT08.S4(NO2)` 1 6038 29642058 75449
## <none> 29636020 75449
## - `PT08.S3(NOx)` 1 10321 29646341 75450
## - `C6H6(GT)` 1 25604 29661624 75455
## - `PT08.S5(O3)` 1 38851 29674871 75459
## - T 1 42837 29678857 75461
## - RH 1 43461 29679481 75461
## - Time 1 61723 29697743 75467
## - `NOx(GT)` 1 121401 29757421 75485
## - `NMHC(GT)` 1 269162 29905182 75532
## - `NO2(GT)` 1 5283128 34919148 76982
##
## Step: AIC=75447.32
## `CO(GT)` ~ Date + Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` +
## `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` +
## T + RH
##
## Df Sum of Sq RSS AIC
## - Date 1 1299 29637607 75446
## <none> 29636309 75447
## - `PT08.S4(NO2)` 1 8302 29644611 75448
## - `PT08.S3(NOx)` 1 12051 29648360 75449
## - `PT08.S5(O3)` 1 39741 29676050 75458
## - Time 1 63297 29699606 75465
## - `C6H6(GT)` 1 70898 29707207 75468
## - T 1 72176 29708485 75468
## - RH 1 100075 29736384 75477
## - `NOx(GT)` 1 135979 29772288 75488
## - `NMHC(GT)` 1 270331 29906640 75530
## - `NO2(GT)` 1 5575237 35211546 77058
##
## Step: AIC=75445.73
## `CO(GT)` ~ Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` + `PT08.S3(NOx)` +
## `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH
##
## Df Sum of Sq RSS AIC
## <none> 29637607 75446
## - `PT08.S4(NO2)` 1 9665 29647273 75447
## - `PT08.S3(NOx)` 1 25077 29662685 75452
## - `PT08.S5(O3)` 1 39025 29676633 75456
## - Time 1 62639 29700246 75463
## - T 1 72058 29709665 75466
## - `C6H6(GT)` 1 78757 29716365 75469
## - RH 1 99904 29737511 75475
## - `NOx(GT)` 1 136808 29774416 75487
## - `NMHC(GT)` 1 333117 29970724 75548
## - `NO2(GT)` 1 5657173 35294781 77078
# summary the backward
summary(backward)##
## Call:
## lm(formula = `CO(GT)` ~ Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` +
## `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` +
## T + RH, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -220.984 -8.926 10.349 23.820 187.541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.421055 7.913484 -6.372 1.96e-10 ***
## Time -0.426023 0.095856 -4.444 8.92e-06 ***
## `NMHC(GT)` 0.049016 0.004782 10.249 < 2e-16 ***
## `C6H6(GT)` -0.670241 0.134491 -4.984 6.36e-07 ***
## `NOx(GT)` 0.036325 0.005530 6.568 5.36e-11 ***
## `PT08.S3(NOx)` 0.011369 0.004043 2.812 0.004932 **
## `NO2(GT)` 0.387472 0.009174 42.237 < 2e-16 ***
## `PT08.S4(NO2)` -0.004892 0.002802 -1.746 0.080874 .
## `PT08.S5(O3)` -0.013820 0.003940 -3.508 0.000454 ***
## T 0.475280 0.099705 4.767 1.90e-06 ***
## RH 0.205332 0.036583 5.613 2.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56.31 on 9346 degrees of freedom
## Multiple R-squared: 0.4747, Adjusted R-squared: 0.4742
## F-statistic: 844.6 on 10 and 9346 DF, p-value: < 2.2e-16
We have create several model of linear regression, let’s we compare each models to know which model is the best one by adj.r.squared!
summary(lm1)$adj.r.squared # model without predictor## [1] 0
summary(lm2)$adj.r.squared # model with 1 preductor## [1] 0.4503696
summary(lm3)$adj.r.squared # model with 2 predictor## [1] 0.2940751
summary(backward)$adj.r.squared # model using stepwise - backward## [1] 0.4741592
Result above shows that it is recommended to use the model with backward regression, but we will predict all model that we have made.
Let’s forecast the model we had and add new column that represent the forecast result of each model in air.
air$predict1 <- predict(lm1, air) # prediction with lm1
air$predict2 <- predict(lm2, air) # prediction with lm2
air$predict3 <- predict(lm3, air) # prediction with lm3
air$predict4 <- predict(backward, air) # prediction with stepwise-backward
head(air)After add the forecast result into new column of each model, we will evaluate each model by calculate the **root mean square error using RMSE() function by MLmetric package, to see how well the models that we have created.
# model evaluation by RMSE
RMSE(y_pred = air$predict1, y_true = air$`CO(GT)`) # 0 predictor## [1] 77.65302
RMSE(y_pred = air$predict2, y_true = air$`CO(GT)`) # 1 predictor## [1] 57.56659
RMSE(y_pred = air$predict3, y_true = air$`CO(GT)`) # 2 predictor## [1] 65.23658
RMSE(y_pred = air$predict4, y_true = air$`CO(GT)`) # 10 predictor (backward)## [1] 56.27989
The result of RMSE shows that to forecast the value of CO(GT), it will be better to use model from Stepwise Regression - Backward method. The Backward model has smallest error amongst of other model with RMSE of 56.28. After knowing which model we’ll be use to forecast, we need to check the assumption of the model. Next section will do assumption checking of the backward model.
In this section we need to check assumption on models we have created. There are 4 assumption that we need to check;
There are 2 ways to check the linearity of the model. Visualize it or by using cor.test() function. We will do both of them to do check and re-check of both method.
cor.test() Function:# linearity of `Time` to `CO(GT)`
cor.test(air$`CO(GT)`, air$Time)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$Time
## t = 7.3705, df = 9355, p-value = 1.844e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.05580707 0.09609791
## sample estimates:
## cor
## 0.0759835
# linearity of `NMHC(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NMHC(GT)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`NMHC(GT)`
## t = 12.518, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1083706 0.1482280
## sample estimates:
## cor
## 0.1283512
# linearity of `C6H6(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`C6H6(GT)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`C6H6(GT)`
## t = -3.0363, df = 9355, p-value = 0.002401
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.05160682 -0.01112199
## sample estimates:
## cor
## -0.03137728
# linearity of `NOX(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NOx(GT)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`NOx(GT)`
## t = 59.89, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5116457 0.5409424
## sample estimates:
## cor
## 0.5264503
# linearity of `PT08.S3(NOx)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S3(NOx)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`PT08.S3(NOx)`
## t = -8.7385, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.11004232 -0.06984558
## sample estimates:
## cor
## -0.08998059
# linearity of `NO2(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NO2(GT)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`NO2(GT)`
## t = 87.563, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6598504 0.6821258
## sample estimates:
## cor
## 0.6711396
# linearity of `PT08.S4(NO2)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S4(NO2)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`PT08.S4(NO2)`
## t = -7.1498, df = 9355, p-value = 9.342e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09384273 -0.05353816
## sample estimates:
## cor
## -0.07372055
# linearity of `PT08.S5(03)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S5(O3)`)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$`PT08.S5(O3)`
## t = 7.7934, df = 9355, p-value = 7.219e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06015113 0.10041454
## sample estimates:
## cor
## 0.0803156
# linearity of `T` to `CO(GT)`
cor.test(air$`CO(GT)`, air$T)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$T
## t = -6.685, df = 9355, p-value = 2.441e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08908964 -0.04875751
## sample estimates:
## cor
## -0.06895175
# linearity of `RH` to `CO(GT)`
cor.test(air$`CO(GT)`, air$RH)##
## Pearson's product-moment correlation
##
## data: air$`CO(GT)` and air$RH
## t = -4.6704, df = 9355, p-value = 3.049e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06842607 -0.02799558
## sample estimates:
## cor
## -0.04823058
The result of cor.test() shows that almost of the residual is not linear, because most of the residual p-value > 0.05. Only RH and C6H6(GT) has negative correlation. Then let’s we visualize it.
# prepare the data (simplify the model into a data frame)
refit <- data.frame(residuals = backward$residuals, fitted = backward$fitted.values)
# create the plot
refit %>%
ggplot(aes(fitted, residuals)) +
geom_point() +
geom_smooth() +
geom_hline(aes(yintercept = 0)) +
theme_minimal()## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Both of method shows that each residual of the model Backward is not linear.
The seconf assupmtion check is the residual of the model is distributed normal or not.
hist(backward$residuals)Because this data set has more than 5,000 rows, to check normality of residual we will use the hist() function. Histogram above shows that the distribution residual of the model is not normal.
Homoscedascity assumption checking will provide us information if the residual has pattern (heteroscedascity) or not (homoscedascity).
The hypothesis of homoscedascity are;
To check the homoscedascity assumption, we will use bptest() function from lmtest package.
bptest(backward)##
## studentized Breusch-Pagan test
##
## data: backward
## BP = 871.89, df = 10, p-value < 2.2e-16
plot(backward$fitted.values, backward$residuals)
abline(h = 0, col = "red")Result above shows that the residual of model has pattern that the residuals of the model has pattern (Heteroscedascity).
This assumption aims to check is there any residual that will overlap other residual (has correlation between each residuals). Practically, it will be better to remove the residuals that has vif result exceeding 5 or 10, because it is more possible that the residual overlap other residual.
vif(backward)## Time `NMHC(GT)` `C6H6(GT)` `NOx(GT)` `PT08.S3(NOx)`
## 1.299347 1.318610 91.379023 5.979800 4.999218
## `NO2(GT)` `PT08.S4(NO2)` `PT08.S5(O3)` T RH
## 4.000454 5.056979 9.559672 54.745166 10.356872
Result above shows that there are residual C6H6(GT) and T has the highest vif value and only 4 residual has vif value less than 5. Thus, it will be better to create new regression model that removed all the residual except Time, NMHC(GT), PT08.S3(NOx), and NO2(GT).
1. After compare the adj.r.squared and RMSE of four models that we have created, model Backward from Stepwise Regression has the smallest value amongst of all. But still we will forecast the data by all model.
2. After checking assumption of the model Backward, we know that there is no assumptions has been fulfilled. This mean although Backward model has smallest RMSE and adj.r.squared, this model is not good enough. There are several ways to improve the model.
Because all assumption is not fulfilled, there are several ways to improve our model:
Multi-colinearity => remove all residual that has vif value exceeding 5 or 10.
Heteroscedascity => there’s possibilities that affect the model has not been included, thus add several predictor that estimated affect the model into a better model than before.
Normality of Residuals & Linearity => transform the data.
We could improve our model to use the Decision Tree.
3. In the end, all we need to do is trial and error. If one method failed, we should try to use other method. After seeing the data set, there’s possibilities we could forecast the data set by other method, such as Time Series (because there are Date in the data set and the data set has frequency hourly) and Decision Tree.