library(tidyverse)
library(readxl) # load data with xlsx type
library(lubridate) # for data pre-processing (date type data, etc)
library(hms) # to convert lubridate type into hour, minute, second type
library(GGally) # your usually statistic package
library(MLmetrics)
library(lmtest)
library(car)

Introduction

knitr::include_graphics("inhale-exhale.jpg")

Fresh air! Who doesn’t love it? Imagine the times when it was raining and you inhale fresh air and the smell of rain. But, are we inhale the good quality air? Can we predict the air that we inhale has good quality or not? This repositories aims to do prediction of air quality in Italian City.

First, we need to load the data set. Because the data set is in .xlsx, I’ll use read_excel() function from readxl packages. After load the data, let’s take a peek with head() function.

air <- read_excel("AirQualityUCI.xlsx")
head(air)

After that, check the data type, are they correct?

glimpse(air)
## Rows: 9,357
## Columns: 15
## $ Date            <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 20...
## $ Time            <dttm> 1899-12-31 18:00:00, 1899-12-31 19:00:00, 1899-12-...
## $ `CO(GT)`        <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -...
## $ `PT08.S1(CO)`   <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.0...
## $ `NMHC(GT)`      <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16...
## $ `C6H6(GT)`      <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, ...
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 68...
## $ `NOx(GT)`       <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, ...
## $ `PT08.S3(NOx)`  <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.5...
## $ `NO2(GT)`       <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 2...
## $ `PT08.S4(NO2)`  <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.0...
## $ `PT08.S5(O3)`   <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25,...
## $ T               <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11....
## $ RH              <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56....
## $ AH              <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.78879...

Data Cleaning

The column Time should shows time only. Thus, we need to do data cleaning before we move to the next step.

unique(air$Time)
##  [1] "1899-12-31 18:00:00 UTC" "1899-12-31 19:00:00 UTC"
##  [3] "1899-12-31 20:00:00 UTC" "1899-12-31 21:00:00 UTC"
##  [5] "1899-12-31 22:00:00 UTC" "1899-12-31 23:00:00 UTC"
##  [7] "1899-12-31 00:00:00 UTC" "1899-12-31 01:00:00 UTC"
##  [9] "1899-12-31 02:00:00 UTC" "1899-12-31 03:00:00 UTC"
## [11] "1899-12-31 04:00:00 UTC" "1899-12-31 05:00:00 UTC"
## [13] "1899-12-31 06:00:00 UTC" "1899-12-31 07:00:00 UTC"
## [15] "1899-12-31 08:00:00 UTC" "1899-12-31 09:00:00 UTC"
## [17] "1899-12-31 10:00:00 UTC" "1899-12-31 11:00:00 UTC"
## [19] "1899-12-31 12:00:00 UTC" "1899-12-31 13:00:00 UTC"
## [21] "1899-12-31 14:00:00 UTC" "1899-12-31 15:00:00 UTC"
## [23] "1899-12-31 16:00:00 UTC" "1899-12-31 17:00:00 UTC"

This mean the date on column Time is a dummy date. The content of Time will be ok if only represent the hour. Thus, we need to change it by using hour() function by lubridate packages.

air <- air %>% 
  mutate(Time = hour(Time))

glimpse(air)
## Rows: 9,357
## Columns: 15
## $ Date            <dttm> 2004-03-10, 2004-03-10, 2004-03-10, 2004-03-10, 20...
## $ Time            <int> 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, ...
## $ `CO(GT)`        <dbl> 2.6, 2.0, 2.2, 2.2, 1.6, 1.2, 1.2, 1.0, 0.9, 0.6, -...
## $ `PT08.S1(CO)`   <dbl> 1360.00, 1292.25, 1402.00, 1375.50, 1272.25, 1197.0...
## $ `NMHC(GT)`      <dbl> 150, 112, 88, 80, 51, 38, 31, 31, 24, 19, 14, 8, 16...
## $ `C6H6(GT)`      <dbl> 11.881723, 9.397165, 8.997817, 9.228796, 6.518224, ...
## $ `PT08.S2(NMHC)` <dbl> 1045.50, 954.75, 939.25, 948.25, 835.50, 750.25, 68...
## $ `NOx(GT)`       <dbl> 166, 103, 131, 172, 131, 89, 62, 62, 45, -200, 21, ...
## $ `PT08.S3(NOx)`  <dbl> 1056.25, 1173.75, 1140.00, 1092.00, 1205.00, 1336.5...
## $ `NO2(GT)`       <dbl> 113, 92, 114, 122, 116, 96, 77, 76, 60, -200, 34, 2...
## $ `PT08.S4(NO2)`  <dbl> 1692.00, 1558.75, 1554.50, 1583.75, 1490.00, 1393.0...
## $ `PT08.S5(O3)`   <dbl> 1267.50, 972.25, 1074.00, 1203.25, 1110.00, 949.25,...
## $ T               <dbl> 13.600, 13.300, 11.900, 11.000, 11.150, 11.175, 11....
## $ RH              <dbl> 48.875, 47.700, 53.975, 60.000, 59.575, 59.175, 56....
## $ AH              <dbl> 0.7577538, 0.7254874, 0.7502391, 0.7867125, 0.78879...

Ok, the data type is correct. But what does columns represent for?

  • Date: represent date (DD/MM/YYYY)
  • Time: represent hour only.
  • CO (GT): True hourly averaged concentration CO in mg/m^3 (reference analyzer)
  • PT08.S1 (CO): (tin oxide) hourly averaged sensor response (nominally CO targeted)
  • NMHC (GT): True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
  • C6H6 (GT): True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
  • PT08.S2(NMHC): (titania) hourly averaged sensor response (nominally NMHC targeted)
  • NOx (GT): True hourly averaged NOx concentration in ppb (reference analyzer)
  • PT08.S3(NOx): (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
  • NO2 (GT): True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
  • PT08.S4 (NO2): (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
  • PT08.S5(O3): (indium oxide) hourly averaged sensor response (nominally O3 targeted)
  • T: Temperature in °C
  • RH: Relative Humidity (%)
  • AH: Absolute Humidity
colSums(is.na(air))
##          Date          Time        CO(GT)   PT08.S1(CO)      NMHC(GT) 
##             0             0             0             0             0 
##      C6H6(GT) PT08.S2(NMHC)       NOx(GT)  PT08.S3(NOx)       NO2(GT) 
##             0             0             0             0             0 
##  PT08.S4(NO2)   PT08.S5(O3)             T            RH            AH 
##             0             0             0             0             0
ggcorr(air, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)
## Warning in ggcorr(air, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2):
## data in column(s) 'Date' are not numeric and were ignored

Correlation graph above shows that there are positive correlation between CO(GT) and NOx(GT) and NO2(GT). Other residual almost has no correlation to CO(GT).

Linear Regression Model

To do prediction, we will create three model of regression model; Simple Linear Regression, Linear Regression (one predictor), and Multiple Linear Regression (Stepwise Regression).

Simple Linear Regression

Simple Linear Regression here means we will create a model regression without any predictor. It will be represent as lm1.

lm1 <- lm(`CO(GT)` ~ 1, air)

summary(lm1)
## 
## Call:
## lm(formula = `CO(GT)` ~ 1, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -165.79   34.81   35.71   36.81   46.11 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -34.2075     0.8028  -42.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.66 on 9356 degrees of freedom

Linear Regression

Next, I will create Linear Regression Model with PT08.S3(NOx) as predictor and assign it into lm2

lm2 <- lm(`CO(GT)` ~ `NO2(GT)`, air)

summary(lm2)
## 
## Call:
## lm(formula = `CO(GT)` ~ `NO2(GT)`, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -226.34  -13.50   11.22   25.82  146.80 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -58.078469   0.654643  -88.72   <2e-16 ***
## `NO2(GT)`     0.410606   0.004689   87.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.57 on 9355 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4504 
## F-statistic:  7667 on 1 and 9355 DF,  p-value: < 2.2e-16

Multiple Linear Regression

In this section, we will create 2 variant of multiple linear regression models. First, create multiple regression model with 2 predictor and assign it as lm3.

# create multiple linear regression with 2 predictor
lm3 <- lm(`CO(GT)` ~ `NMHC(GT)` + `NOx(GT)`, data = air)

summary(lm3)
## 
## Call:
## lm(formula = `CO(GT)` ~ `NMHC(GT)` + `NOx(GT)`, data = air)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303.27  -28.13   24.33   45.78  102.38 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -49.464420   1.111996  -44.48   <2e-16 ***
## `NMHC(GT)`    0.072595   0.004826   15.04   <2e-16 ***
## `NOx(GT)`     0.158988   0.002620   60.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.25 on 9354 degrees of freedom
## Multiple R-squared:  0.2942, Adjusted R-squared:  0.2941 
## F-statistic:  1950 on 2 and 9354 DF,  p-value: < 2.2e-16

Result above shows that: The predictors (NMHC(GT) and NOx(GT)) has correlation to the target (CO(GT)).

Multiple Regression - Stepwise Regression

In this section, I will try to use Stepwise Regression - Backward method. We will create a multiple linear regression that use all predictor and assign it into lm4.

# create multiple linear regression with all predictor 
lm4 <- lm(`CO(GT)` ~ ., air)

# stepwise - backward
backward <- step(object = lm4, direction = "backward")
## Start:  AIC=75453.18
## `CO(GT)` ~ Date + Time + `PT08.S1(CO)` + `NMHC(GT)` + `C6H6(GT)` + 
##     `PT08.S2(NMHC)` + `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` + 
##     `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH + AH
## 
##                   Df Sum of Sq      RSS   AIC
## - AH               1        36 29635898 75451
## - `PT08.S2(NMHC)`  1        73 29635934 75451
## - `PT08.S1(CO)`    1       122 29635983 75451
## - Date             1      1202 29637063 75452
## - `C6H6(GT)`       1      3956 29639817 75452
## - `PT08.S4(NO2)`   1      4186 29640047 75453
## <none>                         29635861 75453
## - `PT08.S3(NOx)`   1      8696 29644557 75454
## - T                1     21511 29657372 75458
## - RH               1     25832 29661693 75459
## - `PT08.S5(O3)`    1     33828 29669689 75462
## - Time             1     59483 29695345 75470
## - `NOx(GT)`        1    112402 29748263 75487
## - `NMHC(GT)`       1    230075 29865936 75524
## - `NO2(GT)`        1   4841655 34477516 76867
## 
## Step:  AIC=75451.19
## `CO(GT)` ~ Date + Time + `PT08.S1(CO)` + `NMHC(GT)` + `C6H6(GT)` + 
##     `PT08.S2(NMHC)` + `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` + 
##     `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH
## 
##                   Df Sum of Sq      RSS   AIC
## - `PT08.S1(CO)`    1       122 29636020 75449
## - `PT08.S2(NMHC)`  1       277 29636175 75449
## - Date             1      1321 29637218 75450
## - `PT08.S4(NO2)`   1      6125 29642023 75451
## <none>                         29635898 75451
## - `PT08.S3(NOx)`   1     10442 29646340 75452
## - `C6H6(GT)`       1     25420 29661318 75457
## - `PT08.S5(O3)`    1     34340 29670238 75460
## - T                1     42764 29678662 75463
## - RH               1     42982 29678879 75463
## - Time             1     60707 29696604 75468
## - `NOx(GT)`        1    120771 29756669 75487
## - `NMHC(GT)`       1    230235 29866132 75522
## - `NO2(GT)`        1   5281829 34917727 76984
## 
## Step:  AIC=75449.23
## `CO(GT)` ~ Date + Time + `NMHC(GT)` + `C6H6(GT)` + `PT08.S2(NMHC)` + 
##     `NOx(GT)` + `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` + 
##     `PT08.S5(O3)` + T + RH
## 
##                   Df Sum of Sq      RSS   AIC
## - `PT08.S2(NMHC)`  1       289 29636309 75447
## - Date             1      1238 29637258 75448
## - `PT08.S4(NO2)`   1      6038 29642058 75449
## <none>                         29636020 75449
## - `PT08.S3(NOx)`   1     10321 29646341 75450
## - `C6H6(GT)`       1     25604 29661624 75455
## - `PT08.S5(O3)`    1     38851 29674871 75459
## - T                1     42837 29678857 75461
## - RH               1     43461 29679481 75461
## - Time             1     61723 29697743 75467
## - `NOx(GT)`        1    121401 29757421 75485
## - `NMHC(GT)`       1    269162 29905182 75532
## - `NO2(GT)`        1   5283128 34919148 76982
## 
## Step:  AIC=75447.32
## `CO(GT)` ~ Date + Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` + 
##     `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` + 
##     T + RH
## 
##                  Df Sum of Sq      RSS   AIC
## - Date            1      1299 29637607 75446
## <none>                        29636309 75447
## - `PT08.S4(NO2)`  1      8302 29644611 75448
## - `PT08.S3(NOx)`  1     12051 29648360 75449
## - `PT08.S5(O3)`   1     39741 29676050 75458
## - Time            1     63297 29699606 75465
## - `C6H6(GT)`      1     70898 29707207 75468
## - T               1     72176 29708485 75468
## - RH              1    100075 29736384 75477
## - `NOx(GT)`       1    135979 29772288 75488
## - `NMHC(GT)`      1    270331 29906640 75530
## - `NO2(GT)`       1   5575237 35211546 77058
## 
## Step:  AIC=75445.73
## `CO(GT)` ~ Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` + `PT08.S3(NOx)` + 
##     `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` + T + RH
## 
##                  Df Sum of Sq      RSS   AIC
## <none>                        29637607 75446
## - `PT08.S4(NO2)`  1      9665 29647273 75447
## - `PT08.S3(NOx)`  1     25077 29662685 75452
## - `PT08.S5(O3)`   1     39025 29676633 75456
## - Time            1     62639 29700246 75463
## - T               1     72058 29709665 75466
## - `C6H6(GT)`      1     78757 29716365 75469
## - RH              1     99904 29737511 75475
## - `NOx(GT)`       1    136808 29774416 75487
## - `NMHC(GT)`      1    333117 29970724 75548
## - `NO2(GT)`       1   5657173 35294781 77078
# summary the backward
summary(backward)
## 
## Call:
## lm(formula = `CO(GT)` ~ Time + `NMHC(GT)` + `C6H6(GT)` + `NOx(GT)` + 
##     `PT08.S3(NOx)` + `NO2(GT)` + `PT08.S4(NO2)` + `PT08.S5(O3)` + 
##     T + RH, data = air)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -220.984   -8.926   10.349   23.820  187.541 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -50.421055   7.913484  -6.372 1.96e-10 ***
## Time            -0.426023   0.095856  -4.444 8.92e-06 ***
## `NMHC(GT)`       0.049016   0.004782  10.249  < 2e-16 ***
## `C6H6(GT)`      -0.670241   0.134491  -4.984 6.36e-07 ***
## `NOx(GT)`        0.036325   0.005530   6.568 5.36e-11 ***
## `PT08.S3(NOx)`   0.011369   0.004043   2.812 0.004932 ** 
## `NO2(GT)`        0.387472   0.009174  42.237  < 2e-16 ***
## `PT08.S4(NO2)`  -0.004892   0.002802  -1.746 0.080874 .  
## `PT08.S5(O3)`   -0.013820   0.003940  -3.508 0.000454 ***
## T                0.475280   0.099705   4.767 1.90e-06 ***
## RH               0.205332   0.036583   5.613 2.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56.31 on 9346 degrees of freedom
## Multiple R-squared:  0.4747, Adjusted R-squared:  0.4742 
## F-statistic: 844.6 on 10 and 9346 DF,  p-value: < 2.2e-16

Compare the adj.r.squared

We have create several model of linear regression, let’s we compare each models to know which model is the best one by adj.r.squared!

summary(lm1)$adj.r.squared # model without predictor
## [1] 0
summary(lm2)$adj.r.squared # model with 1 preductor
## [1] 0.4503696
summary(lm3)$adj.r.squared # model with 2 predictor
## [1] 0.2940751
summary(backward)$adj.r.squared # model using stepwise - backward
## [1] 0.4741592

Result above shows that it is recommended to use the model with backward regression, but we will predict all model that we have made.

Forecast! and Model Evaluation

Forecast!

Let’s forecast the model we had and add new column that represent the forecast result of each model in air.

air$predict1 <- predict(lm1, air) # prediction with lm1
air$predict2 <- predict(lm2, air) # prediction with lm2
air$predict3 <- predict(lm3, air) # prediction with lm3
air$predict4 <- predict(backward, air) # prediction with stepwise-backward

head(air)

After add the forecast result into new column of each model, we will evaluate each model by calculate the **root mean square error using RMSE() function by MLmetric package, to see how well the models that we have created.

Model Evaluation

# model evaluation by RMSE
RMSE(y_pred = air$predict1, y_true = air$`CO(GT)`) # 0 predictor
## [1] 77.65302
RMSE(y_pred = air$predict2, y_true = air$`CO(GT)`) # 1 predictor
## [1] 57.56659
RMSE(y_pred = air$predict3, y_true = air$`CO(GT)`) # 2 predictor
## [1] 65.23658
RMSE(y_pred = air$predict4, y_true = air$`CO(GT)`) # 10 predictor (backward)
## [1] 56.27989

The result of RMSE shows that to forecast the value of CO(GT), it will be better to use model from Stepwise Regression - Backward method. The Backward model has smallest error amongst of other model with RMSE of 56.28. After knowing which model we’ll be use to forecast, we need to check the assumption of the model. Next section will do assumption checking of the backward model.

Evaluation (Assumption Checking)

In this section we need to check assumption on models we have created. There are 4 assumption that we need to check;

1. Linearity Check

There are 2 ways to check the linearity of the model. Visualize it or by using cor.test() function. We will do both of them to do check and re-check of both method.

a. Using cor.test() Function:

# linearity of `Time` to `CO(GT)`
cor.test(air$`CO(GT)`, air$Time)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$Time
## t = 7.3705, df = 9355, p-value = 1.844e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05580707 0.09609791
## sample estimates:
##       cor 
## 0.0759835
# linearity of `NMHC(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NMHC(GT)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`NMHC(GT)`
## t = 12.518, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1083706 0.1482280
## sample estimates:
##       cor 
## 0.1283512
# linearity of `C6H6(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`C6H6(GT)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`C6H6(GT)`
## t = -3.0363, df = 9355, p-value = 0.002401
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05160682 -0.01112199
## sample estimates:
##         cor 
## -0.03137728
# linearity of `NOX(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NOx(GT)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`NOx(GT)`
## t = 59.89, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5116457 0.5409424
## sample estimates:
##       cor 
## 0.5264503
# linearity of `PT08.S3(NOx)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S3(NOx)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`PT08.S3(NOx)`
## t = -8.7385, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11004232 -0.06984558
## sample estimates:
##         cor 
## -0.08998059
# linearity of `NO2(GT)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`NO2(GT)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`NO2(GT)`
## t = 87.563, df = 9355, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6598504 0.6821258
## sample estimates:
##       cor 
## 0.6711396
# linearity of `PT08.S4(NO2)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S4(NO2)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`PT08.S4(NO2)`
## t = -7.1498, df = 9355, p-value = 9.342e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09384273 -0.05353816
## sample estimates:
##         cor 
## -0.07372055
# linearity of `PT08.S5(03)` to `CO(GT)`
cor.test(air$`CO(GT)`, air$`PT08.S5(O3)`)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$`PT08.S5(O3)`
## t = 7.7934, df = 9355, p-value = 7.219e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06015113 0.10041454
## sample estimates:
##       cor 
## 0.0803156
# linearity of `T` to `CO(GT)`
cor.test(air$`CO(GT)`, air$T)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$T
## t = -6.685, df = 9355, p-value = 2.441e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08908964 -0.04875751
## sample estimates:
##         cor 
## -0.06895175
# linearity of `RH` to `CO(GT)`
cor.test(air$`CO(GT)`, air$RH)
## 
##  Pearson's product-moment correlation
## 
## data:  air$`CO(GT)` and air$RH
## t = -4.6704, df = 9355, p-value = 3.049e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06842607 -0.02799558
## sample estimates:
##         cor 
## -0.04823058

The result of cor.test() shows that almost of the residual is not linear, because most of the residual p-value > 0.05. Only RH and C6H6(GT) has negative correlation. Then let’s we visualize it.

b. Visualize!

# prepare the data (simplify the model into a data frame)
refit <- data.frame(residuals = backward$residuals, fitted = backward$fitted.values)

# create the plot
refit %>% 
  ggplot(aes(fitted, residuals)) +
  geom_point() +
  geom_smooth() +
  geom_hline(aes(yintercept = 0)) +
  theme_minimal()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Both of method shows that each residual of the model Backward is not linear.

2. Normality of Residual

The seconf assupmtion check is the residual of the model is distributed normal or not.

hist(backward$residuals)

Because this data set has more than 5,000 rows, to check normality of residual we will use the hist() function. Histogram above shows that the distribution residual of the model is not normal.

3. Homoscedascity

Homoscedascity assumption checking will provide us information if the residual has pattern (heteroscedascity) or not (homoscedascity).

The hypothesis of homoscedascity are;

  • H0: Homoscedasticity (has no pattern) => p-value > 0.05
  • H1: Heteroscedasticity (has pattern) => p-value < 0.05

To check the homoscedascity assumption, we will use bptest() function from lmtest package.

bptest(backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  backward
## BP = 871.89, df = 10, p-value < 2.2e-16
plot(backward$fitted.values, backward$residuals)
abline(h = 0, col = "red")

Result above shows that the residual of model has pattern that the residuals of the model has pattern (Heteroscedascity).

4. No-Multicolinearity

This assumption aims to check is there any residual that will overlap other residual (has correlation between each residuals). Practically, it will be better to remove the residuals that has vif result exceeding 5 or 10, because it is more possible that the residual overlap other residual.

vif(backward)
##           Time     `NMHC(GT)`     `C6H6(GT)`      `NOx(GT)` `PT08.S3(NOx)` 
##       1.299347       1.318610      91.379023       5.979800       4.999218 
##      `NO2(GT)` `PT08.S4(NO2)`  `PT08.S5(O3)`              T             RH 
##       4.000454       5.056979       9.559672      54.745166      10.356872

Result above shows that there are residual C6H6(GT) and T has the highest vif value and only 4 residual has vif value less than 5. Thus, it will be better to create new regression model that removed all the residual except Time, NMHC(GT), PT08.S3(NOx), and NO2(GT).

Conclusion

1. After compare the adj.r.squared and RMSE of four models that we have created, model Backward from Stepwise Regression has the smallest value amongst of all. But still we will forecast the data by all model.

2. After checking assumption of the model Backward, we know that there is no assumptions has been fulfilled. This mean although Backward model has smallest RMSE and adj.r.squared, this model is not good enough. There are several ways to improve the model.

2.1. Model Improvement

Because all assumption is not fulfilled, there are several ways to improve our model:

  1. Multi-colinearity => remove all residual that has vif value exceeding 5 or 10.

  2. Heteroscedascity => there’s possibilities that affect the model has not been included, thus add several predictor that estimated affect the model into a better model than before.

  3. Normality of Residuals & Linearity => transform the data.

  4. We could improve our model to use the Decision Tree.

3. In the end, all we need to do is trial and error. If one method failed, we should try to use other method. After seeing the data set, there’s possibilities we could forecast the data set by other method, such as Time Series (because there are Date in the data set and the data set has frequency hourly) and Decision Tree.