Performing Pricncipal Component Analysis & Factor Analysis and connecting the factors to do a regression

In this case study we are trying to do a factor analysis of the delhi pollution dataset and then put those factors into regression to see how the model behaves.

Load the dataset

data <- read.csv("delhi_pollution_data.csv")

Check Summary

summary(data)
##       SrNo               Date           NO               CO        
##  Min.   :  1.0   01-01-2016:  1   Min.   :  1.12   Min.   : 0.220  
##  1st Qu.:101.5   01-02-2016:  1   1st Qu.: 11.38   1st Qu.: 1.202  
##  Median :202.0   01-03-2016:  1   Median : 39.22   Median : 1.780  
##  Mean   :202.0   01-05-2015:  1   Mean   : 81.32   Mean   : 2.210  
##  3rd Qu.:302.5   01-06-2015:  1   3rd Qu.:118.03   3rd Qu.: 2.737  
##  Max.   :403.0   01-07-2015:  1   Max.   :480.90   Max.   :19.900  
##                  (Other)   :397   NA's   :2        NA's   :1       
##       NO2               O3              SO2              PM2.5       
##  Min.   : 25.27   Min.   :  4.98   Min.   :  0.000   Min.   : 18.75  
##  1st Qu.: 54.50   1st Qu.: 30.12   1st Qu.:  9.515   1st Qu.: 68.11  
##  Median : 73.68   Median : 53.21   Median : 21.680   Median :108.24  
##  Mean   : 74.43   Mean   : 56.30   Mean   : 25.988   Mean   :130.16  
##  3rd Qu.: 91.43   3rd Qu.: 76.85   3rd Qu.: 37.145   3rd Qu.:166.16  
##  Max.   :149.01   Max.   :159.70   Max.   :371.750   Max.   :550.23  
##  NA's   :2        NA's   :6        NA's   :4         NA's   :2       
##     Benzene          Toulene          P_Xylene           NOx        
##  Min.   : 0.160   Min.   : 0.000   Min.   : 0.300   Min.   : 26.49  
##  1st Qu.: 3.235   1st Qu.: 9.695   1st Qu.: 3.382   1st Qu.: 87.67  
##  Median : 5.910   Median :16.240   Median : 6.060   Median :152.31  
##  Mean   : 6.526   Mean   :17.192   Mean   : 8.129   Mean   :194.68  
##  3rd Qu.: 8.940   3rd Qu.:22.973   3rd Qu.:10.570   3rd Qu.:264.38  
##  Max.   :17.510   Max.   :48.410   Max.   :32.350   Max.   :743.70  
##  NA's   :1        NA's   :1        NA's   :31       NA's   :2       
##       PM10        WindDirection        NH3               RH       
##  Min.   : 36.63   Min.   :126.7   Min.   :  1.00   Min.   :15.19  
##  1st Qu.:163.66   1st Qu.:180.4   1st Qu.: 25.90   1st Qu.:41.58  
##  Median :249.67   Median :209.4   Median : 34.70   Median :53.87  
##  Mean   :265.25   Mean   :202.9   Mean   : 39.63   Mean   :52.15  
##  3rd Qu.:353.27   3rd Qu.:228.9   3rd Qu.: 50.72   3rd Qu.:63.85  
##  Max.   :705.70   Max.   :258.1   Max.   :144.82   Max.   :86.82  
##  NA's   :2        NA's   :1       NA's   :2        NA's   :1      
##       Temp         WindSpeed      VerticalWindSpeed     Solar       
##  Min.   : 6.56   Min.   :0.3000   Min.   :-2.8700   Min.   :  3.31  
##  1st Qu.:18.82   1st Qu.:0.8525   1st Qu.:-0.1700   1st Qu.: 84.90  
##  Median :26.57   Median :1.1600   Median :-0.0500   Median :121.85  
##  Mean   :24.43   Mean   :1.2380   Mean   : 0.1889   Mean   :112.95  
##  3rd Qu.:29.86   3rd Qu.:1.5200   3rd Qu.: 0.4000   3rd Qu.:141.03  
##  Max.   :41.05   Max.   :3.4500   Max.   : 3.5600   Max.   :345.68  
##  NA's   :2       NA's   :1        NA's   :2         NA's   :2       
##   BarPressure       Weather       PD_PM2.5         PD_PM10      
##  Min.   :721.7   Autumn : 88   Min.   : 18.75   Min.   : 36.63  
##  1st Qu.:731.7   Monsoon: 62   1st Qu.: 68.05   1st Qu.:163.38  
##  Median :733.4   Spring : 58   Median :108.24   Median :250.10  
##  Mean   :732.6   Summer :140   Mean   :130.51   Mean   :265.62  
##  3rd Qu.:733.8   Winter : 55   3rd Qu.:170.99   3rd Qu.:357.95  
##  Max.   :736.1                 Max.   :550.23   Max.   :705.70  
##  NA's   :2                     NA's   :10       NA's   :11      
##      PD_NO2           PD_SO2            PD_CO       
##  Min.   : 25.27   Min.   :  0.000   Min.   : 0.370  
##  1st Qu.: 54.03   1st Qu.:  9.412   1st Qu.: 1.208  
##  Median : 73.52   Median : 21.605   Median : 1.775  
##  Mean   : 74.32   Mean   : 25.965   Mean   : 2.201  
##  3rd Qu.: 91.37   3rd Qu.: 37.197   3rd Qu.: 2.667  
##  Max.   :149.01   Max.   :371.750   Max.   :19.900  
##  NA's   :12       NA's   :13        NA's   :11

Check Structure

str(data)
## 'data.frame':    403 obs. of  27 variables:
##  $ SrNo             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date             : Factor w/ 403 levels "01-01-2016","01-02-2016",..: 39 53 105 119 133 146 159 173 187 200 ...
##  $ NO               : num  7.22 6.99 7.6 7.57 8.34 6.89 6.89 7.21 7.16 6.64 ...
##  $ CO               : num  1.77 0.22 0.5 0.77 0.48 0.41 0.37 1.25 1.81 1.58 ...
##  $ NO2              : num  47.9 45.3 59.9 63.6 62 ...
##  $ O3               : num  51.1 19.3 94.3 66.9 69.5 ...
##  $ SO2              : num  16.9 16.7 13.1 16.2 20.3 ...
##  $ PM2.5            : num  49 60.2 46.9 113 104.9 ...
##  $ Benzene          : num  2.53 3.19 2.29 3.92 5.19 2.12 2.29 2.36 3.03 2.91 ...
##  $ Toulene          : num  9.65 11.1 8.61 10.76 15.95 ...
##  $ P_Xylene         : num  3 2.67 3.43 4.66 7.66 2.47 2.8 2.57 4.32 2.75 ...
##  $ NOx              : num  53 51.3 65.5 68.8 67.4 ...
##  $ PM10             : num  82.8 113.5 171.4 232.2 235.1 ...
##  $ WindDirection    : num  142 166 220 233 221 ...
##  $ NH3              : num  26.5 31 17.4 25.9 29.5 ...
##  $ RH               : num  61.3 75.5 33.8 43 40.7 ...
##  $ Temp             : num  20.2 16.9 26.6 25.2 26.5 ...
##  $ WindSpeed        : num  1.22 0.62 1.55 1.18 0.88 1.56 1.35 1.3 0.37 1.45 ...
##  $ VerticalWindSpeed: num  0.08 -0.04 -0.17 -0.15 0.15 -0.16 -0.13 0.35 1.39 0.34 ...
##  $ Solar            : num  162.2 99.4 146.9 150.1 137 ...
##  $ BarPressure      : num  732 734 728 730 731 ...
##  $ Weather          : Factor w/ 5 levels "Autumn","Monsoon",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ PD_PM2.5         : num  NA 49 NA 46.9 113 ...
##  $ PD_PM10          : num  NA 82.8 NA 171.4 232.2 ...
##  $ PD_NO2           : num  NA 47.9 NA 59.9 63.6 ...
##  $ PD_SO2           : num  NA 16.9 NA 13.1 16.2 ...
##  $ PD_CO            : num  NA 1.77 NA 0.5 0.77 0.48 0.41 0.37 1.25 1.81 ...

Find Missing Values

sum(is.na(data))
## [1] 124

Now that we know there are missing values in the data, let’s find Missing Values per variable

Find Missing Values per Variable

na_count <-sapply(data, function(y) sum(length(which(is.na(y)))))   #this is a fixed syntax, only things changes here is the name of the dataset, you can use this in any case study.
na_count <- as.data.frame(na_count)
print(na_count)
##                   na_count
## SrNo                     0
## Date                     0
## NO                       2
## CO                       1
## NO2                      2
## O3                       6
## SO2                      4
## PM2.5                    2
## Benzene                  1
## Toulene                  1
## P_Xylene                31
## NOx                      2
## PM10                     2
## WindDirection            1
## NH3                      2
## RH                       1
## Temp                     2
## WindSpeed                1
## VerticalWindSpeed        2
## Solar                    2
## BarPressure              2
## Weather                  0
## PD_PM2.5                10
## PD_PM10                 11
## PD_NO2                  12
## PD_SO2                  13
## PD_CO                   11

Treat Missing Values

We are deleting all observations where we have missing values for this case

data <- na.omit(data)

Scale the variables

Scaling will help in standardising the data and it works only with numerical variables.

data_scaled <- data[, c(3:21,23:27)] # Filtered only Numeric/Integer variables
data_scaled <- scale(data_scaled)
data_scaled <- as.data.frame(data_scaled) # Convert the Scaled Dataset into Data Frame

Perform MLR without PCA/FA

We are assuming PM2.5 is the Response variable for us in this case

mod <- lm(PM2.5 ~ ., data = data_scaled)
summary(mod)
## 
## Call:
## lm(formula = PM2.5 ~ ., data = data_scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00454 -0.13477  0.00270  0.09747  2.88892 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5.226e-16  1.554e-02   0.000  1.00000    
## NO                 2.250e-02  7.871e-02   0.286  0.77518    
## CO                 2.142e-02  2.371e-02   0.903  0.36695    
## NO2                3.761e-02  3.343e-02   1.125  0.26136    
## O3                 7.653e-04  2.474e-02   0.031  0.97534    
## SO2                1.398e-02  2.301e-02   0.608  0.54393    
## Benzene            3.720e-01  7.763e-02   4.793 2.52e-06 ***
## Toulene           -1.191e-01  8.332e-02  -1.429  0.15386    
## P_Xylene          -2.059e-01  8.558e-02  -2.406  0.01668 *  
## NOx                5.016e-02  6.851e-02   0.732  0.46460    
## PM10               5.553e-01  3.563e-02  15.585  < 2e-16 ***
## WindDirection      4.169e-04  1.899e-02   0.022  0.98250    
## NH3                2.040e-01  3.325e-02   6.135 2.48e-09 ***
## RH                 2.009e-02  4.428e-02   0.454  0.65032    
## Temp               2.693e-02  4.195e-02   0.642  0.52141    
## WindSpeed         -7.190e-02  2.650e-02  -2.713  0.00703 ** 
## VerticalWindSpeed -7.904e-03  1.996e-02  -0.396  0.69239    
## Solar              2.646e-02  3.030e-02   0.873  0.38322    
## BarPressure       -3.248e-02  2.363e-02  -1.374  0.17030    
## PD_PM2.5           2.955e-01  4.817e-02   6.135 2.49e-09 ***
## PD_PM10           -1.092e-01  4.538e-02  -2.406  0.01671 *  
## PD_NO2            -1.087e-01  2.728e-02  -3.984 8.39e-05 ***
## PD_SO2            -4.368e-03  2.865e-02  -0.152  0.87891    
## PD_CO              5.575e-03  2.106e-02   0.265  0.79143    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2895 on 323 degrees of freedom
## Multiple R-squared:  0.9217, Adjusted R-squared:  0.9162 
## F-statistic: 165.4 on 23 and 323 DF,  p-value: < 2.2e-16

Let’s perform PCA/FA and then do the regression to see the change

Let’s Remove the Response Variable (PM2.5) before doing PCA/FA.

names(data_scaled)
##  [1] "NO"                "CO"                "NO2"              
##  [4] "O3"                "SO2"               "PM2.5"            
##  [7] "Benzene"           "Toulene"           "P_Xylene"         
## [10] "NOx"               "PM10"              "WindDirection"    
## [13] "NH3"               "RH"                "Temp"             
## [16] "WindSpeed"         "VerticalWindSpeed" "Solar"            
## [19] "BarPressure"       "PD_PM2.5"          "PD_PM10"          
## [22] "PD_NO2"            "PD_SO2"            "PD_CO"
pollution_data <- data_scaled[, -6]  # Removing PM2.5 which is the 6th Variable

Check for Correlation between the remaining X variables

matrix <- cor(data_scaled)
print(matrix)
##                            NO          CO        NO2           O3
## NO                 1.00000000  0.67696029  0.5812679  0.120721119
## CO                 0.67696029  1.00000000  0.4133421 -0.053396186
## NO2                0.58126790  0.41334210  1.0000000  0.375333327
## O3                 0.12072112 -0.05339619  0.3753333  1.000000000
## SO2                0.39118329  0.22670798  0.3612119  0.329173213
## PM2.5              0.70463047  0.55417088  0.5795105  0.067004489
## Benzene            0.77305833  0.64672782  0.5461371 -0.109079794
## Toulene            0.77517892  0.62662264  0.5284132 -0.114901158
## P_Xylene           0.86604077  0.68608766  0.5137107 -0.074955118
## NOx                0.94940445  0.69152935  0.5879553  0.041109820
## PM10               0.67331264  0.47777436  0.5760809  0.183899056
## WindDirection      0.30335980  0.07026543  0.1814424  0.063300864
## NH3                0.62129889  0.49641984  0.5304001 -0.001958607
## RH                 0.14705025  0.25194487 -0.1197191 -0.612502424
## Temp              -0.48205483 -0.38782014 -0.3498939  0.107655790
## WindSpeed         -0.57909211 -0.44140548 -0.6106899 -0.011432555
## VerticalWindSpeed -0.05062095 -0.08169549  0.1377352  0.144604092
## Solar             -0.38751784 -0.44292191 -0.3394173  0.281918705
## BarPressure        0.39035928  0.21819739  0.2773735  0.209755524
## PD_PM2.5           0.60674886  0.49776287  0.4792056  0.061286921
## PD_PM10            0.55552959  0.42245257  0.4815410  0.166941310
## PD_NO2             0.49249620  0.34669363  0.7287087  0.322007022
## PD_SO2             0.49745230  0.30763592  0.4837342  0.370163792
## PD_CO              0.51037323  0.48592215  0.3155569 -0.106288795
##                           SO2       PM2.5     Benzene     Toulene
## NO                 0.39118329  0.70463047  0.77305833  0.77517892
## CO                 0.22670798  0.55417088  0.64672782  0.62662264
## NO2                0.36121189  0.57951054  0.54613710  0.52841323
## O3                 0.32917321  0.06700449 -0.10907979 -0.11490116
## SO2                1.00000000  0.37171324  0.29119218  0.24287386
## PM2.5              0.37171324  1.00000000  0.77516259  0.64730833
## Benzene            0.29119218  0.77516259  1.00000000  0.94894064
## Toulene            0.24287386  0.64730833  0.94894064  1.00000000
## P_Xylene           0.26278805  0.67903498  0.91812642  0.94978302
## NOx                0.32330575  0.64263526  0.77105018  0.79443858
## PM10               0.38643914  0.89286260  0.62730189  0.50773894
## WindDirection      0.22286092  0.29795269  0.23360704  0.20006514
## NH3                0.31317002  0.78947864  0.75652897  0.66175021
## RH                -0.14571485  0.17245031  0.44906696  0.44240742
## Temp              -0.29537661 -0.61164052 -0.64978331 -0.52427868
## WindSpeed         -0.20051890 -0.51683032 -0.68544770 -0.72367197
## VerticalWindSpeed -0.07283954 -0.07481952 -0.09556371 -0.07907504
## Solar             -0.13051213 -0.52053877 -0.60774868 -0.52748322
## BarPressure        0.40189299  0.43110477  0.44613277  0.39384610
## PD_PM2.5           0.31348411  0.84058878  0.71231859  0.58906063
## PD_PM10            0.31946031  0.77639440  0.57753277  0.45458289
## PD_NO2             0.30368321  0.50504787  0.47794333  0.44728494
## PD_SO2             0.70488667  0.51549579  0.39044229  0.33616007
## PD_CO              0.15642631  0.50485710  0.57254792  0.53438208
##                      P_Xylene         NOx        PM10 WindDirection
## NO                 0.86604077  0.94940445  0.67331264   0.303359796
## CO                 0.68608766  0.69152935  0.47777436   0.070265433
## NO2                0.51371066  0.58795526  0.57608095   0.181442450
## O3                -0.07495512  0.04110982  0.18389906   0.063300864
## SO2                0.26278805  0.32330575  0.38643914   0.222860919
## PM2.5              0.67903498  0.64263526  0.89286260   0.297952686
## Benzene            0.91812642  0.77105018  0.62730189   0.233607037
## Toulene            0.94978302  0.79443858  0.50773894   0.200065138
## P_Xylene           1.00000000  0.86401685  0.57462439   0.222664244
## NOx                0.86401685  1.00000000  0.57316609   0.242969309
## PM10               0.57462439  0.57316609  1.00000000   0.365052499
## WindDirection      0.22266424  0.24296931  0.36505250   1.000000000
## NH3                0.65798769  0.57815263  0.63180393   0.085675167
## RH                 0.34517373  0.25502520 -0.07986845  -0.154386275
## Temp              -0.46138203 -0.44560131 -0.47019550  -0.137419762
## WindSpeed         -0.67958725 -0.63574920 -0.39545378  -0.241231651
## VerticalWindSpeed -0.00740521 -0.09134434  0.01580934  -0.078233739
## Solar             -0.48401084 -0.41559695 -0.37328750  -0.004598241
## BarPressure        0.33281591  0.32513134  0.38291915   0.141059308
## PD_PM2.5           0.61853009  0.53926676  0.76602552   0.206533316
## PD_PM10            0.50310007  0.45332846  0.81136785   0.220450779
## PD_NO2             0.42810752  0.45920813  0.52338004   0.122535054
## PD_SO2             0.37349682  0.40680951  0.54857800   0.201381554
## PD_CO              0.55882065  0.50353813  0.42373914   0.069157579
##                            NH3          RH       Temp    WindSpeed
## NO                 0.621298889  0.14705025 -0.4820548 -0.579092113
## CO                 0.496419835  0.25194487 -0.3878201 -0.441405478
## NO2                0.530400134 -0.11971908 -0.3498939 -0.610689885
## O3                -0.001958607 -0.61250242  0.1076558 -0.011432555
## SO2                0.313170020 -0.14571485 -0.2953766 -0.200518902
## PM2.5              0.789478643  0.17245031 -0.6116405 -0.516830320
## Benzene            0.756528972  0.44906696 -0.6497833 -0.685447696
## Toulene            0.661750210  0.44240742 -0.5242787 -0.723671972
## P_Xylene           0.657987695  0.34517373 -0.4613820 -0.679587253
## NOx                0.578152629  0.25502520 -0.4456013 -0.635749200
## PM10               0.631803934 -0.07986845 -0.4701955 -0.395453785
## WindDirection      0.085675167 -0.15438628 -0.1374198 -0.241231651
## NH3                1.000000000  0.39461182 -0.6612095 -0.481880128
## RH                 0.394611818  1.00000000 -0.5773227 -0.288804264
## Temp              -0.661209462 -0.57732266  1.0000000  0.408975428
## WindSpeed         -0.481880128 -0.28880426  0.4089754  1.000000000
## VerticalWindSpeed -0.124593085 -0.35349227  0.2954563  0.004698092
## Solar             -0.600596228 -0.60120581  0.7718077  0.446223575
## BarPressure        0.511212413  0.18107633 -0.5173927 -0.210725045
## PD_PM2.5           0.737144348  0.18356150 -0.6216734 -0.409260603
## PD_PM10            0.607459569 -0.02185193 -0.5083227 -0.301171150
## PD_NO2             0.514390337 -0.12300313 -0.3249090 -0.433981971
## PD_SO2             0.481841057 -0.16651690 -0.3993471 -0.251553694
## PD_CO              0.491399970  0.27474197 -0.4019045 -0.344399910
##                   VerticalWindSpeed        Solar BarPressure    PD_PM2.5
## NO                     -0.050620948 -0.387517836   0.3903593  0.60674886
## CO                     -0.081695487 -0.442921910   0.2181974  0.49776287
## NO2                     0.137735151 -0.339417282   0.2773735  0.47920564
## O3                      0.144604092  0.281918705   0.2097555  0.06128692
## SO2                    -0.072839544 -0.130512131   0.4018930  0.31348411
## PM2.5                  -0.074819524 -0.520538766   0.4311048  0.84058878
## Benzene                -0.095563709 -0.607748677   0.4461328  0.71231859
## Toulene                -0.079075040 -0.527483222   0.3938461  0.58906063
## P_Xylene               -0.007405210 -0.484010841   0.3328159  0.61853009
## NOx                    -0.091344345 -0.415596953   0.3251313  0.53926676
## PM10                    0.015809345 -0.373287496   0.3829191  0.76602552
## WindDirection          -0.078233739 -0.004598241   0.1410593  0.20653332
## NH3                    -0.124593085 -0.600596228   0.5112124  0.73714435
## RH                     -0.353492275 -0.601205813   0.1810763  0.18356150
## Temp                    0.295456254  0.771807663  -0.5173927 -0.62167345
## WindSpeed               0.004698092  0.446223575  -0.2107250 -0.40926060
## VerticalWindSpeed       1.000000000  0.161457566  -0.3713759 -0.06067449
## Solar                   0.161457566  1.000000000  -0.2780028 -0.55260236
## BarPressure            -0.371375892 -0.278002823   1.0000000  0.43721807
## PD_PM2.5               -0.060674490 -0.552602356   0.4372181  1.00000000
## PD_PM10                 0.024801184 -0.447327133   0.3868305  0.90004926
## PD_NO2                  0.171290219 -0.287732044   0.3355824  0.57745670
## PD_SO2                 -0.049614518 -0.254231834   0.5181450  0.51520012
## PD_CO                  -0.044676261 -0.436613669   0.2527666  0.56953828
##                       PD_PM10     PD_NO2      PD_SO2       PD_CO
## NO                 0.55552959  0.4924962  0.49745230  0.51037323
## CO                 0.42245257  0.3466936  0.30763592  0.48592215
## NO2                0.48154105  0.7287087  0.48373420  0.31555691
## O3                 0.16694131  0.3220070  0.37016379 -0.10628880
## SO2                0.31946031  0.3036832  0.70488667  0.15642631
## PM2.5              0.77639440  0.5050479  0.51549579  0.50485710
## Benzene            0.57753277  0.4779433  0.39044229  0.57254792
## Toulene            0.45458289  0.4472849  0.33616007  0.53438208
## P_Xylene           0.50310007  0.4281075  0.37349682  0.55882065
## NOx                0.45332846  0.4592081  0.40680951  0.50353813
## PM10               0.81136785  0.5233800  0.54857800  0.42373914
## WindDirection      0.22045078  0.1225351  0.20138155  0.06915758
## NH3                0.60745957  0.5143903  0.48184106  0.49139997
## RH                -0.02185193 -0.1230031 -0.16651690  0.27474197
## Temp              -0.50832271 -0.3249090 -0.39934713 -0.40190450
## WindSpeed         -0.30117115 -0.4339820 -0.25155369 -0.34439991
## VerticalWindSpeed  0.02480118  0.1712902 -0.04961452 -0.04467626
## Solar             -0.44732713 -0.2877320 -0.25423183 -0.43661367
## BarPressure        0.38683054  0.3355824  0.51814501  0.25276661
## PD_PM2.5           0.90004926  0.5774567  0.51520012  0.56953828
## PD_PM10            1.00000000  0.5739629  0.55016779  0.49497449
## PD_NO2             0.57396293  1.0000000  0.48534503  0.41081304
## PD_SO2             0.55016779  0.4853450  1.00000000  0.31641021
## PD_CO              0.49497449  0.4108130  0.31641021  1.00000000

Visualize the Correlation Matrix

library(corrplot)
## corrplot 0.84 loaded
corrplot(matrix)

corrplot(matrix, type = "upper", method = "number")

corrplot.mixed(matrix,lower = "number", upper = "pie", tl.col = "black", tl.pos = "lt")

Bartlestt test

library(psych)
cortest.bartlett(matrix)
## Warning in cortest.bartlett(matrix): n not specified, 100 used
## $chisq
## [1] 2563.405
## 
## $p.value
## [1] 0
## 
## $df
## [1] 276
# Default n is 100, if sample is less than 100 rows, always good to declare the number of observations in the dataset
cortest.bartlett(matrix, nrow(pollution_data)) 
## $chisq
## [1] 9585.526
## 
## $p.value
## [1] 0
## 
## $df
## [1] 276

P value is less than default significance level (0.05) hence this test suggest that this data is suitable to do PCA/FA

KMO Test

KMO(matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrix)
## Overall MSA =  0.88
## MSA for each item = 
##                NO                CO               NO2                O3 
##              0.84              0.97              0.88              0.73 
##               SO2             PM2.5           Benzene           Toulene 
##              0.87              0.90              0.92              0.90 
##          P_Xylene               NOx              PM10     WindDirection 
##              0.90              0.85              0.90              0.80 
##               NH3                RH              Temp         WindSpeed 
##              0.91              0.66              0.81              0.95 
## VerticalWindSpeed             Solar       BarPressure          PD_PM2.5 
##              0.64              0.91              0.90              0.89 
##           PD_PM10            PD_NO2            PD_SO2             PD_CO 
##              0.89              0.90              0.88              0.98

Overall MSA is greater than 0.5 which means we have adequate number of samples in the dataset, we can go ahead with the analysis

Check for Eigen Values

A <- eigen(matrix)
EV <- A$values
print(EV)
##  [1] 11.21500818  2.98569744  1.78929681  1.34697073  1.11005651
##  [6]  0.89010563  0.73051047  0.60189395  0.57041627  0.41180518
## [11]  0.38091795  0.32580388  0.31021327  0.27113681  0.24302887
## [16]  0.21258838  0.16697205  0.14337784  0.09715854  0.08089151
## [21]  0.04637548  0.03109583  0.02014172  0.01853671

Scree Plot

plot(EV, xlab = "Principal Components", ylab = "Eigen Value", pch = 20, col = "blue")
lines(EV, col = "red")
abline(h=1, col = "green", lty = 2)

As per Kaiser rule we will go with the number of components/factors where anything for EV above 1

Perform Principal Component Analysis without rotation

pr1 <- principal(matrix, nfactors = 5, rotate = "none")
print(pr1)
## Principal Components Analysis
## Call: principal(r = matrix, nfactors = 5, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     PC1   PC2   PC3   PC4   PC5   h2    u2 com
## NO                 0.86  0.06  0.24  0.23  0.08 0.86 0.138 1.3
## CO                 0.69 -0.15  0.23  0.06 -0.02 0.56 0.442 1.3
## NO2                0.68  0.37  0.23 -0.02 -0.29 0.75 0.254 2.2
## O3                 0.07  0.79 -0.03  0.14 -0.27 0.73 0.274 1.3
## SO2                0.44  0.44 -0.30  0.36 -0.12 0.62 0.378 3.8
## PM2.5              0.88  0.09 -0.10 -0.17  0.22 0.87 0.129 1.2
## Benzene            0.91 -0.23  0.13  0.06  0.01 0.91 0.091 1.2
## Toulene            0.85 -0.26  0.27  0.18 -0.05 0.90 0.104 1.5
## P_Xylene           0.86 -0.18  0.34  0.15  0.03 0.92 0.084 1.5
## NOx                0.83 -0.06  0.32  0.27  0.00 0.86 0.137 1.5
## PM10               0.79  0.31 -0.07 -0.19  0.32 0.87 0.135 1.8
## WindDirection      0.28  0.23  0.03  0.33  0.70 0.73 0.273 2.1
## NH3                0.83 -0.10 -0.18 -0.14 -0.12 0.76 0.235 1.2
## RH                 0.30 -0.88 -0.16  0.03 -0.13 0.90 0.096 1.4
## Temp              -0.70  0.30  0.43  0.05  0.09 0.79 0.214 2.1
## WindSpeed         -0.67  0.12 -0.36 -0.19  0.15 0.65 0.346 1.9
## VerticalWindSpeed -0.10  0.34  0.55 -0.47 -0.10 0.66 0.338 2.8
## Solar             -0.64  0.44  0.21  0.24  0.15 0.73 0.269 2.5
## BarPressure        0.53  0.09 -0.54  0.29 -0.16 0.69 0.307 2.8
## PD_PM2.5           0.84  0.08 -0.20 -0.35  0.16 0.89 0.108 1.6
## PD_PM10            0.75  0.26 -0.20 -0.41  0.22 0.89 0.114 2.2
## PD_NO2             0.64  0.39  0.10 -0.23 -0.29 0.70 0.297 2.5
## PD_SO2             0.60  0.48 -0.32  0.14 -0.13 0.74 0.264 2.8
## PD_CO              0.63 -0.15  0.05 -0.22  0.02 0.47 0.527 1.4
## 
##                         PC1  PC2  PC3  PC4  PC5
## SS loadings           11.22 2.99 1.79 1.35 1.11
## Proportion Var         0.47 0.12 0.07 0.06 0.05
## Cumulative Var         0.47 0.59 0.67 0.72 0.77
## Proportion Explained   0.61 0.16 0.10 0.07 0.06
## Cumulative Proportion  0.61 0.77 0.87 0.94 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
## 
## Fit based upon off diagonal values = 0.99

To better understand the division of variables under different Principal Components, lets do a cutoff

print(pr1$loadings, cutoff = 0.5)
## 
## Loadings:
##                   PC1    PC2    PC3    PC4    PC5   
## NO                 0.861                            
## CO                 0.690                            
## NO2                0.685                            
## O3                        0.792                     
## SO2                                                 
## PM2.5              0.882                            
## Benzene            0.914                            
## Toulene            0.850                            
## P_Xylene           0.864                            
## NOx                0.827                            
## PM10               0.790                            
## WindDirection                                  0.697
## NH3                0.830                            
## RH                       -0.877                     
## Temp              -0.705                            
## WindSpeed         -0.674                            
## VerticalWindSpeed                0.554              
## Solar             -0.639                            
## BarPressure        0.532        -0.539              
## PD_PM2.5           0.838                            
## PD_PM10            0.751                            
## PD_NO2             0.636                            
## PD_SO2             0.602                            
## PD_CO              0.632                            
## 
##                   PC1   PC2   PC3   PC4   PC5
## SS loadings    11.215 2.986 1.789 1.347 1.110
## Proportion Var  0.467 0.124 0.075 0.056 0.046
## Cumulative Var  0.467 0.592 0.666 0.722 0.769

We clearly see all the separations and there is no variable in PC4, so we can limit the number of factors to 4 instead of 5.

BarPressure has the same explainability with PC1 & PC3, let’s try to use rotation and see if we get a clear picture

Perform Principal Component Analysis with rotation

pr2 <- principal(matrix, nfactors = 5, rotate = "varimax")
print(pr2)
## Principal Components Analysis
## Call: principal(r = matrix, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     RC1   RC4   RC2   RC3   RC5   h2    u2 com
## NO                 0.80  0.33  0.24  0.09  0.22 0.86 0.138 1.7
## CO                 0.68  0.29  0.01  0.06  0.01 0.56 0.442 1.4
## NO2                0.56  0.32  0.53 -0.17 -0.12 0.75 0.254 2.9
## O3                -0.08 -0.04  0.83 -0.18  0.03 0.73 0.274 1.1
## SO2                0.18  0.14  0.66  0.33  0.15 0.62 0.378 1.9
## PM2.5              0.48  0.75  0.16  0.12  0.19 0.87 0.129 2.0
## Benzene            0.80  0.47  0.00  0.22  0.02 0.91 0.091 1.8
## Toulene            0.89  0.29 -0.01  0.17  0.01 0.90 0.104 1.3
## P_Xylene           0.89  0.32  0.01  0.08  0.10 0.92 0.084 1.3
## NOx                0.87  0.22  0.15  0.10  0.15 0.86 0.137 1.3
## PM10               0.35  0.75  0.28 -0.02  0.33 0.87 0.135 2.2
## WindDirection      0.17  0.13  0.08  0.07  0.82 0.73 0.273 1.2
## NH3                0.49  0.63  0.13  0.28 -0.15 0.76 0.235 2.6
## RH                 0.36  0.11 -0.59  0.55 -0.34 0.90 0.096 3.4
## Temp              -0.33 -0.57  0.01 -0.56  0.17 0.79 0.214 2.8
## WindSpeed         -0.79 -0.13 -0.09 -0.03  0.04 0.65 0.346 1.1
## VerticalWindSpeed  0.02  0.06  0.08 -0.79 -0.15 0.66 0.338 1.1
## Solar             -0.39 -0.55  0.21 -0.35  0.32 0.73 0.269 3.7
## BarPressure        0.18  0.29  0.42  0.63 -0.01 0.69 0.307 2.4
## PD_PM2.5           0.34  0.86  0.13  0.11  0.06 0.89 0.108 1.4
## PD_PM10            0.21  0.88  0.23 -0.02  0.13 0.89 0.114 1.3
## PD_NO2             0.39  0.48  0.50 -0.20 -0.20 0.70 0.297 3.6
## PD_SO2             0.20  0.40  0.68  0.26  0.07 0.74 0.264 2.2
## PD_CO              0.45  0.51 -0.06  0.04 -0.08 0.47 0.527 2.1
## 
##                        RC1  RC4  RC2  RC3  RC5
## SS loadings           6.68 5.22 2.98 2.26 1.30
## Proportion Var        0.28 0.22 0.12 0.09 0.05
## Cumulative Var        0.28 0.50 0.62 0.71 0.77
## Proportion Explained  0.36 0.28 0.16 0.12 0.07
## Cumulative Proportion 0.36 0.65 0.81 0.93 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
## 
## Fit based upon off diagonal values = 0.99

After Cutoff

print(pr2$loadings, cutoff = 0.5)
## 
## Loadings:
##                   RC1    RC4    RC2    RC3    RC5   
## NO                 0.801                            
## CO                 0.684                            
## NO2                0.564         0.530              
## O3                               0.828              
## SO2                              0.660              
## PM2.5                     0.752                     
## Benzene            0.801                            
## Toulene            0.886                            
## P_Xylene           0.895                            
## NOx                0.871                            
## PM10                      0.745                     
## WindDirection                                  0.818
## NH3                       0.633                     
## RH                              -0.592  0.546       
## Temp                     -0.572        -0.563       
## WindSpeed         -0.791                            
## VerticalWindSpeed                      -0.793       
## Solar                    -0.552                     
## BarPressure                             0.632       
## PD_PM2.5                  0.861                     
## PD_PM10                   0.877                     
## PD_NO2                                              
## PD_SO2                           0.680              
## PD_CO                     0.511                     
## 
##                  RC1   RC4   RC2   RC3   RC5
## SS loadings    6.682 5.219 2.984 2.263 1.299
## Proportion Var 0.278 0.217 0.124 0.094 0.054
## Cumulative Var 0.278 0.496 0.620 0.714 0.769

Looks like PCA without Rotation gave better result and we can assign BarPressure to PC1 since it has a higher explainability to it.

Perform Factor Analysis

fa1 <- fa(matrix, nfactors = 5, rotate = "none", fm = "pa")  # Function from psych package
print(fa1)
## Factor Analysis using method =  pa
## Call: fa(r = matrix, nfactors = 5, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     PA1   PA2   PA3   PA4   PA5   h2    u2 com
## NO                 0.86  0.06  0.29  0.15  0.20 0.89 0.107 1.4
## CO                 0.67 -0.12  0.19 -0.01  0.09 0.50 0.495 1.3
## NO2                0.68  0.37  0.21  0.00 -0.41 0.81 0.188 2.5
## O3                 0.06  0.71  0.01  0.19 -0.15 0.57 0.433 1.3
## SO2                0.42  0.38 -0.17  0.38  0.04 0.50 0.501 3.3
## PM2.5              0.88  0.10 -0.14 -0.18  0.15 0.86 0.142 1.2
## Benzene            0.92 -0.23  0.13  0.01  0.01 0.91 0.089 1.2
## Toulene            0.85 -0.26  0.30  0.10 -0.01 0.89 0.107 1.5
## P_Xylene           0.87 -0.18  0.38  0.04  0.13 0.95 0.052 1.5
## NOx                0.83 -0.06  0.37  0.17  0.12 0.86 0.135 1.5
## PM10               0.79  0.32 -0.12 -0.22  0.23 0.84 0.163 1.8
## WindDirection      0.26  0.18  0.05  0.08  0.23 0.16 0.842 3.1
## NH3                0.82 -0.09 -0.19 -0.03 -0.11 0.73 0.272 1.2
## RH                 0.30 -0.91 -0.18  0.09 -0.12 0.96 0.038 1.4
## Temp              -0.70  0.29  0.43 -0.10  0.15 0.79 0.211 2.2
## WindSpeed         -0.66  0.11 -0.31 -0.07  0.21 0.59 0.408 1.8
## VerticalWindSpeed -0.10  0.28  0.32 -0.36 -0.16 0.34 0.656 3.5
## Solar             -0.63  0.41  0.25  0.11  0.23 0.69 0.306 2.5
## BarPressure        0.52  0.08 -0.37  0.39  0.00 0.56 0.439 2.8
## PD_PM2.5           0.84  0.09 -0.28 -0.31  0.08 0.89 0.108 1.6
## PD_PM10            0.75  0.28 -0.29 -0.37  0.13 0.88 0.124 2.2
## PD_NO2             0.62  0.37  0.05 -0.13 -0.36 0.67 0.328 2.4
## PD_SO2             0.59  0.45 -0.25  0.28 -0.01 0.69 0.307 2.8
## PD_CO              0.60 -0.12  0.01 -0.14  0.02 0.40 0.600 1.2
## 
##                         PA1  PA2  PA3  PA4  PA5
## SS loadings           11.00 2.72 1.50 0.99 0.73
## Proportion Var         0.46 0.11 0.06 0.04 0.03
## Cumulative Var         0.46 0.57 0.63 0.68 0.71
## Proportion Explained   0.65 0.16 0.09 0.06 0.04
## Cumulative Proportion  0.65 0.81 0.90 0.96 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  276  and the objective function was  28.43
## The degrees of freedom for the model are 166  and the objective function was  5.24 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.04 
## 
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2  PA3  PA4  PA5
## Correlation of (regression) scores with factors   0.99 0.98 0.95 0.89 0.88
## Multiple R square of scores with factors          0.99 0.97 0.91 0.78 0.78
## Minimum correlation of possible factor scores     0.97 0.94 0.81 0.57 0.56

Factor Analysis using Rotation

fa2 <- fa(matrix, nfactors = 5, rotate = "varimax", fm = "pa")  # Varimax Rotation
print(fa2)
## Factor Analysis using method =  pa
## Call: fa(r = matrix, nfactors = 5, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                     PA1   PA3   PA2   PA4   PA5   h2    u2 com
## NO                 0.82  0.32  0.29  0.09 -0.14 0.89 0.107 1.7
## CO                 0.64  0.30  0.03  0.10  0.00 0.50 0.495 1.5
## NO2                0.52  0.28  0.49 -0.21  0.42 0.81 0.188 3.9
## O3                -0.09  0.00  0.70 -0.27  0.03 0.57 0.433 1.3
## SO2                0.18  0.16  0.63  0.22 -0.06 0.50 0.501 1.6
## PM2.5              0.50  0.74  0.21  0.14  0.00 0.86 0.142 2.0
## Benzene            0.80  0.43  0.04  0.25  0.13 0.91 0.089 1.8
## Toulene            0.88  0.25  0.03  0.19  0.12 0.89 0.107 1.3
## P_Xylene           0.92  0.30  0.04  0.09 -0.03 0.95 0.052 1.2
## NOx                0.88  0.22  0.19  0.10 -0.06 0.86 0.135 1.3
## PM10               0.39  0.75  0.33 -0.02 -0.12 0.84 0.163 2.0
## WindDirection      0.20  0.15  0.21  0.00 -0.23 0.16 0.842 3.8
## NH3                0.48  0.55  0.16  0.31  0.25 0.73 0.272 3.2
## RH                 0.33  0.03 -0.56  0.68  0.29 0.96 0.038 2.8
## Temp              -0.32 -0.47 -0.08 -0.61 -0.31 0.79 0.211 3.1
## WindSpeed         -0.70 -0.14 -0.10 -0.03 -0.27 0.59 0.408 1.4
## VerticalWindSpeed  0.00  0.03 -0.01 -0.57  0.12 0.34 0.656 1.1
## Solar             -0.37 -0.44  0.16 -0.42 -0.41 0.69 0.306 4.2
## BarPressure        0.18  0.25  0.44  0.51  0.04 0.56 0.439 2.8
## PD_PM2.5           0.36  0.84  0.15  0.14  0.09 0.89 0.108 1.5
## PD_PM10            0.24  0.87  0.24  0.00  0.02 0.88 0.124 1.3
## PD_NO2             0.36  0.42  0.43 -0.19  0.39 0.67 0.328 4.3
## PD_SO2             0.21  0.36  0.69  0.20  0.01 0.69 0.307 1.9
## PD_CO              0.45  0.42 -0.02  0.11  0.09 0.40 0.600 2.2
## 
##                        PA1  PA3  PA2  PA4  PA5
## SS loadings           6.53 4.58 2.76 2.12 0.96
## Proportion Var        0.27 0.19 0.12 0.09 0.04
## Cumulative Var        0.27 0.46 0.58 0.67 0.71
## Proportion Explained  0.39 0.27 0.16 0.12 0.06
## Cumulative Proportion 0.39 0.66 0.82 0.94 1.00
## 
## Mean item complexity =  2.2
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  276  and the objective function was  28.43
## The degrees of freedom for the model are 166  and the objective function was  5.24 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.04 
## 
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA3  PA2  PA4  PA5
## Correlation of (regression) scores with factors   0.98 0.96 0.94 0.93 0.89
## Multiple R square of scores with factors          0.97 0.92 0.88 0.87 0.79
## Minimum correlation of possible factor scores     0.94 0.84 0.76 0.73 0.58

We will name the factors as per the attributes of the group

Get FA Scores

To get the FA Scores for all observations we give the name of the dataset instead of passing the Correlation matrix as the first argument. Both works the same way and gives the same output
new_fa <- fa(pollution_data, nfactors = 5, rotate = "varimax")  # New FA with dataset as the input
new_fa_scores <- new_fa$scores   # Fetch the Scores for all observations

Add the Dependent variable to these scores to prepare the new data for MLR

newdata <- cbind(data[,8], new_fa_scores)

Name the Factors

colnames(newdata) <- c("PM2.5", "factor1", "factor2", "factor3", "factor4", "factor5")
newdata <- as.data.frame(newdata)  # Convert the dataset into dataframe for analysis

Split the data

Split the data into 2 parts (70:30), create model on 70% data and validate the model on the rest 30%

set.seed(300) # Marking the 70% data so that everytime the model is run with the same set of observations in train & test
indices= sample(1:nrow(newdata), 0.7*nrow(newdata)) # Get the Splitting ratio
train = newdata[indices,]  # 70% data on which model will be created is known as training data
test = newdata[-indices,]  # 30% data on which model is validated is known as test data

Create a MLR Model on training data

model <- lm(PM2.5 ~ ., data = train)
summary(model)
## 
## Call:
## lm(formula = PM2.5 ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -115.291  -21.138   -4.201   13.400  138.552 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  135.392      2.566  52.764  < 2e-16 ***
## factor1       41.431      2.600  15.937  < 2e-16 ***
## factor2       55.134      2.733  20.174  < 2e-16 ***
## factor3       21.278      2.588   8.223 1.33e-14 ***
## factor4       30.399      3.152   9.644  < 2e-16 ***
## factor5       13.736      3.042   4.515 1.00e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.85 on 236 degrees of freedom
## Multiple R-squared:  0.8001, Adjusted R-squared:  0.7958 
## F-statistic: 188.9 on 5 and 236 DF,  p-value: < 2.2e-16

Get the predictions

pred <- predict(model, newdata = test) # Predicting the value of mpeg basis the model we created
print(pred)
##         2         8        11        13        14        16        20 
##  65.23659  59.31659  58.40312 125.07258 103.75376 111.20893 109.89821 
##        23        24        27        29        32        34        36 
##  56.22766  55.09754 115.17040 130.04334 115.90107  88.02867 123.64754 
##        41        42        48        58        60        63        64 
##  64.07610  59.97295  81.35553  59.72030  92.98331  86.21572  69.39076 
##        67        74        76        79        80        82        87 
## 111.29542  96.69836  56.18432  85.77372  96.32997  59.10137  41.81266 
##        93        94        97        99       108       109       110 
##  59.10010  66.35540  54.75233  47.33670  46.81475  58.58964  51.75934 
##       112       123       127       137       139       145       146 
##  27.93108  57.69575  89.51580  49.78534  70.97577  69.10980  56.52830 
##       148       157       161       163       164       165       166 
##  60.88432  73.88474 115.03974  95.05312  97.54431  76.89740  60.46939 
##       202       211       218       223       224       225       226 
## 154.07563 152.39611 234.40123 260.59057 266.14348 318.61584 310.07360 
##       227       232       233       238       240       241       243 
## 296.22519 218.78178 209.08794 220.48516 316.53335 350.40590 282.96281 
##       244       245       246       251       253       268       269 
## 261.15226 227.04338 233.59037 353.57853 202.32787 239.30635 207.76596 
##       270       271       273       280       291       295       296 
## 246.67427 229.99652 206.96458 210.25812 165.29813 211.83064 247.16510 
##       297       301       302       309       310       314       315 
## 177.41564 212.10145 173.84930 169.54414 179.38103 184.54009 200.92681 
##       316       323       329       333       334       336       338 
## 152.86998 134.73449 189.35367 117.02821 117.53976 155.53335  97.48087 
##       344       347       355       356       357       358       360 
##  69.16929 122.77815 135.90085 133.22076  92.91684  91.04382 146.85076 
##       363       364       367       371       376       387       400 
## 122.78123 138.83106 127.46221 176.24675 160.98754 132.42630 189.72443

Model Validity

There are many ways to validate the model, few of them are mentioned below

Comparing RSquare of train & test data - Calculating Rsquare value manually for the test data as we can’t see a rsquare using summary for prediction

Add the new pred column to the test dataset

test$pred <- pred
Two ways to compare rsquare of train & test

Model is valid if Rsquare value for train & test data is close or if Rsquare of test is better

  1. Test Rsquare value for the predicted data (Method 1)
rsquare <- cor(test$PM2.5,test$pred)^2
print(rsquare)
## [1] 0.8018959
In this case Rsquare for test data (0.8018959) is better/close to train data (0.7958), hence the model is valid
  1. Test Rsquare value for the predicted data (Method 2)
SST = sum((test$PM2.5 - mean(train$PM2.5))^2)  # Calculate the Sum of Squares of Total
SSE <- sum((pred - test$PM2.5)^2) # Calculate the Sum of Squares of Error
SSR <- sum((pred - mean(train$PM2.5))^2)  # Calculate the Sum of Square of Residual
rsqr <- SSR/SST
print(rsqr)
## [1] 0.8175005
In this case Rsquare for test data (0.8175005) is better/close to train data (0.7958), hence the model is valid

Calculate RMSE (Root Mean Square Error)

library(Metrics)
rmse(test$PM2.5, pred)
## [1] 39.59757

We get a RMSE of 39.59757 which means that the error lies in between the min & 1st quartile of the distribution of the dataset.

summary(data$PM2.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.75   67.25  108.24  132.76  180.43  550.23
summary(train$PM2.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.75   68.06  103.58  133.39  187.70  448.18
summary(test$PM2.5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.08   66.60  114.32  131.30  160.26  550.23

When RMSE is within the low range of distribution, Model is considered to be statistically significant (Good Model).

Calculating MAPE (Means Absolute Percentage Error)

mape(pred, test$PM2.5)
## [1] 0.1932948
We get a MAPE of 0.1932948 which means 19%, which means that our model accuracy is 81%. With such high accuracy it can be stated as a valid model.