In this case study we are trying to do a factor analysis of the delhi pollution dataset and then put those factors into regression to see how the model behaves.
data <- read.csv("delhi_pollution_data.csv")
summary(data)
## SrNo Date NO CO
## Min. : 1.0 01-01-2016: 1 Min. : 1.12 Min. : 0.220
## 1st Qu.:101.5 01-02-2016: 1 1st Qu.: 11.38 1st Qu.: 1.202
## Median :202.0 01-03-2016: 1 Median : 39.22 Median : 1.780
## Mean :202.0 01-05-2015: 1 Mean : 81.32 Mean : 2.210
## 3rd Qu.:302.5 01-06-2015: 1 3rd Qu.:118.03 3rd Qu.: 2.737
## Max. :403.0 01-07-2015: 1 Max. :480.90 Max. :19.900
## (Other) :397 NA's :2 NA's :1
## NO2 O3 SO2 PM2.5
## Min. : 25.27 Min. : 4.98 Min. : 0.000 Min. : 18.75
## 1st Qu.: 54.50 1st Qu.: 30.12 1st Qu.: 9.515 1st Qu.: 68.11
## Median : 73.68 Median : 53.21 Median : 21.680 Median :108.24
## Mean : 74.43 Mean : 56.30 Mean : 25.988 Mean :130.16
## 3rd Qu.: 91.43 3rd Qu.: 76.85 3rd Qu.: 37.145 3rd Qu.:166.16
## Max. :149.01 Max. :159.70 Max. :371.750 Max. :550.23
## NA's :2 NA's :6 NA's :4 NA's :2
## Benzene Toulene P_Xylene NOx
## Min. : 0.160 Min. : 0.000 Min. : 0.300 Min. : 26.49
## 1st Qu.: 3.235 1st Qu.: 9.695 1st Qu.: 3.382 1st Qu.: 87.67
## Median : 5.910 Median :16.240 Median : 6.060 Median :152.31
## Mean : 6.526 Mean :17.192 Mean : 8.129 Mean :194.68
## 3rd Qu.: 8.940 3rd Qu.:22.973 3rd Qu.:10.570 3rd Qu.:264.38
## Max. :17.510 Max. :48.410 Max. :32.350 Max. :743.70
## NA's :1 NA's :1 NA's :31 NA's :2
## PM10 WindDirection NH3 RH
## Min. : 36.63 Min. :126.7 Min. : 1.00 Min. :15.19
## 1st Qu.:163.66 1st Qu.:180.4 1st Qu.: 25.90 1st Qu.:41.58
## Median :249.67 Median :209.4 Median : 34.70 Median :53.87
## Mean :265.25 Mean :202.9 Mean : 39.63 Mean :52.15
## 3rd Qu.:353.27 3rd Qu.:228.9 3rd Qu.: 50.72 3rd Qu.:63.85
## Max. :705.70 Max. :258.1 Max. :144.82 Max. :86.82
## NA's :2 NA's :1 NA's :2 NA's :1
## Temp WindSpeed VerticalWindSpeed Solar
## Min. : 6.56 Min. :0.3000 Min. :-2.8700 Min. : 3.31
## 1st Qu.:18.82 1st Qu.:0.8525 1st Qu.:-0.1700 1st Qu.: 84.90
## Median :26.57 Median :1.1600 Median :-0.0500 Median :121.85
## Mean :24.43 Mean :1.2380 Mean : 0.1889 Mean :112.95
## 3rd Qu.:29.86 3rd Qu.:1.5200 3rd Qu.: 0.4000 3rd Qu.:141.03
## Max. :41.05 Max. :3.4500 Max. : 3.5600 Max. :345.68
## NA's :2 NA's :1 NA's :2 NA's :2
## BarPressure Weather PD_PM2.5 PD_PM10
## Min. :721.7 Autumn : 88 Min. : 18.75 Min. : 36.63
## 1st Qu.:731.7 Monsoon: 62 1st Qu.: 68.05 1st Qu.:163.38
## Median :733.4 Spring : 58 Median :108.24 Median :250.10
## Mean :732.6 Summer :140 Mean :130.51 Mean :265.62
## 3rd Qu.:733.8 Winter : 55 3rd Qu.:170.99 3rd Qu.:357.95
## Max. :736.1 Max. :550.23 Max. :705.70
## NA's :2 NA's :10 NA's :11
## PD_NO2 PD_SO2 PD_CO
## Min. : 25.27 Min. : 0.000 Min. : 0.370
## 1st Qu.: 54.03 1st Qu.: 9.412 1st Qu.: 1.208
## Median : 73.52 Median : 21.605 Median : 1.775
## Mean : 74.32 Mean : 25.965 Mean : 2.201
## 3rd Qu.: 91.37 3rd Qu.: 37.197 3rd Qu.: 2.667
## Max. :149.01 Max. :371.750 Max. :19.900
## NA's :12 NA's :13 NA's :11
str(data)
## 'data.frame': 403 obs. of 27 variables:
## $ SrNo : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Date : Factor w/ 403 levels "01-01-2016","01-02-2016",..: 39 53 105 119 133 146 159 173 187 200 ...
## $ NO : num 7.22 6.99 7.6 7.57 8.34 6.89 6.89 7.21 7.16 6.64 ...
## $ CO : num 1.77 0.22 0.5 0.77 0.48 0.41 0.37 1.25 1.81 1.58 ...
## $ NO2 : num 47.9 45.3 59.9 63.6 62 ...
## $ O3 : num 51.1 19.3 94.3 66.9 69.5 ...
## $ SO2 : num 16.9 16.7 13.1 16.2 20.3 ...
## $ PM2.5 : num 49 60.2 46.9 113 104.9 ...
## $ Benzene : num 2.53 3.19 2.29 3.92 5.19 2.12 2.29 2.36 3.03 2.91 ...
## $ Toulene : num 9.65 11.1 8.61 10.76 15.95 ...
## $ P_Xylene : num 3 2.67 3.43 4.66 7.66 2.47 2.8 2.57 4.32 2.75 ...
## $ NOx : num 53 51.3 65.5 68.8 67.4 ...
## $ PM10 : num 82.8 113.5 171.4 232.2 235.1 ...
## $ WindDirection : num 142 166 220 233 221 ...
## $ NH3 : num 26.5 31 17.4 25.9 29.5 ...
## $ RH : num 61.3 75.5 33.8 43 40.7 ...
## $ Temp : num 20.2 16.9 26.6 25.2 26.5 ...
## $ WindSpeed : num 1.22 0.62 1.55 1.18 0.88 1.56 1.35 1.3 0.37 1.45 ...
## $ VerticalWindSpeed: num 0.08 -0.04 -0.17 -0.15 0.15 -0.16 -0.13 0.35 1.39 0.34 ...
## $ Solar : num 162.2 99.4 146.9 150.1 137 ...
## $ BarPressure : num 732 734 728 730 731 ...
## $ Weather : Factor w/ 5 levels "Autumn","Monsoon",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ PD_PM2.5 : num NA 49 NA 46.9 113 ...
## $ PD_PM10 : num NA 82.8 NA 171.4 232.2 ...
## $ PD_NO2 : num NA 47.9 NA 59.9 63.6 ...
## $ PD_SO2 : num NA 16.9 NA 13.1 16.2 ...
## $ PD_CO : num NA 1.77 NA 0.5 0.77 0.48 0.41 0.37 1.25 1.81 ...
sum(is.na(data))
## [1] 124
Now that we know there are missing values in the data, let’s find Missing Values per variable
na_count <-sapply(data, function(y) sum(length(which(is.na(y))))) #this is a fixed syntax, only things changes here is the name of the dataset, you can use this in any case study.
na_count <- as.data.frame(na_count)
print(na_count)
## na_count
## SrNo 0
## Date 0
## NO 2
## CO 1
## NO2 2
## O3 6
## SO2 4
## PM2.5 2
## Benzene 1
## Toulene 1
## P_Xylene 31
## NOx 2
## PM10 2
## WindDirection 1
## NH3 2
## RH 1
## Temp 2
## WindSpeed 1
## VerticalWindSpeed 2
## Solar 2
## BarPressure 2
## Weather 0
## PD_PM2.5 10
## PD_PM10 11
## PD_NO2 12
## PD_SO2 13
## PD_CO 11
We are deleting all observations where we have missing values for this case
data <- na.omit(data)
Scaling will help in standardising the data and it works only with numerical variables.
data_scaled <- data[, c(3:21,23:27)] # Filtered only Numeric/Integer variables
data_scaled <- scale(data_scaled)
data_scaled <- as.data.frame(data_scaled) # Convert the Scaled Dataset into Data Frame
We are assuming PM2.5 is the Response variable for us in this case
mod <- lm(PM2.5 ~ ., data = data_scaled)
summary(mod)
##
## Call:
## lm(formula = PM2.5 ~ ., data = data_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00454 -0.13477 0.00270 0.09747 2.88892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.226e-16 1.554e-02 0.000 1.00000
## NO 2.250e-02 7.871e-02 0.286 0.77518
## CO 2.142e-02 2.371e-02 0.903 0.36695
## NO2 3.761e-02 3.343e-02 1.125 0.26136
## O3 7.653e-04 2.474e-02 0.031 0.97534
## SO2 1.398e-02 2.301e-02 0.608 0.54393
## Benzene 3.720e-01 7.763e-02 4.793 2.52e-06 ***
## Toulene -1.191e-01 8.332e-02 -1.429 0.15386
## P_Xylene -2.059e-01 8.558e-02 -2.406 0.01668 *
## NOx 5.016e-02 6.851e-02 0.732 0.46460
## PM10 5.553e-01 3.563e-02 15.585 < 2e-16 ***
## WindDirection 4.169e-04 1.899e-02 0.022 0.98250
## NH3 2.040e-01 3.325e-02 6.135 2.48e-09 ***
## RH 2.009e-02 4.428e-02 0.454 0.65032
## Temp 2.693e-02 4.195e-02 0.642 0.52141
## WindSpeed -7.190e-02 2.650e-02 -2.713 0.00703 **
## VerticalWindSpeed -7.904e-03 1.996e-02 -0.396 0.69239
## Solar 2.646e-02 3.030e-02 0.873 0.38322
## BarPressure -3.248e-02 2.363e-02 -1.374 0.17030
## PD_PM2.5 2.955e-01 4.817e-02 6.135 2.49e-09 ***
## PD_PM10 -1.092e-01 4.538e-02 -2.406 0.01671 *
## PD_NO2 -1.087e-01 2.728e-02 -3.984 8.39e-05 ***
## PD_SO2 -4.368e-03 2.865e-02 -0.152 0.87891
## PD_CO 5.575e-03 2.106e-02 0.265 0.79143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2895 on 323 degrees of freedom
## Multiple R-squared: 0.9217, Adjusted R-squared: 0.9162
## F-statistic: 165.4 on 23 and 323 DF, p-value: < 2.2e-16
Let’s Remove the Response Variable (PM2.5) before doing PCA/FA.
names(data_scaled)
## [1] "NO" "CO" "NO2"
## [4] "O3" "SO2" "PM2.5"
## [7] "Benzene" "Toulene" "P_Xylene"
## [10] "NOx" "PM10" "WindDirection"
## [13] "NH3" "RH" "Temp"
## [16] "WindSpeed" "VerticalWindSpeed" "Solar"
## [19] "BarPressure" "PD_PM2.5" "PD_PM10"
## [22] "PD_NO2" "PD_SO2" "PD_CO"
pollution_data <- data_scaled[, -6] # Removing PM2.5 which is the 6th Variable
matrix <- cor(data_scaled)
print(matrix)
## NO CO NO2 O3
## NO 1.00000000 0.67696029 0.5812679 0.120721119
## CO 0.67696029 1.00000000 0.4133421 -0.053396186
## NO2 0.58126790 0.41334210 1.0000000 0.375333327
## O3 0.12072112 -0.05339619 0.3753333 1.000000000
## SO2 0.39118329 0.22670798 0.3612119 0.329173213
## PM2.5 0.70463047 0.55417088 0.5795105 0.067004489
## Benzene 0.77305833 0.64672782 0.5461371 -0.109079794
## Toulene 0.77517892 0.62662264 0.5284132 -0.114901158
## P_Xylene 0.86604077 0.68608766 0.5137107 -0.074955118
## NOx 0.94940445 0.69152935 0.5879553 0.041109820
## PM10 0.67331264 0.47777436 0.5760809 0.183899056
## WindDirection 0.30335980 0.07026543 0.1814424 0.063300864
## NH3 0.62129889 0.49641984 0.5304001 -0.001958607
## RH 0.14705025 0.25194487 -0.1197191 -0.612502424
## Temp -0.48205483 -0.38782014 -0.3498939 0.107655790
## WindSpeed -0.57909211 -0.44140548 -0.6106899 -0.011432555
## VerticalWindSpeed -0.05062095 -0.08169549 0.1377352 0.144604092
## Solar -0.38751784 -0.44292191 -0.3394173 0.281918705
## BarPressure 0.39035928 0.21819739 0.2773735 0.209755524
## PD_PM2.5 0.60674886 0.49776287 0.4792056 0.061286921
## PD_PM10 0.55552959 0.42245257 0.4815410 0.166941310
## PD_NO2 0.49249620 0.34669363 0.7287087 0.322007022
## PD_SO2 0.49745230 0.30763592 0.4837342 0.370163792
## PD_CO 0.51037323 0.48592215 0.3155569 -0.106288795
## SO2 PM2.5 Benzene Toulene
## NO 0.39118329 0.70463047 0.77305833 0.77517892
## CO 0.22670798 0.55417088 0.64672782 0.62662264
## NO2 0.36121189 0.57951054 0.54613710 0.52841323
## O3 0.32917321 0.06700449 -0.10907979 -0.11490116
## SO2 1.00000000 0.37171324 0.29119218 0.24287386
## PM2.5 0.37171324 1.00000000 0.77516259 0.64730833
## Benzene 0.29119218 0.77516259 1.00000000 0.94894064
## Toulene 0.24287386 0.64730833 0.94894064 1.00000000
## P_Xylene 0.26278805 0.67903498 0.91812642 0.94978302
## NOx 0.32330575 0.64263526 0.77105018 0.79443858
## PM10 0.38643914 0.89286260 0.62730189 0.50773894
## WindDirection 0.22286092 0.29795269 0.23360704 0.20006514
## NH3 0.31317002 0.78947864 0.75652897 0.66175021
## RH -0.14571485 0.17245031 0.44906696 0.44240742
## Temp -0.29537661 -0.61164052 -0.64978331 -0.52427868
## WindSpeed -0.20051890 -0.51683032 -0.68544770 -0.72367197
## VerticalWindSpeed -0.07283954 -0.07481952 -0.09556371 -0.07907504
## Solar -0.13051213 -0.52053877 -0.60774868 -0.52748322
## BarPressure 0.40189299 0.43110477 0.44613277 0.39384610
## PD_PM2.5 0.31348411 0.84058878 0.71231859 0.58906063
## PD_PM10 0.31946031 0.77639440 0.57753277 0.45458289
## PD_NO2 0.30368321 0.50504787 0.47794333 0.44728494
## PD_SO2 0.70488667 0.51549579 0.39044229 0.33616007
## PD_CO 0.15642631 0.50485710 0.57254792 0.53438208
## P_Xylene NOx PM10 WindDirection
## NO 0.86604077 0.94940445 0.67331264 0.303359796
## CO 0.68608766 0.69152935 0.47777436 0.070265433
## NO2 0.51371066 0.58795526 0.57608095 0.181442450
## O3 -0.07495512 0.04110982 0.18389906 0.063300864
## SO2 0.26278805 0.32330575 0.38643914 0.222860919
## PM2.5 0.67903498 0.64263526 0.89286260 0.297952686
## Benzene 0.91812642 0.77105018 0.62730189 0.233607037
## Toulene 0.94978302 0.79443858 0.50773894 0.200065138
## P_Xylene 1.00000000 0.86401685 0.57462439 0.222664244
## NOx 0.86401685 1.00000000 0.57316609 0.242969309
## PM10 0.57462439 0.57316609 1.00000000 0.365052499
## WindDirection 0.22266424 0.24296931 0.36505250 1.000000000
## NH3 0.65798769 0.57815263 0.63180393 0.085675167
## RH 0.34517373 0.25502520 -0.07986845 -0.154386275
## Temp -0.46138203 -0.44560131 -0.47019550 -0.137419762
## WindSpeed -0.67958725 -0.63574920 -0.39545378 -0.241231651
## VerticalWindSpeed -0.00740521 -0.09134434 0.01580934 -0.078233739
## Solar -0.48401084 -0.41559695 -0.37328750 -0.004598241
## BarPressure 0.33281591 0.32513134 0.38291915 0.141059308
## PD_PM2.5 0.61853009 0.53926676 0.76602552 0.206533316
## PD_PM10 0.50310007 0.45332846 0.81136785 0.220450779
## PD_NO2 0.42810752 0.45920813 0.52338004 0.122535054
## PD_SO2 0.37349682 0.40680951 0.54857800 0.201381554
## PD_CO 0.55882065 0.50353813 0.42373914 0.069157579
## NH3 RH Temp WindSpeed
## NO 0.621298889 0.14705025 -0.4820548 -0.579092113
## CO 0.496419835 0.25194487 -0.3878201 -0.441405478
## NO2 0.530400134 -0.11971908 -0.3498939 -0.610689885
## O3 -0.001958607 -0.61250242 0.1076558 -0.011432555
## SO2 0.313170020 -0.14571485 -0.2953766 -0.200518902
## PM2.5 0.789478643 0.17245031 -0.6116405 -0.516830320
## Benzene 0.756528972 0.44906696 -0.6497833 -0.685447696
## Toulene 0.661750210 0.44240742 -0.5242787 -0.723671972
## P_Xylene 0.657987695 0.34517373 -0.4613820 -0.679587253
## NOx 0.578152629 0.25502520 -0.4456013 -0.635749200
## PM10 0.631803934 -0.07986845 -0.4701955 -0.395453785
## WindDirection 0.085675167 -0.15438628 -0.1374198 -0.241231651
## NH3 1.000000000 0.39461182 -0.6612095 -0.481880128
## RH 0.394611818 1.00000000 -0.5773227 -0.288804264
## Temp -0.661209462 -0.57732266 1.0000000 0.408975428
## WindSpeed -0.481880128 -0.28880426 0.4089754 1.000000000
## VerticalWindSpeed -0.124593085 -0.35349227 0.2954563 0.004698092
## Solar -0.600596228 -0.60120581 0.7718077 0.446223575
## BarPressure 0.511212413 0.18107633 -0.5173927 -0.210725045
## PD_PM2.5 0.737144348 0.18356150 -0.6216734 -0.409260603
## PD_PM10 0.607459569 -0.02185193 -0.5083227 -0.301171150
## PD_NO2 0.514390337 -0.12300313 -0.3249090 -0.433981971
## PD_SO2 0.481841057 -0.16651690 -0.3993471 -0.251553694
## PD_CO 0.491399970 0.27474197 -0.4019045 -0.344399910
## VerticalWindSpeed Solar BarPressure PD_PM2.5
## NO -0.050620948 -0.387517836 0.3903593 0.60674886
## CO -0.081695487 -0.442921910 0.2181974 0.49776287
## NO2 0.137735151 -0.339417282 0.2773735 0.47920564
## O3 0.144604092 0.281918705 0.2097555 0.06128692
## SO2 -0.072839544 -0.130512131 0.4018930 0.31348411
## PM2.5 -0.074819524 -0.520538766 0.4311048 0.84058878
## Benzene -0.095563709 -0.607748677 0.4461328 0.71231859
## Toulene -0.079075040 -0.527483222 0.3938461 0.58906063
## P_Xylene -0.007405210 -0.484010841 0.3328159 0.61853009
## NOx -0.091344345 -0.415596953 0.3251313 0.53926676
## PM10 0.015809345 -0.373287496 0.3829191 0.76602552
## WindDirection -0.078233739 -0.004598241 0.1410593 0.20653332
## NH3 -0.124593085 -0.600596228 0.5112124 0.73714435
## RH -0.353492275 -0.601205813 0.1810763 0.18356150
## Temp 0.295456254 0.771807663 -0.5173927 -0.62167345
## WindSpeed 0.004698092 0.446223575 -0.2107250 -0.40926060
## VerticalWindSpeed 1.000000000 0.161457566 -0.3713759 -0.06067449
## Solar 0.161457566 1.000000000 -0.2780028 -0.55260236
## BarPressure -0.371375892 -0.278002823 1.0000000 0.43721807
## PD_PM2.5 -0.060674490 -0.552602356 0.4372181 1.00000000
## PD_PM10 0.024801184 -0.447327133 0.3868305 0.90004926
## PD_NO2 0.171290219 -0.287732044 0.3355824 0.57745670
## PD_SO2 -0.049614518 -0.254231834 0.5181450 0.51520012
## PD_CO -0.044676261 -0.436613669 0.2527666 0.56953828
## PD_PM10 PD_NO2 PD_SO2 PD_CO
## NO 0.55552959 0.4924962 0.49745230 0.51037323
## CO 0.42245257 0.3466936 0.30763592 0.48592215
## NO2 0.48154105 0.7287087 0.48373420 0.31555691
## O3 0.16694131 0.3220070 0.37016379 -0.10628880
## SO2 0.31946031 0.3036832 0.70488667 0.15642631
## PM2.5 0.77639440 0.5050479 0.51549579 0.50485710
## Benzene 0.57753277 0.4779433 0.39044229 0.57254792
## Toulene 0.45458289 0.4472849 0.33616007 0.53438208
## P_Xylene 0.50310007 0.4281075 0.37349682 0.55882065
## NOx 0.45332846 0.4592081 0.40680951 0.50353813
## PM10 0.81136785 0.5233800 0.54857800 0.42373914
## WindDirection 0.22045078 0.1225351 0.20138155 0.06915758
## NH3 0.60745957 0.5143903 0.48184106 0.49139997
## RH -0.02185193 -0.1230031 -0.16651690 0.27474197
## Temp -0.50832271 -0.3249090 -0.39934713 -0.40190450
## WindSpeed -0.30117115 -0.4339820 -0.25155369 -0.34439991
## VerticalWindSpeed 0.02480118 0.1712902 -0.04961452 -0.04467626
## Solar -0.44732713 -0.2877320 -0.25423183 -0.43661367
## BarPressure 0.38683054 0.3355824 0.51814501 0.25276661
## PD_PM2.5 0.90004926 0.5774567 0.51520012 0.56953828
## PD_PM10 1.00000000 0.5739629 0.55016779 0.49497449
## PD_NO2 0.57396293 1.0000000 0.48534503 0.41081304
## PD_SO2 0.55016779 0.4853450 1.00000000 0.31641021
## PD_CO 0.49497449 0.4108130 0.31641021 1.00000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(matrix)
corrplot(matrix, type = "upper", method = "number")
corrplot.mixed(matrix,lower = "number", upper = "pie", tl.col = "black", tl.pos = "lt")
library(psych)
cortest.bartlett(matrix)
## Warning in cortest.bartlett(matrix): n not specified, 100 used
## $chisq
## [1] 2563.405
##
## $p.value
## [1] 0
##
## $df
## [1] 276
# Default n is 100, if sample is less than 100 rows, always good to declare the number of observations in the dataset
cortest.bartlett(matrix, nrow(pollution_data))
## $chisq
## [1] 9585.526
##
## $p.value
## [1] 0
##
## $df
## [1] 276
P value is less than default significance level (0.05) hence this test suggest that this data is suitable to do PCA/FA
KMO(matrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = matrix)
## Overall MSA = 0.88
## MSA for each item =
## NO CO NO2 O3
## 0.84 0.97 0.88 0.73
## SO2 PM2.5 Benzene Toulene
## 0.87 0.90 0.92 0.90
## P_Xylene NOx PM10 WindDirection
## 0.90 0.85 0.90 0.80
## NH3 RH Temp WindSpeed
## 0.91 0.66 0.81 0.95
## VerticalWindSpeed Solar BarPressure PD_PM2.5
## 0.64 0.91 0.90 0.89
## PD_PM10 PD_NO2 PD_SO2 PD_CO
## 0.89 0.90 0.88 0.98
Overall MSA is greater than 0.5 which means we have adequate number of samples in the dataset, we can go ahead with the analysis
A <- eigen(matrix)
EV <- A$values
print(EV)
## [1] 11.21500818 2.98569744 1.78929681 1.34697073 1.11005651
## [6] 0.89010563 0.73051047 0.60189395 0.57041627 0.41180518
## [11] 0.38091795 0.32580388 0.31021327 0.27113681 0.24302887
## [16] 0.21258838 0.16697205 0.14337784 0.09715854 0.08089151
## [21] 0.04637548 0.03109583 0.02014172 0.01853671
plot(EV, xlab = "Principal Components", ylab = "Eigen Value", pch = 20, col = "blue")
lines(EV, col = "red")
abline(h=1, col = "green", lty = 2)
As per Kaiser rule we will go with the number of components/factors where anything for EV above 1
pr1 <- principal(matrix, nfactors = 5, rotate = "none")
print(pr1)
## Principal Components Analysis
## Call: principal(r = matrix, nfactors = 5, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 PC5 h2 u2 com
## NO 0.86 0.06 0.24 0.23 0.08 0.86 0.138 1.3
## CO 0.69 -0.15 0.23 0.06 -0.02 0.56 0.442 1.3
## NO2 0.68 0.37 0.23 -0.02 -0.29 0.75 0.254 2.2
## O3 0.07 0.79 -0.03 0.14 -0.27 0.73 0.274 1.3
## SO2 0.44 0.44 -0.30 0.36 -0.12 0.62 0.378 3.8
## PM2.5 0.88 0.09 -0.10 -0.17 0.22 0.87 0.129 1.2
## Benzene 0.91 -0.23 0.13 0.06 0.01 0.91 0.091 1.2
## Toulene 0.85 -0.26 0.27 0.18 -0.05 0.90 0.104 1.5
## P_Xylene 0.86 -0.18 0.34 0.15 0.03 0.92 0.084 1.5
## NOx 0.83 -0.06 0.32 0.27 0.00 0.86 0.137 1.5
## PM10 0.79 0.31 -0.07 -0.19 0.32 0.87 0.135 1.8
## WindDirection 0.28 0.23 0.03 0.33 0.70 0.73 0.273 2.1
## NH3 0.83 -0.10 -0.18 -0.14 -0.12 0.76 0.235 1.2
## RH 0.30 -0.88 -0.16 0.03 -0.13 0.90 0.096 1.4
## Temp -0.70 0.30 0.43 0.05 0.09 0.79 0.214 2.1
## WindSpeed -0.67 0.12 -0.36 -0.19 0.15 0.65 0.346 1.9
## VerticalWindSpeed -0.10 0.34 0.55 -0.47 -0.10 0.66 0.338 2.8
## Solar -0.64 0.44 0.21 0.24 0.15 0.73 0.269 2.5
## BarPressure 0.53 0.09 -0.54 0.29 -0.16 0.69 0.307 2.8
## PD_PM2.5 0.84 0.08 -0.20 -0.35 0.16 0.89 0.108 1.6
## PD_PM10 0.75 0.26 -0.20 -0.41 0.22 0.89 0.114 2.2
## PD_NO2 0.64 0.39 0.10 -0.23 -0.29 0.70 0.297 2.5
## PD_SO2 0.60 0.48 -0.32 0.14 -0.13 0.74 0.264 2.8
## PD_CO 0.63 -0.15 0.05 -0.22 0.02 0.47 0.527 1.4
##
## PC1 PC2 PC3 PC4 PC5
## SS loadings 11.22 2.99 1.79 1.35 1.11
## Proportion Var 0.47 0.12 0.07 0.06 0.05
## Cumulative Var 0.47 0.59 0.67 0.72 0.77
## Proportion Explained 0.61 0.16 0.10 0.07 0.06
## Cumulative Proportion 0.61 0.77 0.87 0.94 1.00
##
## Mean item complexity = 1.9
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.05
##
## Fit based upon off diagonal values = 0.99
To better understand the division of variables under different Principal Components, lets do a cutoff
print(pr1$loadings, cutoff = 0.5)
##
## Loadings:
## PC1 PC2 PC3 PC4 PC5
## NO 0.861
## CO 0.690
## NO2 0.685
## O3 0.792
## SO2
## PM2.5 0.882
## Benzene 0.914
## Toulene 0.850
## P_Xylene 0.864
## NOx 0.827
## PM10 0.790
## WindDirection 0.697
## NH3 0.830
## RH -0.877
## Temp -0.705
## WindSpeed -0.674
## VerticalWindSpeed 0.554
## Solar -0.639
## BarPressure 0.532 -0.539
## PD_PM2.5 0.838
## PD_PM10 0.751
## PD_NO2 0.636
## PD_SO2 0.602
## PD_CO 0.632
##
## PC1 PC2 PC3 PC4 PC5
## SS loadings 11.215 2.986 1.789 1.347 1.110
## Proportion Var 0.467 0.124 0.075 0.056 0.046
## Cumulative Var 0.467 0.592 0.666 0.722 0.769
We clearly see all the separations and there is no variable in PC4, so we can limit the number of factors to 4 instead of 5.
BarPressure has the same explainability with PC1 & PC3, let’s try to use rotation and see if we get a clear picture
pr2 <- principal(matrix, nfactors = 5, rotate = "varimax")
print(pr2)
## Principal Components Analysis
## Call: principal(r = matrix, nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC4 RC2 RC3 RC5 h2 u2 com
## NO 0.80 0.33 0.24 0.09 0.22 0.86 0.138 1.7
## CO 0.68 0.29 0.01 0.06 0.01 0.56 0.442 1.4
## NO2 0.56 0.32 0.53 -0.17 -0.12 0.75 0.254 2.9
## O3 -0.08 -0.04 0.83 -0.18 0.03 0.73 0.274 1.1
## SO2 0.18 0.14 0.66 0.33 0.15 0.62 0.378 1.9
## PM2.5 0.48 0.75 0.16 0.12 0.19 0.87 0.129 2.0
## Benzene 0.80 0.47 0.00 0.22 0.02 0.91 0.091 1.8
## Toulene 0.89 0.29 -0.01 0.17 0.01 0.90 0.104 1.3
## P_Xylene 0.89 0.32 0.01 0.08 0.10 0.92 0.084 1.3
## NOx 0.87 0.22 0.15 0.10 0.15 0.86 0.137 1.3
## PM10 0.35 0.75 0.28 -0.02 0.33 0.87 0.135 2.2
## WindDirection 0.17 0.13 0.08 0.07 0.82 0.73 0.273 1.2
## NH3 0.49 0.63 0.13 0.28 -0.15 0.76 0.235 2.6
## RH 0.36 0.11 -0.59 0.55 -0.34 0.90 0.096 3.4
## Temp -0.33 -0.57 0.01 -0.56 0.17 0.79 0.214 2.8
## WindSpeed -0.79 -0.13 -0.09 -0.03 0.04 0.65 0.346 1.1
## VerticalWindSpeed 0.02 0.06 0.08 -0.79 -0.15 0.66 0.338 1.1
## Solar -0.39 -0.55 0.21 -0.35 0.32 0.73 0.269 3.7
## BarPressure 0.18 0.29 0.42 0.63 -0.01 0.69 0.307 2.4
## PD_PM2.5 0.34 0.86 0.13 0.11 0.06 0.89 0.108 1.4
## PD_PM10 0.21 0.88 0.23 -0.02 0.13 0.89 0.114 1.3
## PD_NO2 0.39 0.48 0.50 -0.20 -0.20 0.70 0.297 3.6
## PD_SO2 0.20 0.40 0.68 0.26 0.07 0.74 0.264 2.2
## PD_CO 0.45 0.51 -0.06 0.04 -0.08 0.47 0.527 2.1
##
## RC1 RC4 RC2 RC3 RC5
## SS loadings 6.68 5.22 2.98 2.26 1.30
## Proportion Var 0.28 0.22 0.12 0.09 0.05
## Cumulative Var 0.28 0.50 0.62 0.71 0.77
## Proportion Explained 0.36 0.28 0.16 0.12 0.07
## Cumulative Proportion 0.36 0.65 0.81 0.93 1.00
##
## Mean item complexity = 2
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.05
##
## Fit based upon off diagonal values = 0.99
print(pr2$loadings, cutoff = 0.5)
##
## Loadings:
## RC1 RC4 RC2 RC3 RC5
## NO 0.801
## CO 0.684
## NO2 0.564 0.530
## O3 0.828
## SO2 0.660
## PM2.5 0.752
## Benzene 0.801
## Toulene 0.886
## P_Xylene 0.895
## NOx 0.871
## PM10 0.745
## WindDirection 0.818
## NH3 0.633
## RH -0.592 0.546
## Temp -0.572 -0.563
## WindSpeed -0.791
## VerticalWindSpeed -0.793
## Solar -0.552
## BarPressure 0.632
## PD_PM2.5 0.861
## PD_PM10 0.877
## PD_NO2
## PD_SO2 0.680
## PD_CO 0.511
##
## RC1 RC4 RC2 RC3 RC5
## SS loadings 6.682 5.219 2.984 2.263 1.299
## Proportion Var 0.278 0.217 0.124 0.094 0.054
## Cumulative Var 0.278 0.496 0.620 0.714 0.769
Looks like PCA without Rotation gave better result and we can assign BarPressure to PC1 since it has a higher explainability to it.
fa1 <- fa(matrix, nfactors = 5, rotate = "none", fm = "pa") # Function from psych package
print(fa1)
## Factor Analysis using method = pa
## Call: fa(r = matrix, nfactors = 5, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 PA5 h2 u2 com
## NO 0.86 0.06 0.29 0.15 0.20 0.89 0.107 1.4
## CO 0.67 -0.12 0.19 -0.01 0.09 0.50 0.495 1.3
## NO2 0.68 0.37 0.21 0.00 -0.41 0.81 0.188 2.5
## O3 0.06 0.71 0.01 0.19 -0.15 0.57 0.433 1.3
## SO2 0.42 0.38 -0.17 0.38 0.04 0.50 0.501 3.3
## PM2.5 0.88 0.10 -0.14 -0.18 0.15 0.86 0.142 1.2
## Benzene 0.92 -0.23 0.13 0.01 0.01 0.91 0.089 1.2
## Toulene 0.85 -0.26 0.30 0.10 -0.01 0.89 0.107 1.5
## P_Xylene 0.87 -0.18 0.38 0.04 0.13 0.95 0.052 1.5
## NOx 0.83 -0.06 0.37 0.17 0.12 0.86 0.135 1.5
## PM10 0.79 0.32 -0.12 -0.22 0.23 0.84 0.163 1.8
## WindDirection 0.26 0.18 0.05 0.08 0.23 0.16 0.842 3.1
## NH3 0.82 -0.09 -0.19 -0.03 -0.11 0.73 0.272 1.2
## RH 0.30 -0.91 -0.18 0.09 -0.12 0.96 0.038 1.4
## Temp -0.70 0.29 0.43 -0.10 0.15 0.79 0.211 2.2
## WindSpeed -0.66 0.11 -0.31 -0.07 0.21 0.59 0.408 1.8
## VerticalWindSpeed -0.10 0.28 0.32 -0.36 -0.16 0.34 0.656 3.5
## Solar -0.63 0.41 0.25 0.11 0.23 0.69 0.306 2.5
## BarPressure 0.52 0.08 -0.37 0.39 0.00 0.56 0.439 2.8
## PD_PM2.5 0.84 0.09 -0.28 -0.31 0.08 0.89 0.108 1.6
## PD_PM10 0.75 0.28 -0.29 -0.37 0.13 0.88 0.124 2.2
## PD_NO2 0.62 0.37 0.05 -0.13 -0.36 0.67 0.328 2.4
## PD_SO2 0.59 0.45 -0.25 0.28 -0.01 0.69 0.307 2.8
## PD_CO 0.60 -0.12 0.01 -0.14 0.02 0.40 0.600 1.2
##
## PA1 PA2 PA3 PA4 PA5
## SS loadings 11.00 2.72 1.50 0.99 0.73
## Proportion Var 0.46 0.11 0.06 0.04 0.03
## Cumulative Var 0.46 0.57 0.63 0.68 0.71
## Proportion Explained 0.65 0.16 0.09 0.06 0.04
## Cumulative Proportion 0.65 0.81 0.90 0.96 1.00
##
## Mean item complexity = 2
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 276 and the objective function was 28.43
## The degrees of freedom for the model are 166 and the objective function was 5.24
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2 PA3 PA4 PA5
## Correlation of (regression) scores with factors 0.99 0.98 0.95 0.89 0.88
## Multiple R square of scores with factors 0.99 0.97 0.91 0.78 0.78
## Minimum correlation of possible factor scores 0.97 0.94 0.81 0.57 0.56
fa.diagram(fa1)
Looks like rotation will help in this case as the factors don’t give a clear picture of separation
fa2 <- fa(matrix, nfactors = 5, rotate = "varimax", fm = "pa") # Varimax Rotation
print(fa2)
## Factor Analysis using method = pa
## Call: fa(r = matrix, nfactors = 5, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA3 PA2 PA4 PA5 h2 u2 com
## NO 0.82 0.32 0.29 0.09 -0.14 0.89 0.107 1.7
## CO 0.64 0.30 0.03 0.10 0.00 0.50 0.495 1.5
## NO2 0.52 0.28 0.49 -0.21 0.42 0.81 0.188 3.9
## O3 -0.09 0.00 0.70 -0.27 0.03 0.57 0.433 1.3
## SO2 0.18 0.16 0.63 0.22 -0.06 0.50 0.501 1.6
## PM2.5 0.50 0.74 0.21 0.14 0.00 0.86 0.142 2.0
## Benzene 0.80 0.43 0.04 0.25 0.13 0.91 0.089 1.8
## Toulene 0.88 0.25 0.03 0.19 0.12 0.89 0.107 1.3
## P_Xylene 0.92 0.30 0.04 0.09 -0.03 0.95 0.052 1.2
## NOx 0.88 0.22 0.19 0.10 -0.06 0.86 0.135 1.3
## PM10 0.39 0.75 0.33 -0.02 -0.12 0.84 0.163 2.0
## WindDirection 0.20 0.15 0.21 0.00 -0.23 0.16 0.842 3.8
## NH3 0.48 0.55 0.16 0.31 0.25 0.73 0.272 3.2
## RH 0.33 0.03 -0.56 0.68 0.29 0.96 0.038 2.8
## Temp -0.32 -0.47 -0.08 -0.61 -0.31 0.79 0.211 3.1
## WindSpeed -0.70 -0.14 -0.10 -0.03 -0.27 0.59 0.408 1.4
## VerticalWindSpeed 0.00 0.03 -0.01 -0.57 0.12 0.34 0.656 1.1
## Solar -0.37 -0.44 0.16 -0.42 -0.41 0.69 0.306 4.2
## BarPressure 0.18 0.25 0.44 0.51 0.04 0.56 0.439 2.8
## PD_PM2.5 0.36 0.84 0.15 0.14 0.09 0.89 0.108 1.5
## PD_PM10 0.24 0.87 0.24 0.00 0.02 0.88 0.124 1.3
## PD_NO2 0.36 0.42 0.43 -0.19 0.39 0.67 0.328 4.3
## PD_SO2 0.21 0.36 0.69 0.20 0.01 0.69 0.307 1.9
## PD_CO 0.45 0.42 -0.02 0.11 0.09 0.40 0.600 2.2
##
## PA1 PA3 PA2 PA4 PA5
## SS loadings 6.53 4.58 2.76 2.12 0.96
## Proportion Var 0.27 0.19 0.12 0.09 0.04
## Cumulative Var 0.27 0.46 0.58 0.67 0.71
## Proportion Explained 0.39 0.27 0.16 0.12 0.06
## Cumulative Proportion 0.39 0.66 0.82 0.94 1.00
##
## Mean item complexity = 2.2
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 276 and the objective function was 28.43
## The degrees of freedom for the model are 166 and the objective function was 5.24
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA3 PA2 PA4 PA5
## Correlation of (regression) scores with factors 0.98 0.96 0.94 0.93 0.89
## Multiple R square of scores with factors 0.97 0.92 0.88 0.87 0.79
## Minimum correlation of possible factor scores 0.94 0.84 0.76 0.73 0.58
fa.diagram(fa2)
new_fa <- fa(pollution_data, nfactors = 5, rotate = "varimax") # New FA with dataset as the input
new_fa_scores <- new_fa$scores # Fetch the Scores for all observations
newdata <- cbind(data[,8], new_fa_scores)
colnames(newdata) <- c("PM2.5", "factor1", "factor2", "factor3", "factor4", "factor5")
newdata <- as.data.frame(newdata) # Convert the dataset into dataframe for analysis
Split the data into 2 parts (70:30), create model on 70% data and validate the model on the rest 30%
set.seed(300) # Marking the 70% data so that everytime the model is run with the same set of observations in train & test
indices= sample(1:nrow(newdata), 0.7*nrow(newdata)) # Get the Splitting ratio
train = newdata[indices,] # 70% data on which model will be created is known as training data
test = newdata[-indices,] # 30% data on which model is validated is known as test data
model <- lm(PM2.5 ~ ., data = train)
summary(model)
##
## Call:
## lm(formula = PM2.5 ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115.291 -21.138 -4.201 13.400 138.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.392 2.566 52.764 < 2e-16 ***
## factor1 41.431 2.600 15.937 < 2e-16 ***
## factor2 55.134 2.733 20.174 < 2e-16 ***
## factor3 21.278 2.588 8.223 1.33e-14 ***
## factor4 30.399 3.152 9.644 < 2e-16 ***
## factor5 13.736 3.042 4.515 1.00e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.85 on 236 degrees of freedom
## Multiple R-squared: 0.8001, Adjusted R-squared: 0.7958
## F-statistic: 188.9 on 5 and 236 DF, p-value: < 2.2e-16
pred <- predict(model, newdata = test) # Predicting the value of mpeg basis the model we created
print(pred)
## 2 8 11 13 14 16 20
## 65.23659 59.31659 58.40312 125.07258 103.75376 111.20893 109.89821
## 23 24 27 29 32 34 36
## 56.22766 55.09754 115.17040 130.04334 115.90107 88.02867 123.64754
## 41 42 48 58 60 63 64
## 64.07610 59.97295 81.35553 59.72030 92.98331 86.21572 69.39076
## 67 74 76 79 80 82 87
## 111.29542 96.69836 56.18432 85.77372 96.32997 59.10137 41.81266
## 93 94 97 99 108 109 110
## 59.10010 66.35540 54.75233 47.33670 46.81475 58.58964 51.75934
## 112 123 127 137 139 145 146
## 27.93108 57.69575 89.51580 49.78534 70.97577 69.10980 56.52830
## 148 157 161 163 164 165 166
## 60.88432 73.88474 115.03974 95.05312 97.54431 76.89740 60.46939
## 202 211 218 223 224 225 226
## 154.07563 152.39611 234.40123 260.59057 266.14348 318.61584 310.07360
## 227 232 233 238 240 241 243
## 296.22519 218.78178 209.08794 220.48516 316.53335 350.40590 282.96281
## 244 245 246 251 253 268 269
## 261.15226 227.04338 233.59037 353.57853 202.32787 239.30635 207.76596
## 270 271 273 280 291 295 296
## 246.67427 229.99652 206.96458 210.25812 165.29813 211.83064 247.16510
## 297 301 302 309 310 314 315
## 177.41564 212.10145 173.84930 169.54414 179.38103 184.54009 200.92681
## 316 323 329 333 334 336 338
## 152.86998 134.73449 189.35367 117.02821 117.53976 155.53335 97.48087
## 344 347 355 356 357 358 360
## 69.16929 122.77815 135.90085 133.22076 92.91684 91.04382 146.85076
## 363 364 367 371 376 387 400
## 122.78123 138.83106 127.46221 176.24675 160.98754 132.42630 189.72443
Comparing RSquare of train & test data - Calculating Rsquare value manually for the test data as we can’t see a rsquare using summary for prediction
Add the new pred column to the test dataset
test$pred <- pred
Model is valid if Rsquare value for train & test data is close or if Rsquare of test is better
rsquare <- cor(test$PM2.5,test$pred)^2
print(rsquare)
## [1] 0.8018959
SST = sum((test$PM2.5 - mean(train$PM2.5))^2) # Calculate the Sum of Squares of Total
SSE <- sum((pred - test$PM2.5)^2) # Calculate the Sum of Squares of Error
SSR <- sum((pred - mean(train$PM2.5))^2) # Calculate the Sum of Square of Residual
rsqr <- SSR/SST
print(rsqr)
## [1] 0.8175005
library(Metrics)
rmse(test$PM2.5, pred)
## [1] 39.59757
We get a RMSE of 39.59757 which means that the error lies in between the min & 1st quartile of the distribution of the dataset.
summary(data$PM2.5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.75 67.25 108.24 132.76 180.43 550.23
summary(train$PM2.5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.75 68.06 103.58 133.39 187.70 448.18
summary(test$PM2.5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.08 66.60 114.32 131.30 160.26 550.23
When RMSE is within the low range of distribution, Model is considered to be statistically significant (Good Model).
mape(pred, test$PM2.5)
## [1] 0.1932948