Introduction

In this article I will run Principal Component Analysis (PCA). This method is used to reduce dimension. Dimension reduction is the process of reducing the number of variables to a set of values of variables called principal variables.

Data

I used the data from Kaggle (https://www.kaggle.com/airqualityanthony/leamington-aurn-air-quality-data) which contains information about air quality in Leamington Spa Rugby Road AURN Station for 2009-2019.

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)
## corrplot 0.84 loaded
lear <- read.csv("LEAR.csv",sep=',')
head(lear)
##   X                date code                      site no no2 pm10 pm2.5 nox
## 1 1 2012-01-01 00:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
## 2 2 2012-01-01 01:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
## 3 3 2012-01-01 02:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
## 4 4 2012-01-01 03:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
## 5 5 2012-01-01 04:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
## 6 6 2012-01-01 05:00:00 LEAR Leamington Spa Rugby Road NA  NA   NA    NA  NA
##   nv10 v10 nv2.5 v2.5 ws wd latitude longitude     site.type
## 1   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic
## 2   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic
## 3   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic
## 4   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic
## 5   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic
## 6   NA  NA    NA   NA NA NA 52.29488 -1.542911 Urban Traffic

Many rows contain NA so we have to omit that rows. What’s more, we can not run PCA on all columns so we select only data contains information about air condition. I created correlation matrix.

lear_clear <- na.omit(lear) 
pca_lear <- lear_clear[, 5:14]
head(pca_lear)
##      no no2 pm10 pm2.5 nox nv10 v10 nv2.5 v2.5  ws
## 3659  5  21    7     1  29    8  -1     1    0 0.5
## 3660  2  17    8     5  21    7   1     6   -1 0.5
## 3661  2  15   10     8  19   10   0     7    1 0.6
## 3662  4  21    8     9  27    8   0     7    2 0.7
## 3663  5  27   12     9  34   10   2     8    1 0.3
## 3664  6  29   16    10  38   12   4     8    2 0.6
cor(pca_lear)
##               no        no2       pm10      pm2.5        nox       nv10
## no     1.0000000  0.6272406  0.3917034  0.4139024  0.9540851  0.4165769
## no2    0.6272406  1.0000000  0.4943874  0.5308617  0.8317067  0.5080110
## pm10   0.3917034  0.4943874  1.0000000  0.9097937  0.4692051  0.9738191
## pm2.5  0.4139024  0.5308617  0.9097937  1.0000000  0.4991542  0.8778759
## nox    0.9540851  0.8317067  0.4692051  0.4991542  1.0000000  0.4922268
## nv10   0.4165769  0.5080110  0.9738191  0.8778759  0.4922268  1.0000000
## v10    0.1765160  0.2785238  0.7266120  0.6879937  0.2326930  0.5516335
## nv2.5  0.4514532  0.5567832  0.9073420  0.9673263  0.5359112  0.9129132
## v2.5   0.2000605  0.3185216  0.6626572  0.7840825  0.2649854  0.5443361
## ws    -0.2206839 -0.3513879 -0.2613853 -0.2926741 -0.2923481 -0.2453564
##              v10      nv2.5       v2.5         ws
## no     0.1765160  0.4514532  0.2000605 -0.2206839
## no2    0.2785238  0.5567832  0.3185216 -0.3513879
## pm10   0.7266120  0.9073420  0.6626572 -0.2613853
## pm2.5  0.6879937  0.9673263  0.7840825 -0.2926741
## nox    0.2326930  0.5359112  0.2649854 -0.2923481
## nv10   0.5516335  0.9129132  0.5443361 -0.2453564
## v10    1.0000000  0.5711187  0.7863016 -0.2201102
## nv2.5  0.5711187  1.0000000  0.6241799 -0.2979088
## v2.5   0.7863016  0.6241799  1.0000000 -0.1904726
## ws    -0.2201102 -0.2979088 -0.1904726  1.0000000

PCA

I used two different function to run PCA.

PRINCOMP()

Firstly, I used princomp() function. I print some basic statistics of all components.

Below we can see scree plot. Note where the ‘kink’ in the graph is. This turning point in the level of explained variance indicates the number of principal components present in the data. In this case, there is just one, given the precipitous fall-off in the explained variance when moving from ‘one’ to ‘two’.

fit <- princomp(na.omit(pca_lear), cor = TRUE)
summary(fit)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.4366664 1.3387061 0.93759635 0.81156490 0.59637316
## Proportion of Variance 0.5937343 0.1792134 0.08790869 0.06586376 0.03556609
## Cumulative Proportion  0.5937343 0.7729477 0.86085640 0.92672016 0.96228626
##                            Comp.6    Comp.7       Comp.8       Comp.9
## Standard deviation     0.51676216 0.3219922 0.0797937205 6.242902e-03
## Proportion of Variance 0.02670431 0.0103679 0.0006367038 3.897383e-06
## Cumulative Proportion  0.98899057 0.9993585 0.9999951742 9.999991e-01
##                             Comp.10
## Standard deviation     3.046928e-03
## Proportion of Variance 9.283768e-07
## Cumulative Proportion  1.000000e+00
loadings(fit)
## 
## Loadings:
##       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## no     0.250  0.511  0.180  0.206  0.537                              0.554 
## no2    0.288  0.386               -0.798 -0.180                       0.299 
## pm10   0.381 -0.184        -0.228        -0.311  0.294        -0.754        
## pm2.5  0.388 -0.170                       0.345 -0.336 -0.758               
## nox    0.289  0.513  0.108  0.179                                    -0.777 
## nv10   0.367 -0.110  0.139 -0.445        -0.176  0.459         0.622        
## v10    0.288 -0.346         0.506  0.127 -0.607 -0.323         0.206        
## nv2.5  0.381               -0.322         0.254 -0.542  0.614               
## v2.5   0.297 -0.326         0.529 -0.103  0.529  0.434  0.210               
## ws    -0.157 -0.136  0.952  0.140 -0.163                                    
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings       1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0    1.0
## Proportion Var    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1    0.1
## Cumulative Var    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9
##                Comp.10
## SS loadings        1.0
## Proportion Var     0.1
## Cumulative Var     1.0
plot(fit,type="lines")

Now we can add first principal component to our dataframe.

lear_clear$pc_scores <- fit$scores
pca_final <- lear_clear
head(pca_final)
##         X                date code                      site no no2 pm10 pm2.5
## 3659 3659 2012-06-01 10:00:00 LEAR Leamington Spa Rugby Road  5  21    7     1
## 3660 3660 2012-06-01 11:00:00 LEAR Leamington Spa Rugby Road  2  17    8     5
## 3661 3661 2012-06-01 12:00:00 LEAR Leamington Spa Rugby Road  2  15   10     8
## 3662 3662 2012-06-01 13:00:00 LEAR Leamington Spa Rugby Road  4  21    8     9
## 3663 3663 2012-06-01 14:00:00 LEAR Leamington Spa Rugby Road  5  27   12     9
## 3664 3664 2012-06-01 15:00:00 LEAR Leamington Spa Rugby Road  6  29   16    10
##      nox nv10 v10 nv2.5 v2.5  ws    wd latitude longitude     site.type
## 3659  29    8  -1     1    0 0.5 227.2 52.29488 -1.542911 Urban Traffic
## 3660  21    7   1     6   -1 0.5 270.7 52.29488 -1.542911 Urban Traffic
## 3661  19   10   0     7    1 0.6 156.6 52.29488 -1.542911 Urban Traffic
## 3662  27    8   0     7    2 0.7  40.5 52.29488 -1.542911 Urban Traffic
## 3663  34   10   2     8    1 0.3 194.8 52.29488 -1.542911 Urban Traffic
## 3664  38   12   4     8    2 0.6 325.3 52.29488 -1.542911 Urban Traffic
##      pc_scores.Comp.1 pc_scores.Comp.2 pc_scores.Comp.3 pc_scores.Comp.4
## 3659    -1.8104453917     1.4989181483    -1.6961337282    -0.8672604047
## 3660    -1.4765061459     0.9596157888    -1.7224792312    -1.0088537604
## 3661    -1.0660294882     0.6437343303    -1.5937957825    -1.1063039985
## 3662    -0.8953412705     0.8699294768    -1.5966107913    -0.7011248232
## 3663    -0.3276094423     0.9178937337    -1.7121486502    -0.7095390082
## 3664     0.2935270266     0.5706964095    -1.6034284404    -0.3271261175
##      pc_scores.Comp.5 pc_scores.Comp.6 pc_scores.Comp.7 pc_scores.Comp.8
## 3659     0.0219975459     0.1008595087     0.5079448862    -0.0356497521
## 3660     0.2202546082    -0.1547459042    -0.3880669283     0.0010885520
## 3661     0.2266905595     0.4517399051     0.0448153581    -0.0215987707
## 3662    -0.0830744516     0.6979658113     0.0055132386    -0.0257780041
## 3663    -0.1486088821    -0.0768479420    -0.2012226367    -0.0091106575
## 3664    -0.1324907969    -0.4593476683    -0.0981863959    -0.0099733820
##      pc_scores.Comp.9 pc_scores.Comp.10
## 3659    -0.0001379280     -0.0059575700
## 3660    -0.0009588762     -0.0172371901
## 3661    -0.0007120346     -0.0171216587
## 3662    -0.0012312427      0.0030960122
## 3663    -0.0006487376      0.0128908971
## 3664    -0.0001233950      0.0038479419

PRCOMP()

Second method which I will use is prcomp(). This method let us display some more graphics.

Firstly, I displayed eigenvalues with a scree plot.

pca_prcomp <- prcomp(na.omit(pca_lear, scale = TRUE))
fviz_eig(pca_prcomp, addlabels = TRUE, ylim = c(0, 85))

We can use get_eigenvalue() to display eigen values in data frame.

eig.val <- get_eigenvalue(pca_prcomp)
eig.val
##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.300343e+03     8.573815e+01                    85.73815
## Dim.2  2.484750e+02     9.261138e+00                    94.99929
## Dim.3  1.005530e+02     3.747803e+00                    98.74709
## Dim.4  1.600704e+01     5.966128e-01                    99.34371
## Dim.5  9.386185e+00     3.498410e-01                    99.69355
## Dim.6  5.882429e+00     2.192493e-01                    99.91280
## Dim.7  1.981130e+00     7.384050e-02                    99.98664
## Dim.8  3.499058e-01     1.304165e-02                    99.99968
## Dim.9  5.994503e-03     2.234265e-04                    99.99990
## Dim.10 2.613004e-03     9.739164e-05                   100.00000

Below we can see some grpahs. Firstly, we can see Quality of Representation (cos2). I plotted also total quality of representation (cos2) of variables on Dim.1 and Dim.2 and Individuals PCA.

res.var <- get_pca_var(pca_prcomp)
res.var$coord  
##            Dim.1      Dim.2      Dim.3       Dim.4       Dim.5        Dim.6
## no    18.0939198  3.5416069 -5.1739458  0.07123170 -0.01305072  0.045431325
## no2   13.3372532 -2.4589842  8.1829123 -0.12443968 -0.03758566 -0.091503770
## pm10   5.6874760 -8.7339462 -1.6560938 -1.44263754 -1.24767320  0.054562315
## pm2.5  5.4290320 -7.6009563 -1.0947332  2.42656578  0.43250791 -0.161038969
## nox   41.0773326  2.9762278  0.2515318 -0.01116148 -0.04610243 -0.026291469
## nv10   4.8565646 -6.8724005 -1.3415683 -2.11685533  0.58679895 -0.001494686
## v10    0.8175968 -1.8557058 -0.3094104  0.69041180 -1.83313323  0.062423143
## nv2.5  4.4889820 -5.7095404 -0.8512580  0.88365689  1.59704385  0.070102837
## v2.5   0.9278190 -1.8762210 -0.2240979  1.47988037 -1.17444578 -0.248334284
## ws    -0.7967879  0.4225929 -0.3853681 -0.29713698  0.06436571 -2.402479329
##              Dim.7         Dim.8         Dim.9        Dim.10
## no     0.002618540 -0.0018168453  5.690908e-02  1.164360e-04
## no2   -0.004957402 -0.0001587606  3.712020e-02  8.577614e-05
## pm10  -0.214850829 -0.0178108270 -2.441415e-04  2.941248e-02
## pm2.5  0.221356373  0.3370095465  1.822874e-04  1.796311e-03
## nox    0.001725446  0.0009633251 -3.711865e-02 -8.615341e-05
## nv10   0.508434577  0.0218009380  1.819953e-04 -2.951087e-02
## v10   -0.721918131  0.0211418012  3.915514e-05 -2.945815e-02
## nv2.5 -0.623630090 -0.3418700983 -9.026517e-05 -1.738472e-03
## v2.5   0.834796285 -0.3438052239 -1.262918e-04 -1.718557e-03
## ws    -0.143057907  0.0030647093  6.212378e-05 -7.268007e-05
res.var$contrib        
##             Dim.1       Dim.2       Dim.3        Dim.4        Dim.5
## no    14.23222463  5.04798364 26.62248426 3.169827e-02  0.001814596
## no2    7.73286204  2.43348515 66.59177846 9.674015e-02  0.015050646
## pm10   1.40619849 30.69999099  2.72756238 1.300180e+01 16.584888996
## pm2.5  1.28130428 23.25164588  1.19184950 3.678520e+01  1.992961830
## nox   73.35199643  3.56491813  0.06292029 7.782746e-04  0.022644284
## nv10   1.02533505 19.00790059  1.78990671 2.799441e+01  3.668508486
## v10    0.02905934  1.38591143  0.09520824 2.977868e+00 35.801311645
## nv2.5  0.87599815 13.11956795  0.72065473 4.878163e+00 27.173436086
## v2.5   0.03742260  1.41672392  0.04994366 1.368177e+01 14.695244681
## ws     0.02759897  0.07187232  0.14769177 5.515723e-01  0.044138750
##              Dim.6        Dim.7        Dim.8        Dim.9       Dim.10
## no    3.508764e-02 3.461029e-04 9.433760e-04 5.402690e+01 5.188415e-04
## no2   1.423381e-01 1.240496e-03 7.203345e-06 2.298622e+01 2.815743e-04
## pm10  5.060913e-02 2.330027e+00 9.066029e-02 9.943291e-04 3.310727e+01
## pm2.5 4.408646e-01 2.473267e+00 3.245886e+01 5.543194e-04 1.234875e-01
## nox   1.175095e-02 1.502759e-04 2.652129e-04 2.298429e+01 2.840566e-04
## nv10  3.797899e-05 1.304840e+01 1.358311e-01 5.525445e-04 3.332912e+01
## v10   6.624217e-02 2.630649e+01 1.277418e-01 2.557551e-05 3.321015e+01
## nv2.5 8.354385e-02 1.963094e+01 3.340190e+01 1.359212e-04 1.156632e-01
## v2.5  1.048375e+00 3.517612e+01 3.378111e+01 2.660706e-04 1.130285e-01
## ws    9.812115e+01 1.033025e+00 2.684278e-03 6.438173e-05 2.021579e-04
res.var$cos2     
##              Dim.1      Dim.2       Dim.3        Dim.4        Dim.5
## no     327.3899327 12.5429795 26.76971558 0.0050739547 0.0001703213
## no2    177.8823233  6.0466033 66.96005347 0.0154852339 0.0014126816
## pm10    32.3473836 76.2818158  2.74264672 2.0812030791 1.5566884085
## pm2.5   29.4743889 57.7745371  1.19844083 5.8882215073 0.1870630898
## nox   1687.3472558  8.8579319  0.06326826 0.0001245787 0.0021254345
## nv10    23.5862194 47.2298892  1.79980549 4.4810764978 0.3443330032
## v10      0.6684645  3.4436440  0.09573478 0.4766684502 3.3603774412
## nv2.5   20.1509590 32.5988521  0.72464019 0.7808494942 2.5505490561
## v2.5     0.8608481  3.5202054  0.05021987 2.1900459042 1.3793228921
## ws       0.6348709  0.1785848  0.14850855 0.0882903871 0.0041429448
##              Dim.6        Dim.7        Dim.8        Dim.9       Dim.10
## no    2.064005e-03 6.856750e-06 3.300927e-06 3.238644e-03 1.355735e-08
## no2   8.372940e-03 2.457584e-05 2.520492e-08 1.377909e-03 7.357547e-09
## pm10  2.977046e-03 4.616088e-02 3.172256e-04 5.960509e-08 8.650942e-04
## pm2.5 2.593355e-02 4.899864e-02 1.135754e-01 3.322869e-08 3.226733e-06
## nox   6.912413e-04 2.977162e-06 9.279952e-07 1.377794e-03 7.422410e-09
## nv10  2.234087e-06 2.585057e-01 4.752809e-04 3.312230e-08 8.708912e-04
## v10   3.896649e-03 5.211658e-01 4.469758e-04 1.533125e-09 8.677824e-04
## nv2.5 4.914408e-03 3.889145e-01 1.168752e-01 8.147800e-09 3.022284e-06
## v2.5  6.166992e-02 6.968848e-01 1.182020e-01 1.594961e-08 2.953440e-06
## ws    5.771907e+00 2.046556e-02 9.392443e-06 3.859364e-09 5.282393e-09
corrplot(res.var$cos2, is.corr=FALSE)

fviz_cos2(pca_prcomp, choice = "var", axes = 1:2)

fviz_pca_var(pca_prcomp, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE
)

fviz_pca_ind(pca_prcomp,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
)
## Warning: ggrepel: 43534 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Conclusion

In this article, I used PCA method to reduce dimension. I created first principal variable of this data set and plotted some graphs.