In this article I will run Principal Component Analysis (PCA). This method is used to reduce dimension. Dimension reduction is the process of reducing the number of variables to a set of values of variables called principal variables.
I used the data from Kaggle (https://www.kaggle.com/airqualityanthony/leamington-aurn-air-quality-data) which contains information about air quality in Leamington Spa Rugby Road AURN Station for 2009-2019.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(corrplot)
## corrplot 0.84 loaded
lear <- read.csv("LEAR.csv",sep=',')
head(lear)
## X date code site no no2 pm10 pm2.5 nox
## 1 1 2012-01-01 00:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## 2 2 2012-01-01 01:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## 3 3 2012-01-01 02:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## 4 4 2012-01-01 03:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## 5 5 2012-01-01 04:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## 6 6 2012-01-01 05:00:00 LEAR Leamington Spa Rugby Road NA NA NA NA NA
## nv10 v10 nv2.5 v2.5 ws wd latitude longitude site.type
## 1 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
## 2 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
## 3 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
## 4 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
## 5 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
## 6 NA NA NA NA NA NA 52.29488 -1.542911 Urban Traffic
Many rows contain NA so we have to omit that rows. Whatâs more, we can not run PCA on all columns so we select only data contains information about air condition. I created correlation matrix.
lear_clear <- na.omit(lear)
pca_lear <- lear_clear[, 5:14]
head(pca_lear)
## no no2 pm10 pm2.5 nox nv10 v10 nv2.5 v2.5 ws
## 3659 5 21 7 1 29 8 -1 1 0 0.5
## 3660 2 17 8 5 21 7 1 6 -1 0.5
## 3661 2 15 10 8 19 10 0 7 1 0.6
## 3662 4 21 8 9 27 8 0 7 2 0.7
## 3663 5 27 12 9 34 10 2 8 1 0.3
## 3664 6 29 16 10 38 12 4 8 2 0.6
cor(pca_lear)
## no no2 pm10 pm2.5 nox nv10
## no 1.0000000 0.6272406 0.3917034 0.4139024 0.9540851 0.4165769
## no2 0.6272406 1.0000000 0.4943874 0.5308617 0.8317067 0.5080110
## pm10 0.3917034 0.4943874 1.0000000 0.9097937 0.4692051 0.9738191
## pm2.5 0.4139024 0.5308617 0.9097937 1.0000000 0.4991542 0.8778759
## nox 0.9540851 0.8317067 0.4692051 0.4991542 1.0000000 0.4922268
## nv10 0.4165769 0.5080110 0.9738191 0.8778759 0.4922268 1.0000000
## v10 0.1765160 0.2785238 0.7266120 0.6879937 0.2326930 0.5516335
## nv2.5 0.4514532 0.5567832 0.9073420 0.9673263 0.5359112 0.9129132
## v2.5 0.2000605 0.3185216 0.6626572 0.7840825 0.2649854 0.5443361
## ws -0.2206839 -0.3513879 -0.2613853 -0.2926741 -0.2923481 -0.2453564
## v10 nv2.5 v2.5 ws
## no 0.1765160 0.4514532 0.2000605 -0.2206839
## no2 0.2785238 0.5567832 0.3185216 -0.3513879
## pm10 0.7266120 0.9073420 0.6626572 -0.2613853
## pm2.5 0.6879937 0.9673263 0.7840825 -0.2926741
## nox 0.2326930 0.5359112 0.2649854 -0.2923481
## nv10 0.5516335 0.9129132 0.5443361 -0.2453564
## v10 1.0000000 0.5711187 0.7863016 -0.2201102
## nv2.5 0.5711187 1.0000000 0.6241799 -0.2979088
## v2.5 0.7863016 0.6241799 1.0000000 -0.1904726
## ws -0.2201102 -0.2979088 -0.1904726 1.0000000
I used two different function to run PCA.
Firstly, I used princomp() function. I print some basic statistics of all components.
Below we can see scree plot. Note where the âkinkâ in the graph is. This turning point in the level of explained variance indicates the number of principal components present in the data. In this case, there is just one, given the precipitous fall-off in the explained variance when moving from âoneâ to âtwoâ.
fit <- princomp(na.omit(pca_lear), cor = TRUE)
summary(fit)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.4366664 1.3387061 0.93759635 0.81156490 0.59637316
## Proportion of Variance 0.5937343 0.1792134 0.08790869 0.06586376 0.03556609
## Cumulative Proportion 0.5937343 0.7729477 0.86085640 0.92672016 0.96228626
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.51676216 0.3219922 0.0797937205 6.242902e-03
## Proportion of Variance 0.02670431 0.0103679 0.0006367038 3.897383e-06
## Cumulative Proportion 0.98899057 0.9993585 0.9999951742 9.999991e-01
## Comp.10
## Standard deviation 3.046928e-03
## Proportion of Variance 9.283768e-07
## Cumulative Proportion 1.000000e+00
loadings(fit)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## no 0.250 0.511 0.180 0.206 0.537 0.554
## no2 0.288 0.386 -0.798 -0.180 0.299
## pm10 0.381 -0.184 -0.228 -0.311 0.294 -0.754
## pm2.5 0.388 -0.170 0.345 -0.336 -0.758
## nox 0.289 0.513 0.108 0.179 -0.777
## nv10 0.367 -0.110 0.139 -0.445 -0.176 0.459 0.622
## v10 0.288 -0.346 0.506 0.127 -0.607 -0.323 0.206
## nv2.5 0.381 -0.322 0.254 -0.542 0.614
## v2.5 0.297 -0.326 0.529 -0.103 0.529 0.434 0.210
## ws -0.157 -0.136 0.952 0.140 -0.163
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
## Cumulative Var 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
## Comp.10
## SS loadings 1.0
## Proportion Var 0.1
## Cumulative Var 1.0
plot(fit,type="lines")
Now we can add first principal component to our dataframe.
lear_clear$pc_scores <- fit$scores
pca_final <- lear_clear
head(pca_final)
## X date code site no no2 pm10 pm2.5
## 3659 3659 2012-06-01 10:00:00 LEAR Leamington Spa Rugby Road 5 21 7 1
## 3660 3660 2012-06-01 11:00:00 LEAR Leamington Spa Rugby Road 2 17 8 5
## 3661 3661 2012-06-01 12:00:00 LEAR Leamington Spa Rugby Road 2 15 10 8
## 3662 3662 2012-06-01 13:00:00 LEAR Leamington Spa Rugby Road 4 21 8 9
## 3663 3663 2012-06-01 14:00:00 LEAR Leamington Spa Rugby Road 5 27 12 9
## 3664 3664 2012-06-01 15:00:00 LEAR Leamington Spa Rugby Road 6 29 16 10
## nox nv10 v10 nv2.5 v2.5 ws wd latitude longitude site.type
## 3659 29 8 -1 1 0 0.5 227.2 52.29488 -1.542911 Urban Traffic
## 3660 21 7 1 6 -1 0.5 270.7 52.29488 -1.542911 Urban Traffic
## 3661 19 10 0 7 1 0.6 156.6 52.29488 -1.542911 Urban Traffic
## 3662 27 8 0 7 2 0.7 40.5 52.29488 -1.542911 Urban Traffic
## 3663 34 10 2 8 1 0.3 194.8 52.29488 -1.542911 Urban Traffic
## 3664 38 12 4 8 2 0.6 325.3 52.29488 -1.542911 Urban Traffic
## pc_scores.Comp.1 pc_scores.Comp.2 pc_scores.Comp.3 pc_scores.Comp.4
## 3659 -1.8104453917 1.4989181483 -1.6961337282 -0.8672604047
## 3660 -1.4765061459 0.9596157888 -1.7224792312 -1.0088537604
## 3661 -1.0660294882 0.6437343303 -1.5937957825 -1.1063039985
## 3662 -0.8953412705 0.8699294768 -1.5966107913 -0.7011248232
## 3663 -0.3276094423 0.9178937337 -1.7121486502 -0.7095390082
## 3664 0.2935270266 0.5706964095 -1.6034284404 -0.3271261175
## pc_scores.Comp.5 pc_scores.Comp.6 pc_scores.Comp.7 pc_scores.Comp.8
## 3659 0.0219975459 0.1008595087 0.5079448862 -0.0356497521
## 3660 0.2202546082 -0.1547459042 -0.3880669283 0.0010885520
## 3661 0.2266905595 0.4517399051 0.0448153581 -0.0215987707
## 3662 -0.0830744516 0.6979658113 0.0055132386 -0.0257780041
## 3663 -0.1486088821 -0.0768479420 -0.2012226367 -0.0091106575
## 3664 -0.1324907969 -0.4593476683 -0.0981863959 -0.0099733820
## pc_scores.Comp.9 pc_scores.Comp.10
## 3659 -0.0001379280 -0.0059575700
## 3660 -0.0009588762 -0.0172371901
## 3661 -0.0007120346 -0.0171216587
## 3662 -0.0012312427 0.0030960122
## 3663 -0.0006487376 0.0128908971
## 3664 -0.0001233950 0.0038479419
Second method which I will use is prcomp(). This method let us display some more graphics.
Firstly, I displayed eigenvalues with a scree plot.
pca_prcomp <- prcomp(na.omit(pca_lear, scale = TRUE))
fviz_eig(pca_prcomp, addlabels = TRUE, ylim = c(0, 85))
We can use get_eigenvalue() to display eigen values in data frame.
eig.val <- get_eigenvalue(pca_prcomp)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.300343e+03 8.573815e+01 85.73815
## Dim.2 2.484750e+02 9.261138e+00 94.99929
## Dim.3 1.005530e+02 3.747803e+00 98.74709
## Dim.4 1.600704e+01 5.966128e-01 99.34371
## Dim.5 9.386185e+00 3.498410e-01 99.69355
## Dim.6 5.882429e+00 2.192493e-01 99.91280
## Dim.7 1.981130e+00 7.384050e-02 99.98664
## Dim.8 3.499058e-01 1.304165e-02 99.99968
## Dim.9 5.994503e-03 2.234265e-04 99.99990
## Dim.10 2.613004e-03 9.739164e-05 100.00000
Below we can see some grpahs. Firstly, we can see Quality of Representation (cos2). I plotted also total quality of representation (cos2) of variables on Dim.1 and Dim.2 and Individuals PCA.
res.var <- get_pca_var(pca_prcomp)
res.var$coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## no 18.0939198 3.5416069 -5.1739458 0.07123170 -0.01305072 0.045431325
## no2 13.3372532 -2.4589842 8.1829123 -0.12443968 -0.03758566 -0.091503770
## pm10 5.6874760 -8.7339462 -1.6560938 -1.44263754 -1.24767320 0.054562315
## pm2.5 5.4290320 -7.6009563 -1.0947332 2.42656578 0.43250791 -0.161038969
## nox 41.0773326 2.9762278 0.2515318 -0.01116148 -0.04610243 -0.026291469
## nv10 4.8565646 -6.8724005 -1.3415683 -2.11685533 0.58679895 -0.001494686
## v10 0.8175968 -1.8557058 -0.3094104 0.69041180 -1.83313323 0.062423143
## nv2.5 4.4889820 -5.7095404 -0.8512580 0.88365689 1.59704385 0.070102837
## v2.5 0.9278190 -1.8762210 -0.2240979 1.47988037 -1.17444578 -0.248334284
## ws -0.7967879 0.4225929 -0.3853681 -0.29713698 0.06436571 -2.402479329
## Dim.7 Dim.8 Dim.9 Dim.10
## no 0.002618540 -0.0018168453 5.690908e-02 1.164360e-04
## no2 -0.004957402 -0.0001587606 3.712020e-02 8.577614e-05
## pm10 -0.214850829 -0.0178108270 -2.441415e-04 2.941248e-02
## pm2.5 0.221356373 0.3370095465 1.822874e-04 1.796311e-03
## nox 0.001725446 0.0009633251 -3.711865e-02 -8.615341e-05
## nv10 0.508434577 0.0218009380 1.819953e-04 -2.951087e-02
## v10 -0.721918131 0.0211418012 3.915514e-05 -2.945815e-02
## nv2.5 -0.623630090 -0.3418700983 -9.026517e-05 -1.738472e-03
## v2.5 0.834796285 -0.3438052239 -1.262918e-04 -1.718557e-03
## ws -0.143057907 0.0030647093 6.212378e-05 -7.268007e-05
res.var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## no 14.23222463 5.04798364 26.62248426 3.169827e-02 0.001814596
## no2 7.73286204 2.43348515 66.59177846 9.674015e-02 0.015050646
## pm10 1.40619849 30.69999099 2.72756238 1.300180e+01 16.584888996
## pm2.5 1.28130428 23.25164588 1.19184950 3.678520e+01 1.992961830
## nox 73.35199643 3.56491813 0.06292029 7.782746e-04 0.022644284
## nv10 1.02533505 19.00790059 1.78990671 2.799441e+01 3.668508486
## v10 0.02905934 1.38591143 0.09520824 2.977868e+00 35.801311645
## nv2.5 0.87599815 13.11956795 0.72065473 4.878163e+00 27.173436086
## v2.5 0.03742260 1.41672392 0.04994366 1.368177e+01 14.695244681
## ws 0.02759897 0.07187232 0.14769177 5.515723e-01 0.044138750
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## no 3.508764e-02 3.461029e-04 9.433760e-04 5.402690e+01 5.188415e-04
## no2 1.423381e-01 1.240496e-03 7.203345e-06 2.298622e+01 2.815743e-04
## pm10 5.060913e-02 2.330027e+00 9.066029e-02 9.943291e-04 3.310727e+01
## pm2.5 4.408646e-01 2.473267e+00 3.245886e+01 5.543194e-04 1.234875e-01
## nox 1.175095e-02 1.502759e-04 2.652129e-04 2.298429e+01 2.840566e-04
## nv10 3.797899e-05 1.304840e+01 1.358311e-01 5.525445e-04 3.332912e+01
## v10 6.624217e-02 2.630649e+01 1.277418e-01 2.557551e-05 3.321015e+01
## nv2.5 8.354385e-02 1.963094e+01 3.340190e+01 1.359212e-04 1.156632e-01
## v2.5 1.048375e+00 3.517612e+01 3.378111e+01 2.660706e-04 1.130285e-01
## ws 9.812115e+01 1.033025e+00 2.684278e-03 6.438173e-05 2.021579e-04
res.var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## no 327.3899327 12.5429795 26.76971558 0.0050739547 0.0001703213
## no2 177.8823233 6.0466033 66.96005347 0.0154852339 0.0014126816
## pm10 32.3473836 76.2818158 2.74264672 2.0812030791 1.5566884085
## pm2.5 29.4743889 57.7745371 1.19844083 5.8882215073 0.1870630898
## nox 1687.3472558 8.8579319 0.06326826 0.0001245787 0.0021254345
## nv10 23.5862194 47.2298892 1.79980549 4.4810764978 0.3443330032
## v10 0.6684645 3.4436440 0.09573478 0.4766684502 3.3603774412
## nv2.5 20.1509590 32.5988521 0.72464019 0.7808494942 2.5505490561
## v2.5 0.8608481 3.5202054 0.05021987 2.1900459042 1.3793228921
## ws 0.6348709 0.1785848 0.14850855 0.0882903871 0.0041429448
## Dim.6 Dim.7 Dim.8 Dim.9 Dim.10
## no 2.064005e-03 6.856750e-06 3.300927e-06 3.238644e-03 1.355735e-08
## no2 8.372940e-03 2.457584e-05 2.520492e-08 1.377909e-03 7.357547e-09
## pm10 2.977046e-03 4.616088e-02 3.172256e-04 5.960509e-08 8.650942e-04
## pm2.5 2.593355e-02 4.899864e-02 1.135754e-01 3.322869e-08 3.226733e-06
## nox 6.912413e-04 2.977162e-06 9.279952e-07 1.377794e-03 7.422410e-09
## nv10 2.234087e-06 2.585057e-01 4.752809e-04 3.312230e-08 8.708912e-04
## v10 3.896649e-03 5.211658e-01 4.469758e-04 1.533125e-09 8.677824e-04
## nv2.5 4.914408e-03 3.889145e-01 1.168752e-01 8.147800e-09 3.022284e-06
## v2.5 6.166992e-02 6.968848e-01 1.182020e-01 1.594961e-08 2.953440e-06
## ws 5.771907e+00 2.046556e-02 9.392443e-06 3.859364e-09 5.282393e-09
corrplot(res.var$cos2, is.corr=FALSE)
fviz_cos2(pca_prcomp, choice = "var", axes = 1:2)
fviz_pca_var(pca_prcomp, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)
fviz_pca_ind(pca_prcomp,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Warning: ggrepel: 43534 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
In this article, I used PCA method to reduce dimension. I created first principal variable of this data set and plotted some graphs.