This study aims to perform dimensionality reduction on a dataset containing physical properties of superconducting materials. Since the dataset includes 81 features, many of which are likely correlated, dimensionality reduction provides an effective way to simplify the dataset and serves as a strong starting point for further analysis. To achieve this, Principal Component Analysis will be employed. PCA is a linear technique that transforms the original variables into a smaller set of uncorrelated variables, known as principal components, while retaining as much variance as possible from the original dataset.
The dataset used in this study is obtained from the UCI Machine Learning Repository and was originally introduced in the paper “A Data-Driven Statistical Model for Predicting the Critical Temperature of a Superconductor” by K. Hamidieh (2018):
https://archive.ics.uci.edu/dataset/464/superconductivty+data
The primary objective of the original study was to predict the superconducting critical temperature based on a set of extracted material features.
The dataset called “train.csv” contains 81 input features in addition to the critical temperature as the target variable, resulting in a high-dimensional feature space that can be challenging to analyze. Therefore, the main goal of this study is to apply Principal Component Analysis to reduce the dimensionality of the input features, excluding the critical temperature in order to simplify the dataset while retaining its most informative components.
The variables included in the dataset are based on fundamental atomic properties of the chemical elements that make up each superconductor compound. These properties include:
For each of these variables (except number of elements), several descriptive statistics are computed across the elements present in a compound, including:
This results in a total of 81 features. As can be expected, many of these features are highly correlated (for example, mean atomic mass, weighted mean atomic mass, geometric mean atomic mass…). This high level of correlation motivates the use of dimensionality reduction techniques, such as Principal Component Analysis.
The goal of this study will be achieved using Principal Component Analysis.
PCA is a dimensionality reduction technique used to simplify datasets that contain a large number of variables. The main idea behind PCA is to transform the original features into a new set of orthogonal variables called principal components (PCs), which are linear combinations of the original features. The first principal component explains the largest possible amount of variance in the data, the second explains the next largest amount (while being orthogonal to the first), and so on. Mathematically, PCA is based on the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors define the directions of maximum variance, while eigenvalues indicate how much variance is captured along each direction. By keeping only the first few principal components that retain most of the total variance, PCA reduces the dimensionality of the dataset while preserving its most important structure, making the data easier to analyze.
Let’s start with loading the libraries.
#Loading the libraries
library(corrplot)
library(caret)
library(recipes)
library(stats)
library(factoextra)
library(gridExtra)
library(ggfortify)
library(grid)
Let’s import the dataset called “train.csv”. As mentioned before, the original dependent variable, critical temperature, which appears in the last column, will be removed.
data <- read.csv("train.csv")
dim(data)
## [1] 21263 82
data <- data[,-82]
head(data)
## number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass
## 1 4 88.94447 57.86269 66.36159
## 2 5 92.72921 58.51842 73.13279
## 3 4 88.94447 57.88524 66.36159
## 4 4 88.94447 57.87397 66.36159
## 5 4 88.94447 57.84014 66.36159
## 6 4 88.94447 57.79504 66.36159
## wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass
## 1 36.11661 1.181795 1.0623955
## 2 36.39660 1.449309 1.0577551
## 3 36.12251 1.181795 0.9759805
## 4 36.11956 1.181795 1.0222909
## 5 36.11072 1.181795 1.1292237
## 6 36.09893 1.181795 1.2252028
## range_atomic_mass wtd_range_atomic_mass std_atomic_mass wtd_std_atomic_mass
## 1 122.9061 31.79492 51.96883 53.62253
## 2 122.9061 36.16194 47.09463 53.97987
## 3 122.9061 35.74110 51.96883 53.65627
## 4 122.9061 33.76801 51.96883 53.63940
## 5 122.9061 27.84874 51.96883 53.58877
## 6 122.9061 20.68746 51.96883 53.52115
## mean_fie wtd_mean_fie gmean_fie wtd_gmean_fie entropy_fie wtd_entropy_fie
## 1 775.425 1010.269 718.1529 938.0168 1.305967 0.7914878
## 2 766.440 1010.613 720.6055 938.7454 1.544145 0.8070782
## 3 775.425 1010.820 718.1529 939.0090 1.305967 0.7736202
## 4 775.425 1010.544 718.1529 938.5128 1.305967 0.7832067
## 5 775.425 1009.717 718.1529 937.0256 1.305967 0.8052296
## 6 775.425 1008.614 718.1529 935.0463 1.305967 0.8247426
## range_fie wtd_range_fie std_fie wtd_std_fie mean_atomic_radius
## 1 810.6 735.9857 323.8118 355.5630 160.25
## 2 810.6 743.1643 290.1830 354.9635 161.20
## 3 810.6 743.1643 323.8118 354.8042 160.25
## 4 810.6 739.5750 323.8118 355.1839 160.25
## 5 810.6 728.8071 323.8118 356.3193 160.25
## 6 810.6 714.4500 323.8118 357.8246 160.25
## wtd_mean_atomic_radius gmean_atomic_radius wtd_gmean_atomic_radius
## 1 105.5143 136.1260 84.52842
## 2 104.9714 141.4652 84.37017
## 3 104.6857 136.1260 84.21457
## 4 105.1000 136.1260 84.37135
## 5 106.3429 136.1260 84.84344
## 6 108.0000 136.1260 85.47701
## entropy_atomic_radius wtd_entropy_atomic_radius range_atomic_radius
## 1 1.259244 1.207040 205
## 2 1.508328 1.204115 205
## 3 1.259244 1.132547 205
## 4 1.259244 1.173033 205
## 5 1.259244 1.261194 205
## 6 1.259244 1.331339 205
## wtd_range_atomic_radius std_atomic_radius wtd_std_atomic_radius mean_Density
## 1 42.91429 75.23754 69.23557 4654.357
## 2 50.57143 67.32132 68.00882 5821.486
## 3 49.31429 75.23754 67.79771 4654.357
## 4 46.11429 75.23754 68.52166 4654.357
## 5 36.51429 75.23754 70.63445 4654.357
## 6 23.71429 75.23754 73.32413 4654.357
## wtd_mean_Density gmean_Density wtd_gmean_Density entropy_Density
## 1 2961.502 724.9532 53.54381 1.033129
## 2 3021.017 1237.0951 54.09572 1.314442
## 3 2999.159 724.9532 53.97402 1.033129
## 4 2980.331 724.9532 53.75849 1.033129
## 5 2923.845 724.9532 53.11703 1.033129
## 6 2848.531 724.9532 52.27364 1.033129
## wtd_entropy_Density range_Density wtd_range_Density std_Density
## 1 0.8145982 8958.571 1579.583 3306.163
## 2 0.9148022 10488.571 1667.383 3767.403
## 3 0.7603052 8958.571 1667.383 3306.163
## 4 0.7888885 8958.571 1623.483 3306.163
## 5 0.8598109 8958.571 1491.783 3306.163
## 6 0.9323687 8958.571 1316.183 3306.163
## wtd_std_Density mean_ElectronAffinity wtd_mean_ElectronAffinity
## 1 3572.597 81.8375 111.7271
## 2 3632.649 90.8900 112.3164
## 3 3592.019 81.8375 112.2136
## 4 3582.371 81.8375 111.9704
## 5 3552.669 81.8375 111.2407
## 6 3511.262 81.8375 110.2679
## gmean_ElectronAffinity wtd_gmean_ElectronAffinity entropy_ElectronAffinity
## 1 60.12318 99.41468 1.159687
## 2 69.83331 101.16640 1.427997
## 3 60.12318 101.08215 1.159687
## 4 60.12318 100.24495 1.159687
## 5 60.12318 97.77472 1.159687
## 6 60.12318 94.57550 1.159687
## wtd_entropy_ElectronAffinity range_ElectronAffinity
## 1 0.7873817 127.05
## 2 0.8386665 127.05
## 3 0.7860067 127.05
## 4 0.7869005 127.05
## 5 0.7873962 127.05
## 6 0.7844615 127.05
## wtd_range_ElectronAffinity std_ElectronAffinity wtd_std_ElectronAffinity
## 1 80.98714 51.43371 42.55840
## 2 81.20786 49.43817 41.66762
## 3 81.20786 51.43371 41.63988
## 4 81.09750 51.43371 42.10234
## 5 80.76643 51.43371 43.45206
## 6 80.32500 51.43371 45.17068
## mean_FusionHeat wtd_mean_FusionHeat gmean_FusionHeat wtd_gmean_FusionHeat
## 1 6.9055 3.846857 3.479475 1.040986
## 2 7.7844 3.796857 4.403790 1.035251
## 3 6.9055 3.822571 3.479475 1.037439
## 4 6.9055 3.834714 3.479475 1.039211
## 5 6.9055 3.871143 3.479475 1.044545
## 6 6.9055 3.919714 3.479475 1.051699
## entropy_FusionHeat wtd_entropy_FusionHeat range_FusionHeat
## 1 1.088575 0.9949982 12.878
## 2 1.374977 1.0730938 12.878
## 3 1.088575 0.9274794 12.878
## 4 1.088575 0.9640310 12.878
## 5 1.088575 1.0449695 12.878
## 6 1.088575 1.1118503 12.878
## wtd_range_FusionHeat std_FusionHeat wtd_std_FusionHeat
## 1 1.744571 4.599064 4.666920
## 2 1.595714 4.473363 4.603000
## 3 1.757143 4.599064 4.649635
## 4 1.744571 4.599064 4.658301
## 5 1.744571 4.599064 4.684014
## 6 1.744571 4.599064 4.717642
## mean_ThermalConductivity wtd_mean_ThermalConductivity
## 1 107.7566 61.01519
## 2 172.2053 61.37233
## 3 107.7566 60.94376
## 4 107.7566 60.97947
## 5 107.7566 61.08662
## 6 107.7566 61.22947
## gmean_ThermalConductivity wtd_gmean_ThermalConductivity
## 1 7.062488 0.6219795
## 2 16.064228 0.6197346
## 3 7.062488 0.6190947
## 4 7.062488 0.6205354
## 5 7.062488 0.6248777
## 6 7.062488 0.6307148
## entropy_ThermalConductivity wtd_entropy_ThermalConductivity
## 1 0.3081480 0.2628483
## 2 0.8474042 0.5677061
## 3 0.3081480 0.2504774
## 4 0.3081480 0.2570451
## 5 0.3081480 0.2728199
## 6 0.3081480 0.2882356
## range_ThermalConductivity wtd_range_ThermalConductivity
## 1 399.9734 57.12767
## 2 429.9734 51.41338
## 3 399.9734 57.12767
## 4 399.9734 57.12767
## 5 399.9734 57.12767
## 6 399.9734 57.12767
## std_ThermalConductivity wtd_std_ThermalConductivity mean_Valence
## 1 168.8542 138.5172 2.25
## 2 198.5546 139.6309 2.00
## 3 168.8542 138.5406 2.25
## 4 168.8542 138.5289 2.25
## 5 168.8542 138.4937 2.25
## 6 168.8542 138.4466 2.25
## wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence
## 1 2.257143 2.213364 2.219783 1.368922
## 2 2.257143 1.888175 2.210679 1.557113
## 3 2.271429 2.213364 2.232679 1.368922
## 4 2.264286 2.213364 2.226222 1.368922
## 5 2.242857 2.213364 2.206963 1.368922
## 6 2.214286 2.213364 2.181543 1.368922
## wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence
## 1 1.066221 1 1.085714 0.4330127
## 2 1.047221 2 1.128571 0.6324555
## 3 1.029175 1 1.114286 0.4330127
## 4 1.048834 1 1.100000 0.4330127
## 5 1.096052 1 1.057143 0.4330127
## 6 1.141474 1 1.000000 0.4330127
## wtd_std_Valence
## 1 0.4370588
## 2 0.4686063
## 3 0.4446966
## 4 0.4409521
## 5 0.4288095
## 6 0.4103259
It is time to make a colored correlation matrix to see how much the variables are correlated.
data_corr <- cor(data, method="pearson")
corrplot(data_corr, order ="alphabet", tl.cex=0.6)
The correlation plot above shows that dimensionality reduction makes sense, given the large number of highly correlated variable pairs.
It is a good habit to standardize the data before applying PCA.
preproc1 <- preProcess(data, method=c("center", "scale"))
data.s <- predict(preproc1, data)
summary(data.s)
## number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass
## Min. :-2.16441 Min. :-2.71651 Min. :-1.9876 Min. :-2.1260
## 1st Qu.:-0.77484 1st Qu.:-0.50881 1st Qu.:-0.6224 1st Qu.:-0.4270
## Median :-0.08006 Median :-0.08879 Median :-0.3670 Median :-0.1588
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.61473 3rd Qu.: 0.43289 3rd Qu.: 0.3916 3rd Qu.: 0.2200
## Max. : 3.39387 Max. : 4.09155 Max. : 4.0606 Max. : 4.4373
## wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass
## Min. :-1.5437 Min. :-3.19406 Min. :-2.6503
## 1st Qu.:-0.6355 1st Qu.:-0.54512 1st Qu.:-0.7187
## Median :-0.5081 Median : 0.09298 Median : 0.2065
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.3976 3rd Qu.: 0.76434 3rd Qu.: 0.7362
## Max. : 4.1047 Max. : 2.24204 Max. : 2.2279
## range_atomic_mass wtd_range_atomic_mass std_atomic_mass wtd_std_atomic_mass
## Min. :-2.1162 Min. :-1.2320 Min. :-2.21567 Min. :-2.0741
## 1st Qu.:-0.6789 1st Qu.:-0.6082 1st Qu.:-0.57406 1st Qu.:-0.6460
## Median : 0.1337 Median :-0.2443 Median : 0.03652 Median : 0.1420
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.7051 3rd Qu.: 0.1903 3rd Qu.: 0.74523 3rd Qu.: 0.6096
## Max. : 1.6909 Max. : 6.3915 Max. : 2.82638 Max. : 2.9810
## mean_fie wtd_mean_fie gmean_fie wtd_gmean_fie
## Min. :-4.50475 Min. :-3.4544 Min. :-4.6213 Min. :-3.8178
## 1st Qu.:-0.52435 1st Qu.:-0.9178 1st Qu.:-0.5737 1st Qu.:-0.9406
## Median :-0.05389 Median : 0.1363 Median :-0.1215 Median : 0.1956
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.30524 3rd Qu.: 0.9330 3rd Qu.: 0.3605 3rd Qu.: 0.8750
## Max. : 6.21206 Max. : 3.3333 Max. : 7.3490 Max. : 4.1314
## entropy_fie wtd_entropy_fie range_fie wtd_range_fie
## Min. :-3.4016 Min. :-2.77448 Min. :-1.8482 Min. :-2.1581
## 1st Qu.:-0.5585 1st Qu.:-0.51784 1st Qu.:-1.0007 1st Qu.:-0.8589
## Median : 0.1494 Median :-0.02959 Median : 0.6197 Median : 0.1202
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6597 3rd Qu.: 0.40424 3rd Qu.: 0.7699 3rd Qu.: 0.9248
## Max. : 2.2480 Max. : 3.32867 Max. : 2.3651 Max. : 3.4294
## std_fie wtd_std_fie mean_atomic_radius wtd_mean_atomic_radius
## Min. :-1.9609 Min. :-1.7514 Min. :-5.4590 Min. :-3.0109
## 1st Qu.:-0.9230 1st Qu.:-1.0245 1st Qu.:-0.4293 1st Qu.:-0.7844
## Median : 0.4614 Median : 0.2689 Median : 0.1125 Median :-0.3038
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7465 3rd Qu.: 0.9271 3rd Qu.: 0.5894 3rd Qu.: 0.8175
## Max. : 2.5830 Max. : 1.9942 Max. : 6.9497 Max. : 5.6691
## gmean_atomic_radius wtd_gmean_atomic_radius entropy_atomic_radius
## Min. :-4.36598 Min. :-2.0367 Min. :-3.3770
## 1st Qu.:-0.49370 1st Qu.:-0.8868 1st Qu.:-0.5364
## Median :-0.07429 Median :-0.2179 Median : 0.1678
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.52010 3rd Qu.: 0.8371 3rd Qu.: 0.6515
## Max. : 6.95086 Max. : 4.9392 Max. : 2.3287
## wtd_entropy_atomic_radius range_atomic_radius wtd_range_atomic_radius
## Min. :-2.7781 Min. :-2.0711 Min. :-1.4669
## 1st Qu.:-0.6851 1st Qu.:-0.8819 1st Qu.:-0.6503
## Median : 0.2744 Median : 0.4708 Median :-0.2390
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7234 3rd Qu.: 0.9763 3rd Qu.: 0.2528
## Max. : 1.8976 Max. : 1.7344 Max. : 5.3911
## std_atomic_radius wtd_std_atomic_radius mean_Density wtd_mean_Density
## Min. :-2.2535 Min. :-2.0692 Min. :-2.1463 Min. :-1.6347
## 1st Qu.:-0.7201 1st Qu.:-0.8035 1st Qu.:-0.5613 1st Qu.:-0.7041
## Median : 0.3084 Median : 0.3002 Median :-0.2748 Median :-0.2992
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7784 3rd Qu.: 0.8475 3rd Qu.: 0.2166 3rd Qu.: 0.3567
## Max. : 2.7905 Max. : 1.7711 Max. : 5.7885 Max. : 5.3776
## gmean_Density wtd_gmean_Density entropy_Density wtd_entropy_Density
## Min. :-0.9341 Min. :-0.7840 Min. :-3.13248 Min. :-2.67712
## 1st Qu.:-0.6960 1st Qu.:-0.7674 1st Qu.:-0.46287 1st Qu.:-0.52334
## Median :-0.5727 Median :-0.4030 Median : 0.05312 Median : 0.08353
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.6303 3rd Qu.: 0.6663 3rd Qu.: 0.73463 3rd Qu.: 0.70335
## Max. : 5.1655 Max. : 4.8987 Max. : 2.57589 Max. : 2.65006
## range_Density wtd_range_Density std_Density wtd_std_Density
## Min. :-2.11500 Min. :-1.2102 Min. :-2.04162 Min. :-2.0593
## 1st Qu.:-0.49240 1st Qu.:-0.5195 1st Qu.:-0.35696 1st Qu.:-0.4683
## Median : 0.07155 Median :-0.3418 Median :-0.06873 Median : 0.1901
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.27169 3rd Qu.: 0.2111 3rd Qu.: 0.35095 3rd Qu.: 0.3971
## Max. : 3.39827 Max. : 8.1433 Max. : 4.36625 Max. : 4.3999
## mean_ElectronAffinity wtd_mean_ElectronAffinity gmean_ElectronAffinity
## Min. :-2.7211 Min. :-2.8261 Min. :-1.82227
## 1st Qu.:-0.5339 1st Qu.:-0.6001 1st Qu.:-0.71220
## Median :-0.1364 Median : 0.3141 Median :-0.09961
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3113 3rd Qu.: 0.5583 3rd Qu.: 0.45321
## Max. : 8.9965 Max. : 7.2308 Max. : 9.36796
## wtd_gmean_ElectronAffinity entropy_ElectronAffinity
## Min. :-2.24075 Min. :-3.1167
## 1st Qu.:-0.68389 1st Qu.:-0.5232
## Median : 0.02394 Median : 0.1981
## Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.55483 3rd Qu.: 0.8027
## Max. : 8.01568 Max. : 2.0312
## wtd_entropy_ElectronAffinity range_ElectronAffinity wtd_range_ElectronAffinity
## Min. :-2.69508 Min. :-2.0567 Min. :-2.0731
## 1st Qu.:-0.38497 1st Qu.:-0.5797 1st Qu.:-0.8839
## Median : 0.03653 Median : 0.1077 Median : 0.4131
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.37339 3rd Qu.: 0.3049 3rd Qu.: 0.6071
## Max. : 3.16324 Max. : 3.8887 Max. : 5.5682
## std_ElectronAffinity wtd_std_ElectronAffinity mean_FusionHeat
## Min. :-2.2498 Min. :-2.1738 Min. :-1.2455
## 1st Qu.:-0.4848 1st Qu.:-0.5369 1st Qu.:-0.5936
## Median : 0.1018 Median : 0.1772 Median :-0.4417
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3362 3rd Qu.: 0.4362 3rd Qu.: 0.2494
## Max. : 5.2429 Max. : 6.1023 Max. : 8.0268
## wtd_mean_FusionHeat gmean_FusionHeat wtd_gmean_FusionHeat entropy_FusionHeat
## Min. :-0.9542 Min. :-0.9850 Min. :-0.7552 Min. :-2.90835
## 1st Qu.:-0.6173 1st Qu.:-0.5988 1st Qu.:-0.6715 1st Qu.:-0.69164
## Median :-0.3864 Median :-0.4852 Median :-0.3968 Median : 0.04989
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3268 3rd Qu.: 0.3440 3rd Qu.: 0.4787 3rd Qu.: 0.75750
## Max. : 6.3835 Max. : 9.4242 Max. : 7.2224 Max. : 2.50329
## wtd_entropy_FusionHeat range_FusionHeat wtd_range_FusionHeat
## Min. :-2.4696 Min. :-1.0377 Min. :-0.7200
## 1st Qu.:-0.6520 1st Qu.:-0.4055 1st Qu.:-0.5160
## Median : 0.2187 Median :-0.4055 Median :-0.4190
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6574 3rd Qu.: 0.1012 3rd Qu.: 0.1998
## Max. : 2.2509 Max. : 4.1059 Max. : 8.2754
## std_FusionHeat wtd_std_FusionHeat mean_ThermalConductivity
## Min. :-0.95983 Min. :-1.05891 Min. :-2.3283
## 1st Qu.:-0.46842 1st Qu.:-0.42728 1st Qu.:-0.7453
## Median :-0.38922 Median :-0.30418 Median : 0.1765
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.08279 3rd Qu.: 0.04116 3rd Qu.: 0.5530
## Max. : 4.99463 Max. : 6.03203 Max. : 6.3035
## wtd_mean_ThermalConductivity gmean_ThermalConductivity
## Min. :-1.7909 Min. :-0.8754
## 1st Qu.:-0.6012 1st Qu.:-0.6313
## Median :-0.1805 Median :-0.4567
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3848 3rd Qu.: 0.3679
## Max. : 7.1489 Max. : 8.4570
## wtd_gmean_ThermalConductivity entropy_ThermalConductivity
## Min. :-0.6789 Min. :-2.23216
## 1st Qu.:-0.6524 1st Qu.:-0.82773
## Median :-0.5278 Median : 0.03394
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4976 3rd Qu.: 0.71965
## Max. : 8.6767 Max. : 2.78041
## wtd_entropy_ThermalConductivity range_ThermalConductivity
## Min. :-1.6968 Min. :-1.5809
## 1st Qu.:-0.9091 1st Qu.:-1.0366
## Median : 0.0182 Median : 0.9382
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7458 3rd Qu.: 0.9394
## Max. : 3.3716 Max. : 1.1284
## wtd_range_ThermalConductivity std_ThermalConductivity
## Min. :-1.4385 Min. :-1.6451
## 1st Qu.:-0.7579 1st Qu.:-1.0144
## Median :-0.1270 Median : 0.6122
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6919 3rd Qu.: 0.9122
## Max. : 7.8706 Max. : 1.9294
## wtd_std_ThermalConductivity mean_Valence wtd_mean_Valence
## Min. :-1.5105 Min. :-2.1044 Min. :-1.8075
## 1st Qu.:-1.0084 1st Qu.:-0.8280 1st Qu.:-0.8700
## Median : 0.2719 Median :-0.3493 Median :-0.4491
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0434 3rd Qu.: 0.7675 3rd Qu.: 0.7329
## Max. : 1.8375 Max. : 3.6394 Max. : 3.2293
## gmean_Valence wtd_gmean_Valence entropy_Valence wtd_entropy_Valence
## Min. :-1.9656 Min. :-1.7500 Min. :-3.2956 Min. :-2.7685
## 1st Qu.:-0.7425 1st Qu.:-0.8211 1st Qu.:-0.5973 1st Qu.:-0.7288
## Median :-0.4217 Median :-0.5293 Median : 0.1863 Median : 0.2990
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6417 3rd Qu.: 0.7312 3rd Qu.: 0.7461 3rd Qu.: 0.7309
## Max. : 3.7691 Max. : 3.3572 Max. : 2.1525 Max. : 2.3585
## range_Valence wtd_range_Valence std_Valence wtd_std_Valence
## Min. :-1.64287 Min. :-1.5161 Min. :-1.73176 Min. :-1.4794
## 1st Qu.:-0.83794 1st Qu.:-0.5741 1st Qu.:-0.79969 1st Qu.:-0.8058
## Median :-0.03301 Median :-0.4293 Median :-0.08117 Median :-0.3819
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.77192 3rd Qu.: 0.4451 3rd Qu.: 0.74412 3rd Qu.: 0.7605
## Max. : 3.18671 Max. : 5.6321 Max. : 4.45794 Max. : 5.1056
All means are equal to 0.
First, the eigenvectors and eigenvalues will be presented for illustrative purposes.
Eigenvectors indicate the directions along which the data varies. The first eigenvector points in the direction of maximum variance, the second in the next most important direction, and so on. These eigenvectors define the directions of the principal components.
data.cov <- cov(data.s)
data.eigen <- eigen(data.cov)
head(data.eigen$vectors)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.15577833 -0.09185596 -0.075197122 0.05764604 -0.0115456693 0.06250960
## [2,] -0.05181305 -0.22647459 -0.007334317 -0.18303403 -0.0350782024 0.05190857
## [3,] -0.09916962 -0.19990777 -0.055080664 -0.14724811 0.0285920161 -0.06732956
## [4,] -0.08332166 -0.21894420 0.025325496 -0.15064367 0.0040613110 0.08519657
## [5,] -0.12006367 -0.18477968 -0.028652254 -0.10861815 0.0440689581 -0.03733811
## [6,] 0.14641854 -0.12255116 -0.059729607 0.06044047 0.0009406503 0.12905797
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] -0.05749010 -0.020053317 0.05300080 0.07068765 -0.007686226 -0.01966041
## [2,] 0.07510766 0.129593196 0.06171558 -0.10585155 0.091201103 -0.27748970
## [3,] -0.02037932 0.005283779 0.12665782 -0.07590331 0.031584554 -0.27831251
## [4,] -0.02370737 0.122484110 0.04140090 -0.06334025 0.134187985 -0.24640830
## [5,] -0.08878995 -0.002823301 0.10087474 -0.05350412 0.074966395 -0.24007966
## [6,] -0.12944835 -0.035694539 0.04743043 0.07473660 0.013968364 0.03171565
## [,13] [,14] [,15] [,16] [,17]
## [1,] 0.037028950 -0.015896111 0.071093744 -0.06819561 0.003106938
## [2,] -0.009700472 0.083163916 0.092956904 -0.02373368 0.194171892
## [3,] -0.023166236 0.005116402 -0.002229394 0.09083238 -0.074176333
## [4,] 0.051504940 0.075185579 0.007247210 -0.01006571 0.180487564
## [5,] 0.029004056 0.003923106 -0.039131496 0.04008825 -0.024278994
## [6,] 0.074084689 0.006632305 -0.022041575 -0.04235568 0.015064226
## [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 0.09436509 0.04728067 0.08039423 0.274842325 0.04272168 -0.038649523
## [2,] 0.06041675 -0.01539508 0.06032762 -0.067004634 0.15787768 0.032812990
## [3,] -0.06606138 0.12731751 0.08161066 0.046309954 0.14417047 -0.047382809
## [4,] 0.04563700 0.04699556 0.04172270 -0.062342169 0.20203413 -0.002252252
## [5,] -0.07484941 0.14376597 0.09394998 0.024250357 0.13693565 0.057376253
## [6,] 0.04371487 0.04917930 -0.01812150 0.005682592 0.10713577 -0.119481052
## [,24] [,25] [,26] [,27] [,28] [,29]
## [1,] 0.03175957 0.099262665 -0.09463002 -0.19845564 0.03976466 -0.01875030
## [2,] -0.06834167 0.100334777 0.04753571 0.03563076 0.18857806 -0.10956745
## [3,] 0.04095633 -0.009716354 -0.10960098 0.03655163 -0.06356073 -0.04829886
## [4,] -0.02626309 0.064763591 0.02486778 0.08700908 0.17419545 -0.12858283
## [5,] 0.05834615 0.017286915 -0.13774106 0.04932558 -0.05263465 -0.04296815
## [6,] 0.02147843 0.012737578 -0.06607246 -0.06989853 0.03432158 -0.05295695
## [,30] [,31] [,32] [,33] [,34]
## [1,] 0.03922501 -0.01149625 0.068957020 -0.08880295 -0.050859692
## [2,] -0.16515906 0.09309351 -0.038393112 0.04125160 -0.015996726
## [3,] 0.02959113 0.01397550 -0.031840533 -0.00327170 0.027465885
## [4,] -0.10952550 0.13815365 -0.008433746 0.03117466 0.007668515
## [5,] -0.01359635 -0.03005260 -0.075900779 -0.05214849 -0.031831129
## [6,] -0.03336701 0.05377304 0.096626275 -0.04331764 -0.009472547
## [,35] [,36] [,37] [,38] [,39]
## [1,] 1.369521e-01 0.181517577 0.10198772 0.32779715 0.183898810
## [2,] 1.763473e-01 -0.007943967 0.15307184 -0.06681902 0.003866377
## [3,] -2.868083e-01 0.062368283 0.01765606 0.05966228 0.094260197
## [4,] 9.500306e-02 -0.087629697 0.08871157 -0.09557705 -0.050713547
## [5,] -3.280077e-01 0.008475281 -0.08647074 0.03924441 -0.046502277
## [6,] -4.783907e-05 -0.171122795 -0.15112147 -0.09597421 0.040085943
## [,40] [,41] [,42] [,43] [,44] [,45]
## [1,] -0.08224931 -0.06104958 0.12579853 0.41718545 -0.10147619 0.01071707
## [2,] -0.10729696 0.03487414 0.01969472 0.05685173 -0.02352946 -0.22086740
## [3,] 0.01209019 0.01395026 0.09684873 0.08895304 0.06110343 0.11358388
## [4,] 0.09819595 0.01397741 -0.02655539 -0.06859422 -0.01229771 -0.01469707
## [5,] 0.02778074 -0.06936179 -0.02960670 0.05821517 0.03310120 0.19506803
## [6,] 0.19631764 -0.01520802 -0.04772756 -0.19610793 0.13092696 0.09991225
## [,46] [,47] [,48] [,49] [,50] [,51]
## [1,] 0.08240012 0.18764849 -0.12510331 0.149117714 -0.17651661 0.17365403
## [2,] -0.10648996 0.15697834 0.09684737 -0.111467440 0.03882797 -0.04709902
## [3,] 0.03306138 -0.01626207 0.04152485 -0.138748512 0.14512737 -0.01300247
## [4,] -0.05205525 -0.07635537 -0.15359719 0.159079888 -0.12927260 0.03945734
## [5,] 0.15683292 -0.10837556 -0.04495233 -0.001073976 0.02111025 0.03857396
## [6,] -0.08916072 -0.02328785 -0.06527110 0.191706232 -0.08988014 0.08991247
## [,52] [,53] [,54] [,55] [,56] [,57]
## [1,] -0.14372211 0.09378129 -0.120261483 0.01873669 -0.03762937 0.033431935
## [2,] -0.11996732 -0.03222809 -0.009045325 0.06071338 0.24706132 0.018006944
## [3,] 0.03857375 -0.11404672 -0.110463848 -0.02701829 0.08895578 -0.070160458
## [4,] -0.03625125 0.09196424 0.179557668 0.04964075 -0.18509302 0.068699797
## [5,] 0.12991193 0.06616678 -0.061487455 -0.02155713 -0.18226460 -0.006601438
## [6,] 0.16180857 0.19338827 -0.104884853 -0.11690311 0.06885462 -0.333224985
## [,58] [,59] [,60] [,61] [,62] [,63]
## [1,] 0.05624044 -0.1045509899 -0.03410281 0.27177086 -0.03824890 -0.14006222
## [2,] 0.16194934 0.0233984971 0.04401181 -0.01169217 0.12588152 0.14209552
## [3,] -0.23618103 0.0524883695 -0.06250234 0.02708919 0.19157803 0.07787844
## [4,] 0.14334482 -0.0762401708 0.07877092 -0.02587727 -0.16246368 -0.14304949
## [5,] -0.24428371 -0.0005254918 -0.07356478 -0.01527817 -0.08445826 -0.12086654
## [6,] 0.16411891 0.0053891474 -0.16251028 -0.05361563 -0.28110332 -0.03790376
## [,64] [,65] [,66] [,67] [,68] [,69]
## [1,] -0.11477045 0.03996614 0.06913216 0.01052621 -0.009094840 0.087821088
## [2,] 0.05492107 -0.11247947 0.26072536 -0.11621543 0.003411829 -0.002252168
## [3,] -0.01250183 -0.11573176 0.18674041 -0.17220867 0.048057093 -0.004529280
## [4,] 0.01418947 0.18233058 -0.32203625 0.16354531 0.005364360 -0.021455146
## [5,] -0.06120593 0.05284737 -0.13822259 0.10269399 -0.059232912 0.018650239
## [6,] -0.02899267 -0.07186486 0.18557519 -0.40143572 0.149390270 -0.121237015
## [,70] [,71] [,72] [,73] [,74] [,75]
## [1,] 0.03076486 -0.003101844 0.03933664 0.11247979 -0.007909514 0.009523201
## [2,] -0.11879442 0.049499259 -0.07976154 0.02081370 0.026474043 0.089030328
## [3,] -0.10285875 -0.126230314 -0.01073326 -0.01072977 -0.016187438 -0.126835170
## [4,] 0.17566907 -0.061015274 0.06090120 -0.06512625 -0.048542444 -0.097633627
## [5,] 0.05355546 0.156795194 0.04645022 0.06016132 0.040803667 0.142096723
## [6,] -0.05498821 -0.080377512 -0.06469620 0.09748688 -0.004917660 0.033189371
## [,76] [,77] [,78] [,79] [,80] [,81]
## [1,] -0.02787139 0.01232390 0.01941766 -0.007987346 0.001680979 0.002106604
## [2,] 0.35061341 -0.02027277 -0.03671681 -0.035040961 -0.041983137 -0.009060502
## [3,] -0.52015400 0.02704500 0.04119994 0.068238083 0.063916108 0.013608959
## [4,] -0.34009207 0.01362914 0.03491495 0.036518203 0.046736110 0.006179888
## [5,] 0.53447723 -0.02165557 -0.04359082 -0.075933904 -0.075731457 -0.006908326
## [6,] 0.04022614 0.06054435 -0.02738245 0.001577812 -0.005263212 0.002731923
Eigenvalues indicate how much of the dataset’s variance is captured by each eigenvector. Higher eigenvalues correspond to directions that explain more variance and are therefore more important in describing the structure of the data.
data.eigen$values
## [1] 3.153476e+01 8.490589e+00 7.712172e+00 6.405261e+00 4.764568e+00
## [6] 3.068578e+00 2.934024e+00 2.516757e+00 1.912501e+00 1.602082e+00
## [11] 1.480977e+00 1.181464e+00 9.545415e-01 8.088287e-01 7.958057e-01
## [16] 6.330308e-01 5.822221e-01 4.402077e-01 3.914952e-01 3.091580e-01
## [21] 2.448415e-01 2.321071e-01 2.077513e-01 1.677602e-01 1.597464e-01
## [26] 1.508934e-01 1.377307e-01 1.155001e-01 1.022980e-01 9.325272e-02
## [31] 8.295171e-02 7.779386e-02 7.446023e-02 5.822024e-02 5.547902e-02
## [36] 5.029370e-02 4.623229e-02 3.993050e-02 3.538391e-02 3.423715e-02
## [41] 2.641669e-02 2.478717e-02 2.081580e-02 1.923129e-02 1.810621e-02
## [46] 1.744477e-02 1.591769e-02 1.519301e-02 1.464169e-02 1.419336e-02
## [51] 1.266700e-02 1.245829e-02 1.026368e-02 9.727861e-03 9.546853e-03
## [56] 8.542345e-03 7.230276e-03 6.590798e-03 6.472651e-03 5.816789e-03
## [61] 5.246126e-03 4.608991e-03 3.593798e-03 3.561390e-03 3.192283e-03
## [66] 2.861282e-03 2.607661e-03 2.278549e-03 1.846978e-03 1.667576e-03
## [71] 1.468079e-03 1.351208e-03 1.195417e-03 7.477059e-04 6.163952e-04
## [76] 4.464905e-04 3.309954e-04 1.866445e-04 1.201482e-04 7.401160e-05
## [81] 4.959679e-05
Time to apply Principal Component Analysis.
First, the loadings of the first few original variables for each component are presented below.
data_pca <- prcomp(data.s, center=FALSE, scale.=FALSE)
head(data_pca$rotation)
## PC1 PC2 PC3 PC4
## number_of_elements 0.15577833 0.09185596 0.075197122 -0.05764604
## mean_atomic_mass -0.05181305 0.22647459 0.007334317 0.18303403
## wtd_mean_atomic_mass -0.09916962 0.19990777 0.055080664 0.14724811
## gmean_atomic_mass -0.08332166 0.21894420 -0.025325496 0.15064367
## wtd_gmean_atomic_mass -0.12006367 0.18477968 0.028652254 0.10861815
## entropy_atomic_mass 0.14641854 0.12255116 0.059729607 -0.06044047
## PC5 PC6 PC7 PC8
## number_of_elements 0.0115456693 -0.06250960 0.05749010 -0.020053317
## mean_atomic_mass 0.0350782024 -0.05190857 -0.07510766 0.129593196
## wtd_mean_atomic_mass -0.0285920161 0.06732956 0.02037932 0.005283779
## gmean_atomic_mass -0.0040613110 -0.08519657 0.02370737 0.122484110
## wtd_gmean_atomic_mass -0.0440689581 0.03733811 0.08878995 -0.002823301
## entropy_atomic_mass -0.0009406503 -0.12905797 0.12944835 -0.035694539
## PC9 PC10 PC11 PC12
## number_of_elements 0.05300080 -0.07068765 0.007686226 -0.01966041
## mean_atomic_mass 0.06171558 0.10585155 -0.091201103 -0.27748970
## wtd_mean_atomic_mass 0.12665782 0.07590331 -0.031584554 -0.27831251
## gmean_atomic_mass 0.04140090 0.06334025 -0.134187985 -0.24640830
## wtd_gmean_atomic_mass 0.10087474 0.05350412 -0.074966395 -0.24007966
## entropy_atomic_mass 0.04743043 -0.07473660 -0.013968364 0.03171565
## PC13 PC14 PC15 PC16
## number_of_elements 0.037028950 -0.015896111 -0.071093744 0.06819561
## mean_atomic_mass -0.009700472 0.083163916 -0.092956904 0.02373368
## wtd_mean_atomic_mass -0.023166236 0.005116402 0.002229394 -0.09083238
## gmean_atomic_mass 0.051504940 0.075185579 -0.007247210 0.01006571
## wtd_gmean_atomic_mass 0.029004056 0.003923106 0.039131496 -0.04008825
## entropy_atomic_mass 0.074084689 0.006632305 0.022041575 0.04235568
## PC17 PC18 PC19 PC20
## number_of_elements -0.003106938 0.09436509 -0.04728067 0.08039423
## mean_atomic_mass -0.194171892 0.06041675 0.01539508 0.06032762
## wtd_mean_atomic_mass 0.074176333 -0.06606138 -0.12731751 0.08161066
## gmean_atomic_mass -0.180487564 0.04563700 -0.04699556 0.04172270
## wtd_gmean_atomic_mass 0.024278994 -0.07484941 -0.14376597 0.09394998
## entropy_atomic_mass -0.015064226 0.04371487 -0.04917930 -0.01812150
## PC21 PC22 PC23 PC24
## number_of_elements -0.274842325 0.04272168 -0.038649523 0.03175957
## mean_atomic_mass 0.067004634 0.15787768 0.032812990 -0.06834167
## wtd_mean_atomic_mass -0.046309954 0.14417047 -0.047382809 0.04095633
## gmean_atomic_mass 0.062342169 0.20203413 -0.002252252 -0.02626309
## wtd_gmean_atomic_mass -0.024250357 0.13693565 0.057376253 0.05834615
## entropy_atomic_mass -0.005682592 0.10713577 -0.119481052 0.02147843
## PC25 PC26 PC27 PC28
## number_of_elements -0.099262665 0.09463002 -0.19845564 0.03976466
## mean_atomic_mass -0.100334777 -0.04753571 0.03563076 0.18857806
## wtd_mean_atomic_mass 0.009716354 0.10960098 0.03655163 -0.06356073
## gmean_atomic_mass -0.064763591 -0.02486778 0.08700908 0.17419545
## wtd_gmean_atomic_mass -0.017286915 0.13774106 0.04932558 -0.05263465
## entropy_atomic_mass -0.012737578 0.06607246 -0.06989853 0.03432158
## PC29 PC30 PC31 PC32
## number_of_elements -0.01875030 0.03922501 -0.01149625 -0.068957020
## mean_atomic_mass -0.10956745 -0.16515906 0.09309351 0.038393112
## wtd_mean_atomic_mass -0.04829886 0.02959113 0.01397550 0.031840533
## gmean_atomic_mass -0.12858283 -0.10952550 0.13815365 0.008433746
## wtd_gmean_atomic_mass -0.04296815 -0.01359635 -0.03005260 0.075900779
## entropy_atomic_mass -0.05295695 -0.03336701 0.05377304 -0.096626275
## PC33 PC34 PC35 PC36
## number_of_elements 0.08880295 0.050859692 1.369521e-01 -0.181517577
## mean_atomic_mass -0.04125160 0.015996726 1.763473e-01 0.007943967
## wtd_mean_atomic_mass 0.00327170 -0.027465885 -2.868083e-01 -0.062368283
## gmean_atomic_mass -0.03117466 -0.007668515 9.500306e-02 0.087629697
## wtd_gmean_atomic_mass 0.05214849 0.031831129 -3.280077e-01 -0.008475281
## entropy_atomic_mass 0.04331764 0.009472547 -4.783907e-05 0.171122795
## PC37 PC38 PC39 PC40
## number_of_elements 0.10198772 0.32779715 -0.183898810 -0.08224931
## mean_atomic_mass 0.15307184 -0.06681902 -0.003866377 -0.10729696
## wtd_mean_atomic_mass 0.01765606 0.05966228 -0.094260197 0.01209019
## gmean_atomic_mass 0.08871157 -0.09557705 0.050713547 0.09819595
## wtd_gmean_atomic_mass -0.08647074 0.03924441 0.046502277 0.02778074
## entropy_atomic_mass -0.15112147 -0.09597421 -0.040085943 0.19631764
## PC41 PC42 PC43 PC44
## number_of_elements 0.06104958 0.12579853 0.41718545 -0.10147619
## mean_atomic_mass -0.03487414 0.01969472 0.05685173 -0.02352946
## wtd_mean_atomic_mass -0.01395026 0.09684873 0.08895304 0.06110343
## gmean_atomic_mass -0.01397741 -0.02655539 -0.06859422 -0.01229771
## wtd_gmean_atomic_mass 0.06936179 -0.02960670 0.05821517 0.03310120
## entropy_atomic_mass 0.01520802 -0.04772756 -0.19610793 0.13092696
## PC45 PC46 PC47 PC48
## number_of_elements -0.01071707 -0.08240012 -0.18764849 -0.12510331
## mean_atomic_mass 0.22086740 0.10648996 -0.15697834 0.09684737
## wtd_mean_atomic_mass -0.11358388 -0.03306138 0.01626207 0.04152485
## gmean_atomic_mass 0.01469707 0.05205525 0.07635537 -0.15359719
## wtd_gmean_atomic_mass -0.19506803 -0.15683292 0.10837556 -0.04495233
## entropy_atomic_mass -0.09991225 0.08916072 0.02328785 -0.06527110
## PC49 PC50 PC51 PC52
## number_of_elements -0.149117714 -0.17651661 0.17365403 0.14372211
## mean_atomic_mass 0.111467440 0.03882797 -0.04709902 0.11996732
## wtd_mean_atomic_mass 0.138748512 0.14512737 -0.01300247 -0.03857375
## gmean_atomic_mass -0.159079888 -0.12927260 0.03945734 0.03625125
## wtd_gmean_atomic_mass 0.001073976 0.02111025 0.03857396 -0.12991193
## entropy_atomic_mass -0.191706232 -0.08988014 0.08991247 -0.16180857
## PC53 PC54 PC55 PC56
## number_of_elements 0.09378129 0.120261483 0.01873669 0.03762937
## mean_atomic_mass -0.03222809 0.009045325 0.06071338 -0.24706132
## wtd_mean_atomic_mass -0.11404672 0.110463848 -0.02701829 -0.08895578
## gmean_atomic_mass 0.09196424 -0.179557668 0.04964075 0.18509302
## wtd_gmean_atomic_mass 0.06616678 0.061487455 -0.02155713 0.18226460
## entropy_atomic_mass 0.19338827 0.104884853 -0.11690311 -0.06885462
## PC57 PC58 PC59 PC60
## number_of_elements -0.033431935 -0.05624044 -0.1045509899 -0.03410281
## mean_atomic_mass -0.018006944 -0.16194934 0.0233984971 0.04401181
## wtd_mean_atomic_mass 0.070160458 0.23618103 0.0524883695 -0.06250234
## gmean_atomic_mass -0.068699797 -0.14334482 -0.0762401708 0.07877092
## wtd_gmean_atomic_mass 0.006601438 0.24428371 -0.0005254918 -0.07356478
## entropy_atomic_mass 0.333224985 -0.16411891 0.0053891474 -0.16251028
## PC61 PC62 PC63 PC64
## number_of_elements -0.27177086 -0.03824890 0.14006222 0.11477045
## mean_atomic_mass 0.01169217 0.12588152 -0.14209552 -0.05492107
## wtd_mean_atomic_mass -0.02708919 0.19157803 -0.07787844 0.01250183
## gmean_atomic_mass 0.02587727 -0.16246368 0.14304949 -0.01418947
## wtd_gmean_atomic_mass 0.01527817 -0.08445826 0.12086654 0.06120593
## entropy_atomic_mass 0.05361563 -0.28110332 0.03790376 0.02899267
## PC65 PC66 PC67 PC68
## number_of_elements 0.03996614 0.06913216 0.01052621 0.009094840
## mean_atomic_mass -0.11247947 0.26072536 -0.11621543 -0.003411829
## wtd_mean_atomic_mass -0.11573176 0.18674041 -0.17220867 -0.048057093
## gmean_atomic_mass 0.18233058 -0.32203625 0.16354531 -0.005364360
## wtd_gmean_atomic_mass 0.05284737 -0.13822259 0.10269399 0.059232912
## entropy_atomic_mass -0.07186486 0.18557519 -0.40143572 -0.149390270
## PC69 PC70 PC71 PC72
## number_of_elements 0.087821088 0.03076486 -0.003101844 -0.03933664
## mean_atomic_mass -0.002252168 -0.11879442 0.049499259 0.07976154
## wtd_mean_atomic_mass -0.004529280 -0.10285875 -0.126230314 0.01073326
## gmean_atomic_mass -0.021455146 0.17566907 -0.061015274 -0.06090120
## wtd_gmean_atomic_mass 0.018650239 0.05355546 0.156795194 -0.04645022
## entropy_atomic_mass -0.121237015 -0.05498821 -0.080377512 0.06469620
## PC73 PC74 PC75 PC76
## number_of_elements -0.11247979 -0.007909514 -0.009523201 -0.02787139
## mean_atomic_mass -0.02081370 0.026474043 -0.089030328 0.35061341
## wtd_mean_atomic_mass 0.01072977 -0.016187438 0.126835170 -0.52015400
## gmean_atomic_mass 0.06512625 -0.048542444 0.097633627 -0.34009207
## wtd_gmean_atomic_mass -0.06016132 0.040803667 -0.142096723 0.53447723
## entropy_atomic_mass -0.09748688 -0.004917660 -0.033189371 0.04022614
## PC77 PC78 PC79 PC80
## number_of_elements -0.01232390 -0.01941766 -0.007987346 -0.001680979
## mean_atomic_mass 0.02027277 0.03671681 -0.035040961 0.041983137
## wtd_mean_atomic_mass -0.02704500 -0.04119994 0.068238083 -0.063916108
## gmean_atomic_mass -0.01362914 -0.03491495 0.036518203 -0.046736110
## wtd_gmean_atomic_mass 0.02165557 0.04359082 -0.075933904 0.075731457
## entropy_atomic_mass -0.06054435 0.02738245 0.001577812 0.005263212
## PC81
## number_of_elements -0.002106604
## mean_atomic_mass 0.009060502
## wtd_mean_atomic_mass -0.013608959
## gmean_atomic_mass -0.006179888
## wtd_gmean_atomic_mass 0.006908326
## entropy_atomic_mass -0.002731923
Below is the summary of the PCA, including the standard deviation of each component, the proportion of the original variance explained by each component, and the cumulative variance.
summary(data_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 5.6156 2.9139 2.77708 2.53086 2.18279 1.75174 1.71290
## Proportion of Variance 0.3893 0.1048 0.09521 0.07908 0.05882 0.03788 0.03622
## Cumulative Proportion 0.3893 0.4941 0.58935 0.66843 0.72725 0.76513 0.80136
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.58643 1.38293 1.26573 1.21695 1.08695 0.97701 0.89935
## Proportion of Variance 0.03107 0.02361 0.01978 0.01828 0.01459 0.01178 0.00999
## Cumulative Proportion 0.83243 0.85604 0.87582 0.89410 0.90869 0.92047 0.93046
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.89208 0.79563 0.76303 0.66348 0.62570 0.55602 0.49481
## Proportion of Variance 0.00982 0.00782 0.00719 0.00543 0.00483 0.00382 0.00302
## Cumulative Proportion 0.94028 0.94810 0.95529 0.96072 0.96555 0.96937 0.97239
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.48177 0.45580 0.40959 0.39968 0.38845 0.3711 0.33985
## Proportion of Variance 0.00287 0.00256 0.00207 0.00197 0.00186 0.0017 0.00143
## Cumulative Proportion 0.97526 0.97782 0.97989 0.98187 0.98373 0.9854 0.98686
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 0.31984 0.30537 0.28801 0.27892 0.27287 0.24129 0.23554
## Proportion of Variance 0.00126 0.00115 0.00102 0.00096 0.00092 0.00072 0.00068
## Cumulative Proportion 0.98812 0.98927 0.99029 0.99125 0.99217 0.99289 0.99358
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.22426 0.21502 0.19983 0.18811 0.18503 0.16253 0.15744
## Proportion of Variance 0.00062 0.00057 0.00049 0.00044 0.00042 0.00033 0.00031
## Cumulative Proportion 0.99420 0.99477 0.99526 0.99570 0.99612 0.99645 0.99675
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 0.14428 0.13868 0.13456 0.13208 0.1262 0.12326 0.12100
## Proportion of Variance 0.00026 0.00024 0.00022 0.00022 0.0002 0.00019 0.00018
## Cumulative Proportion 0.99701 0.99725 0.99747 0.99769 0.9979 0.99807 0.99825
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 0.11914 0.11255 0.11162 0.10131 0.09863 0.09771 0.09242
## Proportion of Variance 0.00018 0.00016 0.00015 0.00013 0.00012 0.00012 0.00011
## Cumulative Proportion 0.99843 0.99858 0.99874 0.99886 0.99898 0.99910 0.99921
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 0.08503 0.08118 0.08045 0.07627 0.07243 0.06789 0.05995
## Proportion of Variance 0.00009 0.00008 0.00008 0.00007 0.00006 0.00006 0.00004
## Cumulative Proportion 0.99930 0.99938 0.99946 0.99953 0.99959 0.99965 0.99970
## PC64 PC65 PC66 PC67 PC68 PC69 PC70
## Standard deviation 0.05968 0.05650 0.05349 0.05107 0.04773 0.04298 0.04084
## Proportion of Variance 0.00004 0.00004 0.00004 0.00003 0.00003 0.00002 0.00002
## Cumulative Proportion 0.99974 0.99978 0.99981 0.99985 0.99988 0.99990 0.99992
## PC71 PC72 PC73 PC74 PC75 PC76 PC77
## Standard deviation 0.03832 0.03676 0.03457 0.02734 0.02483 0.02113 0.01819
## Proportion of Variance 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001 0.00000
## Cumulative Proportion 0.99994 0.99995 0.99997 0.99998 0.99999 0.99999 0.99999
## PC78 PC79 PC80 PC81
## Standard deviation 0.01366 0.01096 0.008603 0.007042
## Proportion of Variance 0.00000 0.00000 0.000000 0.000000
## Cumulative Proportion 1.00000 1.00000 1.000000 1.000000
What is particularly interesting is the amount of variance captured by the first few principal components. Using only 6 components, over 75% of the original variance is retained. Additionally, increasing the number to 12 components allows us to preserve more than 90% of the total variance. Given that the original dataset contained 81 variables, this corresponds to an approximate 85% reduction in dimensionality. This result is consistent with the high level of correlation observed among many variables in the original dataset.
Let’s make some exemplary visualizations for the first two components.
fviz_pca_var(data_pca, col.var="darkred")
Variables that appear close to each other are positively correlated in the space defined by the first two principal components.
Time to visualize the observations in two dimensions.
fviz_pca_ind(data_pca, col.ind="cos2", geom="point",
gradient.cols=c("yellow", "orange", "red"))
cos2 measures how well each observation is represented by the selected principal components. A high cos2 indicates a good representation, while a low cos2 indicates poor representation.
PCA is often the first step before diving into further analysis. Deciding how many components to keep is important to capture enough information without retaining too much noise. To help with this choice, we can plot the eigenvalues and the variance explained by each component, along with the cumulative variance curve.
eig_plot <- fviz_eig(
data_pca,
choice = "eigenvalue",
addlabels = TRUE,
ncp = 15,
main = "Eigenvalues of each dimension"
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
var_plot <- fviz_eig(
data_pca,
ncp = 15,
addlabels = TRUE,
main = "Percentage of variance explained by each dimension"
) +
scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
grid.arrange(eig_plot, var_plot, nrow = 2, top = "Eigenvalues and amount of variance explained by each component in PCA")
a <- summary(data_pca)
plot(a$importance[3,],type="l", col = "blue", main = "Cumulative variance")
Choosing the optimal number of principal components can be subjective and there is no universal criterion. Since the original dataset is about the physical properties of materials, the analysis likely requires higher precision. Therefore, I would say it’s justified to retain a larger proportion of the variance.
By examining the results, keeping the first 12 components reduces the dimensionality by approximately 85% while still preserving 91% of the original variance, which seems to be quite satisfactory.
To further support this choice, the Kaiser rule can be applied, which suggests retaining components with eigenvalues greater than 1. According to the plot above, this also indicates that 12 components should be kept.
For these reasons, I would personally choose to retain 12 PCs for any further analysis. Of course, this decision remains somewhat subjective, as there is no single correct answer for the number of principal components to retain.
Assuming that we keep 12 first principal components, we can create plots to examine which variables contribute most to them.
PC1 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 1, top = 10)
PC2 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 2, top = 10)
PC3 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 3, top = 10)
PC4 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 4, top = 10)
PC5 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 5, top = 10)
PC6 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 6, top = 10)
PC7 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 7, top = 10)
PC8 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 8, top = 10)
PC9 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 9, top = 10)
PC10 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 10, top = 10)
PC11 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 11, top = 10)
PC12 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
axes = 12, top = 10)
grid.arrange(PC1, PC2, nrow=2)
grid.arrange(PC3, PC4, nrow=2)
grid.arrange(PC5, PC6, nrow=2)
grid.arrange(PC7, PC8, nrow=2)
grid.arrange(PC9, PC10, nrow=2)
grid.arrange(PC11, PC12, nrow=2)
The primary goal of this study was to reduce the dimensionality of the dataset about physical properties of superconductive materials. The reduction seemed particularly needed because of the high correlation between many variables from the original data.
The analysis was done using Principal Component Analysis (PCA) - a technique that transforms the original features into a new set of orthogonal variables called principal components, which are linear combinations of the original features.
It was found that it is possible to retain the majority of the original variance using only a few of the newly created components. Using the first 12 principal components retains approximately 91% of the original variance. Considering the characteristics of the data, the likely need for a precise approach, and the guidance of the Kaiser rule, I would say keeping these 12 components for further analysis seems reasonable. Reducing the dimensionality by 85% while retaining 90% of the information seems to be a good compromise. However, there is no strict, objective criterion for the number of components to retain, so this choice reflects my point of view.
Hamidieh, K. (2018). Superconductivty Data [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53P47.