Introduction

This study aims to perform dimensionality reduction on a dataset containing physical properties of superconducting materials. Since the dataset includes 81 features, many of which are likely correlated, dimensionality reduction provides an effective way to simplify the dataset and serves as a strong starting point for further analysis. To achieve this, Principal Component Analysis will be employed. PCA is a linear technique that transforms the original variables into a smaller set of uncorrelated variables, known as principal components, while retaining as much variance as possible from the original dataset.

Chapter 1. Description of the Dataset

1.1. Source of the Dataset

The dataset used in this study is obtained from the UCI Machine Learning Repository and was originally introduced in the paper “A Data-Driven Statistical Model for Predicting the Critical Temperature of a Superconductor” by K. Hamidieh (2018):

https://archive.ics.uci.edu/dataset/464/superconductivty+data

The primary objective of the original study was to predict the superconducting critical temperature based on a set of extracted material features.

The dataset called “train.csv” contains 81 input features in addition to the critical temperature as the target variable, resulting in a high-dimensional feature space that can be challenging to analyze. Therefore, the main goal of this study is to apply Principal Component Analysis to reduce the dimensionality of the input features, excluding the critical temperature in order to simplify the dataset while retaining its most informative components.

1.2. Variables

The variables included in the dataset are based on fundamental atomic properties of the chemical elements that make up each superconductor compound. These properties include:

  1. number of elements - the number of distinct chemical elements present in a compound
  2. atomic mass - the combined rest mass of protons and neutrons
  3. first ionization energy - the energy required to remove a valence electron
  4. atomic radius
  5. density - density at standard temperature and pressure
  6. electron affinity - the energy change associated with adding an electron to a neutral atom
  7. fusion heat - the energy required to change a substance from solid to liquid without a change in temperature
  8. thermal conductivity - the thermal conductivity coefficient
  9. valence - the typical number of chemical bonds formed by an element

For each of these variables (except number of elements), several descriptive statistics are computed across the elements present in a compound, including:

  1. Mean
  2. Weighted mean
  3. Geometric mean
  4. Weighted geometric mean
  5. Entropy
  6. Weighted entropy
  7. Range
  8. Weighted range
  9. Standard deviation
  10. Weighted standard deviation

This results in a total of 81 features. As can be expected, many of these features are highly correlated (for example, mean atomic mass, weighted mean atomic mass, geometric mean atomic mass…). This high level of correlation motivates the use of dimensionality reduction techniques, such as Principal Component Analysis.

Chapter 2. Methodology - PCA

The goal of this study will be achieved using Principal Component Analysis.

PCA is a dimensionality reduction technique used to simplify datasets that contain a large number of variables. The main idea behind PCA is to transform the original features into a new set of orthogonal variables called principal components (PCs), which are linear combinations of the original features. The first principal component explains the largest possible amount of variance in the data, the second explains the next largest amount (while being orthogonal to the first), and so on. Mathematically, PCA is based on the eigenvalues and eigenvectors of the covariance matrix. Eigenvectors define the directions of maximum variance, while eigenvalues indicate how much variance is captured along each direction. By keeping only the first few principal components that retain most of the total variance, PCA reduces the dimensionality of the dataset while preserving its most important structure, making the data easier to analyze.

Chapter 3. Data Preparation and Presentation

3.1. Loading and Preprocessing the Data

Let’s start with loading the libraries.

#Loading the libraries
library(corrplot)
library(caret)
library(recipes)
library(stats)
library(factoextra)
library(gridExtra)
library(ggfortify)
library(grid)

Let’s import the dataset called “train.csv”. As mentioned before, the original dependent variable, critical temperature, which appears in the last column, will be removed.

data <- read.csv("train.csv")
dim(data)
## [1] 21263    82
data <- data[,-82]
head(data)
##   number_of_elements mean_atomic_mass wtd_mean_atomic_mass gmean_atomic_mass
## 1                  4         88.94447             57.86269          66.36159
## 2                  5         92.72921             58.51842          73.13279
## 3                  4         88.94447             57.88524          66.36159
## 4                  4         88.94447             57.87397          66.36159
## 5                  4         88.94447             57.84014          66.36159
## 6                  4         88.94447             57.79504          66.36159
##   wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass
## 1              36.11661            1.181795               1.0623955
## 2              36.39660            1.449309               1.0577551
## 3              36.12251            1.181795               0.9759805
## 4              36.11956            1.181795               1.0222909
## 5              36.11072            1.181795               1.1292237
## 6              36.09893            1.181795               1.2252028
##   range_atomic_mass wtd_range_atomic_mass std_atomic_mass wtd_std_atomic_mass
## 1          122.9061              31.79492        51.96883            53.62253
## 2          122.9061              36.16194        47.09463            53.97987
## 3          122.9061              35.74110        51.96883            53.65627
## 4          122.9061              33.76801        51.96883            53.63940
## 5          122.9061              27.84874        51.96883            53.58877
## 6          122.9061              20.68746        51.96883            53.52115
##   mean_fie wtd_mean_fie gmean_fie wtd_gmean_fie entropy_fie wtd_entropy_fie
## 1  775.425     1010.269  718.1529      938.0168    1.305967       0.7914878
## 2  766.440     1010.613  720.6055      938.7454    1.544145       0.8070782
## 3  775.425     1010.820  718.1529      939.0090    1.305967       0.7736202
## 4  775.425     1010.544  718.1529      938.5128    1.305967       0.7832067
## 5  775.425     1009.717  718.1529      937.0256    1.305967       0.8052296
## 6  775.425     1008.614  718.1529      935.0463    1.305967       0.8247426
##   range_fie wtd_range_fie  std_fie wtd_std_fie mean_atomic_radius
## 1     810.6      735.9857 323.8118    355.5630             160.25
## 2     810.6      743.1643 290.1830    354.9635             161.20
## 3     810.6      743.1643 323.8118    354.8042             160.25
## 4     810.6      739.5750 323.8118    355.1839             160.25
## 5     810.6      728.8071 323.8118    356.3193             160.25
## 6     810.6      714.4500 323.8118    357.8246             160.25
##   wtd_mean_atomic_radius gmean_atomic_radius wtd_gmean_atomic_radius
## 1               105.5143            136.1260                84.52842
## 2               104.9714            141.4652                84.37017
## 3               104.6857            136.1260                84.21457
## 4               105.1000            136.1260                84.37135
## 5               106.3429            136.1260                84.84344
## 6               108.0000            136.1260                85.47701
##   entropy_atomic_radius wtd_entropy_atomic_radius range_atomic_radius
## 1              1.259244                  1.207040                 205
## 2              1.508328                  1.204115                 205
## 3              1.259244                  1.132547                 205
## 4              1.259244                  1.173033                 205
## 5              1.259244                  1.261194                 205
## 6              1.259244                  1.331339                 205
##   wtd_range_atomic_radius std_atomic_radius wtd_std_atomic_radius mean_Density
## 1                42.91429          75.23754              69.23557     4654.357
## 2                50.57143          67.32132              68.00882     5821.486
## 3                49.31429          75.23754              67.79771     4654.357
## 4                46.11429          75.23754              68.52166     4654.357
## 5                36.51429          75.23754              70.63445     4654.357
## 6                23.71429          75.23754              73.32413     4654.357
##   wtd_mean_Density gmean_Density wtd_gmean_Density entropy_Density
## 1         2961.502      724.9532          53.54381        1.033129
## 2         3021.017     1237.0951          54.09572        1.314442
## 3         2999.159      724.9532          53.97402        1.033129
## 4         2980.331      724.9532          53.75849        1.033129
## 5         2923.845      724.9532          53.11703        1.033129
## 6         2848.531      724.9532          52.27364        1.033129
##   wtd_entropy_Density range_Density wtd_range_Density std_Density
## 1           0.8145982      8958.571          1579.583    3306.163
## 2           0.9148022     10488.571          1667.383    3767.403
## 3           0.7603052      8958.571          1667.383    3306.163
## 4           0.7888885      8958.571          1623.483    3306.163
## 5           0.8598109      8958.571          1491.783    3306.163
## 6           0.9323687      8958.571          1316.183    3306.163
##   wtd_std_Density mean_ElectronAffinity wtd_mean_ElectronAffinity
## 1        3572.597               81.8375                  111.7271
## 2        3632.649               90.8900                  112.3164
## 3        3592.019               81.8375                  112.2136
## 4        3582.371               81.8375                  111.9704
## 5        3552.669               81.8375                  111.2407
## 6        3511.262               81.8375                  110.2679
##   gmean_ElectronAffinity wtd_gmean_ElectronAffinity entropy_ElectronAffinity
## 1               60.12318                   99.41468                 1.159687
## 2               69.83331                  101.16640                 1.427997
## 3               60.12318                  101.08215                 1.159687
## 4               60.12318                  100.24495                 1.159687
## 5               60.12318                   97.77472                 1.159687
## 6               60.12318                   94.57550                 1.159687
##   wtd_entropy_ElectronAffinity range_ElectronAffinity
## 1                    0.7873817                 127.05
## 2                    0.8386665                 127.05
## 3                    0.7860067                 127.05
## 4                    0.7869005                 127.05
## 5                    0.7873962                 127.05
## 6                    0.7844615                 127.05
##   wtd_range_ElectronAffinity std_ElectronAffinity wtd_std_ElectronAffinity
## 1                   80.98714             51.43371                 42.55840
## 2                   81.20786             49.43817                 41.66762
## 3                   81.20786             51.43371                 41.63988
## 4                   81.09750             51.43371                 42.10234
## 5                   80.76643             51.43371                 43.45206
## 6                   80.32500             51.43371                 45.17068
##   mean_FusionHeat wtd_mean_FusionHeat gmean_FusionHeat wtd_gmean_FusionHeat
## 1          6.9055            3.846857         3.479475             1.040986
## 2          7.7844            3.796857         4.403790             1.035251
## 3          6.9055            3.822571         3.479475             1.037439
## 4          6.9055            3.834714         3.479475             1.039211
## 5          6.9055            3.871143         3.479475             1.044545
## 6          6.9055            3.919714         3.479475             1.051699
##   entropy_FusionHeat wtd_entropy_FusionHeat range_FusionHeat
## 1           1.088575              0.9949982           12.878
## 2           1.374977              1.0730938           12.878
## 3           1.088575              0.9274794           12.878
## 4           1.088575              0.9640310           12.878
## 5           1.088575              1.0449695           12.878
## 6           1.088575              1.1118503           12.878
##   wtd_range_FusionHeat std_FusionHeat wtd_std_FusionHeat
## 1             1.744571       4.599064           4.666920
## 2             1.595714       4.473363           4.603000
## 3             1.757143       4.599064           4.649635
## 4             1.744571       4.599064           4.658301
## 5             1.744571       4.599064           4.684014
## 6             1.744571       4.599064           4.717642
##   mean_ThermalConductivity wtd_mean_ThermalConductivity
## 1                 107.7566                     61.01519
## 2                 172.2053                     61.37233
## 3                 107.7566                     60.94376
## 4                 107.7566                     60.97947
## 5                 107.7566                     61.08662
## 6                 107.7566                     61.22947
##   gmean_ThermalConductivity wtd_gmean_ThermalConductivity
## 1                  7.062488                     0.6219795
## 2                 16.064228                     0.6197346
## 3                  7.062488                     0.6190947
## 4                  7.062488                     0.6205354
## 5                  7.062488                     0.6248777
## 6                  7.062488                     0.6307148
##   entropy_ThermalConductivity wtd_entropy_ThermalConductivity
## 1                   0.3081480                       0.2628483
## 2                   0.8474042                       0.5677061
## 3                   0.3081480                       0.2504774
## 4                   0.3081480                       0.2570451
## 5                   0.3081480                       0.2728199
## 6                   0.3081480                       0.2882356
##   range_ThermalConductivity wtd_range_ThermalConductivity
## 1                  399.9734                      57.12767
## 2                  429.9734                      51.41338
## 3                  399.9734                      57.12767
## 4                  399.9734                      57.12767
## 5                  399.9734                      57.12767
## 6                  399.9734                      57.12767
##   std_ThermalConductivity wtd_std_ThermalConductivity mean_Valence
## 1                168.8542                    138.5172         2.25
## 2                198.5546                    139.6309         2.00
## 3                168.8542                    138.5406         2.25
## 4                168.8542                    138.5289         2.25
## 5                168.8542                    138.4937         2.25
## 6                168.8542                    138.4466         2.25
##   wtd_mean_Valence gmean_Valence wtd_gmean_Valence entropy_Valence
## 1         2.257143      2.213364          2.219783        1.368922
## 2         2.257143      1.888175          2.210679        1.557113
## 3         2.271429      2.213364          2.232679        1.368922
## 4         2.264286      2.213364          2.226222        1.368922
## 5         2.242857      2.213364          2.206963        1.368922
## 6         2.214286      2.213364          2.181543        1.368922
##   wtd_entropy_Valence range_Valence wtd_range_Valence std_Valence
## 1            1.066221             1          1.085714   0.4330127
## 2            1.047221             2          1.128571   0.6324555
## 3            1.029175             1          1.114286   0.4330127
## 4            1.048834             1          1.100000   0.4330127
## 5            1.096052             1          1.057143   0.4330127
## 6            1.141474             1          1.000000   0.4330127
##   wtd_std_Valence
## 1       0.4370588
## 2       0.4686063
## 3       0.4446966
## 4       0.4409521
## 5       0.4288095
## 6       0.4103259

3.2. Correlation Plot

It is time to make a colored correlation matrix to see how much the variables are correlated.

data_corr <- cor(data, method="pearson") 
corrplot(data_corr, order ="alphabet", tl.cex=0.6)

The correlation plot above shows that dimensionality reduction makes sense, given the large number of highly correlated variable pairs.

3.3. Standardization of the Data

It is a good habit to standardize the data before applying PCA.

preproc1 <- preProcess(data, method=c("center", "scale"))
data.s <- predict(preproc1, data)
summary(data.s)
##  number_of_elements mean_atomic_mass   wtd_mean_atomic_mass gmean_atomic_mass
##  Min.   :-2.16441   Min.   :-2.71651   Min.   :-1.9876      Min.   :-2.1260  
##  1st Qu.:-0.77484   1st Qu.:-0.50881   1st Qu.:-0.6224      1st Qu.:-0.4270  
##  Median :-0.08006   Median :-0.08879   Median :-0.3670      Median :-0.1588  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000      Mean   : 0.0000  
##  3rd Qu.: 0.61473   3rd Qu.: 0.43289   3rd Qu.: 0.3916      3rd Qu.: 0.2200  
##  Max.   : 3.39387   Max.   : 4.09155   Max.   : 4.0606      Max.   : 4.4373  
##  wtd_gmean_atomic_mass entropy_atomic_mass wtd_entropy_atomic_mass
##  Min.   :-1.5437       Min.   :-3.19406    Min.   :-2.6503        
##  1st Qu.:-0.6355       1st Qu.:-0.54512    1st Qu.:-0.7187        
##  Median :-0.5081       Median : 0.09298    Median : 0.2065        
##  Mean   : 0.0000       Mean   : 0.00000    Mean   : 0.0000        
##  3rd Qu.: 0.3976       3rd Qu.: 0.76434    3rd Qu.: 0.7362        
##  Max.   : 4.1047       Max.   : 2.24204    Max.   : 2.2279        
##  range_atomic_mass wtd_range_atomic_mass std_atomic_mass    wtd_std_atomic_mass
##  Min.   :-2.1162   Min.   :-1.2320       Min.   :-2.21567   Min.   :-2.0741    
##  1st Qu.:-0.6789   1st Qu.:-0.6082       1st Qu.:-0.57406   1st Qu.:-0.6460    
##  Median : 0.1337   Median :-0.2443       Median : 0.03652   Median : 0.1420    
##  Mean   : 0.0000   Mean   : 0.0000       Mean   : 0.00000   Mean   : 0.0000    
##  3rd Qu.: 0.7051   3rd Qu.: 0.1903       3rd Qu.: 0.74523   3rd Qu.: 0.6096    
##  Max.   : 1.6909   Max.   : 6.3915       Max.   : 2.82638   Max.   : 2.9810    
##     mean_fie         wtd_mean_fie       gmean_fie       wtd_gmean_fie    
##  Min.   :-4.50475   Min.   :-3.4544   Min.   :-4.6213   Min.   :-3.8178  
##  1st Qu.:-0.52435   1st Qu.:-0.9178   1st Qu.:-0.5737   1st Qu.:-0.9406  
##  Median :-0.05389   Median : 0.1363   Median :-0.1215   Median : 0.1956  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.30524   3rd Qu.: 0.9330   3rd Qu.: 0.3605   3rd Qu.: 0.8750  
##  Max.   : 6.21206   Max.   : 3.3333   Max.   : 7.3490   Max.   : 4.1314  
##   entropy_fie      wtd_entropy_fie      range_fie       wtd_range_fie    
##  Min.   :-3.4016   Min.   :-2.77448   Min.   :-1.8482   Min.   :-2.1581  
##  1st Qu.:-0.5585   1st Qu.:-0.51784   1st Qu.:-1.0007   1st Qu.:-0.8589  
##  Median : 0.1494   Median :-0.02959   Median : 0.6197   Median : 0.1202  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6597   3rd Qu.: 0.40424   3rd Qu.: 0.7699   3rd Qu.: 0.9248  
##  Max.   : 2.2480   Max.   : 3.32867   Max.   : 2.3651   Max.   : 3.4294  
##     std_fie         wtd_std_fie      mean_atomic_radius wtd_mean_atomic_radius
##  Min.   :-1.9609   Min.   :-1.7514   Min.   :-5.4590    Min.   :-3.0109       
##  1st Qu.:-0.9230   1st Qu.:-1.0245   1st Qu.:-0.4293    1st Qu.:-0.7844       
##  Median : 0.4614   Median : 0.2689   Median : 0.1125    Median :-0.3038       
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000    Mean   : 0.0000       
##  3rd Qu.: 0.7465   3rd Qu.: 0.9271   3rd Qu.: 0.5894    3rd Qu.: 0.8175       
##  Max.   : 2.5830   Max.   : 1.9942   Max.   : 6.9497    Max.   : 5.6691       
##  gmean_atomic_radius wtd_gmean_atomic_radius entropy_atomic_radius
##  Min.   :-4.36598    Min.   :-2.0367         Min.   :-3.3770      
##  1st Qu.:-0.49370    1st Qu.:-0.8868         1st Qu.:-0.5364      
##  Median :-0.07429    Median :-0.2179         Median : 0.1678      
##  Mean   : 0.00000    Mean   : 0.0000         Mean   : 0.0000      
##  3rd Qu.: 0.52010    3rd Qu.: 0.8371         3rd Qu.: 0.6515      
##  Max.   : 6.95086    Max.   : 4.9392         Max.   : 2.3287      
##  wtd_entropy_atomic_radius range_atomic_radius wtd_range_atomic_radius
##  Min.   :-2.7781           Min.   :-2.0711     Min.   :-1.4669        
##  1st Qu.:-0.6851           1st Qu.:-0.8819     1st Qu.:-0.6503        
##  Median : 0.2744           Median : 0.4708     Median :-0.2390        
##  Mean   : 0.0000           Mean   : 0.0000     Mean   : 0.0000        
##  3rd Qu.: 0.7234           3rd Qu.: 0.9763     3rd Qu.: 0.2528        
##  Max.   : 1.8976           Max.   : 1.7344     Max.   : 5.3911        
##  std_atomic_radius wtd_std_atomic_radius  mean_Density     wtd_mean_Density 
##  Min.   :-2.2535   Min.   :-2.0692       Min.   :-2.1463   Min.   :-1.6347  
##  1st Qu.:-0.7201   1st Qu.:-0.8035       1st Qu.:-0.5613   1st Qu.:-0.7041  
##  Median : 0.3084   Median : 0.3002       Median :-0.2748   Median :-0.2992  
##  Mean   : 0.0000   Mean   : 0.0000       Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7784   3rd Qu.: 0.8475       3rd Qu.: 0.2166   3rd Qu.: 0.3567  
##  Max.   : 2.7905   Max.   : 1.7711       Max.   : 5.7885   Max.   : 5.3776  
##  gmean_Density     wtd_gmean_Density entropy_Density    wtd_entropy_Density
##  Min.   :-0.9341   Min.   :-0.7840   Min.   :-3.13248   Min.   :-2.67712   
##  1st Qu.:-0.6960   1st Qu.:-0.7674   1st Qu.:-0.46287   1st Qu.:-0.52334   
##  Median :-0.5727   Median :-0.4030   Median : 0.05312   Median : 0.08353   
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   
##  3rd Qu.: 0.6303   3rd Qu.: 0.6663   3rd Qu.: 0.73463   3rd Qu.: 0.70335   
##  Max.   : 5.1655   Max.   : 4.8987   Max.   : 2.57589   Max.   : 2.65006   
##  range_Density      wtd_range_Density  std_Density       wtd_std_Density  
##  Min.   :-2.11500   Min.   :-1.2102   Min.   :-2.04162   Min.   :-2.0593  
##  1st Qu.:-0.49240   1st Qu.:-0.5195   1st Qu.:-0.35696   1st Qu.:-0.4683  
##  Median : 0.07155   Median :-0.3418   Median :-0.06873   Median : 0.1901  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.27169   3rd Qu.: 0.2111   3rd Qu.: 0.35095   3rd Qu.: 0.3971  
##  Max.   : 3.39827   Max.   : 8.1433   Max.   : 4.36625   Max.   : 4.3999  
##  mean_ElectronAffinity wtd_mean_ElectronAffinity gmean_ElectronAffinity
##  Min.   :-2.7211       Min.   :-2.8261           Min.   :-1.82227      
##  1st Qu.:-0.5339       1st Qu.:-0.6001           1st Qu.:-0.71220      
##  Median :-0.1364       Median : 0.3141           Median :-0.09961      
##  Mean   : 0.0000       Mean   : 0.0000           Mean   : 0.00000      
##  3rd Qu.: 0.3113       3rd Qu.: 0.5583           3rd Qu.: 0.45321      
##  Max.   : 8.9965       Max.   : 7.2308           Max.   : 9.36796      
##  wtd_gmean_ElectronAffinity entropy_ElectronAffinity
##  Min.   :-2.24075           Min.   :-3.1167         
##  1st Qu.:-0.68389           1st Qu.:-0.5232         
##  Median : 0.02394           Median : 0.1981         
##  Mean   : 0.00000           Mean   : 0.0000         
##  3rd Qu.: 0.55483           3rd Qu.: 0.8027         
##  Max.   : 8.01568           Max.   : 2.0312         
##  wtd_entropy_ElectronAffinity range_ElectronAffinity wtd_range_ElectronAffinity
##  Min.   :-2.69508             Min.   :-2.0567        Min.   :-2.0731           
##  1st Qu.:-0.38497             1st Qu.:-0.5797        1st Qu.:-0.8839           
##  Median : 0.03653             Median : 0.1077        Median : 0.4131           
##  Mean   : 0.00000             Mean   : 0.0000        Mean   : 0.0000           
##  3rd Qu.: 0.37339             3rd Qu.: 0.3049        3rd Qu.: 0.6071           
##  Max.   : 3.16324             Max.   : 3.8887        Max.   : 5.5682           
##  std_ElectronAffinity wtd_std_ElectronAffinity mean_FusionHeat  
##  Min.   :-2.2498      Min.   :-2.1738          Min.   :-1.2455  
##  1st Qu.:-0.4848      1st Qu.:-0.5369          1st Qu.:-0.5936  
##  Median : 0.1018      Median : 0.1772          Median :-0.4417  
##  Mean   : 0.0000      Mean   : 0.0000          Mean   : 0.0000  
##  3rd Qu.: 0.3362      3rd Qu.: 0.4362          3rd Qu.: 0.2494  
##  Max.   : 5.2429      Max.   : 6.1023          Max.   : 8.0268  
##  wtd_mean_FusionHeat gmean_FusionHeat  wtd_gmean_FusionHeat entropy_FusionHeat
##  Min.   :-0.9542     Min.   :-0.9850   Min.   :-0.7552      Min.   :-2.90835  
##  1st Qu.:-0.6173     1st Qu.:-0.5988   1st Qu.:-0.6715      1st Qu.:-0.69164  
##  Median :-0.3864     Median :-0.4852   Median :-0.3968      Median : 0.04989  
##  Mean   : 0.0000     Mean   : 0.0000   Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.3268     3rd Qu.: 0.3440   3rd Qu.: 0.4787      3rd Qu.: 0.75750  
##  Max.   : 6.3835     Max.   : 9.4242   Max.   : 7.2224      Max.   : 2.50329  
##  wtd_entropy_FusionHeat range_FusionHeat  wtd_range_FusionHeat
##  Min.   :-2.4696        Min.   :-1.0377   Min.   :-0.7200     
##  1st Qu.:-0.6520        1st Qu.:-0.4055   1st Qu.:-0.5160     
##  Median : 0.2187        Median :-0.4055   Median :-0.4190     
##  Mean   : 0.0000        Mean   : 0.0000   Mean   : 0.0000     
##  3rd Qu.: 0.6574        3rd Qu.: 0.1012   3rd Qu.: 0.1998     
##  Max.   : 2.2509        Max.   : 4.1059   Max.   : 8.2754     
##  std_FusionHeat     wtd_std_FusionHeat mean_ThermalConductivity
##  Min.   :-0.95983   Min.   :-1.05891   Min.   :-2.3283         
##  1st Qu.:-0.46842   1st Qu.:-0.42728   1st Qu.:-0.7453         
##  Median :-0.38922   Median :-0.30418   Median : 0.1765         
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000         
##  3rd Qu.: 0.08279   3rd Qu.: 0.04116   3rd Qu.: 0.5530         
##  Max.   : 4.99463   Max.   : 6.03203   Max.   : 6.3035         
##  wtd_mean_ThermalConductivity gmean_ThermalConductivity
##  Min.   :-1.7909              Min.   :-0.8754          
##  1st Qu.:-0.6012              1st Qu.:-0.6313          
##  Median :-0.1805              Median :-0.4567          
##  Mean   : 0.0000              Mean   : 0.0000          
##  3rd Qu.: 0.3848              3rd Qu.: 0.3679          
##  Max.   : 7.1489              Max.   : 8.4570          
##  wtd_gmean_ThermalConductivity entropy_ThermalConductivity
##  Min.   :-0.6789               Min.   :-2.23216           
##  1st Qu.:-0.6524               1st Qu.:-0.82773           
##  Median :-0.5278               Median : 0.03394           
##  Mean   : 0.0000               Mean   : 0.00000           
##  3rd Qu.: 0.4976               3rd Qu.: 0.71965           
##  Max.   : 8.6767               Max.   : 2.78041           
##  wtd_entropy_ThermalConductivity range_ThermalConductivity
##  Min.   :-1.6968                 Min.   :-1.5809          
##  1st Qu.:-0.9091                 1st Qu.:-1.0366          
##  Median : 0.0182                 Median : 0.9382          
##  Mean   : 0.0000                 Mean   : 0.0000          
##  3rd Qu.: 0.7458                 3rd Qu.: 0.9394          
##  Max.   : 3.3716                 Max.   : 1.1284          
##  wtd_range_ThermalConductivity std_ThermalConductivity
##  Min.   :-1.4385               Min.   :-1.6451        
##  1st Qu.:-0.7579               1st Qu.:-1.0144        
##  Median :-0.1270               Median : 0.6122        
##  Mean   : 0.0000               Mean   : 0.0000        
##  3rd Qu.: 0.6919               3rd Qu.: 0.9122        
##  Max.   : 7.8706               Max.   : 1.9294        
##  wtd_std_ThermalConductivity  mean_Valence     wtd_mean_Valence 
##  Min.   :-1.5105             Min.   :-2.1044   Min.   :-1.8075  
##  1st Qu.:-1.0084             1st Qu.:-0.8280   1st Qu.:-0.8700  
##  Median : 0.2719             Median :-0.3493   Median :-0.4491  
##  Mean   : 0.0000             Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.0434             3rd Qu.: 0.7675   3rd Qu.: 0.7329  
##  Max.   : 1.8375             Max.   : 3.6394   Max.   : 3.2293  
##  gmean_Valence     wtd_gmean_Valence entropy_Valence   wtd_entropy_Valence
##  Min.   :-1.9656   Min.   :-1.7500   Min.   :-3.2956   Min.   :-2.7685    
##  1st Qu.:-0.7425   1st Qu.:-0.8211   1st Qu.:-0.5973   1st Qu.:-0.7288    
##  Median :-0.4217   Median :-0.5293   Median : 0.1863   Median : 0.2990    
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000    
##  3rd Qu.: 0.6417   3rd Qu.: 0.7312   3rd Qu.: 0.7461   3rd Qu.: 0.7309    
##  Max.   : 3.7691   Max.   : 3.3572   Max.   : 2.1525   Max.   : 2.3585    
##  range_Valence      wtd_range_Valence  std_Valence       wtd_std_Valence  
##  Min.   :-1.64287   Min.   :-1.5161   Min.   :-1.73176   Min.   :-1.4794  
##  1st Qu.:-0.83794   1st Qu.:-0.5741   1st Qu.:-0.79969   1st Qu.:-0.8058  
##  Median :-0.03301   Median :-0.4293   Median :-0.08117   Median :-0.3819  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.77192   3rd Qu.: 0.4451   3rd Qu.: 0.74412   3rd Qu.: 0.7605  
##  Max.   : 3.18671   Max.   : 5.6321   Max.   : 4.45794   Max.   : 5.1056

All means are equal to 0.

Chapter 4. Principal Component Analysis

4.1. Covariance Matrix, Eigenvectors and Eigenvalues

First, the eigenvectors and eigenvalues will be presented for illustrative purposes.

Eigenvectors indicate the directions along which the data varies. The first eigenvector points in the direction of maximum variance, the second in the next most important direction, and so on. These eigenvectors define the directions of the principal components.

data.cov <- cov(data.s)
data.eigen <- eigen(data.cov)
head(data.eigen$vectors)
##             [,1]        [,2]         [,3]        [,4]          [,5]        [,6]
## [1,]  0.15577833 -0.09185596 -0.075197122  0.05764604 -0.0115456693  0.06250960
## [2,] -0.05181305 -0.22647459 -0.007334317 -0.18303403 -0.0350782024  0.05190857
## [3,] -0.09916962 -0.19990777 -0.055080664 -0.14724811  0.0285920161 -0.06732956
## [4,] -0.08332166 -0.21894420  0.025325496 -0.15064367  0.0040613110  0.08519657
## [5,] -0.12006367 -0.18477968 -0.028652254 -0.10861815  0.0440689581 -0.03733811
## [6,]  0.14641854 -0.12255116 -0.059729607  0.06044047  0.0009406503  0.12905797
##             [,7]         [,8]       [,9]       [,10]        [,11]       [,12]
## [1,] -0.05749010 -0.020053317 0.05300080  0.07068765 -0.007686226 -0.01966041
## [2,]  0.07510766  0.129593196 0.06171558 -0.10585155  0.091201103 -0.27748970
## [3,] -0.02037932  0.005283779 0.12665782 -0.07590331  0.031584554 -0.27831251
## [4,] -0.02370737  0.122484110 0.04140090 -0.06334025  0.134187985 -0.24640830
## [5,] -0.08878995 -0.002823301 0.10087474 -0.05350412  0.074966395 -0.24007966
## [6,] -0.12944835 -0.035694539 0.04743043  0.07473660  0.013968364  0.03171565
##             [,13]        [,14]        [,15]       [,16]        [,17]
## [1,]  0.037028950 -0.015896111  0.071093744 -0.06819561  0.003106938
## [2,] -0.009700472  0.083163916  0.092956904 -0.02373368  0.194171892
## [3,] -0.023166236  0.005116402 -0.002229394  0.09083238 -0.074176333
## [4,]  0.051504940  0.075185579  0.007247210 -0.01006571  0.180487564
## [5,]  0.029004056  0.003923106 -0.039131496  0.04008825 -0.024278994
## [6,]  0.074084689  0.006632305 -0.022041575 -0.04235568  0.015064226
##            [,18]       [,19]       [,20]        [,21]      [,22]        [,23]
## [1,]  0.09436509  0.04728067  0.08039423  0.274842325 0.04272168 -0.038649523
## [2,]  0.06041675 -0.01539508  0.06032762 -0.067004634 0.15787768  0.032812990
## [3,] -0.06606138  0.12731751  0.08161066  0.046309954 0.14417047 -0.047382809
## [4,]  0.04563700  0.04699556  0.04172270 -0.062342169 0.20203413 -0.002252252
## [5,] -0.07484941  0.14376597  0.09394998  0.024250357 0.13693565  0.057376253
## [6,]  0.04371487  0.04917930 -0.01812150  0.005682592 0.10713577 -0.119481052
##            [,24]        [,25]       [,26]       [,27]       [,28]       [,29]
## [1,]  0.03175957  0.099262665 -0.09463002 -0.19845564  0.03976466 -0.01875030
## [2,] -0.06834167  0.100334777  0.04753571  0.03563076  0.18857806 -0.10956745
## [3,]  0.04095633 -0.009716354 -0.10960098  0.03655163 -0.06356073 -0.04829886
## [4,] -0.02626309  0.064763591  0.02486778  0.08700908  0.17419545 -0.12858283
## [5,]  0.05834615  0.017286915 -0.13774106  0.04932558 -0.05263465 -0.04296815
## [6,]  0.02147843  0.012737578 -0.06607246 -0.06989853  0.03432158 -0.05295695
##            [,30]       [,31]        [,32]       [,33]        [,34]
## [1,]  0.03922501 -0.01149625  0.068957020 -0.08880295 -0.050859692
## [2,] -0.16515906  0.09309351 -0.038393112  0.04125160 -0.015996726
## [3,]  0.02959113  0.01397550 -0.031840533 -0.00327170  0.027465885
## [4,] -0.10952550  0.13815365 -0.008433746  0.03117466  0.007668515
## [5,] -0.01359635 -0.03005260 -0.075900779 -0.05214849 -0.031831129
## [6,] -0.03336701  0.05377304  0.096626275 -0.04331764 -0.009472547
##              [,35]        [,36]       [,37]       [,38]        [,39]
## [1,]  1.369521e-01  0.181517577  0.10198772  0.32779715  0.183898810
## [2,]  1.763473e-01 -0.007943967  0.15307184 -0.06681902  0.003866377
## [3,] -2.868083e-01  0.062368283  0.01765606  0.05966228  0.094260197
## [4,]  9.500306e-02 -0.087629697  0.08871157 -0.09557705 -0.050713547
## [5,] -3.280077e-01  0.008475281 -0.08647074  0.03924441 -0.046502277
## [6,] -4.783907e-05 -0.171122795 -0.15112147 -0.09597421  0.040085943
##            [,40]       [,41]       [,42]       [,43]       [,44]       [,45]
## [1,] -0.08224931 -0.06104958  0.12579853  0.41718545 -0.10147619  0.01071707
## [2,] -0.10729696  0.03487414  0.01969472  0.05685173 -0.02352946 -0.22086740
## [3,]  0.01209019  0.01395026  0.09684873  0.08895304  0.06110343  0.11358388
## [4,]  0.09819595  0.01397741 -0.02655539 -0.06859422 -0.01229771 -0.01469707
## [5,]  0.02778074 -0.06936179 -0.02960670  0.05821517  0.03310120  0.19506803
## [6,]  0.19631764 -0.01520802 -0.04772756 -0.19610793  0.13092696  0.09991225
##            [,46]       [,47]       [,48]        [,49]       [,50]       [,51]
## [1,]  0.08240012  0.18764849 -0.12510331  0.149117714 -0.17651661  0.17365403
## [2,] -0.10648996  0.15697834  0.09684737 -0.111467440  0.03882797 -0.04709902
## [3,]  0.03306138 -0.01626207  0.04152485 -0.138748512  0.14512737 -0.01300247
## [4,] -0.05205525 -0.07635537 -0.15359719  0.159079888 -0.12927260  0.03945734
## [5,]  0.15683292 -0.10837556 -0.04495233 -0.001073976  0.02111025  0.03857396
## [6,] -0.08916072 -0.02328785 -0.06527110  0.191706232 -0.08988014  0.08991247
##            [,52]       [,53]        [,54]       [,55]       [,56]        [,57]
## [1,] -0.14372211  0.09378129 -0.120261483  0.01873669 -0.03762937  0.033431935
## [2,] -0.11996732 -0.03222809 -0.009045325  0.06071338  0.24706132  0.018006944
## [3,]  0.03857375 -0.11404672 -0.110463848 -0.02701829  0.08895578 -0.070160458
## [4,] -0.03625125  0.09196424  0.179557668  0.04964075 -0.18509302  0.068699797
## [5,]  0.12991193  0.06616678 -0.061487455 -0.02155713 -0.18226460 -0.006601438
## [6,]  0.16180857  0.19338827 -0.104884853 -0.11690311  0.06885462 -0.333224985
##            [,58]         [,59]       [,60]       [,61]       [,62]       [,63]
## [1,]  0.05624044 -0.1045509899 -0.03410281  0.27177086 -0.03824890 -0.14006222
## [2,]  0.16194934  0.0233984971  0.04401181 -0.01169217  0.12588152  0.14209552
## [3,] -0.23618103  0.0524883695 -0.06250234  0.02708919  0.19157803  0.07787844
## [4,]  0.14334482 -0.0762401708  0.07877092 -0.02587727 -0.16246368 -0.14304949
## [5,] -0.24428371 -0.0005254918 -0.07356478 -0.01527817 -0.08445826 -0.12086654
## [6,]  0.16411891  0.0053891474 -0.16251028 -0.05361563 -0.28110332 -0.03790376
##            [,64]       [,65]       [,66]       [,67]        [,68]        [,69]
## [1,] -0.11477045  0.03996614  0.06913216  0.01052621 -0.009094840  0.087821088
## [2,]  0.05492107 -0.11247947  0.26072536 -0.11621543  0.003411829 -0.002252168
## [3,] -0.01250183 -0.11573176  0.18674041 -0.17220867  0.048057093 -0.004529280
## [4,]  0.01418947  0.18233058 -0.32203625  0.16354531  0.005364360 -0.021455146
## [5,] -0.06120593  0.05284737 -0.13822259  0.10269399 -0.059232912  0.018650239
## [6,] -0.02899267 -0.07186486  0.18557519 -0.40143572  0.149390270 -0.121237015
##            [,70]        [,71]       [,72]       [,73]        [,74]        [,75]
## [1,]  0.03076486 -0.003101844  0.03933664  0.11247979 -0.007909514  0.009523201
## [2,] -0.11879442  0.049499259 -0.07976154  0.02081370  0.026474043  0.089030328
## [3,] -0.10285875 -0.126230314 -0.01073326 -0.01072977 -0.016187438 -0.126835170
## [4,]  0.17566907 -0.061015274  0.06090120 -0.06512625 -0.048542444 -0.097633627
## [5,]  0.05355546  0.156795194  0.04645022  0.06016132  0.040803667  0.142096723
## [6,] -0.05498821 -0.080377512 -0.06469620  0.09748688 -0.004917660  0.033189371
##            [,76]       [,77]       [,78]        [,79]        [,80]        [,81]
## [1,] -0.02787139  0.01232390  0.01941766 -0.007987346  0.001680979  0.002106604
## [2,]  0.35061341 -0.02027277 -0.03671681 -0.035040961 -0.041983137 -0.009060502
## [3,] -0.52015400  0.02704500  0.04119994  0.068238083  0.063916108  0.013608959
## [4,] -0.34009207  0.01362914  0.03491495  0.036518203  0.046736110  0.006179888
## [5,]  0.53447723 -0.02165557 -0.04359082 -0.075933904 -0.075731457 -0.006908326
## [6,]  0.04022614  0.06054435 -0.02738245  0.001577812 -0.005263212  0.002731923

Eigenvalues indicate how much of the dataset’s variance is captured by each eigenvector. Higher eigenvalues correspond to directions that explain more variance and are therefore more important in describing the structure of the data.

data.eigen$values
##  [1] 3.153476e+01 8.490589e+00 7.712172e+00 6.405261e+00 4.764568e+00
##  [6] 3.068578e+00 2.934024e+00 2.516757e+00 1.912501e+00 1.602082e+00
## [11] 1.480977e+00 1.181464e+00 9.545415e-01 8.088287e-01 7.958057e-01
## [16] 6.330308e-01 5.822221e-01 4.402077e-01 3.914952e-01 3.091580e-01
## [21] 2.448415e-01 2.321071e-01 2.077513e-01 1.677602e-01 1.597464e-01
## [26] 1.508934e-01 1.377307e-01 1.155001e-01 1.022980e-01 9.325272e-02
## [31] 8.295171e-02 7.779386e-02 7.446023e-02 5.822024e-02 5.547902e-02
## [36] 5.029370e-02 4.623229e-02 3.993050e-02 3.538391e-02 3.423715e-02
## [41] 2.641669e-02 2.478717e-02 2.081580e-02 1.923129e-02 1.810621e-02
## [46] 1.744477e-02 1.591769e-02 1.519301e-02 1.464169e-02 1.419336e-02
## [51] 1.266700e-02 1.245829e-02 1.026368e-02 9.727861e-03 9.546853e-03
## [56] 8.542345e-03 7.230276e-03 6.590798e-03 6.472651e-03 5.816789e-03
## [61] 5.246126e-03 4.608991e-03 3.593798e-03 3.561390e-03 3.192283e-03
## [66] 2.861282e-03 2.607661e-03 2.278549e-03 1.846978e-03 1.667576e-03
## [71] 1.468079e-03 1.351208e-03 1.195417e-03 7.477059e-04 6.163952e-04
## [76] 4.464905e-04 3.309954e-04 1.866445e-04 1.201482e-04 7.401160e-05
## [81] 4.959679e-05

4.2. PCA

Time to apply Principal Component Analysis.

First, the loadings of the first few original variables for each component are presented below.

data_pca <- prcomp(data.s, center=FALSE, scale.=FALSE)
head(data_pca$rotation)
##                               PC1        PC2          PC3         PC4
## number_of_elements     0.15577833 0.09185596  0.075197122 -0.05764604
## mean_atomic_mass      -0.05181305 0.22647459  0.007334317  0.18303403
## wtd_mean_atomic_mass  -0.09916962 0.19990777  0.055080664  0.14724811
## gmean_atomic_mass     -0.08332166 0.21894420 -0.025325496  0.15064367
## wtd_gmean_atomic_mass -0.12006367 0.18477968  0.028652254  0.10861815
## entropy_atomic_mass    0.14641854 0.12255116  0.059729607 -0.06044047
##                                 PC5         PC6         PC7          PC8
## number_of_elements     0.0115456693 -0.06250960  0.05749010 -0.020053317
## mean_atomic_mass       0.0350782024 -0.05190857 -0.07510766  0.129593196
## wtd_mean_atomic_mass  -0.0285920161  0.06732956  0.02037932  0.005283779
## gmean_atomic_mass     -0.0040613110 -0.08519657  0.02370737  0.122484110
## wtd_gmean_atomic_mass -0.0440689581  0.03733811  0.08878995 -0.002823301
## entropy_atomic_mass   -0.0009406503 -0.12905797  0.12944835 -0.035694539
##                              PC9        PC10         PC11        PC12
## number_of_elements    0.05300080 -0.07068765  0.007686226 -0.01966041
## mean_atomic_mass      0.06171558  0.10585155 -0.091201103 -0.27748970
## wtd_mean_atomic_mass  0.12665782  0.07590331 -0.031584554 -0.27831251
## gmean_atomic_mass     0.04140090  0.06334025 -0.134187985 -0.24640830
## wtd_gmean_atomic_mass 0.10087474  0.05350412 -0.074966395 -0.24007966
## entropy_atomic_mass   0.04743043 -0.07473660 -0.013968364  0.03171565
##                               PC13         PC14         PC15        PC16
## number_of_elements     0.037028950 -0.015896111 -0.071093744  0.06819561
## mean_atomic_mass      -0.009700472  0.083163916 -0.092956904  0.02373368
## wtd_mean_atomic_mass  -0.023166236  0.005116402  0.002229394 -0.09083238
## gmean_atomic_mass      0.051504940  0.075185579 -0.007247210  0.01006571
## wtd_gmean_atomic_mass  0.029004056  0.003923106  0.039131496 -0.04008825
## entropy_atomic_mass    0.074084689  0.006632305  0.022041575  0.04235568
##                               PC17        PC18        PC19        PC20
## number_of_elements    -0.003106938  0.09436509 -0.04728067  0.08039423
## mean_atomic_mass      -0.194171892  0.06041675  0.01539508  0.06032762
## wtd_mean_atomic_mass   0.074176333 -0.06606138 -0.12731751  0.08161066
## gmean_atomic_mass     -0.180487564  0.04563700 -0.04699556  0.04172270
## wtd_gmean_atomic_mass  0.024278994 -0.07484941 -0.14376597  0.09394998
## entropy_atomic_mass   -0.015064226  0.04371487 -0.04917930 -0.01812150
##                               PC21       PC22         PC23        PC24
## number_of_elements    -0.274842325 0.04272168 -0.038649523  0.03175957
## mean_atomic_mass       0.067004634 0.15787768  0.032812990 -0.06834167
## wtd_mean_atomic_mass  -0.046309954 0.14417047 -0.047382809  0.04095633
## gmean_atomic_mass      0.062342169 0.20203413 -0.002252252 -0.02626309
## wtd_gmean_atomic_mass -0.024250357 0.13693565  0.057376253  0.05834615
## entropy_atomic_mass   -0.005682592 0.10713577 -0.119481052  0.02147843
##                               PC25        PC26        PC27        PC28
## number_of_elements    -0.099262665  0.09463002 -0.19845564  0.03976466
## mean_atomic_mass      -0.100334777 -0.04753571  0.03563076  0.18857806
## wtd_mean_atomic_mass   0.009716354  0.10960098  0.03655163 -0.06356073
## gmean_atomic_mass     -0.064763591 -0.02486778  0.08700908  0.17419545
## wtd_gmean_atomic_mass -0.017286915  0.13774106  0.04932558 -0.05263465
## entropy_atomic_mass   -0.012737578  0.06607246 -0.06989853  0.03432158
##                              PC29        PC30        PC31         PC32
## number_of_elements    -0.01875030  0.03922501 -0.01149625 -0.068957020
## mean_atomic_mass      -0.10956745 -0.16515906  0.09309351  0.038393112
## wtd_mean_atomic_mass  -0.04829886  0.02959113  0.01397550  0.031840533
## gmean_atomic_mass     -0.12858283 -0.10952550  0.13815365  0.008433746
## wtd_gmean_atomic_mass -0.04296815 -0.01359635 -0.03005260  0.075900779
## entropy_atomic_mass   -0.05295695 -0.03336701  0.05377304 -0.096626275
##                              PC33         PC34          PC35         PC36
## number_of_elements     0.08880295  0.050859692  1.369521e-01 -0.181517577
## mean_atomic_mass      -0.04125160  0.015996726  1.763473e-01  0.007943967
## wtd_mean_atomic_mass   0.00327170 -0.027465885 -2.868083e-01 -0.062368283
## gmean_atomic_mass     -0.03117466 -0.007668515  9.500306e-02  0.087629697
## wtd_gmean_atomic_mass  0.05214849  0.031831129 -3.280077e-01 -0.008475281
## entropy_atomic_mass    0.04331764  0.009472547 -4.783907e-05  0.171122795
##                              PC37        PC38         PC39        PC40
## number_of_elements     0.10198772  0.32779715 -0.183898810 -0.08224931
## mean_atomic_mass       0.15307184 -0.06681902 -0.003866377 -0.10729696
## wtd_mean_atomic_mass   0.01765606  0.05966228 -0.094260197  0.01209019
## gmean_atomic_mass      0.08871157 -0.09557705  0.050713547  0.09819595
## wtd_gmean_atomic_mass -0.08647074  0.03924441  0.046502277  0.02778074
## entropy_atomic_mass   -0.15112147 -0.09597421 -0.040085943  0.19631764
##                              PC41        PC42        PC43        PC44
## number_of_elements     0.06104958  0.12579853  0.41718545 -0.10147619
## mean_atomic_mass      -0.03487414  0.01969472  0.05685173 -0.02352946
## wtd_mean_atomic_mass  -0.01395026  0.09684873  0.08895304  0.06110343
## gmean_atomic_mass     -0.01397741 -0.02655539 -0.06859422 -0.01229771
## wtd_gmean_atomic_mass  0.06936179 -0.02960670  0.05821517  0.03310120
## entropy_atomic_mass    0.01520802 -0.04772756 -0.19610793  0.13092696
##                              PC45        PC46        PC47        PC48
## number_of_elements    -0.01071707 -0.08240012 -0.18764849 -0.12510331
## mean_atomic_mass       0.22086740  0.10648996 -0.15697834  0.09684737
## wtd_mean_atomic_mass  -0.11358388 -0.03306138  0.01626207  0.04152485
## gmean_atomic_mass      0.01469707  0.05205525  0.07635537 -0.15359719
## wtd_gmean_atomic_mass -0.19506803 -0.15683292  0.10837556 -0.04495233
## entropy_atomic_mass   -0.09991225  0.08916072  0.02328785 -0.06527110
##                               PC49        PC50        PC51        PC52
## number_of_elements    -0.149117714 -0.17651661  0.17365403  0.14372211
## mean_atomic_mass       0.111467440  0.03882797 -0.04709902  0.11996732
## wtd_mean_atomic_mass   0.138748512  0.14512737 -0.01300247 -0.03857375
## gmean_atomic_mass     -0.159079888 -0.12927260  0.03945734  0.03625125
## wtd_gmean_atomic_mass  0.001073976  0.02111025  0.03857396 -0.12991193
## entropy_atomic_mass   -0.191706232 -0.08988014  0.08991247 -0.16180857
##                              PC53         PC54        PC55        PC56
## number_of_elements     0.09378129  0.120261483  0.01873669  0.03762937
## mean_atomic_mass      -0.03222809  0.009045325  0.06071338 -0.24706132
## wtd_mean_atomic_mass  -0.11404672  0.110463848 -0.02701829 -0.08895578
## gmean_atomic_mass      0.09196424 -0.179557668  0.04964075  0.18509302
## wtd_gmean_atomic_mass  0.06616678  0.061487455 -0.02155713  0.18226460
## entropy_atomic_mass    0.19338827  0.104884853 -0.11690311 -0.06885462
##                               PC57        PC58          PC59        PC60
## number_of_elements    -0.033431935 -0.05624044 -0.1045509899 -0.03410281
## mean_atomic_mass      -0.018006944 -0.16194934  0.0233984971  0.04401181
## wtd_mean_atomic_mass   0.070160458  0.23618103  0.0524883695 -0.06250234
## gmean_atomic_mass     -0.068699797 -0.14334482 -0.0762401708  0.07877092
## wtd_gmean_atomic_mass  0.006601438  0.24428371 -0.0005254918 -0.07356478
## entropy_atomic_mass    0.333224985 -0.16411891  0.0053891474 -0.16251028
##                              PC61        PC62        PC63        PC64
## number_of_elements    -0.27177086 -0.03824890  0.14006222  0.11477045
## mean_atomic_mass       0.01169217  0.12588152 -0.14209552 -0.05492107
## wtd_mean_atomic_mass  -0.02708919  0.19157803 -0.07787844  0.01250183
## gmean_atomic_mass      0.02587727 -0.16246368  0.14304949 -0.01418947
## wtd_gmean_atomic_mass  0.01527817 -0.08445826  0.12086654  0.06120593
## entropy_atomic_mass    0.05361563 -0.28110332  0.03790376  0.02899267
##                              PC65        PC66        PC67         PC68
## number_of_elements     0.03996614  0.06913216  0.01052621  0.009094840
## mean_atomic_mass      -0.11247947  0.26072536 -0.11621543 -0.003411829
## wtd_mean_atomic_mass  -0.11573176  0.18674041 -0.17220867 -0.048057093
## gmean_atomic_mass      0.18233058 -0.32203625  0.16354531 -0.005364360
## wtd_gmean_atomic_mass  0.05284737 -0.13822259  0.10269399  0.059232912
## entropy_atomic_mass   -0.07186486  0.18557519 -0.40143572 -0.149390270
##                               PC69        PC70         PC71        PC72
## number_of_elements     0.087821088  0.03076486 -0.003101844 -0.03933664
## mean_atomic_mass      -0.002252168 -0.11879442  0.049499259  0.07976154
## wtd_mean_atomic_mass  -0.004529280 -0.10285875 -0.126230314  0.01073326
## gmean_atomic_mass     -0.021455146  0.17566907 -0.061015274 -0.06090120
## wtd_gmean_atomic_mass  0.018650239  0.05355546  0.156795194 -0.04645022
## entropy_atomic_mass   -0.121237015 -0.05498821 -0.080377512  0.06469620
##                              PC73         PC74         PC75        PC76
## number_of_elements    -0.11247979 -0.007909514 -0.009523201 -0.02787139
## mean_atomic_mass      -0.02081370  0.026474043 -0.089030328  0.35061341
## wtd_mean_atomic_mass   0.01072977 -0.016187438  0.126835170 -0.52015400
## gmean_atomic_mass      0.06512625 -0.048542444  0.097633627 -0.34009207
## wtd_gmean_atomic_mass -0.06016132  0.040803667 -0.142096723  0.53447723
## entropy_atomic_mass   -0.09748688 -0.004917660 -0.033189371  0.04022614
##                              PC77        PC78         PC79         PC80
## number_of_elements    -0.01232390 -0.01941766 -0.007987346 -0.001680979
## mean_atomic_mass       0.02027277  0.03671681 -0.035040961  0.041983137
## wtd_mean_atomic_mass  -0.02704500 -0.04119994  0.068238083 -0.063916108
## gmean_atomic_mass     -0.01362914 -0.03491495  0.036518203 -0.046736110
## wtd_gmean_atomic_mass  0.02165557  0.04359082 -0.075933904  0.075731457
## entropy_atomic_mass   -0.06054435  0.02738245  0.001577812  0.005263212
##                               PC81
## number_of_elements    -0.002106604
## mean_atomic_mass       0.009060502
## wtd_mean_atomic_mass  -0.013608959
## gmean_atomic_mass     -0.006179888
## wtd_gmean_atomic_mass  0.006908326
## entropy_atomic_mass   -0.002731923

Below is the summary of the PCA, including the standard deviation of each component, the proportion of the original variance explained by each component, and the cumulative variance.

summary(data_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     5.6156 2.9139 2.77708 2.53086 2.18279 1.75174 1.71290
## Proportion of Variance 0.3893 0.1048 0.09521 0.07908 0.05882 0.03788 0.03622
## Cumulative Proportion  0.3893 0.4941 0.58935 0.66843 0.72725 0.76513 0.80136
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.58643 1.38293 1.26573 1.21695 1.08695 0.97701 0.89935
## Proportion of Variance 0.03107 0.02361 0.01978 0.01828 0.01459 0.01178 0.00999
## Cumulative Proportion  0.83243 0.85604 0.87582 0.89410 0.90869 0.92047 0.93046
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.89208 0.79563 0.76303 0.66348 0.62570 0.55602 0.49481
## Proportion of Variance 0.00982 0.00782 0.00719 0.00543 0.00483 0.00382 0.00302
## Cumulative Proportion  0.94028 0.94810 0.95529 0.96072 0.96555 0.96937 0.97239
##                           PC22    PC23    PC24    PC25    PC26   PC27    PC28
## Standard deviation     0.48177 0.45580 0.40959 0.39968 0.38845 0.3711 0.33985
## Proportion of Variance 0.00287 0.00256 0.00207 0.00197 0.00186 0.0017 0.00143
## Cumulative Proportion  0.97526 0.97782 0.97989 0.98187 0.98373 0.9854 0.98686
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.31984 0.30537 0.28801 0.27892 0.27287 0.24129 0.23554
## Proportion of Variance 0.00126 0.00115 0.00102 0.00096 0.00092 0.00072 0.00068
## Cumulative Proportion  0.98812 0.98927 0.99029 0.99125 0.99217 0.99289 0.99358
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     0.22426 0.21502 0.19983 0.18811 0.18503 0.16253 0.15744
## Proportion of Variance 0.00062 0.00057 0.00049 0.00044 0.00042 0.00033 0.00031
## Cumulative Proportion  0.99420 0.99477 0.99526 0.99570 0.99612 0.99645 0.99675
##                           PC43    PC44    PC45    PC46   PC47    PC48    PC49
## Standard deviation     0.14428 0.13868 0.13456 0.13208 0.1262 0.12326 0.12100
## Proportion of Variance 0.00026 0.00024 0.00022 0.00022 0.0002 0.00019 0.00018
## Cumulative Proportion  0.99701 0.99725 0.99747 0.99769 0.9979 0.99807 0.99825
##                           PC50    PC51    PC52    PC53    PC54    PC55    PC56
## Standard deviation     0.11914 0.11255 0.11162 0.10131 0.09863 0.09771 0.09242
## Proportion of Variance 0.00018 0.00016 0.00015 0.00013 0.00012 0.00012 0.00011
## Cumulative Proportion  0.99843 0.99858 0.99874 0.99886 0.99898 0.99910 0.99921
##                           PC57    PC58    PC59    PC60    PC61    PC62    PC63
## Standard deviation     0.08503 0.08118 0.08045 0.07627 0.07243 0.06789 0.05995
## Proportion of Variance 0.00009 0.00008 0.00008 0.00007 0.00006 0.00006 0.00004
## Cumulative Proportion  0.99930 0.99938 0.99946 0.99953 0.99959 0.99965 0.99970
##                           PC64    PC65    PC66    PC67    PC68    PC69    PC70
## Standard deviation     0.05968 0.05650 0.05349 0.05107 0.04773 0.04298 0.04084
## Proportion of Variance 0.00004 0.00004 0.00004 0.00003 0.00003 0.00002 0.00002
## Cumulative Proportion  0.99974 0.99978 0.99981 0.99985 0.99988 0.99990 0.99992
##                           PC71    PC72    PC73    PC74    PC75    PC76    PC77
## Standard deviation     0.03832 0.03676 0.03457 0.02734 0.02483 0.02113 0.01819
## Proportion of Variance 0.00002 0.00002 0.00001 0.00001 0.00001 0.00001 0.00000
## Cumulative Proportion  0.99994 0.99995 0.99997 0.99998 0.99999 0.99999 0.99999
##                           PC78    PC79     PC80     PC81
## Standard deviation     0.01366 0.01096 0.008603 0.007042
## Proportion of Variance 0.00000 0.00000 0.000000 0.000000
## Cumulative Proportion  1.00000 1.00000 1.000000 1.000000

What is particularly interesting is the amount of variance captured by the first few principal components. Using only 6 components, over 75% of the original variance is retained. Additionally, increasing the number to 12 components allows us to preserve more than 90% of the total variance. Given that the original dataset contained 81 variables, this corresponds to an approximate 85% reduction in dimensionality. This result is consistent with the high level of correlation observed among many variables in the original dataset.

Let’s make some exemplary visualizations for the first two components.

fviz_pca_var(data_pca, col.var="darkred")

Variables that appear close to each other are positively correlated in the space defined by the first two principal components.

Time to visualize the observations in two dimensions.

fviz_pca_ind(data_pca, col.ind="cos2", geom="point", 
             gradient.cols=c("yellow", "orange", "red"))

cos2 measures how well each observation is represented by the selected principal components. A high cos2 indicates a good representation, while a low cos2 indicates poor representation.

Chapter 5. How Many Principal Components to Keep? Which Variables Contribute Most?

5.1. How Many Principal Components to Keep?

PCA is often the first step before diving into further analysis. Deciding how many components to keep is important to capture enough information without retaining too much noise. To help with this choice, we can plot the eigenvalues and the variance explained by each component, along with the cumulative variance curve.

eig_plot <- fviz_eig(
  data_pca,
  choice = "eigenvalue",
  addlabels = TRUE,
  ncp = 15,
  main = "Eigenvalues of each dimension"
) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))
var_plot <- fviz_eig(
  data_pca,
  ncp = 15,
  addlabels = TRUE,
  main = "Percentage of variance explained by each dimension"
) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1)))

grid.arrange(eig_plot, var_plot, nrow = 2,  top = "Eigenvalues and amount of variance explained by each component in PCA")

a <- summary(data_pca)
plot(a$importance[3,],type="l", col = "blue", main = "Cumulative variance")

Choosing the optimal number of principal components can be subjective and there is no universal criterion. Since the original dataset is about the physical properties of materials, the analysis likely requires higher precision. Therefore, I would say it’s justified to retain a larger proportion of the variance.

By examining the results, keeping the first 12 components reduces the dimensionality by approximately 85% while still preserving 91% of the original variance, which seems to be quite satisfactory.

To further support this choice, the Kaiser rule can be applied, which suggests retaining components with eigenvalues greater than 1. According to the plot above, this also indicates that 12 components should be kept.

For these reasons, I would personally choose to retain 12 PCs for any further analysis. Of course, this decision remains somewhat subjective, as there is no single correct answer for the number of principal components to retain.

5.2. Which Variables Contribute Most?

Assuming that we keep 12 first principal components, we can create plots to examine which variables contribute most to them.

PC1 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 1, top = 10)
PC2 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 2, top = 10)
PC3 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 3, top = 10)
PC4 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 4, top = 10)
PC5 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 5, top = 10)
PC6 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 6, top = 10)
PC7 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 7, top = 10)
PC8 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 8, top = 10)
PC9 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                    axes = 9, top = 10)
PC10 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                     axes = 10, top = 10)
PC11 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                     axes = 11, top = 10)
PC12 <- fviz_contrib(data_pca, choice = "var", fill = "steelblue",
                     axes = 12, top = 10)
grid.arrange(PC1, PC2, nrow=2)

grid.arrange(PC3, PC4, nrow=2)

grid.arrange(PC5, PC6, nrow=2)

grid.arrange(PC7, PC8, nrow=2)

grid.arrange(PC9, PC10, nrow=2)

grid.arrange(PC11, PC12, nrow=2)

Summary and Conclusions

The primary goal of this study was to reduce the dimensionality of the dataset about physical properties of superconductive materials. The reduction seemed particularly needed because of the high correlation between many variables from the original data.

The analysis was done using Principal Component Analysis (PCA) - a technique that transforms the original features into a new set of orthogonal variables called principal components, which are linear combinations of the original features.

It was found that it is possible to retain the majority of the original variance using only a few of the newly created components. Using the first 12 principal components retains approximately 91% of the original variance. Considering the characteristics of the data, the likely need for a precise approach, and the guidance of the Kaiser rule, I would say keeping these 12 components for further analysis seems reasonable. Reducing the dimensionality by 85% while retaining 90% of the information seems to be a good compromise. However, there is no strict, objective criterion for the number of components to retain, so this choice reflects my point of view.

References

Hamidieh, K. (2018). Superconductivty Data [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C53P47.