PCA and PCR on suicide data

Introduction

High dimensional setting is a situation in which the dimension \(p\) (number of regressors) is large relative to \(n\) (number of observations). In this case the number of parameters to estimate (governed by \(p\)) is too large for the information we have (governed by \(n\)). Applying classis OLS methods on such data result in high variance and overfitting. Even if an estimator is unbiased, high variance can make it perform very poorly. This problem is often called “curse of dimensionality”. The very special case is when \(p>n\). However, there are a few methods which tackle the problem of dimensionality reduction. The most popular unsupervised learning tools are PCA and PCR. However, they can be used even in a non-HD setting - f.e. when variables are significantly correlated PCR is preferable to OLS. On the other hand, supervised learning covers yet another technique - penalized regression. In this study, I’m going to conduct a PCA/PCR analysis on a dataset containing the dependent variable - suicide rate for 100 000 people and independent macroeconomic variables: GDP per capita, unemployment rate, alcohol consumption per capita, fertility rate and CPI (2010 current prices). For the purpose of conducting PCR, I’m going to generate additional variables artificially and see if it selects the proper ones.

Theory behind PCA and PCR

Principal Component Analysis is a statistical procedure, the aim of which is revealing internal structure of the data in a way that explains its variance. It uses orthogonal transformation of a set of correlated variables to convert it into a set of linearly uncorrelated variables called principal components. PCA is done by eigenvalue decomposition of covariance/correlation (we use covariance if levels of variables are comparable - f.e. percentage shares, correlation when they aren’t) matrix after normalisation of data.

Usually the procedure is conducted as follows:

Computing means for rows of the matrix containing observations.
Computing deviation matrix which is initiall matrix minus the one obtained in the first step.
Computing covariance matrix using deviation matrix.
Choosing eigenvalues.
Projection on eigenvalues matrix (by multiplying matrices) - obtaining matrix of principal components.

Mathematically the procedure can be written:

Let’s state that we have a set of observations \(x_1, x_2..., x_N\) and we consider a rank-\(q\) linear model for representing them: \(f(\lambda)=\mu+V_q\lambda\) \(\mu\) is a location vector in \(R^p\), \(V_q\) is a \(p x q\) matrix with \(q\) orthogonal unit vectors as colums, \(\lambda\) is a \(q\) vector of parameters. Then \(f(\lambda)\) is a parametric representation of an affline hyperplane of rank \(q\). Fitting such model by least squares amounts to minimizing reconstuction error. After optimising for \(\mu\) and \(\lambda_i\) we obtain: \(\hat{\mu}=\overline{x}\) \(\hat{\lambda_i}=V^T_q(x_i-\overline{x})\) Then the orthogonal matrix \(V_q\) is to be found: \(\min_{V_q}\sum_{i = 1}^{N} ||(x_i-\overline{x})-V_qV^T_q(x_i-\overline{x})||^2\) The \(p x p\) matrix \(H_q = V_qV^T_q\) is a projection matrix which maps each point \(x_i\) onto its rank \(q\) reconstruction \(H_qx_i\), the orthogonal projection of \(x_i\) onto the subspace spanned by the columns of \(V_q\). We can contruct the single decomposition of \(X\): \(X=UDV^T\)

The columns of \(UD\) are then called the principal components od \(X\). The \(N\) optimal \(\hat{\lambda_i}\) are given by the first \(q\) principal components: the \(N\) rows of the \(N x q\) matrix \(U_qD_q\).

Principal Component Regression is a technique based on PCA. In a traditional regression the dependent variable is regressed on the explanatory variables, while in PCR principal components of the explanatory variables are used as regressors. In most cases principal components with higher variances are chosen. PCR has two major drawbacks: it can reduce dimensionality and overcome the problem of multicollonearity.

Usually the procedure is conducted as follows:

Performing PCA on the initial dataset and selecting principal components for further use.
Regressing the observed vector of outcomes on the selected principal components as covariates, using OLS.
Transforming the obtained vector back to the scale of the actual covariates, using the selected PCA loadings.
Estimating the regression coefficients of the original model using the obtained PCR estimator.

Mathematically the shortened procedure can be written:

Let \(z_m=Xv_m\) be the derived input of orthogonal columns. \(y\) is regressed on \(z_1,...,z_M\) for some \(M\leq p\). This regression is a sum of univariate regressions: \(\hat{y}_{(M)}^{pcr}=\overline{y}1+\sum_{m = 1}^{M} \hat{\theta}_mz_m\) where \(\hat{\theta}_m=(z_m,y)/(z_m,z_m)\). Since the \(z_m\) are linear combinations of the original \(x_j\), the solution can be expressed: \(\hat{\beta}_{(M)}^{pcr}=\sum_{m = 1}^{M} \hat{\theta}_mv_m\)

If \(M=p\) we would get the usual least squares estimates since the columns of \(Z=UD\) span the column space of \(X\). For \(M<p\) we get the reduced regression.

Theory behind penalized regression

Let’s take a HD linear model with \(p\) regressors: \(Y_i=\beta'X_i+\varepsilon_i=\sum_{j=1}^p \beta_jX_{i,j}+\varepsilon_i, i=1,...,n\)

In HD setting, without imposing restrictions on \(\beta\), life becomes difficult. However, often some reasonable restrictions can be imposed. We call \(\beta\) sparse if most of its components are zero. Assuming sparsity allows us to do variable selection (via information criteria or cross-validation) - find a trade-off between fit and model complexity.

For finite, positive \(q\) we can define the penalized LS estimator: \(\hat{\beta}=argmin_{\beta}(\dfrac{1}{n}\sum_{i=1}^n(Y_i-\beta'X_i)^2+\lambda\sum_{j=1}^p|\beta_j|^q)\)

Which is also called regularized least squares, with \(\lambda||\beta||_q^q\) being the regularizer. The first part of the summation corresponds to the lack of fit, the second one - for complexity. \(\lambda\) is the number of restictions imposed due to variable screening.

For \(q=1\) we get the lasso solution and for \(q=2\) - ridge regression.

However, these old, well-know methods have limitations. For example, lasso tends to select too many variables. It also does not perform well when features are highly correlated. In this case, we may use the Elastic Net estimator: \(\hat{\beta_{EN}}=argmin_{\beta}(||Y-X\beta||_2^2/n+\lambda_1||\beta||_1+\lambda_2||\beta||_2^2)\)

Lasso als ignores ordering of the features, due to which problem more generalized methods (which impose more restrictions) are recommended, such as Adaptive Lasso, Fused Lasso or CARDS algorithm.

PCA

First, I’m loading and browsing the dataset.

suicide <- read_excel("suicide.xlsx")
View(suicide)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

summary(suicide)

##    country             suicide           GDPpc              unempl       
##  Length:181         Min.   : 0.500   Min.   :   237.4   Min.   :0.00140  
##  Class :character   1st Qu.: 4.700   1st Qu.:  1587.1   1st Qu.:0.04120  
##  Mode  :character   Median : 7.800   Median :  4808.6   Median :0.06180  
##                     Mean   : 9.235   Mean   : 12299.5   Mean   :0.08024  
##                     3rd Qu.:12.500   3rd Qu.: 14912.7   3rd Qu.:0.10200  
##                     Max.   :31.900   Max.   :100738.7   Max.   :0.30600  
##     alcohol        fertility          cpi           
##  Min.   : 0.00   Min.   :1.172   Min.   :9.800e+01  
##  1st Qu.: 2.40   1st Qu.:1.733   1st Qu.:1.100e+02  
##  Median : 6.30   Median :2.334   Median :1.200e+02  
##  Mean   : 6.14   Mean   :2.803   Mean   :3.669e+07  
##  3rd Qu.: 9.50   3rd Qu.:3.760   3rd Qu.:1.440e+02  
##  Max.   :15.20   Max.   :7.239   Max.   :6.641e+09

suicide$country <- NULL
dim(suicide)

## [1] 181   6

Now, I’m splitting the dataset into train and test parts.

suicide.test <- suicide[151:181,]
suicide.train <- suicide[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL

Then, I’m normalizing the data and checking correlation between variables.

dim(suicide.train)

## [1] 150   5

suicide.stand<- as.data.frame(apply(suicide.train[, 1:5], 2, function(x) (x - mean(x))/sd(x)))
suicide.y.stand <- data.Normalization(suicide.y, type="n1", normalization="column")
res <- cor(suicide.train, method="pearson")
corrplot::corrplot(res, order = "alphabet")

corrplot::corrplot(res, method= "color", order = "hclust", tl.pos = 'n')

The two correlation plots show mild correlation between some of the variables. To obtain detailed correlation matrix and a second one referring to p-value siginificance of correltions, I’m running some more commands.

source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(suicide)

## $r
##              cpi GDPpc suicide alcohol unempl fertility
## cpi            1                                       
## GDPpc     -0.031     1                                 
## suicide     0.21  0.26       1                         
## alcohol    0.092  0.38    0.57       1                 
## unempl    -0.092 -0.13  -0.025    0.08      1          
## fertility -0.059 -0.48   -0.33    -0.4  -0.11         1
## 
## $p
##              cpi   GDPpc suicide alcohol unempl fertility
## cpi            0                                         
## GDPpc       0.68       0                                 
## suicide   0.0044 0.00037       0                         
## alcohol     0.22 9.6e-08 8.3e-17       0                 
## unempl      0.22   0.084    0.74    0.29      0          
## fertility   0.43   1e-11 7.8e-06 3.3e-08   0.14         0
## 
## $sym
##           cpi GDPpc suicide alcohol unempl fertility
## cpi       1                                         
## GDPpc         1                                     
## suicide             1                               
## alcohol       .     .       1                       
## unempl                              1               
## fertility     .     .       .              1        
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

We can see that the highest correlation between variables is equal to 0.57. However, many pairs are significantly correlated (p-value less than 0.05): suicide wih cpi, alcohol and fertility, GDPpc with alcohol and fertility, alcohol with fertility.

Now, let’s run PCA.

suicide.pca1 <- prcomp(suicide.stand, center=TRUE, scale.=TRUE)
summary(suicide.pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5
## Standard deviation     1.3545 1.0549 0.9936 0.7826 0.67291
## Proportion of Variance 0.3669 0.2225 0.1975 0.1225 0.09056
## Cumulative Proportion  0.3669 0.5895 0.7870 0.9094 1.00000

The quality of this PCA is good enough. We have 5 principal components amounting to 100 % of the variance explained, the first one explaining 36 %. It is interesting that this number equals the number of independent variables. However, later validation plots will show what is the optimal number of components. I can then check how each component of my output matrix looks like.

# X=UD
suicide.pca1$x [,1:5] %>% head(1)

##            PC1        PC2        PC3       PC4       PC5
## [1,] -1.985734 0.01469254 0.09627027 0.3186402 0.5266854

# V matrix - eigenvectors
suicide.pca1$rotation

##                   PC1         PC2         PC3        PC4        PC5
## GDPpc      0.57282151 -0.14603512  0.37533527  0.1654400  0.6944799
## unempl     0.04958360  0.80674071 -0.44961037  0.1980481  0.3245587
## alcohol    0.54874028  0.08907206 -0.21971588 -0.7916179 -0.1265554
## fertility -0.60165291 -0.06551524  0.01687948 -0.5247764  0.5983697
## cpi        0.07946611 -0.56179809 -0.78000895  0.1770801  0.1956955

# X matrix - each column is a principal component for each observation
suicide.pca1$x

##                 PC1          PC2          PC3          PC4         PC5
##   [1,] -1.985733808  0.014692540  0.096270271  0.318640201  0.52668537
##   [2,]  0.415218294  1.095788457 -0.696368359  0.306294904 -0.49283659
##   [3,] -0.954988454  0.263837950  0.013862358  0.974897923 -0.09892425
##   [4,] -1.562247086 -0.072167413 -0.072453303 -1.256317838  0.85800282
##   [5,]  0.509385862  0.323673036 -0.072290501  0.209715632 -0.17278593
##   [6,]  0.694633430  0.206523571 -0.154347788 -0.469050861 -0.29718589
##   [7,]  0.204480466  1.424717993 -0.806870958  0.803156768 -0.35302162
##   [8,]  2.241501938 -0.454906021  0.832522205 -0.165041569  0.83426478
##   [9,]  2.339712462 -0.337044467  0.641921541 -0.271232191  0.46912612
##  [10,] -0.643131320 -0.364996369  0.379698568  1.149676344 -0.73079492
##  [11,]  0.849035907  0.480361529  0.222018831  1.040829554  0.56012215
##  [12,]  0.057946346 -0.992119900  1.006380724  0.970556084 -0.15588356
##  [13,] -0.909017304 -0.451419566  0.414121846  1.191678466 -0.77172943
##  [14,]  1.003519709  0.329583244 -0.151843323 -0.177929055 -0.32278577
##  [15,]  1.773802354 -7.605131945 -9.368733652  1.319351424  1.07803377
##  [16,]  2.209919489 -0.073310543  0.411857482 -0.421712108  0.49812180
##  [17,] -0.068287448  0.096061433 -0.104126635 -0.071251253 -0.46428548
##  [18,] -1.814460385 -0.752330159  0.410120390 -0.532713301  0.27650731
##  [19,] -0.775273094 -0.693128718  0.549897215  1.053301609 -0.85354062
##  [20,] -0.576765267 -0.567856516  0.305208063 -0.027047730 -0.57222792
##  [21,]  0.527788034  2.345861051 -1.332051570  0.963889008 -0.07863454
##  [22,]  0.195504998  1.285230608 -0.803959994 -0.164512759  0.13740915
##  [23,]  0.571762325  0.599738397 -0.354283481  0.175299196 -0.49842616
##  [24,]  0.125591583 -0.320003242  0.770080988  1.535022814  0.29352174
##  [25,]  1.222592062  0.204178184 -0.348362180 -0.805761023 -0.98693543
##  [26,] -1.280135530 -0.165792020 -0.134363985 -1.530961969  0.47642275
##  [27,] -1.570924722 -0.809867360  0.242150242 -1.685964156  0.38648691
##  [28,] -0.166512131  0.409684176 -0.269383460  0.240748060 -0.46506460
##  [29,] -0.275294796 -0.863873227  0.366382377 -0.369131636 -0.99107034
##  [30,] -0.897270564 -0.399950908 -0.009642107 -1.474708127  0.09495233
##  [31,]  1.868555980 -0.251754694  0.660273332  0.200227417  0.55183143
##  [32,] -1.713281167 -0.273530927  0.125732650 -0.437616577  0.39037604
##  [33,] -2.414114064 -0.413406047  0.265306785 -0.522581739  0.90661410
##  [34,]  0.886174831 -0.041310364  0.033900171 -0.225560206 -0.55644598
##  [35,]  0.466208021 -0.296435447  0.161747607  0.102435990 -0.90553480
##  [36,]  0.133645515  0.158110300 -0.079322438  0.370477290 -0.66899984
##  [37,] -1.796673835 -0.535068278  0.382312532  0.158417136  0.15677058
##  [38,] -2.367243223 -0.664813197  0.355230036 -0.859662658  0.82928269
##  [39,] -0.965674620  0.367946426 -0.390022284 -1.055265020  0.43151654
##  [40,]  0.240759681  0.169900788  0.058592978  0.662485914 -0.39317202
##  [41,] -1.058987225 -0.634626315  0.141126959 -1.509565212  0.11940109
##  [42,]  0.992629516  0.798212571 -0.441252545  0.176671055 -0.44743841
##  [43,]  1.641653101  0.725218197 -0.281929833 -0.052664590 -0.07798473
##  [44,]  1.768012146 -0.320803290  0.065566086 -1.146284712 -0.75753967
##  [45,]  2.386614292 -0.426989591  0.890049130 -0.039386813  0.97107136
##  [46,] -1.141323182 -0.290076781  0.302314607  0.865336583 -0.36456401
##  [47,]  0.038961701 -0.221669776  0.097842965 -0.133376588 -0.55873918
##  [48,] -0.346748541 -0.384910480  0.275085728  0.274677141 -0.53351975
##  [49,] -1.231897203  0.517158601 -0.121770022  0.946594390  0.22575116
##  [50,] -0.336081615 -0.383977522  0.269686379  0.534068659 -0.79141102
##  [51,] -0.297197649 -0.167762401 -0.099772791 -1.778922516  0.43387263
##  [52,] -1.640676146 -0.244749309  0.204164341  0.229884235  0.15212842
##  [53,]  1.397186782 -0.013423483 -0.006147975 -0.544592689 -0.55483619
##  [54,]  0.166674757  2.519890052 -1.629766024 -0.330963521  0.55264288
##  [55,] -1.480848797 -0.391689634  0.227028730 -0.117411608  0.07078858
##  [56,] -0.556370523 -0.198686893  0.213005747  0.568933904 -0.42749820
##  [57,]  2.131093118  0.007925676  0.461150865 -0.088117282  0.65795508
##  [58,]  2.030404341  0.248527502  0.132785859 -0.587382880  0.53252213
##  [59,]  0.160272457  1.654298470 -1.147337771 -1.060950896  0.65374682
##  [60,] -1.863661719  0.129348040 -0.125320213 -0.640253319  0.79072327
##  [61,]  0.553267791  0.689185766 -0.570941993 -0.344296813 -0.62752717
##  [62,]  2.469935841 -0.520582425  0.628484533 -0.695923273  0.22008473
##  [63,] -1.395113372 -0.753862447  0.449865486 -0.098119070 -0.13301145
##  [64,]  1.488352613  2.124382153 -1.143225592  0.305644243  0.24375013
##  [65,]  0.738664319  2.192219892 -1.286784922  0.151513448  0.31018940
##  [66,] -0.897348337 -0.661251699  0.470131688  0.388681530 -0.42899129
##  [67,] -1.967410194 -0.529789974  0.355537839 -0.106151056  0.37393720
##  [68,] -1.406098935 -0.234858542  0.051827317 -0.625992025  0.25253983
##  [69,] -0.104337189  0.584515796 -0.367670201  0.120553941 -0.27140292
##  [70,] -0.458925623  0.875239799 -0.581037798  0.091387964 -0.11686632
##  [71,] -0.510257353 -0.351936207  0.210191103  0.325540892 -0.67714612
##  [72,]  1.249744819 -0.180862800  0.012123625 -0.557129855 -0.89147322
##  [73,]  2.387185412 -0.928817429  1.340841404  0.136404836  1.16791557
##  [74,] -0.261214144 -0.453406513  0.188937670  0.016217302 -0.87438672
##  [75,] -0.846242254 -0.495432865  0.440968394  0.956957160 -0.60417366
##  [76,] -0.391034900  0.594117072 -0.135755574  1.466051592 -0.42055429
##  [77,] -1.714477693 -0.122719025  0.238543406  0.392688929  0.52212986
##  [78,]  2.999648703 -0.253226293  0.867401419 -0.450489892  1.50269915
##  [79,]  0.352247053 -0.669231494  0.987179720  0.457407299  1.03935395
##  [80,]  1.440298050  0.432551267  0.136988432  0.592216266  0.24979815
##  [81,] -0.133941843  0.745309566 -0.359621888  0.754203782 -0.33907308
##  [82,]  1.677781260 -0.730489023  0.907236345  0.275324735  0.17135874
##  [83,] -1.199489550  0.880347049 -0.327352185  0.943281533  0.43880593
##  [84,]  0.034415525 -0.295431675  0.118273973 -0.408856286 -0.43793738
##  [85,] -1.175146958  0.455453554 -0.251326527  0.112412283  0.26400583
##  [86,] -1.321680591  2.848187122 -1.459117061  1.353163778  1.25573878
##  [87,]  1.702938257 -0.499376487  0.497413059 -0.126242721 -0.44550144
##  [88,]  0.008878048 -0.940086401  1.137790507  1.429204071  0.12217589
##  [89,] -0.527028357  0.000488651 -0.105541935 -0.262949819 -0.38454998
##  [90,]  0.184011247 -0.742300285  0.166268128 -1.089831141 -0.97404995
##  [91,]  1.412493848  0.409120865 -0.359006577 -0.778937494 -0.54240292
##  [92,] -0.295429019 -0.174711733  0.324275863  1.195516647 -0.58810718
##  [93,] -0.520695865  2.614967562 -1.515673120  0.611059138  0.70705139
##  [94,] -1.289998170 -0.697967719  0.267097426 -0.915090334 -0.00445978
##  [95,] -0.745873663  1.313267819 -0.506717681  1.606357791  0.18250317
##  [96,]  1.699399763  0.217814614 -0.322317358 -1.220022769 -0.66274010
##  [97,]  4.418258008 -0.751451198  1.784573563  0.027753906  2.71929262
##  [98,]  0.673866906  2.198481118 -1.314655921  0.540565769 -0.10865667
##  [99,] -1.627990488 -0.830219402  0.502460245 -0.050371301 -0.09122334
## [100,] -1.535657087 -0.288891964  0.121058352 -0.408701198  0.22935544
## [101,] -0.495554370 -0.615935900  0.611373758  1.099407417 -0.54220806
## [102,] -0.261737798 -0.380117004  0.413015930  0.794453606 -0.47484572
## [103,] -2.468048046 -0.163851204  0.134919414 -0.462144634  1.07054131
## [104,]  1.220567402 -0.405568213  0.476604386  0.164102216 -0.32598831
## [105,] -2.002730839  0.132504792  0.047431201  0.377959892  0.62345874
## [106,]  0.170850957 -0.034958714  0.191087655  0.961008876 -0.69491496
## [107,]  0.134406455 -0.439399280  0.267201324 -0.001984059 -0.66640956
## [108,] -0.888050149  1.056166451 -0.510728506  0.715405396  0.29237065
## [109,]  1.460168391 -0.116473834 -0.362093921 -1.323993929 -1.58972412
## [110,] -0.133787160  0.023927944 -0.117501434 -0.330326196 -0.46364565
## [111,]  0.617413504  1.401764403 -0.836292598  0.336552562 -0.28164706
## [112,] -0.909587559  0.177317565  0.060538801  1.105064135 -0.29748102
## [113,] -1.844563845  2.125995345 -1.180716047  0.185792065  1.56080039
## [114,] -0.364820625 -0.811351123  0.417319235  0.139523948 -1.06118155
## [115,]  0.045020603  2.102033566 -1.366669564 -0.515989074  0.61855306
## [116,] -0.686920438 -0.569757852  0.388534479  0.765135561 -0.91765764
## [117,]  1.919190083 -0.414057441  0.814741812  0.215509856  0.66719182
## [118,]  1.900836199 -0.451855889  0.660660950 -0.319349674  0.43597513
## [119,] -0.249007527 -0.340747026  0.155439141  0.191904581 -0.84841792
## [120,] -3.163597457 -1.191604766  0.715834228 -1.004164912  1.20842064
## [121,] -0.623276730  0.007903968 -0.415856746 -2.534659788  0.49457451
## [122,]  2.576868654 -0.829575795  1.530053444  0.620922405  1.69153020
## [123,] -0.597710441 -0.718454772  0.760179019  0.926176895 -0.04516988
## [124,] -1.475832545 -0.576554196  0.453506796  0.593054477 -0.20213434
## [125,]  0.373265199 -0.415561797  0.294625049 -0.318342692 -0.29939942
## [126,] -1.411868773 -0.728988186  0.517336570  0.328812359 -0.17219136
## [127,] -0.040899406 -0.225594196  0.040205247 -0.245730760 -0.66564363
## [128,] -0.070374035 -0.478127012  0.252054692 -0.081732601 -0.68264237
## [129,] -0.372124392 -0.575645121  0.233586771 -0.394736235 -0.63044396
## [130,]  1.327421381 -0.032272816 -0.083870051 -0.515671024 -0.91657626
## [131,]  1.712454785  0.548752851 -0.306862524 -0.415789650 -0.38412217
## [132,]  1.355646800 -1.437463523  1.887584736  1.323750210  1.23140925
## [133,]  1.243947548 -0.032237505 -0.177656317 -0.838711491 -0.96408326
## [134,]  1.023594830 -0.099252981 -0.119895151 -0.754237732 -0.91320266
## [135,] -0.569163414 -0.741334039  0.176923720 -1.280093907 -0.45079874
## [136,] -1.283869850  0.019478872  0.070442535  0.162912209  0.29338907
## [137,] -0.973649203  0.738755999 -0.549302674 -0.692828519  0.53644222
## [138,] -0.433657337 -0.460780871  0.726944696  1.212367149  0.23650445
## [139,] -1.997989719 -0.496847826  0.366081652  0.051733282  0.38008345
## [140,]  1.037473993  1.177753509 -0.860886525 -0.257071296 -0.65617759
## [141,]  0.805654804 -0.470386629  0.169006631 -1.210863070 -0.32544354
## [142,] -1.229862213 -0.404473208  0.109766753 -0.777465938  0.06205166
## [143,]  1.541726591 -1.155699725  1.670165095  1.607082216  0.84175085
## [144,]  1.445528414  0.376705624 -0.237831428 -0.377042344 -0.52955578
## [145,]  1.672921314  0.133586056 -0.060547174 -0.650909516 -0.36539471
## [146,] -1.488573868 -0.807179044  0.543111247  0.195061635 -0.14175106
## [147,] -2.753474410 -0.428110332  0.326157850 -0.358472710  1.09624914
## [148,]  0.449743745  2.541543375 -1.565361129  0.051883772  0.41013658
## [149,] -0.670117423  0.580962845 -0.662560241 -1.706010815  0.42559012
## [150,]  1.699731959  1.539432606 -0.650199347  0.342368755  0.41054336

# Eigenvalues
suicide.pca1$sdev

## [1] 1.3545382 1.0548554 0.9936200 0.7825733 0.6729076

res1 <- cor(suicide.pca1$x, method="pearson")
corrplot::corrplot(res1, method = "color", order = "hclust", tl.pos = 'n')

The correlation plot of residuals of my PCA is clean, which means that the principal components are indeed orthogonal. Now, let’s do some more plotting to visualise this PCA.

plot(summary(suicide.pca1)$importance[3,])

fviz_eig(suicide.pca1)

The first two plots shows the aggregate variance explained which sums up to 1 for 5 components as could be seen in the initial PCA output matrix.

fviz_pca_var(suicide.pca1, col.var="black")

The third plot presents how strongly a loading of a given variable contributes to a given principal component.

fviz_pca_ind(suicide.pca1, col.ind="cos2", geom = "point", gradient.cols = c("white", "#2E9FDF", "#FC4E07" ))

The next plot shows unlabeled observations in two dimensions with coloured quality of representation. The output seems to be equally distributed except for one outlier.

autoplot(prcomp(suicide, center=TRUE, scale.=TRUE), data=suicide, loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size = 3)

pca2d(suicide.pca1, biplot=TRUE, biplot.vars=3)

The two above plots show the same properties. Let’s investigate the output with regard to three components.

# a trick for including 3d charts in html output
knit_hooks$set(webgl = hook_webgl)
pca3d(suicide.pca1)

## [1] 0.08836516 0.15210264 0.18737467
## Creating new device

You must enable Javascript to view this page properly.

It can be seen on the 3D plot that the output is almost linear.

PCR

Now, I’m generating 100 artificial independent variables from the Gaussian distribution and merging them with my initial dataset. Firstly, I need to check the standard deviation of my variables in order to simulate data with acurate variance.

sapply(suicide, sd, na.rm = TRUE)

##      suicide        GDPpc       unempl      alcohol    fertility          cpi 
## 6.009867e+00 1.745159e+04 6.123073e-02 4.130803e+00 1.354098e+00 4.936389e+08

x = create.random.matrix(100, 181, st.dev = 1, perturb = 0.2)
suicidex <- merge(suicide, x)
View(suicidex)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

I have to split this new dataset into train and test data.

suicide.test <- suicidex[151:181,]
suicide.train <- suicidex[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL

Now I’m going to run PCR on artificial and original matrix and prepare data for forecasting.

reg.pcr.x<- pcr(suicide~., data=suicidex, scale=TRUE, validation="CV")
summary(reg.pcr.x)

## Data:    X dimension: 32761 105 
##  Y dimension: 32761 1
## Fit method: svdpc
## Number of components considered: 105
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           5.993    5.994    5.995    5.995    5.995    5.995    5.996
## adjCV        5.993    5.994    5.995    5.995    5.995    5.995    5.996
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.996    5.996    5.997     5.997     5.997     5.998     5.998
## adjCV    5.996    5.996    5.997     5.997     5.997     5.998     5.999
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps
## CV        5.999     5.998     5.934     5.217     5.214     5.213     5.212
## adjCV     6.000     6.002     6.006     5.208     5.208     5.210     5.210
##        21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps
## CV        5.211     5.211     5.211      5.21     5.211     5.211     5.211
## adjCV     5.210     5.210     5.210      5.21     5.210     5.211     5.211
##        28 comps  29 comps  30 comps  31 comps  32 comps  33 comps  34 comps
## CV        5.212     5.212     5.212     5.213     5.213     5.213     5.213
## adjCV     5.211     5.211     5.212     5.212     5.212     5.212     5.213
##        35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps
## CV        5.213     5.213     5.214     5.214     5.214     5.178     5.176
## adjCV     5.213     5.213     5.213     5.213     5.213     5.177     5.176
##        42 comps  43 comps  44 comps  45 comps  46 comps  47 comps  48 comps
## CV        5.177     5.177     5.177     5.178     5.147     5.070     5.065
## adjCV     5.177     5.177     5.177     5.179     5.176     5.064     5.064
##        49 comps  50 comps  51 comps  52 comps  53 comps  54 comps  55 comps
## CV        5.065     5.065     5.065     5.065     5.065     5.065     5.066
## adjCV     5.064     5.064     5.064     5.065     5.065     5.065     5.065
##        56 comps  57 comps  58 comps  59 comps  60 comps  61 comps  62 comps
## CV        5.066     5.066     5.066     5.067     5.067     5.068     5.068
## adjCV     5.065     5.066     5.066     5.066     5.067     5.068     5.068
##        63 comps  64 comps  65 comps  66 comps  67 comps  68 comps  69 comps
## CV        5.061     4.825     4.824     4.824     4.824     4.824     4.824
## adjCV     5.069     4.820     4.822     4.823     4.823     4.823     4.823
##        70 comps  71 comps  72 comps  73 comps  74 comps  75 comps  76 comps
## CV        4.824     4.825     4.825     4.825     4.825     4.796     4.797
## adjCV     4.824     4.824     4.824     4.824     4.824     4.795     4.796
##        77 comps  78 comps  79 comps  80 comps  81 comps  82 comps  83 comps
## CV        4.797     4.797     4.797     4.797     4.797     4.797     4.797
## adjCV     4.796     4.796     4.796     4.796     4.797     4.797     4.797
##        84 comps  85 comps  86 comps  87 comps  88 comps  89 comps  90 comps
## CV        4.798     4.798     4.798     4.798     4.798     4.798     4.799
## adjCV     4.797     4.797     4.797     4.797     4.798     4.798     4.798
##        91 comps  92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
## CV        4.799     4.799     4.799     4.799     4.799     4.799     4.800
## adjCV     4.798     4.798     4.798     4.799     4.799     4.799     4.799
##        98 comps  99 comps  100 comps  101 comps  102 comps  103 comps
## CV        4.800     4.800      4.800      4.800      4.800        4.8
## adjCV     4.799     4.799      4.799      4.799      4.799        4.8
##        104 comps  105 comps
## CV         4.801      4.801
## adjCV      4.800      4.800
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X          2.705    5.347    7.913     10.3    12.64    14.91     17.1    19.24
## suicide    0.000    0.000    0.000      0.0     0.00     0.00      0.0     0.00
##          9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X          21.34     23.41     25.41      27.4     29.31     31.19        33
## suicide     0.00      0.00      0.00       0.0      0.00      0.00         0
##          16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps
## X           34.77     36.53     38.24     39.95     41.59     43.23     44.84
## suicide      0.00     24.55     24.55     24.55     24.55     24.55     24.55
##          23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
## X           46.40     47.96     49.49     50.95     52.39     53.80     55.12
## suicide     24.55     24.55     24.55     24.55     24.55     24.55     24.55
##          30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps
## X           56.43     57.71     58.99     60.24     61.46     62.64     63.79
## suicide     24.55     24.55     24.55     24.55     24.55     24.55     24.55
##          37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps
## X           64.92     66.05     67.14     68.19     69.23     70.26     71.26
## suicide     24.55     24.55     24.55     25.63     25.63     25.63     25.63
##          44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps
## X           72.21     73.17     74.11     75.04     75.98     76.88     77.74
## suicide     25.63     25.63     25.63     28.84     28.84     28.84     28.84
##          51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps
## X           78.59     79.42     80.21     80.98     81.74     82.49     83.23
## suicide     28.84     28.84     28.84     28.84     28.84     28.84     28.84
##          58 comps  59 comps  60 comps  61 comps  62 comps  63 comps  64 comps
## X           83.94     84.61     85.24     85.86     86.47     87.07     87.65
## suicide     28.84     28.84     28.84     28.84     28.84     28.84     35.50
##          65 comps  66 comps  67 comps  68 comps  69 comps  70 comps  71 comps
## X           88.23      88.8     89.34     89.87     90.39     90.88     91.34
## suicide     35.50      35.5     35.50     35.50     35.50     35.50     35.50
##          72 comps  73 comps  74 comps  75 comps  76 comps  77 comps  78 comps
## X            91.8     92.25     92.69     93.11     93.53     93.93     94.30
## suicide      35.5     35.50     35.50     36.27     36.27     36.27     36.27
##          79 comps  80 comps  81 comps  82 comps  83 comps  84 comps  85 comps
## X           94.67     95.02     95.36     95.68     95.99     96.29     96.58
## suicide     36.27     36.27     36.27     36.27     36.27     36.27     36.27
##          86 comps  87 comps  88 comps  89 comps  90 comps  91 comps  92 comps
## X           96.86     97.13     97.38     97.63     97.86     98.07     98.29
## suicide     36.27     36.27     36.27     36.27     36.27     36.27     36.27
##          93 comps  94 comps  95 comps  96 comps  97 comps  98 comps  99 comps
## X           98.49     98.68     98.86     99.02     99.18     99.33     99.46
## suicide     36.27     36.27     36.27     36.27     36.27     36.27     36.27
##          100 comps  101 comps  102 comps  103 comps  104 comps  105 comps
## X            99.58      99.68      99.77      99.86      99.94     100.00
## suicide      36.27      36.27      36.27      36.27      36.27      36.27

reg.pcr<- pcr(suicide~., data=suicide, scale=TRUE, validation="CV")
summary(reg.pcr)

## Data:    X dimension: 181 5 
##  Y dimension: 181 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
## CV           6.027   410636   343029   631621  1247524  1245562
## adjCV        6.027   389683   325526   599392  1183869  1182006
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          36.93    59.04    78.70    91.03   100.00
## suicide    24.55    25.63    28.84    35.50    36.27

It can be seen that in both cases, the same percentage of variance is explained (36,27%). Therfore, adding artificial variables didn’t change anything in terms of the accuracy. hence, PCA chooses the correct variables. For the purpose of forecasting, I will use the original dataset without artificial variables.

suicide.test <- suicide[151:181,]
suicide.train <- suicide[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL
suicide.train.all=cbind(suicide.y, suicide.train)
test.x<-suicide.test[, 2:6]
test.y <- suicide.test[,1]
reg.pcr.train<- pcr(suicide.y~., data=suicide.train.all, na.action=na.omit, scale=TRUE, validation="CV")
reg.pcr.pred<-predict(reg.pcr.train, test.x, ncomp = 3)
reg.train<- lm(suicide.y~., data=suicide.train.all)
reg.pred<-predict(reg.train, test.x)

plot(reg.pcr.pred, col="red", cex=0.8)
points(test.y)
points(reg.pred, col="blue")
abline(v=(-10:100)*5, lty=3, col="grey80")

Since I have fitted and predicted values, I can plot them together. Red dots correspond to values predicted via PCR, black to empirical ones and blue to values predicted using OLS. Empirical values are further away from the PCR and OLS ones, but the two latter are pretty close to each other. The reason might be the fact that the multicollinearity between the independent variables wasn’t severe (correlation up to 0.57 but most pairs significantly correlated). Then I’m comparing PCR coefficients in full regression and train model.

coefplot(reg.pcr)
coefplot(reg.pcr.train, add=TRUE, lwd=2)

It can be seen that both values are pretty close to each other - the quality of this forecast is good enough. Finally, I’m drawing validation plot to choose an optimal number of components, using different accuracy measures. It can be seen that the biggest change occur at the number of components equal to 4. Therefore it can be point out as the optimific number of components.

validationplot(reg.pcr, val.type = "R2")
validationplot(reg.pcr.train, val.type = "R2", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)

validationplot(reg.pcr, val.type = "RMSEP")
validationplot(reg.pcr.train, val.type = "RMSEP", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)

validationplot(reg.pcr, val.type = "MSEP")
validationplot(reg.pcr.train, val.type = "MSEP", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)

Conclusion

PCA and PCR are useful Unsupervised Learning tools for reducing dimensionality. However, they can be useful also with smaller datasets. PCA can be used when we have a few similar variables which we would like to “merge” in one for making interpretation clearer. In this suicide dataset, the optimal number of components turned out to be 4 out of 5. The variables which are responsible for a surprisingly low percentage of variance are cpi and alcohol consumption per capita (p-value of the t test close to 0). Excluding them would probably result in no change in fit. On the other hand, PCR is favourable over OLS when multicollinearity occurs. In this dataset, variables were collinear in some extend, but no perfect multicollinearity was present. Therefore the PCR and OLS estimation results are pretty close to each other as could be seen on the scatter plot.

Sources:

Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning, Second Edition, Springer Series in Statistics

Unsupervised Learning, Block 2, Learning Materials

PCA and PCR on suicide data

Honorata Bogusz

Introduction

Theory behind PCA and PCR

Theory behind penalized regression

PCA

PCR

Conclusion