High dimensional setting is a situation in which the dimension \(p\) (number of regressors) is large relative to \(n\) (number of observations). In this case the number of parameters to estimate (governed by \(p\)) is too large for the information we have (governed by \(n\)). Applying classis OLS methods on such data result in high variance and overfitting. Even if an estimator is unbiased, high variance can make it perform very poorly. This problem is often called “curse of dimensionality”. The very special case is when \(p>n\). However, there are a few methods which tackle the problem of dimensionality reduction. The most popular unsupervised learning tools are PCA and PCR. However, they can be used even in a non-HD setting - f.e. when variables are significantly correlated PCR is preferable to OLS. On the other hand, supervised learning covers yet another technique - penalized regression. In this study, I’m going to conduct a PCA/PCR analysis on a dataset containing the dependent variable - suicide rate for 100 000 people and independent macroeconomic variables: GDP per capita, unemployment rate, alcohol consumption per capita, fertility rate and CPI (2010 current prices). For the purpose of conducting PCR, I’m going to generate additional variables artificially and see if it selects the proper ones.
Principal Component Analysis is a statistical procedure, the aim of which is revealing internal structure of the data in a way that explains its variance. It uses orthogonal transformation of a set of correlated variables to convert it into a set of linearly uncorrelated variables called principal components. PCA is done by eigenvalue decomposition of covariance/correlation (we use covariance if levels of variables are comparable - f.e. percentage shares, correlation when they aren’t) matrix after normalisation of data.
Usually the procedure is conducted as follows:
Mathematically the procedure can be written:
Let’s state that we have a set of observations \(x_1, x_2..., x_N\) and we consider a rank-\(q\) linear model for representing them:The columns of \(UD\) are then called the principal components od \(X\). The \(N\) optimal \(\hat{\lambda_i}\) are given by the first \(q\) principal components: the \(N\) rows of the \(N x q\) matrix \(U_qD_q\).
Principal Component Regression is a technique based on PCA. In a traditional regression the dependent variable is regressed on the explanatory variables, while in PCR principal components of the explanatory variables are used as regressors. In most cases principal components with higher variances are chosen. PCR has two major drawbacks: it can reduce dimensionality and overcome the problem of multicollonearity.
Usually the procedure is conducted as follows:
Mathematically the shortened procedure can be written:
Let \(z_m=Xv_m\) be the derived input of orthogonal columns. \(y\) is regressed on \(z_1,...,z_M\) for some \(M\leq p\). This regression is a sum of univariate regressions:If \(M=p\) we would get the usual least squares estimates since the columns of \(Z=UD\) span the column space of \(X\). For \(M<p\) we get the reduced regression.
In HD setting, without imposing restrictions on \(\beta\), life becomes difficult. However, often some reasonable restrictions can be imposed. We call \(\beta\) sparse if most of its components are zero. Assuming sparsity allows us to do variable selection (via information criteria or cross-validation) - find a trade-off between fit and model complexity.
For finite, positive \(q\) we can define the penalized LS estimator:Which is also called regularized least squares, with \(\lambda||\beta||_q^q\) being the regularizer. The first part of the summation corresponds to the lack of fit, the second one - for complexity. \(\lambda\) is the number of restictions imposed due to variable screening.
For \(q=1\) we get the lasso solution and for \(q=2\) - ridge regression.
However, these old, well-know methods have limitations. For example, lasso tends to select too many variables. It also does not perform well when features are highly correlated. In this case, we may use the Elastic Net estimator:Lasso als ignores ordering of the features, due to which problem more generalized methods (which impose more restrictions) are recommended, such as Adaptive Lasso, Fused Lasso or CARDS algorithm.
First, I’m loading and browsing the dataset.
suicide <- read_excel("suicide.xlsx")
View(suicide)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
summary(suicide)
## country suicide GDPpc unempl
## Length:181 Min. : 0.500 Min. : 237.4 Min. :0.00140
## Class :character 1st Qu.: 4.700 1st Qu.: 1587.1 1st Qu.:0.04120
## Mode :character Median : 7.800 Median : 4808.6 Median :0.06180
## Mean : 9.235 Mean : 12299.5 Mean :0.08024
## 3rd Qu.:12.500 3rd Qu.: 14912.7 3rd Qu.:0.10200
## Max. :31.900 Max. :100738.7 Max. :0.30600
## alcohol fertility cpi
## Min. : 0.00 Min. :1.172 Min. :9.800e+01
## 1st Qu.: 2.40 1st Qu.:1.733 1st Qu.:1.100e+02
## Median : 6.30 Median :2.334 Median :1.200e+02
## Mean : 6.14 Mean :2.803 Mean :3.669e+07
## 3rd Qu.: 9.50 3rd Qu.:3.760 3rd Qu.:1.440e+02
## Max. :15.20 Max. :7.239 Max. :6.641e+09
suicide$country <- NULL
dim(suicide)
## [1] 181 6
Now, I’m splitting the dataset into train and test parts.
suicide.test <- suicide[151:181,]
suicide.train <- suicide[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL
Then, I’m normalizing the data and checking correlation between variables.
dim(suicide.train)
## [1] 150 5
suicide.stand<- as.data.frame(apply(suicide.train[, 1:5], 2, function(x) (x - mean(x))/sd(x)))
suicide.y.stand <- data.Normalization(suicide.y, type="n1", normalization="column")
res <- cor(suicide.train, method="pearson")
corrplot::corrplot(res, order = "alphabet")
corrplot::corrplot(res, method= "color", order = "hclust", tl.pos = 'n')
The two correlation plots show mild correlation between some of the variables. To obtain detailed correlation matrix and a second one referring to p-value siginificance of correltions, I’m running some more commands.
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(suicide)
## $r
## cpi GDPpc suicide alcohol unempl fertility
## cpi 1
## GDPpc -0.031 1
## suicide 0.21 0.26 1
## alcohol 0.092 0.38 0.57 1
## unempl -0.092 -0.13 -0.025 0.08 1
## fertility -0.059 -0.48 -0.33 -0.4 -0.11 1
##
## $p
## cpi GDPpc suicide alcohol unempl fertility
## cpi 0
## GDPpc 0.68 0
## suicide 0.0044 0.00037 0
## alcohol 0.22 9.6e-08 8.3e-17 0
## unempl 0.22 0.084 0.74 0.29 0
## fertility 0.43 1e-11 7.8e-06 3.3e-08 0.14 0
##
## $sym
## cpi GDPpc suicide alcohol unempl fertility
## cpi 1
## GDPpc 1
## suicide 1
## alcohol . . 1
## unempl 1
## fertility . . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
We can see that the highest correlation between variables is equal to 0.57. However, many pairs are significantly correlated (p-value less than 0.05): suicide wih cpi, alcohol and fertility, GDPpc with alcohol and fertility, alcohol with fertility.
Now, let’s run PCA.
suicide.pca1 <- prcomp(suicide.stand, center=TRUE, scale.=TRUE)
summary(suicide.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.3545 1.0549 0.9936 0.7826 0.67291
## Proportion of Variance 0.3669 0.2225 0.1975 0.1225 0.09056
## Cumulative Proportion 0.3669 0.5895 0.7870 0.9094 1.00000
The quality of this PCA is good enough. We have 5 principal components amounting to 100 % of the variance explained, the first one explaining 36 %. It is interesting that this number equals the number of independent variables. However, later validation plots will show what is the optimal number of components. I can then check how each component of my output matrix looks like.
# X=UD
suicide.pca1$x [,1:5] %>% head(1)
## PC1 PC2 PC3 PC4 PC5
## [1,] -1.985734 0.01469254 0.09627027 0.3186402 0.5266854
# V matrix - eigenvectors
suicide.pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## GDPpc 0.57282151 -0.14603512 0.37533527 0.1654400 0.6944799
## unempl 0.04958360 0.80674071 -0.44961037 0.1980481 0.3245587
## alcohol 0.54874028 0.08907206 -0.21971588 -0.7916179 -0.1265554
## fertility -0.60165291 -0.06551524 0.01687948 -0.5247764 0.5983697
## cpi 0.07946611 -0.56179809 -0.78000895 0.1770801 0.1956955
# X matrix - each column is a principal component for each observation
suicide.pca1$x
## PC1 PC2 PC3 PC4 PC5
## [1,] -1.985733808 0.014692540 0.096270271 0.318640201 0.52668537
## [2,] 0.415218294 1.095788457 -0.696368359 0.306294904 -0.49283659
## [3,] -0.954988454 0.263837950 0.013862358 0.974897923 -0.09892425
## [4,] -1.562247086 -0.072167413 -0.072453303 -1.256317838 0.85800282
## [5,] 0.509385862 0.323673036 -0.072290501 0.209715632 -0.17278593
## [6,] 0.694633430 0.206523571 -0.154347788 -0.469050861 -0.29718589
## [7,] 0.204480466 1.424717993 -0.806870958 0.803156768 -0.35302162
## [8,] 2.241501938 -0.454906021 0.832522205 -0.165041569 0.83426478
## [9,] 2.339712462 -0.337044467 0.641921541 -0.271232191 0.46912612
## [10,] -0.643131320 -0.364996369 0.379698568 1.149676344 -0.73079492
## [11,] 0.849035907 0.480361529 0.222018831 1.040829554 0.56012215
## [12,] 0.057946346 -0.992119900 1.006380724 0.970556084 -0.15588356
## [13,] -0.909017304 -0.451419566 0.414121846 1.191678466 -0.77172943
## [14,] 1.003519709 0.329583244 -0.151843323 -0.177929055 -0.32278577
## [15,] 1.773802354 -7.605131945 -9.368733652 1.319351424 1.07803377
## [16,] 2.209919489 -0.073310543 0.411857482 -0.421712108 0.49812180
## [17,] -0.068287448 0.096061433 -0.104126635 -0.071251253 -0.46428548
## [18,] -1.814460385 -0.752330159 0.410120390 -0.532713301 0.27650731
## [19,] -0.775273094 -0.693128718 0.549897215 1.053301609 -0.85354062
## [20,] -0.576765267 -0.567856516 0.305208063 -0.027047730 -0.57222792
## [21,] 0.527788034 2.345861051 -1.332051570 0.963889008 -0.07863454
## [22,] 0.195504998 1.285230608 -0.803959994 -0.164512759 0.13740915
## [23,] 0.571762325 0.599738397 -0.354283481 0.175299196 -0.49842616
## [24,] 0.125591583 -0.320003242 0.770080988 1.535022814 0.29352174
## [25,] 1.222592062 0.204178184 -0.348362180 -0.805761023 -0.98693543
## [26,] -1.280135530 -0.165792020 -0.134363985 -1.530961969 0.47642275
## [27,] -1.570924722 -0.809867360 0.242150242 -1.685964156 0.38648691
## [28,] -0.166512131 0.409684176 -0.269383460 0.240748060 -0.46506460
## [29,] -0.275294796 -0.863873227 0.366382377 -0.369131636 -0.99107034
## [30,] -0.897270564 -0.399950908 -0.009642107 -1.474708127 0.09495233
## [31,] 1.868555980 -0.251754694 0.660273332 0.200227417 0.55183143
## [32,] -1.713281167 -0.273530927 0.125732650 -0.437616577 0.39037604
## [33,] -2.414114064 -0.413406047 0.265306785 -0.522581739 0.90661410
## [34,] 0.886174831 -0.041310364 0.033900171 -0.225560206 -0.55644598
## [35,] 0.466208021 -0.296435447 0.161747607 0.102435990 -0.90553480
## [36,] 0.133645515 0.158110300 -0.079322438 0.370477290 -0.66899984
## [37,] -1.796673835 -0.535068278 0.382312532 0.158417136 0.15677058
## [38,] -2.367243223 -0.664813197 0.355230036 -0.859662658 0.82928269
## [39,] -0.965674620 0.367946426 -0.390022284 -1.055265020 0.43151654
## [40,] 0.240759681 0.169900788 0.058592978 0.662485914 -0.39317202
## [41,] -1.058987225 -0.634626315 0.141126959 -1.509565212 0.11940109
## [42,] 0.992629516 0.798212571 -0.441252545 0.176671055 -0.44743841
## [43,] 1.641653101 0.725218197 -0.281929833 -0.052664590 -0.07798473
## [44,] 1.768012146 -0.320803290 0.065566086 -1.146284712 -0.75753967
## [45,] 2.386614292 -0.426989591 0.890049130 -0.039386813 0.97107136
## [46,] -1.141323182 -0.290076781 0.302314607 0.865336583 -0.36456401
## [47,] 0.038961701 -0.221669776 0.097842965 -0.133376588 -0.55873918
## [48,] -0.346748541 -0.384910480 0.275085728 0.274677141 -0.53351975
## [49,] -1.231897203 0.517158601 -0.121770022 0.946594390 0.22575116
## [50,] -0.336081615 -0.383977522 0.269686379 0.534068659 -0.79141102
## [51,] -0.297197649 -0.167762401 -0.099772791 -1.778922516 0.43387263
## [52,] -1.640676146 -0.244749309 0.204164341 0.229884235 0.15212842
## [53,] 1.397186782 -0.013423483 -0.006147975 -0.544592689 -0.55483619
## [54,] 0.166674757 2.519890052 -1.629766024 -0.330963521 0.55264288
## [55,] -1.480848797 -0.391689634 0.227028730 -0.117411608 0.07078858
## [56,] -0.556370523 -0.198686893 0.213005747 0.568933904 -0.42749820
## [57,] 2.131093118 0.007925676 0.461150865 -0.088117282 0.65795508
## [58,] 2.030404341 0.248527502 0.132785859 -0.587382880 0.53252213
## [59,] 0.160272457 1.654298470 -1.147337771 -1.060950896 0.65374682
## [60,] -1.863661719 0.129348040 -0.125320213 -0.640253319 0.79072327
## [61,] 0.553267791 0.689185766 -0.570941993 -0.344296813 -0.62752717
## [62,] 2.469935841 -0.520582425 0.628484533 -0.695923273 0.22008473
## [63,] -1.395113372 -0.753862447 0.449865486 -0.098119070 -0.13301145
## [64,] 1.488352613 2.124382153 -1.143225592 0.305644243 0.24375013
## [65,] 0.738664319 2.192219892 -1.286784922 0.151513448 0.31018940
## [66,] -0.897348337 -0.661251699 0.470131688 0.388681530 -0.42899129
## [67,] -1.967410194 -0.529789974 0.355537839 -0.106151056 0.37393720
## [68,] -1.406098935 -0.234858542 0.051827317 -0.625992025 0.25253983
## [69,] -0.104337189 0.584515796 -0.367670201 0.120553941 -0.27140292
## [70,] -0.458925623 0.875239799 -0.581037798 0.091387964 -0.11686632
## [71,] -0.510257353 -0.351936207 0.210191103 0.325540892 -0.67714612
## [72,] 1.249744819 -0.180862800 0.012123625 -0.557129855 -0.89147322
## [73,] 2.387185412 -0.928817429 1.340841404 0.136404836 1.16791557
## [74,] -0.261214144 -0.453406513 0.188937670 0.016217302 -0.87438672
## [75,] -0.846242254 -0.495432865 0.440968394 0.956957160 -0.60417366
## [76,] -0.391034900 0.594117072 -0.135755574 1.466051592 -0.42055429
## [77,] -1.714477693 -0.122719025 0.238543406 0.392688929 0.52212986
## [78,] 2.999648703 -0.253226293 0.867401419 -0.450489892 1.50269915
## [79,] 0.352247053 -0.669231494 0.987179720 0.457407299 1.03935395
## [80,] 1.440298050 0.432551267 0.136988432 0.592216266 0.24979815
## [81,] -0.133941843 0.745309566 -0.359621888 0.754203782 -0.33907308
## [82,] 1.677781260 -0.730489023 0.907236345 0.275324735 0.17135874
## [83,] -1.199489550 0.880347049 -0.327352185 0.943281533 0.43880593
## [84,] 0.034415525 -0.295431675 0.118273973 -0.408856286 -0.43793738
## [85,] -1.175146958 0.455453554 -0.251326527 0.112412283 0.26400583
## [86,] -1.321680591 2.848187122 -1.459117061 1.353163778 1.25573878
## [87,] 1.702938257 -0.499376487 0.497413059 -0.126242721 -0.44550144
## [88,] 0.008878048 -0.940086401 1.137790507 1.429204071 0.12217589
## [89,] -0.527028357 0.000488651 -0.105541935 -0.262949819 -0.38454998
## [90,] 0.184011247 -0.742300285 0.166268128 -1.089831141 -0.97404995
## [91,] 1.412493848 0.409120865 -0.359006577 -0.778937494 -0.54240292
## [92,] -0.295429019 -0.174711733 0.324275863 1.195516647 -0.58810718
## [93,] -0.520695865 2.614967562 -1.515673120 0.611059138 0.70705139
## [94,] -1.289998170 -0.697967719 0.267097426 -0.915090334 -0.00445978
## [95,] -0.745873663 1.313267819 -0.506717681 1.606357791 0.18250317
## [96,] 1.699399763 0.217814614 -0.322317358 -1.220022769 -0.66274010
## [97,] 4.418258008 -0.751451198 1.784573563 0.027753906 2.71929262
## [98,] 0.673866906 2.198481118 -1.314655921 0.540565769 -0.10865667
## [99,] -1.627990488 -0.830219402 0.502460245 -0.050371301 -0.09122334
## [100,] -1.535657087 -0.288891964 0.121058352 -0.408701198 0.22935544
## [101,] -0.495554370 -0.615935900 0.611373758 1.099407417 -0.54220806
## [102,] -0.261737798 -0.380117004 0.413015930 0.794453606 -0.47484572
## [103,] -2.468048046 -0.163851204 0.134919414 -0.462144634 1.07054131
## [104,] 1.220567402 -0.405568213 0.476604386 0.164102216 -0.32598831
## [105,] -2.002730839 0.132504792 0.047431201 0.377959892 0.62345874
## [106,] 0.170850957 -0.034958714 0.191087655 0.961008876 -0.69491496
## [107,] 0.134406455 -0.439399280 0.267201324 -0.001984059 -0.66640956
## [108,] -0.888050149 1.056166451 -0.510728506 0.715405396 0.29237065
## [109,] 1.460168391 -0.116473834 -0.362093921 -1.323993929 -1.58972412
## [110,] -0.133787160 0.023927944 -0.117501434 -0.330326196 -0.46364565
## [111,] 0.617413504 1.401764403 -0.836292598 0.336552562 -0.28164706
## [112,] -0.909587559 0.177317565 0.060538801 1.105064135 -0.29748102
## [113,] -1.844563845 2.125995345 -1.180716047 0.185792065 1.56080039
## [114,] -0.364820625 -0.811351123 0.417319235 0.139523948 -1.06118155
## [115,] 0.045020603 2.102033566 -1.366669564 -0.515989074 0.61855306
## [116,] -0.686920438 -0.569757852 0.388534479 0.765135561 -0.91765764
## [117,] 1.919190083 -0.414057441 0.814741812 0.215509856 0.66719182
## [118,] 1.900836199 -0.451855889 0.660660950 -0.319349674 0.43597513
## [119,] -0.249007527 -0.340747026 0.155439141 0.191904581 -0.84841792
## [120,] -3.163597457 -1.191604766 0.715834228 -1.004164912 1.20842064
## [121,] -0.623276730 0.007903968 -0.415856746 -2.534659788 0.49457451
## [122,] 2.576868654 -0.829575795 1.530053444 0.620922405 1.69153020
## [123,] -0.597710441 -0.718454772 0.760179019 0.926176895 -0.04516988
## [124,] -1.475832545 -0.576554196 0.453506796 0.593054477 -0.20213434
## [125,] 0.373265199 -0.415561797 0.294625049 -0.318342692 -0.29939942
## [126,] -1.411868773 -0.728988186 0.517336570 0.328812359 -0.17219136
## [127,] -0.040899406 -0.225594196 0.040205247 -0.245730760 -0.66564363
## [128,] -0.070374035 -0.478127012 0.252054692 -0.081732601 -0.68264237
## [129,] -0.372124392 -0.575645121 0.233586771 -0.394736235 -0.63044396
## [130,] 1.327421381 -0.032272816 -0.083870051 -0.515671024 -0.91657626
## [131,] 1.712454785 0.548752851 -0.306862524 -0.415789650 -0.38412217
## [132,] 1.355646800 -1.437463523 1.887584736 1.323750210 1.23140925
## [133,] 1.243947548 -0.032237505 -0.177656317 -0.838711491 -0.96408326
## [134,] 1.023594830 -0.099252981 -0.119895151 -0.754237732 -0.91320266
## [135,] -0.569163414 -0.741334039 0.176923720 -1.280093907 -0.45079874
## [136,] -1.283869850 0.019478872 0.070442535 0.162912209 0.29338907
## [137,] -0.973649203 0.738755999 -0.549302674 -0.692828519 0.53644222
## [138,] -0.433657337 -0.460780871 0.726944696 1.212367149 0.23650445
## [139,] -1.997989719 -0.496847826 0.366081652 0.051733282 0.38008345
## [140,] 1.037473993 1.177753509 -0.860886525 -0.257071296 -0.65617759
## [141,] 0.805654804 -0.470386629 0.169006631 -1.210863070 -0.32544354
## [142,] -1.229862213 -0.404473208 0.109766753 -0.777465938 0.06205166
## [143,] 1.541726591 -1.155699725 1.670165095 1.607082216 0.84175085
## [144,] 1.445528414 0.376705624 -0.237831428 -0.377042344 -0.52955578
## [145,] 1.672921314 0.133586056 -0.060547174 -0.650909516 -0.36539471
## [146,] -1.488573868 -0.807179044 0.543111247 0.195061635 -0.14175106
## [147,] -2.753474410 -0.428110332 0.326157850 -0.358472710 1.09624914
## [148,] 0.449743745 2.541543375 -1.565361129 0.051883772 0.41013658
## [149,] -0.670117423 0.580962845 -0.662560241 -1.706010815 0.42559012
## [150,] 1.699731959 1.539432606 -0.650199347 0.342368755 0.41054336
# Eigenvalues
suicide.pca1$sdev
## [1] 1.3545382 1.0548554 0.9936200 0.7825733 0.6729076
res1 <- cor(suicide.pca1$x, method="pearson")
corrplot::corrplot(res1, method = "color", order = "hclust", tl.pos = 'n')
The correlation plot of residuals of my PCA is clean, which means that the principal components are indeed orthogonal. Now, let’s do some more plotting to visualise this PCA.
plot(summary(suicide.pca1)$importance[3,])
fviz_eig(suicide.pca1)
The first two plots shows the aggregate variance explained which sums up to 1 for 5 components as could be seen in the initial PCA output matrix.
fviz_pca_var(suicide.pca1, col.var="black")
The third plot presents how strongly a loading of a given variable contributes to a given principal component.
fviz_pca_ind(suicide.pca1, col.ind="cos2", geom = "point", gradient.cols = c("white", "#2E9FDF", "#FC4E07" ))
The next plot shows unlabeled observations in two dimensions with coloured quality of representation. The output seems to be equally distributed except for one outlier.
autoplot(prcomp(suicide, center=TRUE, scale.=TRUE), data=suicide, loadings = TRUE, loadings.colour = 'blue', loadings.label = TRUE, loadings.label.size = 3)
pca2d(suicide.pca1, biplot=TRUE, biplot.vars=3)
The two above plots show the same properties. Let’s investigate the output with regard to three components.
# a trick for including 3d charts in html output
knit_hooks$set(webgl = hook_webgl)
pca3d(suicide.pca1)
## [1] 0.08836516 0.15210264 0.18737467
## Creating new device
You must enable Javascript to view this page properly.
It can be seen on the 3D plot that the output is almost linear.
Now, I’m generating 100 artificial independent variables from the Gaussian distribution and merging them with my initial dataset. Firstly, I need to check the standard deviation of my variables in order to simulate data with acurate variance.
sapply(suicide, sd, na.rm = TRUE)
## suicide GDPpc unempl alcohol fertility cpi
## 6.009867e+00 1.745159e+04 6.123073e-02 4.130803e+00 1.354098e+00 4.936389e+08
x = create.random.matrix(100, 181, st.dev = 1, perturb = 0.2)
suicidex <- merge(suicide, x)
View(suicidex)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
I have to split this new dataset into train and test data.
suicide.test <- suicidex[151:181,]
suicide.train <- suicidex[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL
Now I’m going to run PCR on artificial and original matrix and prepare data for forecasting.
reg.pcr.x<- pcr(suicide~., data=suicidex, scale=TRUE, validation="CV")
summary(reg.pcr.x)
## Data: X dimension: 32761 105
## Y dimension: 32761 1
## Fit method: svdpc
## Number of components considered: 105
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 5.993 5.994 5.995 5.995 5.995 5.995 5.996
## adjCV 5.993 5.994 5.995 5.995 5.995 5.995 5.996
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.996 5.996 5.997 5.997 5.997 5.998 5.998
## adjCV 5.996 5.996 5.997 5.997 5.997 5.998 5.999
## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps 20 comps
## CV 5.999 5.998 5.934 5.217 5.214 5.213 5.212
## adjCV 6.000 6.002 6.006 5.208 5.208 5.210 5.210
## 21 comps 22 comps 23 comps 24 comps 25 comps 26 comps 27 comps
## CV 5.211 5.211 5.211 5.21 5.211 5.211 5.211
## adjCV 5.210 5.210 5.210 5.21 5.210 5.211 5.211
## 28 comps 29 comps 30 comps 31 comps 32 comps 33 comps 34 comps
## CV 5.212 5.212 5.212 5.213 5.213 5.213 5.213
## adjCV 5.211 5.211 5.212 5.212 5.212 5.212 5.213
## 35 comps 36 comps 37 comps 38 comps 39 comps 40 comps 41 comps
## CV 5.213 5.213 5.214 5.214 5.214 5.178 5.176
## adjCV 5.213 5.213 5.213 5.213 5.213 5.177 5.176
## 42 comps 43 comps 44 comps 45 comps 46 comps 47 comps 48 comps
## CV 5.177 5.177 5.177 5.178 5.147 5.070 5.065
## adjCV 5.177 5.177 5.177 5.179 5.176 5.064 5.064
## 49 comps 50 comps 51 comps 52 comps 53 comps 54 comps 55 comps
## CV 5.065 5.065 5.065 5.065 5.065 5.065 5.066
## adjCV 5.064 5.064 5.064 5.065 5.065 5.065 5.065
## 56 comps 57 comps 58 comps 59 comps 60 comps 61 comps 62 comps
## CV 5.066 5.066 5.066 5.067 5.067 5.068 5.068
## adjCV 5.065 5.066 5.066 5.066 5.067 5.068 5.068
## 63 comps 64 comps 65 comps 66 comps 67 comps 68 comps 69 comps
## CV 5.061 4.825 4.824 4.824 4.824 4.824 4.824
## adjCV 5.069 4.820 4.822 4.823 4.823 4.823 4.823
## 70 comps 71 comps 72 comps 73 comps 74 comps 75 comps 76 comps
## CV 4.824 4.825 4.825 4.825 4.825 4.796 4.797
## adjCV 4.824 4.824 4.824 4.824 4.824 4.795 4.796
## 77 comps 78 comps 79 comps 80 comps 81 comps 82 comps 83 comps
## CV 4.797 4.797 4.797 4.797 4.797 4.797 4.797
## adjCV 4.796 4.796 4.796 4.796 4.797 4.797 4.797
## 84 comps 85 comps 86 comps 87 comps 88 comps 89 comps 90 comps
## CV 4.798 4.798 4.798 4.798 4.798 4.798 4.799
## adjCV 4.797 4.797 4.797 4.797 4.798 4.798 4.798
## 91 comps 92 comps 93 comps 94 comps 95 comps 96 comps 97 comps
## CV 4.799 4.799 4.799 4.799 4.799 4.799 4.800
## adjCV 4.798 4.798 4.798 4.799 4.799 4.799 4.799
## 98 comps 99 comps 100 comps 101 comps 102 comps 103 comps
## CV 4.800 4.800 4.800 4.800 4.800 4.8
## adjCV 4.799 4.799 4.799 4.799 4.799 4.8
## 104 comps 105 comps
## CV 4.801 4.801
## adjCV 4.800 4.800
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 2.705 5.347 7.913 10.3 12.64 14.91 17.1 19.24
## suicide 0.000 0.000 0.000 0.0 0.00 0.00 0.0 0.00
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 21.34 23.41 25.41 27.4 29.31 31.19 33
## suicide 0.00 0.00 0.00 0.0 0.00 0.00 0
## 16 comps 17 comps 18 comps 19 comps 20 comps 21 comps 22 comps
## X 34.77 36.53 38.24 39.95 41.59 43.23 44.84
## suicide 0.00 24.55 24.55 24.55 24.55 24.55 24.55
## 23 comps 24 comps 25 comps 26 comps 27 comps 28 comps 29 comps
## X 46.40 47.96 49.49 50.95 52.39 53.80 55.12
## suicide 24.55 24.55 24.55 24.55 24.55 24.55 24.55
## 30 comps 31 comps 32 comps 33 comps 34 comps 35 comps 36 comps
## X 56.43 57.71 58.99 60.24 61.46 62.64 63.79
## suicide 24.55 24.55 24.55 24.55 24.55 24.55 24.55
## 37 comps 38 comps 39 comps 40 comps 41 comps 42 comps 43 comps
## X 64.92 66.05 67.14 68.19 69.23 70.26 71.26
## suicide 24.55 24.55 24.55 25.63 25.63 25.63 25.63
## 44 comps 45 comps 46 comps 47 comps 48 comps 49 comps 50 comps
## X 72.21 73.17 74.11 75.04 75.98 76.88 77.74
## suicide 25.63 25.63 25.63 28.84 28.84 28.84 28.84
## 51 comps 52 comps 53 comps 54 comps 55 comps 56 comps 57 comps
## X 78.59 79.42 80.21 80.98 81.74 82.49 83.23
## suicide 28.84 28.84 28.84 28.84 28.84 28.84 28.84
## 58 comps 59 comps 60 comps 61 comps 62 comps 63 comps 64 comps
## X 83.94 84.61 85.24 85.86 86.47 87.07 87.65
## suicide 28.84 28.84 28.84 28.84 28.84 28.84 35.50
## 65 comps 66 comps 67 comps 68 comps 69 comps 70 comps 71 comps
## X 88.23 88.8 89.34 89.87 90.39 90.88 91.34
## suicide 35.50 35.5 35.50 35.50 35.50 35.50 35.50
## 72 comps 73 comps 74 comps 75 comps 76 comps 77 comps 78 comps
## X 91.8 92.25 92.69 93.11 93.53 93.93 94.30
## suicide 35.5 35.50 35.50 36.27 36.27 36.27 36.27
## 79 comps 80 comps 81 comps 82 comps 83 comps 84 comps 85 comps
## X 94.67 95.02 95.36 95.68 95.99 96.29 96.58
## suicide 36.27 36.27 36.27 36.27 36.27 36.27 36.27
## 86 comps 87 comps 88 comps 89 comps 90 comps 91 comps 92 comps
## X 96.86 97.13 97.38 97.63 97.86 98.07 98.29
## suicide 36.27 36.27 36.27 36.27 36.27 36.27 36.27
## 93 comps 94 comps 95 comps 96 comps 97 comps 98 comps 99 comps
## X 98.49 98.68 98.86 99.02 99.18 99.33 99.46
## suicide 36.27 36.27 36.27 36.27 36.27 36.27 36.27
## 100 comps 101 comps 102 comps 103 comps 104 comps 105 comps
## X 99.58 99.68 99.77 99.86 99.94 100.00
## suicide 36.27 36.27 36.27 36.27 36.27 36.27
reg.pcr<- pcr(suicide~., data=suicide, scale=TRUE, validation="CV")
summary(reg.pcr)
## Data: X dimension: 181 5
## Y dimension: 181 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 6.027 410636 343029 631621 1247524 1245562
## adjCV 6.027 389683 325526 599392 1183869 1182006
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 36.93 59.04 78.70 91.03 100.00
## suicide 24.55 25.63 28.84 35.50 36.27
It can be seen that in both cases, the same percentage of variance is explained (36,27%). Therfore, adding artificial variables didn’t change anything in terms of the accuracy. hence, PCA chooses the correct variables. For the purpose of forecasting, I will use the original dataset without artificial variables.
suicide.test <- suicide[151:181,]
suicide.train <- suicide[1:150,]
suicide.y <- suicide.train$suicide
suicide.train$suicide <- NULL
suicide.train.all=cbind(suicide.y, suicide.train)
test.x<-suicide.test[, 2:6]
test.y <- suicide.test[,1]
reg.pcr.train<- pcr(suicide.y~., data=suicide.train.all, na.action=na.omit, scale=TRUE, validation="CV")
reg.pcr.pred<-predict(reg.pcr.train, test.x, ncomp = 3)
reg.train<- lm(suicide.y~., data=suicide.train.all)
reg.pred<-predict(reg.train, test.x)
plot(reg.pcr.pred, col="red", cex=0.8)
points(test.y)
points(reg.pred, col="blue")
abline(v=(-10:100)*5, lty=3, col="grey80")
Since I have fitted and predicted values, I can plot them together. Red dots correspond to values predicted via PCR, black to empirical ones and blue to values predicted using OLS. Empirical values are further away from the PCR and OLS ones, but the two latter are pretty close to each other. The reason might be the fact that the multicollinearity between the independent variables wasn’t severe (correlation up to 0.57 but most pairs significantly correlated). Then I’m comparing PCR coefficients in full regression and train model.
coefplot(reg.pcr)
coefplot(reg.pcr.train, add=TRUE, lwd=2)
It can be seen that both values are pretty close to each other - the quality of this forecast is good enough. Finally, I’m drawing validation plot to choose an optimal number of components, using different accuracy measures. It can be seen that the biggest change occur at the number of components equal to 4. Therefore it can be point out as the optimific number of components.
validationplot(reg.pcr, val.type = "R2")
validationplot(reg.pcr.train, val.type = "R2", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)
validationplot(reg.pcr, val.type = "RMSEP")
validationplot(reg.pcr.train, val.type = "RMSEP", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)
validationplot(reg.pcr, val.type = "MSEP")
validationplot(reg.pcr.train, val.type = "MSEP", add=TRUE, lwd=2)
axis(side = 1, at = c(4), cex.axis=0.7)
abline(v = 4, col = "blue", lty = 3)
PCA and PCR are useful Unsupervised Learning tools for reducing dimensionality. However, they can be useful also with smaller datasets. PCA can be used when we have a few similar variables which we would like to “merge” in one for making interpretation clearer. In this suicide dataset, the optimal number of components turned out to be 4 out of 5. The variables which are responsible for a surprisingly low percentage of variance are cpi and alcohol consumption per capita (p-value of the t test close to 0). Excluding them would probably result in no change in fit. On the other hand, PCR is favourable over OLS when multicollinearity occurs. In this dataset, variables were collinear in some extend, but no perfect multicollinearity was present. Therefore the PCR and OLS estimation results are pretty close to each other as could be seen on the scatter plot.
Sources:
Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning, Second Edition, Springer Series in Statistics
Unsupervised Learning, Block 2, Learning Materials