Dimension Reduction Activity

The purpose of this activity is to use dimension reduction to reduce the dimensionality of data without losing much of the information it can provide. In other words we are reducing the importance of some of the attributes without any loss in data integrity. The method being used in this example is Principal Component Analysis (PCA).

# Read the cars data in excluding cateforical data
cars_numeric <- mtcars[, c(1:7, 10, 11)]
head (cars_numeric)
##                    mpg cyl disp  hp drat    wt  qsec gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22    3    1

Compute Correlation Matrix

We have 9 attributes in this data set, and therefor we will obtain 9 principal components from running a PCA on the dataset. Each of these explains the percentage of total variance in the dataset. Looking at the dataset below, for example, we can see that PC1 accounts for 62.8% of the total variance. This means that 62% of the information in the dataset can be encapsulated by the PC1 principal component.

cat("\nSummary of Cars PCA\n")
## 
## Summary of Cars PCA
cars_numeric.pca <- prcomp(cars_numeric, center = TRUE, scale. = TRUE)
summary(cars_numeric.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000
cat("\nCorrelation Matrix\n")
## 
## Correlation Matrix
cars_numeric.cor <- cor(cars_numeric)
cars_numeric.cor
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec       gear       carb
## mpg   0.41868403  0.4802848 -0.5509251
## cyl  -0.59124207 -0.4926866  0.5269883
## disp -0.43369788 -0.5555692  0.3949769
## hp   -0.70822339 -0.1257043  0.7498125
## drat  0.09120476  0.6996101 -0.0907898
## wt   -0.17471588 -0.5832870  0.4276059
## qsec  1.00000000 -0.2126822 -0.6562492
## gear -0.21268223  1.0000000  0.2740728
## carb -0.65624923  0.2740728  1.0000000

Compute the Variance Covariance Matrix

Generate a covariance matrix, which is a sqaure matrix giving the covariance between each pair of a given vector.

cars_numeric.cov = cov(cars_numeric)

cat("Covariance Matrix")
## Covariance Matrix
cars_numeric.cov
##              mpg         cyl        disp          hp         drat          wt
## mpg    36.324103  -9.1723790  -633.09721 -320.732056   2.19506351  -5.1166847
## cyl    -9.172379   3.1895161   199.66028  101.931452  -0.66836694   1.3673710
## disp -633.097208 199.6602823 15360.79983 6721.158669 -47.06401915 107.6842040
## hp   -320.732056 101.9314516  6721.15867 4700.866935 -16.45110887  44.1926613
## drat    2.195064  -0.6683669   -47.06402  -16.451109   0.28588135  -0.3727207
## wt     -5.116685   1.3673710   107.68420   44.192661  -0.37272073   0.9573790
## qsec    4.509149  -1.8868548   -96.05168  -86.770081   0.08714073  -0.3054816
## gear    2.135685  -0.6491935   -50.80262   -6.358871   0.27598790  -0.4210806
## carb   -5.363105   1.5201613    79.06875   83.036290  -0.07840726   0.6757903
##              qsec        gear        carb
## mpg    4.50914919   2.1356855 -5.36310484
## cyl   -1.88685484  -0.6491935  1.52016129
## disp -96.05168145 -50.8026210 79.06875000
## hp   -86.77008065  -6.3588710 83.03629032
## drat   0.08714073   0.2759879 -0.07840726
## wt    -0.30548161  -0.4210806  0.67579032
## qsec   3.19316613  -0.2804032 -1.89411290
## gear  -0.28040323   0.5443548  0.32661290
## carb  -1.89411290   0.3266129  2.60887097

Check Statistical Significance

cars_numeric.sig <- corr.p(cars_numeric.cor, 32, alpha=.05)
print(cars_numeric.sig, short=FALSE)
## Call:corr.p(r = cars_numeric.cor, n = 32, alpha = 0.05)
## Correlation matrix 
##        mpg   cyl  disp    hp  drat    wt  qsec  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00 -0.21 -0.66
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66  0.27  1.00
## Sample Size 
## [1] 32
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##       mpg cyl disp   hp drat   wt qsec gear carb
## mpg  0.00   0 0.00 0.00 0.00 0.00 0.14 0.06 0.02
## cyl  0.00   0 0.00 0.00 0.00 0.00 0.01 0.05 0.03
## disp 0.00   0 0.00 0.00 0.00 0.00 0.13 0.02 0.18
## hp   0.00   0 0.00 0.00 0.11 0.00 0.00 1.00 0.00
## drat 0.00   0 0.00 0.01 0.00 0.00 1.00 0.00 1.00
## wt   0.00   0 0.00 0.00 0.00 0.00 1.00 0.01 0.13
## qsec 0.02   0 0.01 0.00 0.62 0.34 0.00 1.00 0.00
## gear 0.01   0 0.00 0.49 0.00 0.00 0.24 0.00 0.77
## carb 0.00   0 0.03 0.00 0.62 0.01 0.00 0.13 0.00
## 
##  Confidence intervals based upon normal theory.  To get bootstrapped values, try cor.ci
##           lower     r upper    p
## mpg-cyl   -0.93 -0.85 -0.72 0.00
## mpg-disp  -0.92 -0.85 -0.71 0.00
## mpg-hp    -0.89 -0.78 -0.59 0.00
## mpg-drat   0.44  0.68  0.83 0.00
## mpg-wt    -0.93 -0.87 -0.74 0.00
## mpg-qsec   0.08  0.42  0.67 0.02
## mpg-gear   0.16  0.48  0.71 0.01
## mpg-carb  -0.75 -0.55 -0.25 0.00
## cyl-disp   0.81  0.90  0.95 0.00
## cyl-hp     0.68  0.83  0.92 0.00
## cyl-drat  -0.84 -0.70 -0.46 0.00
## cyl-wt     0.60  0.78  0.89 0.00
## cyl-qsec  -0.78 -0.59 -0.31 0.00
## cyl-gear  -0.72 -0.49 -0.17 0.00
## cyl-carb   0.22  0.53  0.74 0.00
## disp-hp    0.61  0.79  0.89 0.00
## disp-drat -0.85 -0.71 -0.48 0.00
## disp-wt    0.78  0.89  0.94 0.00
## disp-qsec -0.68 -0.43 -0.10 0.01
## disp-gear -0.76 -0.56 -0.26 0.00
## disp-carb  0.05  0.39  0.65 0.03
## hp-drat   -0.69 -0.45 -0.12 0.01
## hp-wt      0.40  0.66  0.82 0.00
## hp-qsec   -0.85 -0.71 -0.48 0.00
## hp-gear   -0.45 -0.13  0.23 0.49
## hp-carb    0.54  0.75  0.87 0.00
## drat-wt   -0.85 -0.71 -0.48 0.00
## drat-qsec -0.27  0.09  0.43 0.62
## drat-gear  0.46  0.70  0.84 0.00
## drat-carb -0.43 -0.09  0.27 0.62
## wt-qsec   -0.49 -0.17  0.19 0.34
## wt-gear   -0.77 -0.58 -0.29 0.00
## wt-carb    0.09  0.43  0.68 0.01
## qsec-gear -0.52 -0.21  0.15 0.24
## qsec-carb -0.82 -0.66 -0.40 0.00
## gear-carb -0.08  0.27  0.57 0.13

Test Assumption 1 and 2

# coerce data into a matrix and ignore headers
dat <- data.matrix(cars_numeric)  
paf.pca <- paf(dat, eigcrit=1, convcrit=.001)
summary(paf.pca)  # Notice Bartlett and KMO values and ignore the rest
## $KMO
## [1] 0.7737
## 
## $MSA
##          MSA
## mpg  0.92464
## cyl  0.86063
## disp 0.71852
## hp   0.83555
## drat 0.92452
## wt   0.68649
## qsec 0.66585
## gear 0.76358
## carb 0.57355
## 
## $Bartlett
## [1] 332.33
## 
## $Communalities
##      Initial Communalities Final Extraction
## mpg                0.85958          0.85411
## cyl                0.92480          0.91407
## disp               0.95435          0.90336
## hp                 0.89477          0.92262
## drat               0.70238          0.70892
## wt                 0.94397          0.81366
## qsec               0.82578          0.63180
## gear               0.79958          0.85598
## carb               0.87276          0.74790
## 
## $Factor.Loadings
##          [,1]      [,2]
## mpg   0.92358 -0.033209
## cyl  -0.95572 -0.025825
## disp -0.94224  0.124652
## hp   -0.87239 -0.401940
## drat  0.71425 -0.445841
## wt   -0.87181  0.231556
## qsec  0.50520  0.613656
## gear  0.49386 -0.782361
## carb -0.56232 -0.657034
## 
## $RMS
## [1] 0.03966

Test Assumption 3

Verify that the determinate is positive. If the determinate is positive then the assumptions tests have been completed.

det(cars_numeric.cor)
## [1] 4.8675e-06

Compute the PCA

cars_numeric.pca_1_components <- principal(cars_numeric.cor, rotate = "none")
cat("\nPCA Components\n")
## 
## PCA Components
cars_numeric.pca_1_components
## Principal Components Analysis
## Call: principal(r = cars_numeric.cor, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   h2    u2 com
## mpg  -0.93 0.87 0.126   1
## cyl   0.96 0.92 0.083   1
## disp  0.94 0.89 0.107   1
## hp    0.87 0.76 0.238   1
## drat -0.74 0.55 0.450   1
## wt    0.89 0.79 0.211   1
## qsec -0.53 0.28 0.715   1
## gear -0.50 0.25 0.752   1
## carb  0.58 0.34 0.662   1
## 
##                 PC1
## SS loadings    5.66
## Proportion Var 0.63
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.2 
## 
## Fit based upon off diagonal values = 0.9
cars_numeric.pca_all_components <- principal(cars_numeric.cor, nfactors=9, rotate="none")   #we calculate all 5 components
cat("\nAll PCA Components Calculated\n")
## 
## All PCA Components Calculated
cars_numeric.pca_all_components
## Principal Components Analysis
## Call: principal(r = cars_numeric.cor, nfactors = 9, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9 h2       u2 com
## mpg  -0.93  0.04 -0.16  0.00  0.14  0.25  0.12 -0.03  0.02  1  4.4e-16 1.3
## cyl   0.96  0.02 -0.18  0.02 -0.05  0.08  0.05  0.20  0.02  1  7.8e-16 1.2
## disp  0.94 -0.13 -0.06  0.17  0.21 -0.01  0.06 -0.02 -0.10  1 -2.2e-16 1.2
## hp    0.87  0.39 -0.01  0.04  0.13  0.12 -0.23 -0.04  0.04  1  2.2e-16 1.6
## drat -0.74  0.49  0.11  0.44 -0.07 -0.01 -0.02  0.03  0.01  1  7.8e-16 2.5
## wt    0.89 -0.25  0.32  0.10  0.08 -0.03  0.14 -0.05  0.08  1  4.4e-16 1.6
## qsec -0.53 -0.70  0.45 -0.02  0.06  0.09 -0.09  0.09 -0.03  1  5.6e-16 2.8
## gear -0.50  0.79  0.15 -0.15  0.24 -0.11  0.03  0.08  0.01  1  7.8e-16 2.2
## carb  0.58  0.70  0.33 -0.11 -0.17  0.13  0.07 -0.03 -0.05  1  4.4e-16 2.7
## 
##                        PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9
## SS loadings           5.66 2.08 0.50 0.27 0.18 0.12 0.11 0.06 0.02
## Proportion Var        0.63 0.23 0.06 0.03 0.02 0.01 0.01 0.01 0.00
## Cumulative Var        0.63 0.86 0.92 0.95 0.97 0.98 0.99 1.00 1.00
## Proportion Explained  0.63 0.23 0.06 0.03 0.02 0.01 0.01 0.01 0.00
## Cumulative Proportion 0.63 0.86 0.92 0.95 0.97 0.98 0.99 1.00 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 9 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
## 
## Fit based upon off diagonal values = 1

Calculate the Residual Error

We need to account for the residual error, no matter how small the error is. This is done below:

alpha(cars_numeric.cor)
## Warning in alpha(cars_numeric.cor): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( mpg drat qsec gear ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
## 
## Reliability analysis   
## Call: alpha(x = cars_numeric.cor)
## 
##   raw_alpha std.alpha G6(smc) average_r   S/N median_r
##      -0.56     -0.56     0.8    -0.041 -0.36    -0.15
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r    S/N var.r  med.r
## mpg      0.195     0.195    0.88     0.029  0.242  0.37 -0.108
## cyl     -1.034    -1.034    0.69    -0.068 -0.508  0.35 -0.150
## disp    -1.055    -1.055    0.65    -0.069 -0.513  0.36 -0.150
## hp      -1.835    -1.835    0.61    -0.088 -0.647  0.37 -0.194
## drat    -0.093    -0.093    0.87    -0.011 -0.085  0.41 -0.150
## wt      -1.045    -1.045    0.65    -0.068 -0.511  0.37 -0.169
## qsec     0.186     0.186    0.88     0.028  0.228  0.45  0.092
## gear    -0.369    -0.369    0.82    -0.035 -0.269  0.45 -0.133
## carb    -2.047    -2.047    0.55    -0.092 -0.672  0.43 -0.323
## 
##  Item statistics 
##           r r.cor r.drop
## mpg  -0.536 -0.66  -0.75
## cyl   0.574  0.61   0.20
## disp  0.583  0.63   0.21
## hp    0.804  0.85   0.56
## drat -0.078 -0.22  -0.44
## wt    0.578  0.62   0.20
## qsec -0.516 -0.66  -0.73
## gear  0.197  0.13  -0.21
## carb  0.846  0.89   0.64

Plot Results

cars_numeric.pca
## Standard deviations (1, .., p=9):
## [1] 2.37822 1.44295 0.71008 0.51481 0.42797 0.35184 0.32413 0.24190 0.14896
## 
## Rotation (n x k) = (9 x 9):
##           PC1       PC2       PC3        PC4      PC5       PC6       PC7
## mpg  -0.39315  0.027539 -0.221193 -0.0061264 -0.32076  0.720156 -0.381381
## cyl   0.40255  0.015710 -0.252316  0.0407003  0.11714  0.224325 -0.158933
## disp  0.39735 -0.088885 -0.078251  0.3394937 -0.48678 -0.019675 -0.182331
## hp    0.36708  0.269414 -0.017212  0.0683010 -0.29473  0.353942  0.696208
## drat -0.31182  0.341653  0.149955  0.8456585  0.16193 -0.015368  0.047680
## wt    0.37348 -0.171943  0.453734  0.1912600 -0.18748 -0.083772 -0.427776
## qsec -0.22435 -0.484044  0.628128 -0.0303291 -0.14825  0.257529  0.276226
## gear -0.20947  0.550783  0.206584 -0.2823818 -0.56249 -0.322982 -0.085557
## carb  0.24458  0.484313  0.464121 -0.2144922  0.39978  0.357069 -0.206042
##            PC8       PC9
## mpg  -0.124660  0.114929
## cyl   0.810322  0.162663
## disp -0.064167 -0.661908
## hp   -0.165740  0.251773
## drat  0.135051  0.038091
## wt   -0.198394  0.569188
## qsec  0.356133 -0.168737
## gear  0.316365  0.047197
## carb -0.108328 -0.320459
fa.parallel(cars_numeric, n.obs = 32, fm="pa", fa="pc")
## Warning in fa.parallel(cars_numeric, n.obs = 32, fm = "pa", fa = "pc"): You
## specified the number of subjects, implying a correlation matrix, but do not have
## a correlation matrix, correlations found
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2
fa.diagram(cars_numeric.pca_all_components)

pcaDF <- data.frame(cars_numeric)   # convert pca matrix to dataframe
pcaDF
##                      mpg cyl  disp  hp drat    wt  qsec gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60    4    2
attach(pcaDF)              # so we can use variable names

pcscores <- .53*mpg + .57*cyl + .58*disp + .80*hp + .07*drat + .57*wt + .51*qsec + .19*gear +.84*carb 
pcscores <- sort(pcscores, decreasing=FALSE)

pcscores
##  [1] 117.05 126.58 128.00 130.97 164.09 164.61 171.31 174.21 176.33 184.80
## [11] 188.22 209.63 210.06 224.46 224.89 241.40 253.69 265.77 322.19 330.27
## [21] 331.22 331.71 332.10 376.38 400.09 424.85 431.21 441.52 464.13 465.19
## [31] 467.64 472.49

Plot the Histograms

par(mfrow = c(1,2))

hist(pcscores)