The purpose of this activity is to use dimension reduction to reduce the dimensionality of data without losing much of the information it can provide. In other words we are reducing the importance of some of the attributes without any loss in data integrity. The method being used in this example is Principal Component Analysis (PCA).
# Read the cars data in excluding cateforical data
cars_numeric <- mtcars[, c(1:7, 10, 11)]
head (cars_numeric)
## mpg cyl disp hp drat wt qsec gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 3 1
We have 9 attributes in this data set, and therefor we will obtain 9 principal components from running a PCA on the dataset. Each of these explains the percentage of total variance in the dataset. Looking at the dataset below, for example, we can see that PC1 accounts for 62.8% of the total variance. This means that 62% of the information in the dataset can be encapsulated by the PC1 principal component.
cat("\nSummary of Cars PCA\n")
##
## Summary of Cars PCA
cars_numeric.pca <- prcomp(cars_numeric, center = TRUE, scale. = TRUE)
summary(cars_numeric.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
## PC8 PC9
## Standard deviation 0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion 0.9975 1.00000
cat("\nCorrelation Matrix\n")
##
## Correlation Matrix
cars_numeric.cor <- cor(cars_numeric)
cars_numeric.cor
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec gear carb
## mpg 0.41868403 0.4802848 -0.5509251
## cyl -0.59124207 -0.4926866 0.5269883
## disp -0.43369788 -0.5555692 0.3949769
## hp -0.70822339 -0.1257043 0.7498125
## drat 0.09120476 0.6996101 -0.0907898
## wt -0.17471588 -0.5832870 0.4276059
## qsec 1.00000000 -0.2126822 -0.6562492
## gear -0.21268223 1.0000000 0.2740728
## carb -0.65624923 0.2740728 1.0000000
Generate a covariance matrix, which is a sqaure matrix giving the covariance between each pair of a given vector.
cars_numeric.cov = cov(cars_numeric)
cat("Covariance Matrix")
## Covariance Matrix
cars_numeric.cov
## mpg cyl disp hp drat wt
## mpg 36.324103 -9.1723790 -633.09721 -320.732056 2.19506351 -5.1166847
## cyl -9.172379 3.1895161 199.66028 101.931452 -0.66836694 1.3673710
## disp -633.097208 199.6602823 15360.79983 6721.158669 -47.06401915 107.6842040
## hp -320.732056 101.9314516 6721.15867 4700.866935 -16.45110887 44.1926613
## drat 2.195064 -0.6683669 -47.06402 -16.451109 0.28588135 -0.3727207
## wt -5.116685 1.3673710 107.68420 44.192661 -0.37272073 0.9573790
## qsec 4.509149 -1.8868548 -96.05168 -86.770081 0.08714073 -0.3054816
## gear 2.135685 -0.6491935 -50.80262 -6.358871 0.27598790 -0.4210806
## carb -5.363105 1.5201613 79.06875 83.036290 -0.07840726 0.6757903
## qsec gear carb
## mpg 4.50914919 2.1356855 -5.36310484
## cyl -1.88685484 -0.6491935 1.52016129
## disp -96.05168145 -50.8026210 79.06875000
## hp -86.77008065 -6.3588710 83.03629032
## drat 0.08714073 0.2759879 -0.07840726
## wt -0.30548161 -0.4210806 0.67579032
## qsec 3.19316613 -0.2804032 -1.89411290
## gear -0.28040323 0.5443548 0.32661290
## carb -1.89411290 0.3266129 2.60887097
cars_numeric.sig <- corr.p(cars_numeric.cor, 32, alpha=.05)
print(cars_numeric.sig, short=FALSE)
## Call:corr.p(r = cars_numeric.cor, n = 32, alpha = 0.05)
## Correlation matrix
## mpg cyl disp hp drat wt qsec gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 -0.21 -0.66
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 0.27 1.00
## Sample Size
## [1] 32
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## mpg cyl disp hp drat wt qsec gear carb
## mpg 0.00 0 0.00 0.00 0.00 0.00 0.14 0.06 0.02
## cyl 0.00 0 0.00 0.00 0.00 0.00 0.01 0.05 0.03
## disp 0.00 0 0.00 0.00 0.00 0.00 0.13 0.02 0.18
## hp 0.00 0 0.00 0.00 0.11 0.00 0.00 1.00 0.00
## drat 0.00 0 0.00 0.01 0.00 0.00 1.00 0.00 1.00
## wt 0.00 0 0.00 0.00 0.00 0.00 1.00 0.01 0.13
## qsec 0.02 0 0.01 0.00 0.62 0.34 0.00 1.00 0.00
## gear 0.01 0 0.00 0.49 0.00 0.00 0.24 0.00 0.77
## carb 0.00 0 0.03 0.00 0.62 0.01 0.00 0.13 0.00
##
## Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
## lower r upper p
## mpg-cyl -0.93 -0.85 -0.72 0.00
## mpg-disp -0.92 -0.85 -0.71 0.00
## mpg-hp -0.89 -0.78 -0.59 0.00
## mpg-drat 0.44 0.68 0.83 0.00
## mpg-wt -0.93 -0.87 -0.74 0.00
## mpg-qsec 0.08 0.42 0.67 0.02
## mpg-gear 0.16 0.48 0.71 0.01
## mpg-carb -0.75 -0.55 -0.25 0.00
## cyl-disp 0.81 0.90 0.95 0.00
## cyl-hp 0.68 0.83 0.92 0.00
## cyl-drat -0.84 -0.70 -0.46 0.00
## cyl-wt 0.60 0.78 0.89 0.00
## cyl-qsec -0.78 -0.59 -0.31 0.00
## cyl-gear -0.72 -0.49 -0.17 0.00
## cyl-carb 0.22 0.53 0.74 0.00
## disp-hp 0.61 0.79 0.89 0.00
## disp-drat -0.85 -0.71 -0.48 0.00
## disp-wt 0.78 0.89 0.94 0.00
## disp-qsec -0.68 -0.43 -0.10 0.01
## disp-gear -0.76 -0.56 -0.26 0.00
## disp-carb 0.05 0.39 0.65 0.03
## hp-drat -0.69 -0.45 -0.12 0.01
## hp-wt 0.40 0.66 0.82 0.00
## hp-qsec -0.85 -0.71 -0.48 0.00
## hp-gear -0.45 -0.13 0.23 0.49
## hp-carb 0.54 0.75 0.87 0.00
## drat-wt -0.85 -0.71 -0.48 0.00
## drat-qsec -0.27 0.09 0.43 0.62
## drat-gear 0.46 0.70 0.84 0.00
## drat-carb -0.43 -0.09 0.27 0.62
## wt-qsec -0.49 -0.17 0.19 0.34
## wt-gear -0.77 -0.58 -0.29 0.00
## wt-carb 0.09 0.43 0.68 0.01
## qsec-gear -0.52 -0.21 0.15 0.24
## qsec-carb -0.82 -0.66 -0.40 0.00
## gear-carb -0.08 0.27 0.57 0.13
# coerce data into a matrix and ignore headers
dat <- data.matrix(cars_numeric)
paf.pca <- paf(dat, eigcrit=1, convcrit=.001)
summary(paf.pca) # Notice Bartlett and KMO values and ignore the rest
## $KMO
## [1] 0.7737
##
## $MSA
## MSA
## mpg 0.92464
## cyl 0.86063
## disp 0.71852
## hp 0.83555
## drat 0.92452
## wt 0.68649
## qsec 0.66585
## gear 0.76358
## carb 0.57355
##
## $Bartlett
## [1] 332.33
##
## $Communalities
## Initial Communalities Final Extraction
## mpg 0.85958 0.85411
## cyl 0.92480 0.91407
## disp 0.95435 0.90336
## hp 0.89477 0.92262
## drat 0.70238 0.70892
## wt 0.94397 0.81366
## qsec 0.82578 0.63180
## gear 0.79958 0.85598
## carb 0.87276 0.74790
##
## $Factor.Loadings
## [,1] [,2]
## mpg 0.92358 -0.033209
## cyl -0.95572 -0.025825
## disp -0.94224 0.124652
## hp -0.87239 -0.401940
## drat 0.71425 -0.445841
## wt -0.87181 0.231556
## qsec 0.50520 0.613656
## gear 0.49386 -0.782361
## carb -0.56232 -0.657034
##
## $RMS
## [1] 0.03966
Verify that the determinate is positive. If the determinate is positive then the assumptions tests have been completed.
det(cars_numeric.cor)
## [1] 4.8675e-06
cars_numeric.pca_1_components <- principal(cars_numeric.cor, rotate = "none")
cat("\nPCA Components\n")
##
## PCA Components
cars_numeric.pca_1_components
## Principal Components Analysis
## Call: principal(r = cars_numeric.cor, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 h2 u2 com
## mpg -0.93 0.87 0.126 1
## cyl 0.96 0.92 0.083 1
## disp 0.94 0.89 0.107 1
## hp 0.87 0.76 0.238 1
## drat -0.74 0.55 0.450 1
## wt 0.89 0.79 0.211 1
## qsec -0.53 0.28 0.715 1
## gear -0.50 0.25 0.752 1
## carb 0.58 0.34 0.662 1
##
## PC1
## SS loadings 5.66
## Proportion Var 0.63
##
## Mean item complexity = 1
## Test of the hypothesis that 1 component is sufficient.
##
## The root mean square of the residuals (RMSR) is 0.2
##
## Fit based upon off diagonal values = 0.9
cars_numeric.pca_all_components <- principal(cars_numeric.cor, nfactors=9, rotate="none") #we calculate all 5 components
cat("\nAll PCA Components Calculated\n")
##
## All PCA Components Calculated
cars_numeric.pca_all_components
## Principal Components Analysis
## Call: principal(r = cars_numeric.cor, nfactors = 9, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 h2 u2 com
## mpg -0.93 0.04 -0.16 0.00 0.14 0.25 0.12 -0.03 0.02 1 4.4e-16 1.3
## cyl 0.96 0.02 -0.18 0.02 -0.05 0.08 0.05 0.20 0.02 1 7.8e-16 1.2
## disp 0.94 -0.13 -0.06 0.17 0.21 -0.01 0.06 -0.02 -0.10 1 -2.2e-16 1.2
## hp 0.87 0.39 -0.01 0.04 0.13 0.12 -0.23 -0.04 0.04 1 2.2e-16 1.6
## drat -0.74 0.49 0.11 0.44 -0.07 -0.01 -0.02 0.03 0.01 1 7.8e-16 2.5
## wt 0.89 -0.25 0.32 0.10 0.08 -0.03 0.14 -0.05 0.08 1 4.4e-16 1.6
## qsec -0.53 -0.70 0.45 -0.02 0.06 0.09 -0.09 0.09 -0.03 1 5.6e-16 2.8
## gear -0.50 0.79 0.15 -0.15 0.24 -0.11 0.03 0.08 0.01 1 7.8e-16 2.2
## carb 0.58 0.70 0.33 -0.11 -0.17 0.13 0.07 -0.03 -0.05 1 4.4e-16 2.7
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## SS loadings 5.66 2.08 0.50 0.27 0.18 0.12 0.11 0.06 0.02
## Proportion Var 0.63 0.23 0.06 0.03 0.02 0.01 0.01 0.01 0.00
## Cumulative Var 0.63 0.86 0.92 0.95 0.97 0.98 0.99 1.00 1.00
## Proportion Explained 0.63 0.23 0.06 0.03 0.02 0.01 0.01 0.01 0.00
## Cumulative Proportion 0.63 0.86 0.92 0.95 0.97 0.98 0.99 1.00 1.00
##
## Mean item complexity = 1.9
## Test of the hypothesis that 9 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
##
## Fit based upon off diagonal values = 1
We need to account for the residual error, no matter how small the error is. This is done below:
alpha(cars_numeric.cor)
## Warning in alpha(cars_numeric.cor): Some items were negatively correlated with the total scale and probably
## should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( mpg drat qsec gear ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
##
## Reliability analysis
## Call: alpha(x = cars_numeric.cor)
##
## raw_alpha std.alpha G6(smc) average_r S/N median_r
## -0.56 -0.56 0.8 -0.041 -0.36 -0.15
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N var.r med.r
## mpg 0.195 0.195 0.88 0.029 0.242 0.37 -0.108
## cyl -1.034 -1.034 0.69 -0.068 -0.508 0.35 -0.150
## disp -1.055 -1.055 0.65 -0.069 -0.513 0.36 -0.150
## hp -1.835 -1.835 0.61 -0.088 -0.647 0.37 -0.194
## drat -0.093 -0.093 0.87 -0.011 -0.085 0.41 -0.150
## wt -1.045 -1.045 0.65 -0.068 -0.511 0.37 -0.169
## qsec 0.186 0.186 0.88 0.028 0.228 0.45 0.092
## gear -0.369 -0.369 0.82 -0.035 -0.269 0.45 -0.133
## carb -2.047 -2.047 0.55 -0.092 -0.672 0.43 -0.323
##
## Item statistics
## r r.cor r.drop
## mpg -0.536 -0.66 -0.75
## cyl 0.574 0.61 0.20
## disp 0.583 0.63 0.21
## hp 0.804 0.85 0.56
## drat -0.078 -0.22 -0.44
## wt 0.578 0.62 0.20
## qsec -0.516 -0.66 -0.73
## gear 0.197 0.13 -0.21
## carb 0.846 0.89 0.64
cars_numeric.pca
## Standard deviations (1, .., p=9):
## [1] 2.37822 1.44295 0.71008 0.51481 0.42797 0.35184 0.32413 0.24190 0.14896
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## mpg -0.39315 0.027539 -0.221193 -0.0061264 -0.32076 0.720156 -0.381381
## cyl 0.40255 0.015710 -0.252316 0.0407003 0.11714 0.224325 -0.158933
## disp 0.39735 -0.088885 -0.078251 0.3394937 -0.48678 -0.019675 -0.182331
## hp 0.36708 0.269414 -0.017212 0.0683010 -0.29473 0.353942 0.696208
## drat -0.31182 0.341653 0.149955 0.8456585 0.16193 -0.015368 0.047680
## wt 0.37348 -0.171943 0.453734 0.1912600 -0.18748 -0.083772 -0.427776
## qsec -0.22435 -0.484044 0.628128 -0.0303291 -0.14825 0.257529 0.276226
## gear -0.20947 0.550783 0.206584 -0.2823818 -0.56249 -0.322982 -0.085557
## carb 0.24458 0.484313 0.464121 -0.2144922 0.39978 0.357069 -0.206042
## PC8 PC9
## mpg -0.124660 0.114929
## cyl 0.810322 0.162663
## disp -0.064167 -0.661908
## hp -0.165740 0.251773
## drat 0.135051 0.038091
## wt -0.198394 0.569188
## qsec 0.356133 -0.168737
## gear 0.316365 0.047197
## carb -0.108328 -0.320459
fa.parallel(cars_numeric, n.obs = 32, fm="pa", fa="pc")
## Warning in fa.parallel(cars_numeric, n.obs = 32, fm = "pa", fa = "pc"): You
## specified the number of subjects, implying a correlation matrix, but do not have
## a correlation matrix, correlations found
## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect. Try a
## different factor score estimation method.
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
fa.diagram(cars_numeric.pca_all_components)
pcaDF <- data.frame(cars_numeric) # convert pca matrix to dataframe
pcaDF
## mpg cyl disp hp drat wt qsec gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 4 2
attach(pcaDF) # so we can use variable names
pcscores <- .53*mpg + .57*cyl + .58*disp + .80*hp + .07*drat + .57*wt + .51*qsec + .19*gear +.84*carb
pcscores <- sort(pcscores, decreasing=FALSE)
pcscores
## [1] 117.05 126.58 128.00 130.97 164.09 164.61 171.31 174.21 176.33 184.80
## [11] 188.22 209.63 210.06 224.46 224.89 241.40 253.69 265.77 322.19 330.27
## [21] 331.22 331.71 332.10 376.38 400.09 424.85 431.21 441.52 464.13 465.19
## [31] 467.64 472.49
par(mfrow = c(1,2))
hist(pcscores)