PCA Assignment Instructions


See the Data Set Definition for a list of variables.


Data Analysis

Data Set Summary

Investigate the distributions of the variables (boxplots and or histograms): *How close to, or far from, normality, are the variables?


The variables are mostly normally distributed with exception to “Health_care_and_Environment”, “The_Arts” and “Housing” with are all skewed to the right.

X1 Climate_and_Terrain Housing Health_Care_and_Environment Crime Transportation Education The_Arts Recreation Economics
Mean 165.0 538.7325 8346.559 1185.739 961.0547 4210.082 2814.888 3150.884 1845.957 5525.365
Median 165.0 542.0000 7877.000 833.000 947.0000 4080.000 2794.000 1871.000 1670.000 5384.000
Minimum 1.0 105.0000 5159.000 43.000 308.0000 1145.000 1701.000 52.000 300.000 3045.000
Maximum 329.0 910.0000 23640.000 7850.000 2498.0000 8625.000 3781.000 56745.000 4800.000 9980.000
Variance 9047.5 14594.6356 5689477.778 1006013.084 127559.1129 2105921.185 102908.118 21550798.304 652683.297 1176071.976


Analysis

Examine the covariance matrix and correlation matrix, and give a brief discussion comparing the variability of the different variables and the pairwise correlations:

Covariance


The following is a covariance matrix for the Euclidean distances:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.7318  0.8555  1.0044  1.0024  1.1543  3.2536

Correlation


The following is a correlation matrix for the Euclidean distances:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0007268 0.0475655 0.0879974 0.1369932 0.1661601 1.1755652

The Covariance matrix of the scaled data matches the Correlation matrix of the non-scaled data; this is a good reading indicating the data is scaled properly for use in a PCA.


Find the principal components from the original data. Repeat the previous steps after standardizing the data

Pre-Scaling

## Importance of components:
##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     4941.0201 2099.5436 1279.8636 1.037e+03 691.62778
## Proportion of Variance    0.7527    0.1359    0.0505 3.319e-02   0.01475
## Cumulative Proportion     0.7527    0.8886    0.9391 9.723e-01   0.98703
##                              PC6       PC7       PC8       PC9     PC10
## Standard deviation     490.76856 304.67149 258.89207 105.33685 93.55214
## Proportion of Variance   0.00743   0.00286   0.00207   0.00034  0.00027
## Cumulative Proportion    0.99446   0.99732   0.99939   0.99973  1.00000

Post-scaling

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.8475 1.1147 1.0794 0.98811 0.94541 0.86623
## Proportion of Variance 0.3413 0.1243 0.1165 0.09764 0.08938 0.07504
## Cumulative Proportion  0.3413 0.4656 0.5821 0.67973 0.76911 0.84414
##                            PC7     PC8     PC9    PC10
## Standard deviation     0.79306 0.70170 0.56292 0.34692
## Proportion of Variance 0.06289 0.04924 0.03169 0.01204
## Cumulative Proportion  0.90704 0.95628 0.98796 1.00000


Post-analysis questions:


Find and interpret the relevant scree plot and a biplot:

The plots are shown below.

Biplot

The Biplot is dificult to interpret because of the number of points on it. When I zoom in I notice that the “Transportation”, “The Arts” and “Health_care_and_environment” components are clustered together; could these hold similar importance to people rating these places? I notice two other similar groups like this one: (Crime, Housing & Recreation) & (Economics & Climate_and_Terrain).

Scree Plot

According to the screeplot, the first component is the most important to keep; after that the explanatory-value of the individual components to come diminish.

Which analysis do you prefer and why?

Of all of the analyses that we have studied these past weeks, I really like PCA because it can be applied in many different scenarios and it gives me a clear indication of “how to proceed”. It is not limited by the boundaries of the existing variables, it takes the most interesting and impactful elements of the entire dataset, recognizes interactions and proposes groups that are most likely to be predictive.

Data Set Definition

The data set is a set of ratings about places. There are 329 observations with nine numerical “rating” criterion. The variables in the data set are the rating criterion.





EndNotes