See the Data Set Definition for a list of variables.
Investigate the distributions of the variables (boxplots and or histograms): *How close to, or far from, normality, are the variables?
The variables are mostly normally distributed with exception to “Health_care_and_Environment”, “The_Arts” and “Housing” with are all skewed to the right.
| X1 | Climate_and_Terrain | Housing | Health_Care_and_Environment | Crime | Transportation | Education | The_Arts | Recreation | Economics | |
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 165.0 | 538.7325 | 8346.559 | 1185.739 | 961.0547 | 4210.082 | 2814.888 | 3150.884 | 1845.957 | 5525.365 |
| Median | 165.0 | 542.0000 | 7877.000 | 833.000 | 947.0000 | 4080.000 | 2794.000 | 1871.000 | 1670.000 | 5384.000 |
| Minimum | 1.0 | 105.0000 | 5159.000 | 43.000 | 308.0000 | 1145.000 | 1701.000 | 52.000 | 300.000 | 3045.000 |
| Maximum | 329.0 | 910.0000 | 23640.000 | 7850.000 | 2498.0000 | 8625.000 | 3781.000 | 56745.000 | 4800.000 | 9980.000 |
| Variance | 9047.5 | 14594.6356 | 5689477.778 | 1006013.084 | 127559.1129 | 2105921.185 | 102908.118 | 21550798.304 | 652683.297 | 1176071.976 |
Examine the covariance matrix and correlation matrix, and give a brief discussion comparing the variability of the different variables and the pairwise correlations:
The following is a covariance matrix for the Euclidean distances:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7318 0.8555 1.0044 1.0024 1.1543 3.2536
The following is a correlation matrix for the Euclidean distances:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007268 0.0475655 0.0879974 0.1369932 0.1661601 1.1755652
The Covariance matrix of the scaled data matches the Correlation matrix of the non-scaled data; this is a good reading indicating the data is scaled properly for use in a PCA.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 4941.0201 2099.5436 1279.8636 1.037e+03 691.62778
## Proportion of Variance 0.7527 0.1359 0.0505 3.319e-02 0.01475
## Cumulative Proportion 0.7527 0.8886 0.9391 9.723e-01 0.98703
## PC6 PC7 PC8 PC9 PC10
## Standard deviation 490.76856 304.67149 258.89207 105.33685 93.55214
## Proportion of Variance 0.00743 0.00286 0.00207 0.00034 0.00027
## Cumulative Proportion 0.99446 0.99732 0.99939 0.99973 1.00000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.8475 1.1147 1.0794 0.98811 0.94541 0.86623
## Proportion of Variance 0.3413 0.1243 0.1165 0.09764 0.08938 0.07504
## Cumulative Proportion 0.3413 0.4656 0.5821 0.67973 0.76911 0.84414
## PC7 PC8 PC9 PC10
## Standard deviation 0.79306 0.70170 0.56292 0.34692
## Proportion of Variance 0.06289 0.04924 0.03169 0.01204
## Cumulative Proportion 0.90704 0.95628 0.98796 1.00000
The plots are shown below.
The Biplot is dificult to interpret because of the number of points on it. When I zoom in I notice that the “Transportation”, “The Arts” and “Health_care_and_environment” components are clustered together; could these hold similar importance to people rating these places? I notice two other similar groups like this one: (Crime, Housing & Recreation) & (Economics & Climate_and_Terrain).
According to the screeplot, the first component is the most important to keep; after that the explanatory-value of the individual components to come diminish.
Of all of the analyses that we have studied these past weeks, I really like PCA because it can be applied in many different scenarios and it gives me a clear indication of “how to proceed”. It is not limited by the boundaries of the existing variables, it takes the most interesting and impactful elements of the entire dataset, recognizes interactions and proposes groups that are most likely to be predictive.
The data set is a set of ratings about places. There are 329 observations with nine numerical “rating” criterion. The variables in the data set are the rating criterion.