Principal Component Analysis of U.S. Cities

1. Examining the Data

cities30 <- read.csv("C:/Users/nicho/OneDrive/Desktop/MSDA/MK 6460 - Marketing Research & Analytics/Week 5 - Principal Component Analysis & Multidemensional Scaling/Week5_task_data/cities30.csv")
rownames(cities30) <- cities30[, 1]
cities30_data <- cities30[, 2:10]
str(cities30)

## 'data.frame':    30 obs. of  11 variables:
##  $ City    : chr  "Anaheim-Santa-AnaCA" "AtlantaGA" "BaltimoreMD" "BostonMA" ...
##  $ Climate : int  885 696 567 623 514 584 579 544 521 536 ...
##  $ Housing : int  16047 8316 9148 11609 10913 8143 9168 9318 10789 8525 ...
##  $ HlthCare: int  2025 3195 3562 5301 5766 2138 3167 2825 2533 4142 ...
##  $ Crime   : int  983 1308 1730 1215 1034 978 1138 1529 1365 1587 ...
##  $ Transp  : int  3954 8409 7405 6801 7742 5748 7333 6213 8145 4808 ...
##  $ Educ    : int  2843 3057 3471 3479 3486 2918 2972 3269 3145 3064 ...
##  $ Arts    : int  5632 7559 9788 21042 24846 9688 12679 10438 8477 10389 ...
##  $ Recreat : int  3156 1362 2925 3066 2856 2451 3300 2310 2324 2483 ...
##  $ Econ    : int  6220 6315 5503 6363 5205 5270 4879 7710 7164 3904 ...
##  $ Pop     : int  1932709 2138231 2199531 2805911 6060387 1401491 1898825 1957378 1428836 4488072 ...

summary(cities30)

##      City              Climate         Housing         HlthCare   
##  Length:30          Min.   :293.0   Min.   : 7442   Min.   :1189  
##  Class :character   1st Qu.:536.2   1st Qu.: 8978   1st Qu.:2376  
##  Mode  :character   Median :608.0   Median :10180   Median :3110  
##                     Mean   :633.2   Mean   :10972   Mean   :3385  
##                     3rd Qu.:686.0   3rd Qu.:13302   3rd Qu.:4063  
##                     Max.   :910.0   Max.   :17158   Max.   :7850  
##      Crime          Transp          Educ           Arts          Recreat    
##  Min.   : 566   Min.   :2119   Min.   :2596   Min.   : 4573   Min.   :1362  
##  1st Qu.:1060   1st Qu.:4842   1st Qu.:2924   1st Qu.: 7788   1st Qu.:2148  
##  Median :1307   Median :5682   Median :3060   Median : 9738   Median :2478  
##  Mean   :1338   Mean   :5932   Mean   :3133   Mean   :12640   Mean   :2762  
##  3rd Qu.:1522   3rd Qu.:7280   3rd Qu.:3369   3rd Qu.:13838   3rd Qu.:3264  
##  Max.   :2498   Max.   :8625   Max.   :3781   Max.   :56745   Max.   :4800  
##       Econ           Pop         
##  Min.   :3904   Min.   :1295071  
##  1st Qu.:5322   1st Qu.:1609002  
##  Median :5820   Median :1915767  
##  Mean   :5931   Mean   :2637847  
##  3rd Qu.:6351   3rd Qu.:2703278  
##  Max.   :7710   Max.   :8274961

The dataset cities30.csv includes 30 cities across 9 quantitative metrics, such as Housing, Education, Crime, etc. To properly label plots, city names are set as row names. Examining the structure and summary statistics confirms that the data is clean, with no missing values, and all inputs are numeric, which is suitable for PCA.

2. Conducting PCA

cities_pca <- PCA(cities30_data, scale.unit = TRUE, graph = TRUE)

Using the PCA() function, we reduce the dimensionality of the original variables into uncorrelated principal components. Standardization is automatically applied to account for varying scales. This step compresses the essential information from 9 variables into fewer factors for clearer interpretation.

3. Interpreting Eigenvalues and Variable Relationships

eig.val <- get_eigenvalue(cities_pca)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.67474678       29.7194087                    29.71941
## Dim.2 2.15113497       23.9014997                    53.62091
## Dim.3 1.11906827       12.4340919                    66.05500
## Dim.4 0.96928314       10.7698127                    76.82481
## Dim.5 0.72550534        8.0611704                    84.88598
## Dim.6 0.66815639        7.4239599                    92.30994
## Dim.7 0.43170487        4.7967208                    97.10666
## Dim.8 0.20364202        2.2626891                    99.36935
## Dim.9 0.05675822        0.6306469                   100.00000

fviz_eig(cities_pca, addlabels = TRUE, ylim = c(0, 40))

var <- get_pca_var(cities_pca)
var$coord

##               Dim.1       Dim.2       Dim.3      Dim.4        Dim.5
## Climate   0.2142764  0.83044623  0.21295021 -0.1656714 -0.273473946
## Housing   0.3664159  0.70180909  0.44344162 -0.1972430  0.004846431
## HlthCare  0.8787761 -0.27856262  0.03992466 -0.2747599  0.169397449
## Crime     0.5272423  0.20545412 -0.31333529  0.6019620  0.215697338
## Transp    0.5334603 -0.24041140  0.13141777  0.4414921 -0.642712107
## Educ      0.3293290 -0.65602402  0.42948026 -0.1573546  0.035975482
## Arts      0.9101290 -0.15896992  0.01769167 -0.0344317  0.170960926
## Recreat   0.3447763  0.57870936 -0.20249105  0.1027961  0.197285189
## Econ     -0.3226404  0.02843216  0.73093603  0.4833753  0.304878405

var$cos2

##               Dim.1       Dim.2        Dim.3       Dim.4        Dim.5
## Climate  0.04591437 0.689640943 0.0453477921 0.027447027 0.0747879992
## Housing  0.13426058 0.492535993 0.1966404722 0.038904805 0.0000234879
## HlthCare 0.77224752 0.077597134 0.0015939788 0.075493001 0.0286954959
## Crime    0.27798448 0.042211397 0.0981790013 0.362358300 0.0465253417
## Transp   0.28457993 0.057797640 0.0172706312 0.194915282 0.4130788521
## Educ     0.10845760 0.430367514 0.1844532956 0.024760485 0.0012942353
## Arts     0.82833475 0.025271436 0.0003129951 0.001185542 0.0292276383
## Recreat  0.11887069 0.334904525 0.0410026252 0.010567028 0.0389214457
## Econ     0.10409686 0.000808388 0.5342674813 0.233651668 0.0929508421

var$contrib

##              Dim.1      Dim.2       Dim.3      Dim.4        Dim.5
## Climate   1.716588 32.0593990  4.05228110  2.8316831 10.308400960
## Housing   5.019562 22.8965639 17.57180299  4.0137710  0.003237453
## HlthCare 28.871799  3.6072648  0.14243803  7.7885396  3.955242552
## Crime    10.392927  1.9622849  8.77328075 37.3841539  6.412818663
## Transp   10.639509  2.6868439  1.54330452 20.1092204 56.936707467
## Educ      4.054874 20.0065324 16.48275624  2.5545152  0.178390871
## Arts     30.968717  1.1747955  0.02796926  0.1223112  4.028590390
## Recreat   4.444185 15.5687360  3.66399675  1.0901900  5.364735951
## Econ      3.891840  0.0375796 47.74217036 24.1056157 12.811875694

fviz_pca_var(cities_pca, col.var = "cos2",
             gradient.cols = c("blue", "yellow", "red"),
             repel = TRUE)

The eigenvalues show that the first two components together explain 53.6% of the variance in the dataset. This justifies visualizing cities in a two-dimensional PCA map. The cos2 values help evaluate how well each variable is represented on the component axes, and contrib values reveal which variables most influence each principal component. Arts and Economics dominate PC1, while Recreation and Climate shape PC2.

4. City Projections on PCA Space

ind <- get_pca_ind(cities_pca)
fviz_pca_ind(cities_pca, col.ind = "cos2",
             gradient.cols = c("blue", "yellow", "red"),
             repel = TRUE)

Cities are plotted in the PCA space where proximity indicates similarity across the 9 metrics. For example, cities that cluster on the right may excel in education and economic indicators, while those on the left may score lower. This visual aids in understanding which cities are comparable in terms of overall quality-of-life features.

5. Biplot of Cities and Variables

fviz_pca_biplot(cities_pca, col.var = "cos2", col.ind = "cos2",
                gradient.cols = c("blue", "yellow", "red"),
                repel = TRUE)

The biplot combines city scores and variable vectors to illustrate both performance and drivers. A city positioned in the direction of the Arts vector, for example, has a strong performance in that category. This plot is especially helpful for strategic insights, such as identifying cities with untapped strengths or weaknesses.

6. Supplementary Variable: Population

cities_pca_sup <- PCA(cities30[, 2:10], quanti.sup = 9, graph = TRUE)

fviz_pca_var(cities_pca_sup, col.var = "black",
             col.quanti.sup = "red")

By introducing population (Pop) as a supplementary variable, we can see how city size aligns with other indicators. This does not affect the PCA computation but provides context. For instance, a large city like New York might align with economic and arts metrics, suggesting larger populations often accompany broader infrastructure or cultural capacity.