cities30 <- read.csv("C:/Users/nicho/OneDrive/Desktop/MSDA/MK 6460 - Marketing Research & Analytics/Week 5 - Principal Component Analysis & Multidemensional Scaling/Week5_task_data/cities30.csv")
rownames(cities30) <- cities30[, 1]
cities30_data <- cities30[, 2:10]
str(cities30)
## 'data.frame': 30 obs. of 11 variables:
## $ City : chr "Anaheim-Santa-AnaCA" "AtlantaGA" "BaltimoreMD" "BostonMA" ...
## $ Climate : int 885 696 567 623 514 584 579 544 521 536 ...
## $ Housing : int 16047 8316 9148 11609 10913 8143 9168 9318 10789 8525 ...
## $ HlthCare: int 2025 3195 3562 5301 5766 2138 3167 2825 2533 4142 ...
## $ Crime : int 983 1308 1730 1215 1034 978 1138 1529 1365 1587 ...
## $ Transp : int 3954 8409 7405 6801 7742 5748 7333 6213 8145 4808 ...
## $ Educ : int 2843 3057 3471 3479 3486 2918 2972 3269 3145 3064 ...
## $ Arts : int 5632 7559 9788 21042 24846 9688 12679 10438 8477 10389 ...
## $ Recreat : int 3156 1362 2925 3066 2856 2451 3300 2310 2324 2483 ...
## $ Econ : int 6220 6315 5503 6363 5205 5270 4879 7710 7164 3904 ...
## $ Pop : int 1932709 2138231 2199531 2805911 6060387 1401491 1898825 1957378 1428836 4488072 ...
summary(cities30)
## City Climate Housing HlthCare
## Length:30 Min. :293.0 Min. : 7442 Min. :1189
## Class :character 1st Qu.:536.2 1st Qu.: 8978 1st Qu.:2376
## Mode :character Median :608.0 Median :10180 Median :3110
## Mean :633.2 Mean :10972 Mean :3385
## 3rd Qu.:686.0 3rd Qu.:13302 3rd Qu.:4063
## Max. :910.0 Max. :17158 Max. :7850
## Crime Transp Educ Arts Recreat
## Min. : 566 Min. :2119 Min. :2596 Min. : 4573 Min. :1362
## 1st Qu.:1060 1st Qu.:4842 1st Qu.:2924 1st Qu.: 7788 1st Qu.:2148
## Median :1307 Median :5682 Median :3060 Median : 9738 Median :2478
## Mean :1338 Mean :5932 Mean :3133 Mean :12640 Mean :2762
## 3rd Qu.:1522 3rd Qu.:7280 3rd Qu.:3369 3rd Qu.:13838 3rd Qu.:3264
## Max. :2498 Max. :8625 Max. :3781 Max. :56745 Max. :4800
## Econ Pop
## Min. :3904 Min. :1295071
## 1st Qu.:5322 1st Qu.:1609002
## Median :5820 Median :1915767
## Mean :5931 Mean :2637847
## 3rd Qu.:6351 3rd Qu.:2703278
## Max. :7710 Max. :8274961
The dataset cities30.csv includes 30 cities across 9
quantitative metrics, such as Housing, Education, Crime, etc. To
properly label plots, city names are set as row names. Examining the
structure and summary statistics confirms that the data is clean, with
no missing values, and all inputs are numeric, which is suitable for
PCA.
cities_pca <- PCA(cities30_data, scale.unit = TRUE, graph = TRUE)
Using the PCA() function, we reduce the dimensionality
of the original variables into uncorrelated principal components.
Standardization is automatically applied to account for varying scales.
This step compresses the essential information from 9 variables into
fewer factors for clearer interpretation.
eig.val <- get_eigenvalue(cities_pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.67474678 29.7194087 29.71941
## Dim.2 2.15113497 23.9014997 53.62091
## Dim.3 1.11906827 12.4340919 66.05500
## Dim.4 0.96928314 10.7698127 76.82481
## Dim.5 0.72550534 8.0611704 84.88598
## Dim.6 0.66815639 7.4239599 92.30994
## Dim.7 0.43170487 4.7967208 97.10666
## Dim.8 0.20364202 2.2626891 99.36935
## Dim.9 0.05675822 0.6306469 100.00000
fviz_eig(cities_pca, addlabels = TRUE, ylim = c(0, 40))
var <- get_pca_var(cities_pca)
var$coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Climate 0.2142764 0.83044623 0.21295021 -0.1656714 -0.273473946
## Housing 0.3664159 0.70180909 0.44344162 -0.1972430 0.004846431
## HlthCare 0.8787761 -0.27856262 0.03992466 -0.2747599 0.169397449
## Crime 0.5272423 0.20545412 -0.31333529 0.6019620 0.215697338
## Transp 0.5334603 -0.24041140 0.13141777 0.4414921 -0.642712107
## Educ 0.3293290 -0.65602402 0.42948026 -0.1573546 0.035975482
## Arts 0.9101290 -0.15896992 0.01769167 -0.0344317 0.170960926
## Recreat 0.3447763 0.57870936 -0.20249105 0.1027961 0.197285189
## Econ -0.3226404 0.02843216 0.73093603 0.4833753 0.304878405
var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Climate 0.04591437 0.689640943 0.0453477921 0.027447027 0.0747879992
## Housing 0.13426058 0.492535993 0.1966404722 0.038904805 0.0000234879
## HlthCare 0.77224752 0.077597134 0.0015939788 0.075493001 0.0286954959
## Crime 0.27798448 0.042211397 0.0981790013 0.362358300 0.0465253417
## Transp 0.28457993 0.057797640 0.0172706312 0.194915282 0.4130788521
## Educ 0.10845760 0.430367514 0.1844532956 0.024760485 0.0012942353
## Arts 0.82833475 0.025271436 0.0003129951 0.001185542 0.0292276383
## Recreat 0.11887069 0.334904525 0.0410026252 0.010567028 0.0389214457
## Econ 0.10409686 0.000808388 0.5342674813 0.233651668 0.0929508421
var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Climate 1.716588 32.0593990 4.05228110 2.8316831 10.308400960
## Housing 5.019562 22.8965639 17.57180299 4.0137710 0.003237453
## HlthCare 28.871799 3.6072648 0.14243803 7.7885396 3.955242552
## Crime 10.392927 1.9622849 8.77328075 37.3841539 6.412818663
## Transp 10.639509 2.6868439 1.54330452 20.1092204 56.936707467
## Educ 4.054874 20.0065324 16.48275624 2.5545152 0.178390871
## Arts 30.968717 1.1747955 0.02796926 0.1223112 4.028590390
## Recreat 4.444185 15.5687360 3.66399675 1.0901900 5.364735951
## Econ 3.891840 0.0375796 47.74217036 24.1056157 12.811875694
fviz_pca_var(cities_pca, col.var = "cos2",
gradient.cols = c("blue", "yellow", "red"),
repel = TRUE)
The eigenvalues show that the first two components together explain
53.6% of the variance in the dataset. This justifies visualizing cities
in a two-dimensional PCA map. The cos2 values help evaluate
how well each variable is represented on the component axes, and
contrib values reveal which variables most influence each
principal component. Arts and Economics dominate PC1, while Recreation
and Climate shape PC2.
ind <- get_pca_ind(cities_pca)
fviz_pca_ind(cities_pca, col.ind = "cos2",
gradient.cols = c("blue", "yellow", "red"),
repel = TRUE)
Cities are plotted in the PCA space where proximity indicates similarity across the 9 metrics. For example, cities that cluster on the right may excel in education and economic indicators, while those on the left may score lower. This visual aids in understanding which cities are comparable in terms of overall quality-of-life features.
fviz_pca_biplot(cities_pca, col.var = "cos2", col.ind = "cos2",
gradient.cols = c("blue", "yellow", "red"),
repel = TRUE)
The biplot combines city scores and variable vectors to illustrate both performance and drivers. A city positioned in the direction of the Arts vector, for example, has a strong performance in that category. This plot is especially helpful for strategic insights, such as identifying cities with untapped strengths or weaknesses.
cities_pca_sup <- PCA(cities30[, 2:10], quanti.sup = 9, graph = TRUE)
fviz_pca_var(cities_pca_sup, col.var = "black",
col.quanti.sup = "red")
By introducing population (Pop) as a supplementary
variable, we can see how city size aligns with other indicators. This
does not affect the PCA computation but provides context. For instance,
a large city like New York might align with economic and arts metrics,
suggesting larger populations often accompany broader infrastructure or
cultural capacity.