eig <- c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)
prop <- eig / 6
prop[1] 0.58333333 0.16666667 0.11666667 0.06666667 0.04166667 0.02500000
sum(prop[1:3])[1] 0.8666667
sum(prop[1:4])[1] 0.9333333
SUBMISSION INSTRUCTIONS
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
eig <- c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)
prop <- eig / 6
prop[1] 0.58333333 0.16666667 0.11666667 0.06666667 0.04166667 0.02500000
sum(prop[1:3])[1] 0.8666667
sum(prop[1:4])[1] 0.9333333
4 of the PCs explain ~93% of the total variability in this space, so I would keep 4.
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixfive_irises Sepal.Length Sepal.Width Petal.Length Petal.Width
1 0.189 -1.970 0.137 -0.262000
2 0.551 0.786 1.040 1.580000
3 -0.415 2.620 -1.340 -1.310000
4 0.310 -0.590 0.534 0.000875
5 -0.898 1.700 -1.050 -1.050000
Consider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixpc_loadings PC1 PC2
Sepal.Length 0.5210659 -0.37741762
Sepal.Width -0.2693474 -0.92329566
Petal.Length 0.5804131 -0.02449161
Petal.Width 0.5648565 -0.06694199
pc_scores <- five_irises %*% pc_loadings
pc_scores PC1 PC2
1 0.5606200 1.7617440
2 1.5715031 -1.0649071
3 -2.4396481 -2.1418936
4 0.6308802 0.4146079
5 -2.1283408 -1.1346763
A plot of the first two PC scores for these five irises is shown in the plot below.

Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
A - 3
B - 1
C - 4
D - 2
E - 5
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
library(factoextra)Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(tidyverse)
numeric_variables <- places %>% select(Climate:Pop)
rownames(numeric_variables) <- places$City
places_pca <- prcomp(numeric_variables, scale. = TRUE)fviz_eig(places_pca) Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.
fviz_eig(places_pca, geom='line') +
labs(title='Scree plot, % variability explained')+
theme_classic(base_size = 16) Plot Interpretation:
I see the slope of the cliff starts to flatten out going from PC3 to PC4, so I think using 3 is the right choice here. The percent of total variability that these retained PCs explain is ~62%. I got this by adding up the first 3 PCs: ~37% + ~13% + ~12%.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
fviz_contrib(places_pca, choice = "var", axes = 1)fviz_contrib(places_pca, choice = "var", axes = 2)fviz_pca_var(places_pca, axes = c(1, 2))Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
PC1 (~40% of variance) shows the difference between urban services and amenities
PC2 (~14% of variance) shows the contrast between both economic and recreation strength, as well as the quality of education.
Based on the signs of the loadings, I would say that arts and housing increase along PC1, and for PC 2, it looks like education goes in the opposite direction as the economy.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
library(ggrepel)Warning: package 'ggrepel' was built under R version 4.4.3
fviz_pca(places_pca, axes = 1:2)places_pca$x[,1:2] PC1 PC2
AbileneTX -1.200063833 1.556514809
AkronOH 0.504815548 -0.538215899
AlbanyGA -1.898891933 0.502877996
Albany-Schenectady-TroyNY 0.974876662 -2.266101330
AlbuquerqueNM 1.863754939 1.357399512
AlexandriaLA -1.796047280 0.137333763
AllentownBethlehemPA-NJ -0.884954353 -1.675815005
AltonGranite-CityIL -0.701084596 -0.607919131
AltoonaPA -1.781442928 -1.648523852
AmarilloTX -0.638014016 0.861976953
Anaheim-Santa-AnaCA 3.259134701 2.311003372
AnchorageAK 0.205196615 1.632538860
AndersonIN -2.199121763 -0.908264633
AndersonSC -1.787423447 -0.376195529
Ann-ArborMI 1.617484649 -1.350467442
AnnistonAL -1.985151476 0.866313945
Appleton-Oshkosh-NeenahWI -1.037626367 -1.126998905
AshevilleNC -0.418962001 -0.688164633
AthensGA -1.508627704 0.273492649
AtlantaGA 3.497164728 -0.126991856
Atlantic-CityNJ 0.978290913 1.917224570
AugustaGA-SC -0.985887212 -0.354344869
Aurora-ElginIL -0.523509501 -0.127779095
AustinTX -0.041053963 1.167223094
BakersfieldCA -0.488984431 2.434842333
BaltimoreMD 4.702824985 -0.768488830
BangorME -0.507123313 -1.899889518
Baton-RougeLA -0.486070305 1.844211419
Battle-CreekMI -1.146110953 -0.660807160
Beaumont-Port-ArthurTX -0.984237656 1.566754033
Beaver-CountyPA -1.741312165 -1.683180644
BellinghamWA -0.233746537 0.411051278
Benton-HarborMI -1.476780926 -0.406721181
Bergen-PassaicNJ 2.903190850 -1.708385130
BillingsMT -0.235538359 -0.406607708
Biloxi-GulfportMS -1.447240947 1.251000886
BinghamptonNY -0.510393424 -1.631945288
BirminghamAL 0.436104566 -0.066472158
BismarckND -1.590080459 -1.314330492
BloomingtonIN -1.331011178 -0.528838832
Bloomington-NormalIL -0.416836567 -0.644133544
Boise-CityID -0.475759269 0.330168131
BostonMA 6.873900487 -1.944584517
Boulder-LongmontCO 1.565915021 1.726217982
BradentonFL -1.121023511 1.673082894
BrazoriaTX -1.519778067 1.134316882
BremertonWA -1.176029418 0.849001842
Bridgeport-MilfordCT 1.964978981 -0.465487101
BristolCT -1.036111069 -1.039312206
BrocktonMA -0.324139897 0.004038204
Brownsville-HarlingtonTX -1.656851557 2.137154623
Bryan-College-StationTX -0.945300971 1.558367906
BuffaloNY 1.990296220 -1.491883312
BurlingtonNC -1.992953347 -0.105259432
BurlingtonVT 1.203868378 -1.737240778
CantonOH -1.208082838 -1.056469262
CasperWY -1.127248587 0.664401960
Cedar-RapidsIA -1.012388419 -0.783834597
Champaign-Urbana-RantoulIL 0.463229775 -0.653035035
CharlestonSC 0.814263699 0.566294753
CharlestonWV -0.541434090 -0.909817115
Charlotte-Gastonia-Rock-HillNC-SC 0.818706204 -0.010874242
CharlottesvilleVA 0.720786604 -1.082670655
ChattanoogaTN-GA -1.405043885 0.182480725
ChicagoIL 8.659281177 -2.652465063
ChicoCA -0.507501482 0.639521507
CincinnatiOH-KY-IN 2.107522755 -0.656572014
Clarksville-HopkinsvilleTN-KY -1.714502384 0.149302959
ClevelandOH 3.958585527 -1.063678205
Colorado-SpringsCO -0.173922405 1.635525681
ColumbiaMO -0.507843466 0.151877698
ColumbiaSC 0.505360553 0.335827597
ColumbusGA-AL -1.545398821 0.024972282
ColumbusOH 1.203190729 -0.640850522
Corpus-ChristiTX -0.670171077 2.023378319
CumberlandMD-WV -1.341728501 -2.101722874
DallasTX 3.740902845 1.148595645
DanburyCT 1.154367759 -1.060648416
DanvilleVA -2.180178738 -1.562002180
Davenport-Rock-Island-MolineIA-IL -0.577336413 -0.746002499
Dayton-SpringfieldOH 0.829194919 -0.990687124
Daytona-BeachFL 0.223357254 2.418698618
DecaturIL -1.407279071 -0.658240901
DenverCO 3.511865831 0.499810721
Des-MoinesIA -0.180994974 -0.191828009
DetroitMI 4.744807079 -1.444202134
DothganAL -2.692268644 0.679780508
DubuqueIA -1.624768044 -1.287993505
DuluthMN-WI -0.950462023 -1.817257767
East-St.-Louis-BellevilleIL -0.460219772 0.414848836
Eau-ClaireWI -1.679307903 -1.794236276
El-PasoTX -0.714127038 1.683468587
Elkhart-GoshenIN -1.880204380 -1.271930137
ElmiraNY -1.398006103 -1.611050321
EnidOK -1.549808162 1.175032708
EriePA -0.804473771 -1.323801840
Eugene-SpringfieldOR 0.742976619 0.074183995
EvansvilleIN-KY -0.941848659 -0.202286811
Fall-RiverMA-RI -0.567464971 -0.558642888
Fargo-MoorheadND-MN -1.215272445 -1.543803789
FayettevilleNC -0.804148973 -0.422658171
Fayettteville-SprindaleAR -1.703390541 0.127595822
Fitchburg-LeominsterMA -1.815514476 -1.517391261
FlintMI -0.372649437 -0.212315502
FlorenceAL -2.527724277 -0.204493382
FlorenceSC -0.614608384 -0.275812877
Fort-Collins-Lover=landCO -0.247722998 1.705572328
Fort-Lauderdale-Hollywood-Pompano-BeachFL 1.151467608 2.452191468
Fort-MyersFL -0.136301037 1.904729142
Fort-PierceFL -0.187193444 2.252927062
Fort-SmithAR-OK -1.647865979 -0.190120753
Fort-Walton-BeachFL -1.353871273 0.739738274
Fort-WayneIN -0.524205985 -1.292184359
Forth-ArlingtonTX 0.873147460 1.871226429
FresnoCA 0.357537857 2.686433734
GadsdenAL -2.774332411 -0.153840062
GainesvilleFL 0.088820214 0.759290368
Galveston-Texas-CityTX 0.445169849 2.095521753
Gary-HammondIN 0.099817086 -0.869402452
Glens-FallsNY -1.996518584 -1.111953240
Grand-ForksND -1.375262780 -2.037662928
Grand-RapidsMI -0.294368353 -0.721385024
Great-FallsMT -1.222593304 -0.546397343
GreeleyCO -1.531299049 0.102548974
Green-BayWI -1.026569435 -0.973309498
Greensboro-Winston-Salem-High-PointNC 0.599541415 -0.620278790
Greenville-SpartanburgSC 0.126043499 -0.288846820
HagerstownMD -1.096677537 -1.328268511
Hamilton-MiddletownOH -0.688252101 -0.578474401
Harrisburg-Lebanon-CarlislePA 0.732842277 -1.872578180
HartfordCT 2.611692441 -1.463284188
HickoryNC -1.567732352 -0.493246591
HonoluluHI 2.583656234 2.685777106
Houma-ThibodauxLA -2.288363630 1.830014482
HoustonTX 3.335454279 1.083343549
Huntington-AshlandWV-KY-OH -1.318366385 -0.829367405
HuntsvilleAL -1.643523782 1.043077920
IndianapolisIN 1.189534056 -0.662689483
Iowa-CityIA -0.282147341 -0.908631569
JacksonMI -0.961406787 -0.795056234
JacksonMS 0.146464382 -0.200673790
JacksonvilleFL 0.688067893 0.843828101
JacksonvilleNC -1.697251824 0.478726274
Janesville-BeloitWI -1.504173291 -0.635768591
Jersey-CityNJ 0.786310867 -0.548555357
Johnson-City-Kingsport-BristolTN-VA -0.879595840 -0.698971060
JohnstownPA -1.719812029 -1.782322830
JolietIL -0.195463171 -0.982403783
JoplinMO -2.090643220 -0.323652405
KalamazooMI 0.102662498 -0.603185775
KankakeeIL -1.326982135 -0.442571686
Kansas-CityKS 0.269922095 -0.106653788
Kansas-CityMO 1.428707538 0.007102818
KenoshaWI -0.002131713 -0.687397225
Kileen-TempleTX -1.645071157 0.265749043
KnoxvilleTN -0.121943817 0.037187636
KokomoIN -2.289267727 -1.771720675
La-CrosseWI -0.434582793 -1.200596768
LafayetteIN -0.853140403 -1.415937674
LafayetteLA -0.514734942 2.708772016
Lake-CharlesLA -1.404362091 1.742893032
Lake-CountyIL 0.428268233 0.283683698
Lakeland-Winter-HavenFL -0.848108227 1.202262132
LancasterPA -0.765460280 -1.760368511
Lansing-East-LansingMI 0.148332548 -1.319770602
LaredoTX -1.826156175 1.017982973
Las-CrucesNM -0.731596022 1.798283505
Las-VegasNV 1.098840944 3.648509795
LawrenceKS -1.037490547 0.079765100
Lawrence-HaverhillMA-NH -0.225391663 -0.275348385
LawtonOK -1.660727108 1.588178250
Lewiston-AuburnME -1.467332962 -1.498225513
Lexington-FayetteKY 0.361066087 0.007133817
LimaOH -1.229311382 -0.805735912
LincolnNE -0.255039861 -0.750005582
Little-RockNorth-Little-RockAR -0.113262477 0.665486748
Longview-MarshallTX -1.577972288 1.025378274
Lorain-ElyriaOH -0.981926229 -1.406888507
Los-AngelesLong-BeachCA 9.959538863 1.676407866
LouisvilleKY-IN 1.123922291 -0.665806824
LowellMA-NH -0.625566992 -0.591626388
LubbockTX -0.107961331 2.031476002
LynchburgVA -0.895518865 -0.753362623
MaconWarner-RobbinsGA -1.662193827 0.158792072
MadisonWI 0.969836317 -1.441424711
ManchesterNH -1.068796808 -1.113841197
MansfieldOH -1.482116991 -0.681947215
McAllen-Edinburg-MissionTX -2.436032653 1.092357346
MedfordOR -1.124035849 -0.020136029
Melbourne-Titusville-Palm-BayFL -0.083492578 2.266717862
MemphisTN-AR-MS 1.283802082 0.429453305
Miami-HialeahFL 3.696920472 3.309422602
Middlesex-SomersetHunterdonNJ 2.104552814 -1.814468398
MiddletownCT 0.041268922 -1.050399027
MidlandTX -0.470459653 3.256285756
MilwaukeeWI 2.455775290 -1.391537406
Minneapolis-St.-PaulMN-WI 3.155763009 -1.805230663
MobileAL -0.229739535 0.991359550
ModestoCA -1.397287395 1.466264939
Monmouth-OceanNJ 1.874970795 -0.148370134
MonroeLA -1.640563634 0.953952857
MontgomeryAL -1.644024193 0.301131582
MuncieIN -0.952493999 -1.356590233
MuskegonMI -0.598026237 0.154534252
NashuaNH -0.714471085 -0.907230307
NashvilleTN 0.760304997 0.140209606
Nassua-SuffolkNY 3.650657797 -0.838034691
New-BedsfordMA -1.218143097 -0.391231141
New-BritainCT 0.121361375 -1.085223695
New-Haven-MeridenCT 2.028886161 -1.908991396
New-London-NorwichCT-RI 0.119247044 -0.887475327
New-OrleansLA 2.677428774 1.706737854
New-YorkNY 15.262947353 -1.450383321
NewarkNJ 4.438830711 -1.190321590
Niagara-FallsNY -0.837725429 -0.981456048
Norfolk-Virginia-Beach-Newport-NewsVA 1.545484593 -0.059117510
NorwalkCT 2.443285979 -0.721482119
OaklandCA 3.494823316 2.129645111
OcalaFL -0.540334708 1.830874038
OdessaTX -1.126436049 2.732710187
Oklahoma-CityOK 0.945190652 1.434359272
OlympiaWA -0.974984334 0.232004425
OmahaNE-IA 1.048095737 -0.735914623
Orange-CountyNY 0.282636690 -0.867406473
OrlandoFL 1.176296758 2.149205393
OwensboroKY -2.051568205 -0.130742194
Oxnard-VenturaCA 0.224480804 2.592043795
Panama-CityFL -1.288781003 1.951417390
Parkerburg-MariettaWV-OH -1.596336357 -0.741210912
PascagoulaMS -2.656184178 1.431366545
Pawtucket-Woonsocket-AttleboroRI-MA -0.182777612 -1.507958470
PensacolaFL -0.452071452 1.592964059
PeoriaIL -0.967415720 -0.415197981
PhiladelphiaPA-NJ 6.441800018 -2.703367209
PhoenixAZ 1.567556598 1.651304302
Pine-BluffAR -2.254385535 0.373318850
PittsburghPA 3.324507568 -2.313286094
PittsfieldMA -0.982696935 -1.781623498
PortlandME 0.326112680 -0.796036560
PortlandOR 2.542755789 0.345530520
Portsmouth-Dover-RochesterNH-ME -1.076949387 -0.785092044
PoughkeepsieNY -0.717228788 -0.737840489
ProvidenceRI 2.051923181 -1.915098288
Provo-OremUT -0.694647105 0.050578737
PuebloCO -1.220750409 0.631404014
RacineWI -0.615212529 -0.501425002
Raleigh-DurhamNC 2.383550299 -1.053497785
ReadingPA -0.766074104 -1.239685833
ReddingCA -0.449020929 1.151172140
RenoNV 0.671774355 1.563235306
Richland-Kinnewick-PascoWA -1.653666679 1.084927556
Richmond-PetersburgVA 1.726955963 -0.567895153
Riverside-San-BernardinoCA 1.378696755 1.783466532
RoanokeVA -0.565148868 -0.130207566
RochesterMN -0.784896951 -0.913302125
RochesterNY 2.056092811 -1.513766391
RockfordIL -1.226809455 -0.585231113
SacramentoCA 1.198972101 1.484830937
Saginaw-Bay-City-MidlandMI -0.660958676 -0.981463797
St.-CloudMN -1.860746567 -1.456794953
St.-JosephMO -1.663932940 0.039167162
St.-LouisMO-IL 3.365409495 -0.873289947
SalemOR 0.122205547 -0.261548877
Salem-GlousterMA 0.125973175 -0.802820911
Salinas-Seaside-MontereyCA 0.756083364 2.911001825
Salt-Lake-City-OgdenUT 1.510294185 1.082777053
San-AngeloTX -1.499630303 1.837342004
San-AntonioTX 0.530588811 1.187789611
San-DiegoCA 3.893643450 2.133807068
San-FranciscoCA 7.066608220 1.991905725
San-JoseCA 2.943155149 2.163589445
Santa-Barbara-Santa-Maria-LompocCA 1.752157939 2.667380013
Santa-CruzCA 0.995335598 1.746740009
Santa-Rosa-PetalumaCA 0.613642135 1.673071616
SarasotaFL -0.486190979 1.928735502
SavannahGA 0.212999030 1.497634484
Scranton-Wilkes-BarrePA -0.588304956 -2.165625920
SeattleWA 4.138599849 1.217847383
SharonPA -1.955831369 -1.881820706
SheboyganWI -1.855062771 -1.257041792
Sherman-DenisonTX -2.021653588 1.101109993
ShreveportLA -0.126404105 0.947505389
Sioux-CityIA-NE -1.419783891 -0.906372579
Sioux-FallsSD -1.112014847 -1.284034123
South-Bend-MishawakaIN -0.266646015 -0.652671289
SpokaneWA -0.468545554 -0.471644171
SpringfieldIL 0.636302974 -0.520520219
SpringfieldMA 0.651299152 -1.661257138
SpringfieldMO -1.016963375 0.441663713
StamfordCT 2.851115769 -0.604495033
State-CollegePA -1.201592873 -1.637458395
Steubenville-WeirtonOH-WV -2.323838771 -1.365951908
StocktonCA -0.649411647 2.424950860
SyracuseNY 1.011805346 -1.780136172
TacomaWA 0.940445234 1.475320755
TallahasseeFL -0.408786698 0.947776537
Tampa-St.-Petersburg-ClearwaterFL 1.774278930 1.720222229
Terre-HauteIN -1.493359506 -1.143634852
TexarkanaTX-TexarkanaAR -2.177274115 -0.111957862
ToledoOH 0.547772356 -1.068127975
TopekaKS -0.539580059 -0.200892962
TrentonNJ 1.192456947 -0.775677432
TusconAZ 1.045842182 2.464916464
TulsaOK 0.270558143 1.151435042
TuscaloosaAL -1.431805518 0.422386063
TylerTX -0.950635649 0.584359884
Utica-RomeNY -1.095990823 -1.727952801
Vallejo-Fairfield-NapaCA 0.233705101 1.758598365
VancouverWA -0.939301075 0.590439112
VictoriaTX -1.744535494 2.006486653
Vineland-Millville-BridgetonNJ -0.597134072 -0.388241991
Visalia-Tulare-PortervilleCA -1.166424369 2.588735214
WacoTX -1.344257407 1.117428254
WashingtonDC-MD-VA 6.955969521 -1.898310859
MaterburyCT -0.813894618 -1.166787491
Waterloo-Cedar-FallsIA -1.013755780 -1.216041659
WausauWI -1.732533360 -1.811080840
West-Palm-Beach-Boca-Raton-Delray-BeachFL 1.810695221 2.918767359
WheelingWV-OH -1.770686937 -1.655561844
WichitaKS -0.317848649 0.398089847
Wichita-FallsTX -1.424109467 1.301088205
WilliamsportPA -1.827999153 -2.243763781
WilmingtonDE-NJ-MD 0.766135464 -0.507038195
WilmingtonNC -0.242250107 0.870711924
WorcesterMA -0.151707572 -2.225050976
YakimaWA -1.129388040 0.713392447
YorkPA -1.587300224 -1.513414797
Youngstown-WarrenOH -0.622605220 -1.321755396
Yuba-CityCA -1.719840576 1.425412640
fviz_pca(places_pca, axes = c(1,2),
label ='var',
repel = TRUE,
pointsize = 3)LA, Long Beach - recreation, climate, housing
New York - healthcare, transportation
Philadelphia, Chicago - education, healthcare
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
numeric_variables <- bc_cells %>% select(Radius:FracDim)
cells_pca <- prcomp(numeric_variables, scale. = TRUE)
cells_pcaStandard deviations (1, .., p=8):
[1] 2.0705378 1.3503646 0.9086939 0.7061387 0.6101579 0.3035518 0.2622598
[8] 0.1783697
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5
Radius -0.3003952 0.52850910 0.27751200 -0.0449523963 0.04245937
Texture -0.1432175 0.35378530 -0.89839046 -0.0002176232 0.21581443
Smoothness -0.3482386 -0.32661945 0.12684205 0.1097614573 0.84332416
Compactness -0.4584098 -0.07219238 -0.02956419 0.1825835334 -0.23762997
Concavity -0.4508935 0.12707085 0.04245883 0.1571126948 -0.30459047
ConcavePts -0.4459288 0.22823091 0.17458320 0.0608428515 0.01923459
Symmetry -0.3240333 -0.28112508 -0.08456832 -0.8897711849 -0.11359240
FracDim -0.2251375 -0.57996072 -0.24389523 0.3640273309 -0.27912206
PC6 PC7 PC8
Radius -0.518437923 0.36152546 -0.387460874
Texture -0.006127134 0.02418201 0.004590238
Smoothness 0.079444068 -0.04732075 -0.155456892
Compactness -0.388065805 -0.73686177 0.020239147
Concavity 0.700061530 0.02347868 -0.413095816
ConcavePts 0.125314641 0.21313047 0.808318445
Symmetry -0.018262848 0.05764443 -0.023810142
FracDim -0.261064577 0.52365191 -0.026129456
fviz_eig(cells_pca)Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.
fviz_eig(cells_pca, geom='line') +
labs(title='Scree plot, % variability explained')+
theme_classic(base_size = 16) Based on the scree plot I made for the bc_cells data, it looks like the elbow/kink happens at the 3rd PC, so I would agree with the suggestion of 3 PCs being retained. It appears to explain around 80% of the variability (50% + 20% + 10%)
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
fviz_pca_var(cells_pca, axes = c(1, 2))fviz_contrib(cells_pca, choice = "var", axes = 1)fviz_contrib(cells_pca, choice = "var", axes = 2)fviz_contrib(cells_pca, choice = "var", axes = 3)Based on my plots …
PC 1 - 53.6% of variance - compactness, concavity, concave points
Interpretation - can show shape abnormalities and compactness of tumors
PC 2 - 22.8% of variance - radius, fracdim, texture
Interpretation - can show size and texture details of the tumors
PC 3 - 10-15% of variance - texture (80%)
Interpretation - mostly driven by texture
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
cells_scores <- as.data.frame(cells_pca$x[, 1:3])
cells_scores$Diagnosis <- bc_cells$Diagnosiscells_scores <- as.data.frame(cells_pca$x[, 1:3])
cells_scores$Diagnosis <- as.factor(bc_cells$Diagnosis)
fviz_pca(cells_pca,
axes = c(1, 2),
col.ind = "white",
label = "var",
repel = TRUE,
pointsize = 0) +
geom_point(data = cells_scores,
aes(x = PC1, y = PC2,
color = Diagnosis)) +
labs(color = "Diagnosis",
x = "PC1",
y = "PC2",
title = "Breast cancer cells: PCs 1–3") +
theme_classic()Characteristics that distinguish malignant from benign cells:
-Malignant cells have higher radius, text, concave pts, compactness, concavity
-These values are lower in benign cells
PC that does best job differentiating benign from malignant:
-I would say PC1 because it is driven by things like radius, concavity, compactness, etc – which seem to be larger in the malignant tumors. PC2 and PC3 don’t seem to show this separation as well between the two diagnoses