Activity 3.3 - PCA implementation

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Problem 1

Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:

\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]

If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?

I would keep 4 principle components .

PC1: 3.5/6 = 58.33%

PC1-2: (3.5 + 1) / 6 = 75%

PC1-3: (3.5 + 1 + .7) / 6 = 86.7%

PC1-4 : (3.5 + 1+ .7 + + .4) / 6 = 93.33%

I’d want to keep PC1-4 to explain at least 90% of the variability.

Problem 2

The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:

library(tidyverse)
five_irises <- data.frame(
  row.names = 1:5,
  Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
  Sepal.Width  = c(-1.97, 0.786, 2.62, -0.590, 1.70),
  Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
  Petal.Width  = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrix

Consider also the loadings for the first two principal components:

# Create the data frame
pc_loadings <- data.frame(
  PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
  PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
  row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrix

A plot of the first two PC scores for these five irises is shown in the plot below.

Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.

Iris ID 1: B

Iris ID 2: D

Iris ID 3: A

Iris ID 4: C

Iris ID 5: E

Problem 3

These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:

Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics

For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):

Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Housing: utility bills, property taxes, mortgage payments.
Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, accademic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.

In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.

Use PCA to identify the major components of variation in the ratings among cities.

places <- read.csv('Data/Places.csv')
head(places)

                       City Climate Housing HlthCare Crime Transp Educ Arts
1                 AbileneTX     521    6200      237   923   4031 2757  996
2                   AkronOH     575    8138     1656   886   4883 2438 5564
3                  AlbanyGA     468    7339      618   970   2531 2560  237
4 Albany-Schenectady-TroyNY     476    7908     1431   610   6883 3399 4655
5             AlbuquerqueNM     659    8393     1853  1483   6558 3026 4496
6              AlexandriaLA     520    5819      640   727   2444 2972  334
  Recreat Econ      Long     Lat    Pop
1    1405 7633  -99.6890 32.5590 110932
2    2632 4350  -81.5180 41.0850 660328
3     859 5250  -84.1580 31.5750 112402
4    1617 5864  -73.7983 42.7327 835880
5    2612 5727 -106.6500 35.0830 419700
6    1018 5254  -92.4530 31.3020 135282

A.

If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.

places_data <- places %>%
  select(Climate, Housing, HlthCare, Crime, Transp,
         Educ, Arts, Recreat, Econ)


places_scaled <- scale(places_data)

pca_places <- prcomp(places_scaled, center = TRUE, scale. = TRUE)


summary(pca_places)

Importance of components:
                          PC1    PC2    PC3    PC4    PC5     PC6     PC7
Standard deviation     1.8462 1.1018 1.0684 0.9596 0.8679 0.79408 0.70217
Proportion of Variance 0.3787 0.1349 0.1268 0.1023 0.0837 0.07006 0.05478
Cumulative Proportion  0.3787 0.5136 0.6404 0.7427 0.8264 0.89650 0.95128
                           PC8     PC9
Standard deviation     0.56395 0.34699
Proportion of Variance 0.03534 0.01338
Cumulative Proportion  0.98662 1.00000

var_explained <- pca_places$sdev^2 / sum(pca_places$sdev^2)
df <- data.frame(
  PC = 1:length(var_explained),
  Variance = var_explained,
  Cumulative = cumsum(var_explained)
)


plot(pca_places, type = "l", main = "Scree Plot: Places Rated Almanac")

ggplot(df, aes(x = PC, y = Variance)) +
  geom_line() +
  geom_point() +
  geom_text(aes(label = paste0(round(Cumulative * 100, 1), "%")),
            vjust = -0.7, size = 3) +
  labs(
    title = "Scree Plot: Variance Explained by Principal Components",
    x = "Principal Component",
    y = "Proportion of Variance Explained"
  ) +
  theme_minimal()

To explore this dataset in lower-dimensional space while preserving most of the information, I would retain the first 6 Pcs, which together explain about 90% of the total variance in the Places Rated Almanac ratings. This is based upon what the scree plot shows me.

B.

Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!

pca_places$rotation

               PC1        PC2          PC3         PC4        PC5         PC6
Climate  0.2064140  0.2178353 -0.689955982  0.13732125 -0.3691499  0.37460469
Housing  0.3565216  0.2506240 -0.208172230  0.51182871  0.2334878 -0.14163983
HlthCare 0.4602146 -0.2994653 -0.007324926  0.01470183 -0.1032405 -0.37384804
Crime    0.2812984  0.3553423  0.185104981 -0.53905047 -0.5239397  0.08092329
Transp   0.3511508 -0.1796045  0.146376283 -0.30290371  0.4043485  0.46759180
Educ     0.2752926 -0.4833821  0.229702548  0.33541103 -0.2088191  0.50216981
Arts     0.4630545 -0.1947899 -0.026484298 -0.10108039 -0.1050976 -0.46188072
Recreat  0.3278879  0.3844746 -0.050852640 -0.18980082  0.5295406  0.08991578
Econ     0.1354123  0.4712833  0.607314475  0.42176994 -0.1596201  0.03260813
                 PC7         PC8           PC9
Climate  -0.08470577 -0.36230833  0.0013913515
Housing  -0.23063862  0.61385513  0.0136003402
HlthCare  0.01386761 -0.18567612 -0.7163548935
Crime     0.01860646  0.43002477 -0.0586084614
Transp   -0.58339097 -0.09359866  0.0036294527
Educ      0.42618186  0.18866756  0.1108401911
Arts     -0.02152515 -0.20398969  0.6857582127
Recreat   0.62787789 -0.15059597 -0.0255062915
Econ     -0.14974066 -0.40480926  0.0004377942

library(ggplot2)
loadings_df <- as.data.frame(pca_places$rotation[, 1:2]) %>%
  rownames_to_column("Variable")

ggplot(loadings_df, aes(x = PC1, y = PC2, label = Variable)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  geom_point() +
  geom_text(vjust = -0.7) +
  labs(title = "Loadings Plot for PC1 and PC2",
       x = "PC1 Loadings",
       y = "PC2 Loadings") +
  theme_minimal()

From the loadings plot, PC1 mainly measures a city’s overall quality of life and opportunity level. Most variables have positive loadings, meaning cities with higher PC1 scores tend to have better healthcare, education, arts, and recreation. However, since Crime and Housing are factors where lower values are better, their positive loadings mean that high PC1 cities may also have higher crime rates and housing costs. Basically, positive PC1 cities are larger and more developed but more expensive, while negative PC1 cities are smaller, quieter, and more affordable.

PC2 mixes positive and negative loadings, showing a trade-off between lifestyle and infrastructure. Positive PC2 cities have stronger economies, better climates, and more recreation, while negative PC2 cities focus more on education, healthcare, and transportation.

Overall, PC1 separates cities by livability and cost, while PC2 contrasts climate and recreation with infrastructure and education.

C.

Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?

library(tidyverse)
library(ggrepel)


places_pcs <- cbind(places, pca_places$x[, 1:2])


outliers <- places_pcs %>%
  slice_max(PC1, n = 3) %>%
  bind_rows(slice_min(places_pcs, PC1, n = 3)) %>%
  bind_rows(slice_max(places_pcs, PC2, n = 3)) %>%
  bind_rows(slice_min(places_pcs, PC2, n = 3)) %>%
  distinct(City, .keep_all = TRUE)


ggplot(places_pcs, aes(x = PC1, y = PC2)) +
  geom_point(alpha = 0.4, color = "steelblue", size = 1.8) +
  geom_text_repel(data = outliers, aes(label = City),
                  size = 3, color = "black", max.overlaps = 30) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(
    title = "Biplot of U.S. Cities (First Two Principal Components)",
    x = "PC1: Overall Quality & Opportunity",
    y = "PC2: Climate vs. Infrastructure"
  ) +
  theme_minimal(base_size = 12)

From the biplot, a few cities clearly stand out from the rest. New York City sits far to the right, meaning it ranks very high in areas like economy, culture, and education, but it’s also expensive and crowded. Los Angeles and San Francisco are also high on overall quality and opportunity, yet they score even better for climate and recreation, making them attractive places to live weather-wise. Pittsburgh, Philadelphia, and Chicago also rate high on opportunity but appear lower on the plot, suggesting they have strong infrastructure and job markets but less favorable climates. Meanwhile, cities like Midland, TX, Las Vegas, NV, and Atlantic City, NJ score high for warm weather and recreation, while Pascagoula, MS and Dothan, AL fall toward the lower-opportunity end, reflecting smaller, more affordable communities. In general, cities to the right of the plot are big, developed, and costly; those higher up enjoy better weather and outdoor activities; and those to the left are smaller, quieter, and more affordable.

Problem 4

The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.

The variables in the data file you will be using are:

ID - patient identification number (not used in PCA)
Diagnosis determined by biopsy - B = benign or M = malignant
Radius: mean of distances from center to points on the perimeter
Texture: standard deviation of gray-scale values
Smoothness: local variation in radius lengths
Compactness: perimeter^2 / area - 1.0
Concavity: severity of concave portions of the contour
Concavepts: number of concave portions of the contour
Symmetry: measure of symmetry of the cell nucleus
FracDim: fractal dimension; “coastline approximation” - 1

bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells)

  Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1         M  17.99   10.38    0.11840     0.27760    0.3001    0.14710   0.2419
2         M  20.57   17.77    0.08474     0.07864    0.0869    0.07017   0.1812
3         M  19.69   21.25    0.10960     0.15990    0.1974    0.12790   0.2069
4         M  11.42   20.38    0.14250     0.28390    0.2414    0.10520   0.2597
5         M  20.29   14.34    0.10030     0.13280    0.1980    0.10430   0.1809
6         M  12.45   15.70    0.12780     0.17000    0.1578    0.08089   0.2087
  FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613

A.

My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?

bc_data <- bc_cells %>%
  select(Radius, Texture, Smoothness, Compactness,
         Concavity, ConcavePts, Symmetry, FracDim)


bc_scaled <- scale(bc_data)


pca_bc <- prcomp(bc_scaled, center = TRUE, scale. = TRUE)


summary(pca_bc)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion  0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
                           PC8
Standard deviation     0.17837
Proportion of Variance 0.00398
Cumulative Proportion  1.00000

var_explained <- pca_bc$sdev^2 / sum(pca_bc$sdev^2)
df <- data.frame(
  PC = 1:length(var_explained),
  Variance = var_explained,
  Cumulative = cumsum(var_explained)
)

ggplot(df, aes(x = PC, y = Variance)) +
  geom_line() +
  geom_point() +
  geom_text(aes(label = paste0(round(Cumulative * 100, 1), "%")),
            vjust = -0.7, size = 3) +
  labs(title = "Scree Plot: Variance Explained by Principal Components",
       x = "Principal Component",
       y = "Proportion of Variance Explained") +
  theme_minimal()

The scree plot shows that the first three principal components explain most of the variation in the data. PC1 alone explains about 54%, PC2 adds another 23%, and PC3 adds about 10%, for a total of roughly 87% of the total variability. After the third component, the curve flattens out, meaning the remaining components don’t add much useful information. So yes, keeping three components makes sense, as it captures almost all of the important patterns in the breast cancer cell data while keeping things simple and easy to interpret.

B.

Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.

pca_bc$rotation

                   PC1         PC2         PC3           PC4         PC5
Radius      -0.3003952  0.52850910  0.27751200 -0.0449523963  0.04245937
Texture     -0.1432175  0.35378530 -0.89839046 -0.0002176232  0.21581443
Smoothness  -0.3482386 -0.32661945  0.12684205  0.1097614573  0.84332416
Compactness -0.4584098 -0.07219238 -0.02956419  0.1825835334 -0.23762997
Concavity   -0.4508935  0.12707085  0.04245883  0.1571126948 -0.30459047
ConcavePts  -0.4459288  0.22823091  0.17458320  0.0608428515  0.01923459
Symmetry    -0.3240333 -0.28112508 -0.08456832 -0.8897711849 -0.11359240
FracDim     -0.2251375 -0.57996072 -0.24389523  0.3640273309 -0.27912206
                     PC6         PC7          PC8
Radius      -0.518437923  0.36152546 -0.387460874
Texture     -0.006127134  0.02418201  0.004590238
Smoothness   0.079444068 -0.04732075 -0.155456892
Compactness -0.388065805 -0.73686177  0.020239147
Concavity    0.700061530  0.02347868 -0.413095816
ConcavePts   0.125314641  0.21313047  0.808318445
Symmetry    -0.018262848  0.05764443 -0.023810142
FracDim     -0.261064577  0.52365191 -0.026129456

round(pca_bc$rotation[, 1:3], 3)

               PC1    PC2    PC3
Radius      -0.300  0.529  0.278
Texture     -0.143  0.354 -0.898
Smoothness  -0.348 -0.327  0.127
Compactness -0.458 -0.072 -0.030
Concavity   -0.451  0.127  0.042
ConcavePts  -0.446  0.228  0.175
Symmetry    -0.324 -0.281 -0.085
FracDim     -0.225 -0.580 -0.244

When looking at the loadings, the first three principal components describe different aspects of the cell shapes. PC1 captures how big and uneven the cells are. Tumors with higher scores here tend to have larger, rougher, and more irregularly shaped cells, which are often cancerous. PC2 separates tumors based on how smooth and symmetrical they are. Higher scores mean the cells are rougher and less even, while lower scores represent smoother, more regular shapes. PC3 focuses on fine details, picking up small differences in texture and surface complexity. These three components together describe most of what makes benign and malignant tumors look different: size, smoothness, and surface texture.

C.

Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:

library(factoextra)

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggplot2)
library(dplyr)


bc_pcs <- cbind(bc_cells, pca_bc$x[, 1:3])


fviz_pca_biplot(pca_bc, col.ind = "white") +
  geom_point(data = bc_pcs,
             aes(x = PC1, y = PC2,
                 color = Diagnosis,
                 size = PC3),
             alpha = 0.7) +
  scale_color_manual(values = c("B" = "steelblue", "M" = "firebrick")) +
  labs(
    title = "Biplot of Breast Cancer Cells (PC1 vs PC2, PC3 as Size)",
    x = "Principal Component 1 (Cell Size & Irregularity)",
    y = "Principal Component 2 (Smoothness & Symmetry)",
    size = "PC3 (Texture Detail)",
    color = "Diagnosis"
  ) +
  theme_minimal(base_size = 12)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

What characteristics distinguish malignant from benign cells?

Malignant cells are typically larger, rougher, and more uneven, showing greater texture detail and irregularity, while benign cells are smaller, smoother, and more uniform in both shape and texture. This clear separation suggests that overall cell size, surface roughness, and structural complexity are the key features that distinguish malignant from benign tumors.Of the 3 PCs,

Which does the best job of differentiating malignant from benign cells?

PC1 does the best job separating malignant from benign cells. It mainly reflects cell size and shape irregularity, so malignant cells tend to be larger and more uneven, while benign cells are smaller and smoother. In simple terms, differences in overall size and roughness are what most clearly distinguish the two groups.