Activity 3.3 - PCA implementation

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Problem 1

Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:

\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]

If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?

eigenvalues <- c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)
total_variance <- sum(eigenvalues)
prop_var <- eigenvalues / total_variance

cum_var <- cumsum(prop_var)

data.frame(
  Component = 1:6,
  Eigenvalue = eigenvalues,
  Proportion = round(prop_var, 3),
  Cumulative = round(cum_var, 3)
)

  Component Eigenvalue Proportion Cumulative
1         1       3.50      0.583      0.583
2         2       1.00      0.167      0.750
3         3       0.70      0.117      0.867
4         4       0.40      0.067      0.933
5         5       0.25      0.042      0.975
6         6       0.15      0.025      1.000

We need to have 4 dimensions to explain at least 90% of the variability.

Problem 2

The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:

library(tidyverse)
five_irises <- data.frame(
  row.names = 1:5,
  Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
  Sepal.Width  = c(-1.97, 0.786, 2.62, -0.590, 1.70),
  Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
  Petal.Width  = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrix

Consider also the loadings for the first two principal components:

# Create the data frame
pc_loadings <- data.frame(
  PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
  PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
  row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrix
# Calculate PC scores by matrix multiplication
pc_scores <- five_irises %*% pc_loadings

# Display the results
pc_scores

         PC1        PC2
1  0.5606200  1.7617440
2  1.5715031 -1.0649071
3 -2.4396481 -2.1418936
4  0.6308802  0.4146079
5 -2.1283408 -1.1346763

A plot of the first two PC scores for these five irises is shown in the plot below.

Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.

Answers: 1b 2d 3a 4c 5e

Problem 3

These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:

Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics

For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):

Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Housing: utility bills, property taxes, mortgage payments.
Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, academic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.

In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.

Use PCA to identify the major components of variation in the ratings among cities.

places <- read.csv('Data/Places.csv')
head(places)

                       City Climate Housing HlthCare Crime Transp Educ Arts
1                 AbileneTX     521    6200      237   923   4031 2757  996
2                   AkronOH     575    8138     1656   886   4883 2438 5564
3                  AlbanyGA     468    7339      618   970   2531 2560  237
4 Albany-Schenectady-TroyNY     476    7908     1431   610   6883 3399 4655
5             AlbuquerqueNM     659    8393     1853  1483   6558 3026 4496
6              AlexandriaLA     520    5819      640   727   2444 2972  334
  Recreat Econ      Long     Lat    Pop
1    1405 7633  -99.6890 32.5590 110932
2    2632 4350  -81.5180 41.0850 660328
3     859 5250  -84.1580 31.5750 112402
4    1617 5864  -73.7983 42.7327 835880
5    2612 5727 -106.6500 35.0830 419700
6    1018 5254  -92.4530 31.3020 135282

A.

If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.

exclude_cols <- c("ID", "City", "State", "Population", "Latitude", "Longitude", "id", "city", "state", "pop", "lat", "long")
rating_cols <- setdiff(colnames(places), exclude_cols)

pca_results <- prcomp(places[, rating_cols], scale. = TRUE)

variance_explained <- pca_results$sdev^2
prop_variance <- variance_explained / sum(variance_explained)
cumulative_variance <- cumsum(prop_variance)

scree_data <- data.frame(
  PC = 1:length(prop_variance),
  Variance = prop_variance,
  Cumulative = cumulative_variance
)

ggplot(scree_data, aes(x = PC, y = Variance)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "blue", size = 3) +
  geom_hline(yintercept = 1/length(rating_cols), linetype = "dashed", color = "red") +
  scale_x_continuous(breaks = 1:length(rating_cols)) +
  labs(title = "Scree Plot for Places Rated PCA",
       x = "Principal Component",
       y = "Proportion of Variance Explained") +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

print(scree_data)

   PC    Variance Cumulative
1   1 0.341293088  0.3412931
2   2 0.151262047  0.4925551
3   3 0.126326675  0.6188818
4   4 0.086263882  0.7051457
5   5 0.078754984  0.7839007
6   6 0.058476911  0.8423776
7   7 0.048180992  0.8905586
8   8 0.036616349  0.9271749
9   9 0.029881747  0.9570567
10 10 0.024745215  0.9818019
11 11 0.010601412  0.9924033
12 12 0.007596698  1.0000000

k <- sum(cumulative_variance < 0.80) + 1
cat("\nAnswer: Retain the first", k, 
    "PCs, which explains", 
    round(cumulative_variance[k] * 100, 2), 
    "% of total variability\n")


Answer: Retain the first 6 PCs, which explains 84.24 % of total variability

B.

Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!

loadings <- pca_results$rotation

print(round(loadings[, 1:3], 3))

            PC1    PC2    PC3
Climate   0.179  0.157 -0.388
Housing   0.298  0.080 -0.250
HlthCare  0.437 -0.197  0.086
Crime     0.252  0.427  0.173
Transp    0.306 -0.096 -0.082
Educ      0.245 -0.323  0.294
Arts      0.446 -0.096  0.037
Recreat   0.281  0.254 -0.187
Econ      0.098  0.353  0.412
Long     -0.011 -0.437  0.480
Lat       0.023 -0.492 -0.466
Pop       0.428 -0.051  0.064

loading_data <- as.data.frame(loadings[, 1:2])
loading_data$Variable <- rownames(loading_data)

ggplot(loading_data, aes(x = PC1, y = PC2)) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "blue", size = 1) +
  geom_text(aes(label = Variable), vjust = -0.5, size = 3.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Loading Plot: PC1 vs PC2",
       x = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
       y = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")) +
  theme_minimal() +
  coord_fixed()

pc1_data <- data.frame(
  Variable = rownames(loadings),
  Loading = loadings[, 1]
) %>% arrange(desc(abs(Loading)))

ggplot(pc1_data, aes(x = reorder(Variable, Loading), y = Loading)) +
  geom_col(aes(fill = Loading > 0)) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral")) +
  labs(title = "PC1 Loadings",
       x = "Variable",
       y = "Loading") +
  theme_minimal() +
  theme(legend.position = "none")

pc2_data <- data.frame(
  Variable = rownames(loadings),
  Loading = loadings[, 2]
) %>% arrange(desc(abs(Loading)))

ggplot(pc2_data, aes(x = reorder(Variable, Loading), y = Loading)) +
  geom_col(aes(fill = Loading > 0)) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral")) +
  labs(title = "PC2 Loadings",
       x = "Variable",
       y = "Loading") +
  theme_minimal() +
  theme(legend.position = "none")

Loading 1 is mostly population, art and healthcare, while pc2 is more balanced in its positive loadings, its 5 positive variables decrease in relevance at a consistent rate, while crime is the most important loading.

C.

Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?

library(ggrepel)

places$PC1 <- pca_results$x[, 1]
places$PC2 <- pca_results$x[, 2]
loadings_plot <- as.data.frame(pca_results$rotation[, 1:2])
loadings_plot$Variable <- rownames(loadings_plot)

scale_factor <- max(abs(places$PC1), abs(places$PC2)) * 0.8
loadings_plot$PC1_scaled <- loadings_plot$PC1 * scale_factor
loadings_plot$PC2_scaled <- loadings_plot$PC2 * scale_factor

ggplot() +
  geom_point(data = places, 
             aes(x = PC1, y = PC2),
             alpha = 0.6, 
             color = "darkblue",
             size = 2) +
  geom_text_repel(data = places,
                  aes(x = PC1, y = PC2, label = City), 
                  size = 2.5, 
                  max.overlaps = 15,
                  color = "darkblue") +
  geom_segment(data = loadings_plot,
               aes(x = 0, y = 0, xend = PC1_scaled, yend = PC2_scaled),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "red", 
               size = 1,
               alpha = 0.7) +
  geom_text_repel(data = loadings_plot,
                  aes(x = PC1_scaled, y = PC2_scaled, label = Variable),
                  color = "red",
                  size = 3.5,
                  fontface = "bold",
                  box.padding = 0.5,
                  point.padding = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Biplot of Cities: PC1 vs PC2 with Variable Loadings",
       subtitle = "Red arrows show variable contributions",
       x = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
       y = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")) +
  theme_minimal()

Warning: ggrepel: 307 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

We can see a few outliers visually on the pc plot. New York is very high in arts and population, while Long Beach is very high in housing

Problem 4

The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.

The variables in the data file you will be using are:

ID - patient identification number (not used in PCA)
Diagnosis determined by biopsy - B = benign or M = malignant
Radius: mean of distances from center to points on the perimeter
Texture: standard deviation of gray-scale values
Smoothness: local variation in radius lengths
Compactness: perimeter^2 / area - 1.0
Concavity: severity of concave portions of the contour
Concavepts: number of concave portions of the contour
Symmetry: measure of symmetry of the cell nucleus
FracDim: fractal dimension; “coastline approximation” - 1

bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells)

  Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1         M  17.99   10.38    0.11840     0.27760    0.3001    0.14710   0.2419
2         M  20.57   17.77    0.08474     0.07864    0.0869    0.07017   0.1812
3         M  19.69   21.25    0.10960     0.15990    0.1974    0.12790   0.2069
4         M  11.42   20.38    0.14250     0.28390    0.2414    0.10520   0.2597
5         M  20.29   14.34    0.10030     0.13280    0.1980    0.10430   0.1809
6         M  12.45   15.70    0.12780     0.17000    0.1578    0.08089   0.2087
  FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613

A.

My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?

exclude_cols <- c("ID", "Diagnosis", "id", "diagnosis")
pca_cols <- setdiff(colnames(bc_cells), exclude_cols)

print(pca_cols)

[1] "Radius"      "Texture"     "Smoothness"  "Compactness" "Concavity"  
[6] "ConcavePts"  "Symmetry"    "FracDim"

pca_results <- prcomp(bc_cells[, pca_cols], scale. = TRUE)

variance_explained <- pca_results$sdev^2
prop_variance <- variance_explained / sum(variance_explained)
cumulative_variance <- cumsum(prop_variance)

scree_data <- data.frame(
  PC = 1:length(prop_variance),
  Variance = prop_variance,
  Cumulative = cumulative_variance
)

ggplot(scree_data, aes(x = PC, y = Variance)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "blue", size = 3) +
  geom_hline(yintercept = 1/length(pca_cols), linetype = "dashed", color = "red", alpha = 0.5) +
  scale_x_continuous(breaks = 1:length(pca_cols)) +
  labs(title = "Scree Plot for Breast Cancer Cell PCA",
       subtitle = "Red line shows average variance (Kaiser criterion)",
       x = "Principal Component",
       y = "Proportion of Variance Explained") +
  theme_minimal()

ggplot(scree_data, aes(x = PC, y = Cumulative)) +
  geom_line(color = "darkgreen", size = 1) +
  geom_point(color = "darkgreen", size = 3) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "red", alpha = 0.5) +
  geom_hline(yintercept = 0.90, linetype = "dashed", color = "orange", alpha = 0.5) +
  scale_x_continuous(breaks = 1:length(pca_cols)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Cumulative Variance Explained",
       x = "Principal Component",
       y = "Cumulative Proportion of Variance") +
  theme_minimal()

print(round(scree_data, 4))

  PC Variance Cumulative
1  1   0.5359     0.5359
2  2   0.2279     0.7638
3  3   0.1032     0.8670
4  4   0.0623     0.9294
5  5   0.0465     0.9759
6  6   0.0115     0.9874
7  7   0.0086     0.9960
8  8   0.0040     1.0000

3 components retains 86.7% of the original variance of the entire dataset.

B.

Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.

loadings <- pca_results$rotation[, 1:3]

print(round(loadings, 3))

               PC1    PC2    PC3
Radius      -0.300  0.529  0.278
Texture     -0.143  0.354 -0.898
Smoothness  -0.348 -0.327  0.127
Compactness -0.458 -0.072 -0.030
Concavity   -0.451  0.127  0.042
ConcavePts  -0.446  0.228  0.175
Symmetry    -0.324 -0.281 -0.085
FracDim     -0.225 -0.580 -0.244

for(i in 1:3) {
  loading_data <- data.frame(
    Variable = rownames(loadings),
    Loading = loadings[, i]
  ) %>% arrange(desc(abs(Loading)))
  
  p <- ggplot(loading_data, aes(x = reorder(Variable, Loading), y = Loading)) +
    geom_col(aes(fill = Loading > 0)) +
    coord_flip() +
    scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral")) +
    labs(title = paste0("PC", i, " Loadings (", round(prop_variance[i]*100, 1), "% variance)"),
         x = "Variable",
         y = "Loading") +
    theme_minimal() +
    theme(legend.position = "none")
  
  print(p)
}

loading_data_12 <- as.data.frame(loadings[, 1:2])
loading_data_12$Variable <- rownames(loading_data_12)

ggplot(loading_data_12, aes(x = PC1, y = PC2)) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "blue", size = 1) +
  geom_text(aes(label = Variable), vjust = -0.5, hjust = 0.5, size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50") +
  labs(title = "Loading Plot: PC1 vs PC2",
       x = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
       y = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)")) +
  theme_minimal() +
  coord_fixed()

cat("Variables with highest positive loadings:", 
    paste(names(sort(loadings[,1], decreasing=TRUE)[1:3]), collapse=", "), "\n")

Variables with highest positive loadings: Texture, FracDim, Radius

cat("Variables with highest negative loadings:", 
    paste(names(sort(loadings[,1], decreasing=FALSE)[1:3]), collapse=", "), "\n")

Variables with highest negative loadings: Compactness, Concavity, ConcavePts

cat("\nPC2 Interpretation:\n")


PC2 Interpretation:

cat("Variables with highest positive loadings:", 
    paste(names(sort(loadings[,2], decreasing=TRUE)[1:3]), collapse=", "), "\n")

Variables with highest positive loadings: Radius, Texture, ConcavePts

cat("Variables with highest negative loadings:", 
    paste(names(sort(loadings[,2], decreasing=FALSE)[1:3]), collapse=", "), "\n")

Variables with highest negative loadings: FracDim, Smoothness, Symmetry

cat("\nPC3 Interpretation:\n")


PC3 Interpretation:

cat("Variables with highest positive loadings:", 
    paste(names(sort(loadings[,3], decreasing=TRUE)[1:3]), collapse=", "), "\n")

Variables with highest positive loadings: Radius, ConcavePts, Smoothness

cat("Variables with highest negative loadings:", 
    paste(names(sort(loadings[,3], decreasing=FALSE)[1:3]), collapse=", "), "\n")

Variables with highest negative loadings: Texture, FracDim, Symmetry

C.

Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:

What characteristics distinguish malignant from benign cells?
Of the 3 PCs, which does the best job of differentiating malignant from benign cells?

library(factoextra)

Warning: package 'factoextra' was built under R version 4.5.2

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(ggrepel)

bc_cells$PC1 <- pca_results$x[, 1]
bc_cells$PC2 <- pca_results$x[, 2]
bc_cells$PC3 <- pca_results$x[, 3]

loadings_plot <- as.data.frame(pca_results$rotation[, 1:2])
loadings_plot$Variable <- rownames(loadings_plot)

scale_factor <- 5  
loadings_plot$PC1_scaled <- loadings_plot$PC1 * scale_factor
loadings_plot$PC2_scaled <- loadings_plot$PC2 * scale_factor

bc_cells$PC3_size <- abs(bc_cells$PC3)

ggplot() +
  geom_point(data = bc_cells, 
             aes(x = PC1, y = PC2, color = Diagnosis, size = PC3_size),
             alpha = 0.6) +
  geom_segment(data = loadings_plot,
               aes(x = 0, y = 0, xend = PC1_scaled, yend = PC2_scaled),
               arrow = arrow(length = unit(0.3, "cm")),
               color = "blue", 
               size = 0.8,
               alpha = 0.7) +
  geom_text_repel(data = loadings_plot,
                  aes(x = PC1_scaled, y = PC2_scaled, label = Variable),
                  color = "blue",
                  size = 3.5,
                  fontface = "bold",
                  box.padding = 0.5,
                  point.padding = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50", alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray50", alpha = 0.5) +
  scale_color_manual(values = c("B" = "green3", "M" = "red2"),
                     labels = c("B" = "Benign", "M" = "Malignant")) +
  scale_size_continuous(name = "|PC3|", range = c(1, 5)) +
  labs(title = "Biplot: Breast Cancer Cells with Variable Loadings",
       subtitle = "Points colored by diagnosis and sized by |PC3|",
       x = paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),
       y = paste0("PC2 (", round(prop_variance[2]*100, 1), "%)"),
       color = "Diagnosis") +
  theme_minimal() +
  theme(legend.position = "right")

for(i in 1:3) {
  cat(paste0("\nPC", i, " by Diagnosis:\n"))
  pc_summary <- bc_cells %>%
    group_by(Diagnosis) %>%
    summarise(
      Mean = mean(get(paste0("PC", i))),
      SD = sd(get(paste0("PC", i))),
      Min = min(get(paste0("PC", i))),
      Max = max(get(paste0("PC", i)))
    )
  print(pc_summary)
}


PC1 by Diagnosis:
# A tibble: 2 × 5
  Diagnosis  Mean    SD   Min   Max
  <chr>     <dbl> <dbl> <dbl> <dbl>
1 B          1.11  1.16 -4.24  3.43
2 M         -1.87  1.91 -8.52  1.81

PC2 by Diagnosis:
# A tibble: 2 × 5
  Diagnosis   Mean    SD   Min   Max
  <chr>      <dbl> <dbl> <dbl> <dbl>
1 B         -0.426  1.14 -6.17  1.97
2 M          0.717  1.38 -4.71  3.59

PC3 by Diagnosis:
# A tibble: 2 × 5
  Diagnosis     Mean    SD   Min   Max
  <chr>        <dbl> <dbl> <dbl> <dbl>
1 B          0.00172 0.915 -3.48  1.95
2 M         -0.00289 0.901 -3.60  2.34

pc_long <- bc_cells %>%
  select(Diagnosis, PC1, PC2, PC3) %>%
  pivot_longer(cols = c(PC1, PC2, PC3), names_to = "PC", values_to = "Score")

ggplot(pc_long, aes(x = Diagnosis, y = Score, fill = Diagnosis)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~PC, scales = "free_y") +
  scale_fill_manual(values = c("B" = "green3", "M" = "red2"),
                    labels = c("B" = "Benign", "M" = "Malignant")) +
  labs(title = "Distribution of PC Scores by Diagnosis",
       x = "Diagnosis",
       y = "PC Score") +
  theme_minimal()

for(i in 1:3) {
  pc_col <- paste0("PC", i)
  benign <- bc_cells[bc_cells$Diagnosis == "B", pc_col]
  malignant <- bc_cells[bc_cells$Diagnosis == "M", pc_col]
  
  mean_diff <- abs(mean(malignant) - mean(benign))
  pooled_sd <- sqrt((var(benign) + var(malignant)) / 2)
  effect_size <- mean_diff / pooled_sd }

Answer: Cells that are malignant high in all of these size and shapre measurements, while benign cells are low in all these measurements. Basically benign cells are smaller, and malignant cells are larger and misshapen.

I think pc2 is the best at describing what makes a cell malignant, while the other 2 pcs have mostly negative loadings, meaning they better desribe what a benign cell is. This is reinforced by looking at the boxplots.