3.5+1.0+0.7+0.4+0.25+0.15[1] 6
3.5/6 + 1/6 +0.7/6 + 0.4/6[1] 0.9333333
SUBMISSION INSTRUCTIONS
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
3.5+1.0+0.7+0.4+0.25+0.15[1] 6
3.5/6 + 1/6 +0.7/6 + 0.4/6[1] 0.9333333
You should keep 4 of the values as at that point you reach a value greater than 90%.
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixConsider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixA plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
five_irises Sepal.Length Sepal.Width Petal.Length Petal.Width
1 0.189 -1.970 0.137 -0.262000
2 0.551 0.786 1.040 1.580000
3 -0.415 2.620 -1.340 -1.310000
4 0.310 -0.590 0.534 0.000875
5 -0.898 1.700 -1.050 -1.050000
pc_loadings PC1 PC2
Sepal.Length 0.5210659 -0.37741762
Sepal.Width -0.2693474 -0.92329566
Petal.Length 0.5804131 -0.02449161
Petal.Width 0.5648565 -0.06694199
Iris1PC1 <- .189*.521 + -1.97 * -.269 + .137 * .580 + -.262 * .565
Iris1PC2 <- .189*-.377 + -1.97 * -.923 + .137 * -0.024 + -.262 * -0.067
Iris2PC1 <- .551*.521 + .786 * -.269 + 1.040 * .580 + 1.58 * .565
Iris2PC2 <- .551*-.377 + .786 * -.923 + 1.040 * -0.024 + 1.58 * -0.067
Iris3PC1 <- -0.415*.521 + 2.620 * -.269 + -1.340 * .580 + -1.31 * .565
Iris3PC2 <- -0.415*-.377 + 2.620 * -.923 + -1.340 * -0.024 + -1.31 * -0.067
Iris4PC1 <- 0.310*.521 + -.590 * -.269 + .534 * .580 + .000875 * .565
Iris4PC2 <- 0.310*-.377 + -.590 * -.923 + .534 * -0.024 + .000875 * -0.067
Iris5PC1 <- -0.898*.521 + 1.7 * -.269 + -1.050 * .580 + -1.05 * .565
Iris5PC2 <- -0.898*-.377 + 1.7 * -.923 + -1.050 * -0.024 + -1.05 * -0.067Iris1: B Iris2: D Iris3: A Iris4: C Iris5: E
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
library(GGally)
library(factoextra)Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
places_numeric <- (places
%>% select(-City,-Long,-Lat,-Pop)
)
places_pca <- prcomp(places_numeric, scale. = TRUE)
fviz_eig(places_pca) I would use the first 3 dimensions as after trhat they seem to not have as sharp of a decline. This would explain approximately 65-70% of the total variability.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
fviz_contrib(places_pca, choice = 'var', axes = 1) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to first dimension')sort(places_pca$rotation[,1], decreasing=TRUE) Arts HlthCare Housing Transp Recreat Crime Educ Climate
0.4630545 0.4602146 0.3565216 0.3511508 0.3278879 0.2812984 0.2752926 0.2064140
Econ
0.1354123
Variables that would be utilized to seperate cities along the first principal axes would be Arts, Health Care, Housing, and Transportation. We would expect to have higher values in Arts, Health Care, Transportation and lower values in Housing due to the postive values of these loadings.
fviz_contrib(places_pca, choice = 'var', axes = 2) +
theme_classic(base_size = 16)+
labs(x = 'Variable',
title = 'Contribution to second dimension')sort(places_pca$rotation[,2], decreasing=TRUE) Econ Recreat Crime Housing Climate Transp Arts
0.4712833 0.3844746 0.3553423 0.2506240 0.2178353 -0.1796045 -0.1947899
HlthCare Educ
-0.2994653 -0.4833821
Variables that would be utilized to separate cities along the second principal axes would be Education Econ, Recreation, and Crime. We would expect higher values in Econ, Recreation, and lower values in Crime due to the positive signs on the loadings. We would expect low values though in education due to the negative sign associated with the loading.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
places$PC1 <- places_pca$x[,1]
places$PC2 <- places_pca$x[,2]
library(ggplot2)
library(ggrepel)
var_loadings <- as.data.frame(places_pca$rotation[, 1:2]) * 5
var_loadings$Variable <- rownames(places_pca$rotation)
ggplot() +
# Variable arrows
geom_segment(
data = var_loadings,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.2, "cm")),
color = "firebrick",
size = 0.8
) +
geom_text_repel(
data = var_loadings,
aes(x = PC1, y = PC2, label = Variable),
color = "firebrick",
size = 3
) +
# City points (transparent)
geom_point(
data = places,
aes(x = PC1, y = PC2),
color = "steelblue",
alpha = 0.6, # <-- transparency (0 = invisible, 1 = solid)
size = 2
) +
geom_text_repel(
data = places,
aes(x = PC1, y = PC2, label = City),
size = 3
) +
theme_classic(base_size = 14) +
labs(
title = "Biplot of Cities on the First Two Principal Components",
x = "Principal Component 1",
y = "Principal Component 2"
)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: ggrepel: 305 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
From our Biplot we can see that New York is our largest outlier as it is located in the bottom right of the graph indicating high PC1 value and low PC2 value which means we would expect scores with high economic, high education, high arts, and low housing scores. Similarily in this area we see Boston, Washington DC, Philidelphia, and Chicago and would expect similiar traits. Other outliers we have are San Francisco and Los Angeles/Long Beach California which is located in the middle right which indicates High PC1 value and a near 0 PC2 value meaning we would expect high arts, low housing, and average scores in education and econ scores.
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
cells_numeric <- (bc_cells
%>% select(-Diagnosis)
)
cells_pca <- prcomp(cells_numeric, scale. = TRUE)
summary(cells_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
There is 86.7% of the total variablity explained by the first 3 PCs. Due to this I would support this suggestion that is a good variability value to keep at.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
fviz_contrib(cells_pca, choice = 'var', axes = 1) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to first dimension')sort(cells_pca$rotation[,1], decreasing=TRUE) Texture FracDim Radius Symmetry Smoothness ConcavePts
-0.1432175 -0.2251375 -0.3003952 -0.3240333 -0.3482386 -0.4459288
Concavity Compactness
-0.4508935 -0.4584098
The main driving forces for principal component 1 is Compactness, Concavity, and ConcavePts. These values are all negative though so this means we would expect low values for each of these variables when a cell is high in PC1.
fviz_contrib(cells_pca, choice = 'var', axes = 2) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to second dimension')sort(cells_pca$rotation[,2], decreasing=TRUE) Radius Texture ConcavePts Concavity Compactness Symmetry
0.52850910 0.35378530 0.22823091 0.12707085 -0.07219238 -0.28112508
Smoothness FracDim
-0.32661945 -0.57996072
The main driving forces for principal component 2 is Fractal Dimension, Radius, and Texture. Fractal Dimension is negative so this means we would expect a low value for this variable when a cell is high in PC2. The other two variables are positive so we would expect greater values for each of these variables in Radius and Texture for a cell that has a high value of PC2.
fviz_contrib(cells_pca, choice = 'var', axes = 3) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to third dimension')sort(cells_pca$rotation[,3], decreasing=TRUE) Radius ConcavePts Smoothness Concavity Compactness Symmetry
0.27751200 0.17458320 0.12684205 0.04245883 -0.02956419 -0.08456832
FracDim Texture
-0.24389523 -0.89839046
The main driving force for principal component 3 is Texture. This is a large negative variable so we would expect low values for this variable when cells are high in PC3.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
library(factoextra)
library(ggplot2)
library(dplyr)
pca_coords <- as.data.frame(cells_pca$x[, 1:3])
pca_coords$Diagnosis <- bc_cells$Diagnosis
pca_biplot_base <- fviz_pca_biplot(
cells_pca,
col.ind = "white",
repel = TRUE
)
ggplot(pca_coords, aes(x = PC1, y = PC2)) +
geom_point(aes(size = PC3, color = Diagnosis), alpha = 0.2) +
scale_size_continuous(name = "PC3") +
scale_color_manual(values = c("B" = "blue", "M" = "red")) +
geom_segment(
data = as.data.frame(cells_pca$rotation[, 1:2]),
aes(x = 0, y = 0, xend = PC1*5, yend = PC2*5),
arrow = arrow(length = unit(0.3, "cm")),
color = "black"
) +
geom_text(
data = as.data.frame(cells_pca$rotation[, 1:2]),
aes(x = PC1*5.3, y = PC2*5.3, label = rownames(cells_pca$rotation)),
size = 4
) +
theme_minimal(base_size = 14) +
labs(
title = "PCA Biplot (PC1 vs PC2, size = PC3)",
x = "PC1",
y = "PC2"
)The characteristics that distinguish between malignant and benign cells appears to be a mixture of PC1 and PC2 variables as they have a diagonal split between them. The PC that does the best at differentiating between the benign and malignant cells would be the PC1 as we have a near vertical line split between our high concentration points of benign and malignant cells and seem to notice more blue on the right side of the graph and more red on the left side of the graph. Malignant cells appear to have high Radius, number of ConcavePts, Concavity, and Texture. Whereas benign cells have lower values in these variables and thats due to higher values in PC1.