We need to have 4 dimensions to explain at least 90% of the variability.
Problem 2
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
A plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
Answers: 1b 2d 3a 4c 5e
Problem 3
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, academic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
k <-sum(cumulative_variance <0.80) +1cat("\nAnswer: Retain the first", k, "PCs, which explains", round(cumulative_variance[k] *100, 2), "% of total variability\n")
Answer: Retain the first 6 PCs, which explains 84.24 % of total variability
B.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
Loading 1 is mostly population, art and healthcare, while pc2 is more balanced in its positive loadings, its 5 positive variables decrease in relevance at a consistent rate, while crime is the most important loading.
C.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
Warning: ggrepel: 307 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
We can see a few outliers visually on the pc plot. New York is very high in arts and population, while Long Beach is very high in housing
Problem 4
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
ID - patient identification number (not used in PCA)
Diagnosis determined by biopsy - B = benign or M = malignant
Radius: mean of distances from center to points on the perimeter
Texture: standard deviation of gray-scale values
Smoothness: local variation in radius lengths
Compactness: perimeter^2 / area - 1.0
Concavity: severity of concave portions of the contour
Concavepts: number of concave portions of the contour
cat("Variables with highest positive loadings:", paste(names(sort(loadings[,1], decreasing=TRUE)[1:3]), collapse=", "), "\n")
Variables with highest positive loadings: Texture, FracDim, Radius
cat("Variables with highest negative loadings:", paste(names(sort(loadings[,1], decreasing=FALSE)[1:3]), collapse=", "), "\n")
Variables with highest negative loadings: Compactness, Concavity, ConcavePts
cat("\nPC2 Interpretation:\n")
PC2 Interpretation:
cat("Variables with highest positive loadings:", paste(names(sort(loadings[,2], decreasing=TRUE)[1:3]), collapse=", "), "\n")
Variables with highest positive loadings: Radius, Texture, ConcavePts
cat("Variables with highest negative loadings:", paste(names(sort(loadings[,2], decreasing=FALSE)[1:3]), collapse=", "), "\n")
Variables with highest negative loadings: FracDim, Smoothness, Symmetry
cat("\nPC3 Interpretation:\n")
PC3 Interpretation:
cat("Variables with highest positive loadings:", paste(names(sort(loadings[,3], decreasing=TRUE)[1:3]), collapse=", "), "\n")
Variables with highest positive loadings: Radius, ConcavePts, Smoothness
cat("Variables with highest negative loadings:", paste(names(sort(loadings[,3], decreasing=FALSE)[1:3]), collapse=", "), "\n")
Variables with highest negative loadings: Texture, FracDim, Symmetry
C.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
What characteristics distinguish malignant from benign cells?
Of the 3 PCs, which does the best job of differentiating malignant from benign cells?
library(factoextra)
Warning: package 'factoextra' was built under R version 4.5.2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggrepel)bc_cells$PC1 <- pca_results$x[, 1]bc_cells$PC2 <- pca_results$x[, 2]bc_cells$PC3 <- pca_results$x[, 3]loadings_plot <-as.data.frame(pca_results$rotation[, 1:2])loadings_plot$Variable <-rownames(loadings_plot)scale_factor <-5loadings_plot$PC1_scaled <- loadings_plot$PC1 * scale_factorloadings_plot$PC2_scaled <- loadings_plot$PC2 * scale_factorbc_cells$PC3_size <-abs(bc_cells$PC3)ggplot() +geom_point(data = bc_cells, aes(x = PC1, y = PC2, color = Diagnosis, size = PC3_size),alpha =0.6) +geom_segment(data = loadings_plot,aes(x =0, y =0, xend = PC1_scaled, yend = PC2_scaled),arrow =arrow(length =unit(0.3, "cm")),color ="blue", size =0.8,alpha =0.7) +geom_text_repel(data = loadings_plot,aes(x = PC1_scaled, y = PC2_scaled, label = Variable),color ="blue",size =3.5,fontface ="bold",box.padding =0.5,point.padding =0.5) +geom_hline(yintercept =0, linetype ="dashed", color ="gray50", alpha =0.5) +geom_vline(xintercept =0, linetype ="dashed", color ="gray50", alpha =0.5) +scale_color_manual(values =c("B"="green3", "M"="red2"),labels =c("B"="Benign", "M"="Malignant")) +scale_size_continuous(name ="|PC3|", range =c(1, 5)) +labs(title ="Biplot: Breast Cancer Cells with Variable Loadings",subtitle ="Points colored by diagnosis and sized by |PC3|",x =paste0("PC1 (", round(prop_variance[1]*100, 1), "%)"),y =paste0("PC2 (", round(prop_variance[2]*100, 1), "%)"),color ="Diagnosis") +theme_minimal() +theme(legend.position ="right")
PC1 by Diagnosis:
# A tibble: 2 × 5
Diagnosis Mean SD Min Max
<chr> <dbl> <dbl> <dbl> <dbl>
1 B 1.11 1.16 -4.24 3.43
2 M -1.87 1.91 -8.52 1.81
PC2 by Diagnosis:
# A tibble: 2 × 5
Diagnosis Mean SD Min Max
<chr> <dbl> <dbl> <dbl> <dbl>
1 B -0.426 1.14 -6.17 1.97
2 M 0.717 1.38 -4.71 3.59
PC3 by Diagnosis:
# A tibble: 2 × 5
Diagnosis Mean SD Min Max
<chr> <dbl> <dbl> <dbl> <dbl>
1 B 0.00172 0.915 -3.48 1.95
2 M -0.00289 0.901 -3.60 2.34
pc_long <- bc_cells %>%select(Diagnosis, PC1, PC2, PC3) %>%pivot_longer(cols =c(PC1, PC2, PC3), names_to ="PC", values_to ="Score")ggplot(pc_long, aes(x = Diagnosis, y = Score, fill = Diagnosis)) +geom_boxplot(alpha =0.7) +facet_wrap(~PC, scales ="free_y") +scale_fill_manual(values =c("B"="green3", "M"="red2"),labels =c("B"="Benign", "M"="Malignant")) +labs(title ="Distribution of PC Scores by Diagnosis",x ="Diagnosis",y ="PC Score") +theme_minimal()
Answer: Cells that are malignant high in all of these size and shapre measurements, while benign cells are low in all these measurements. Basically benign cells are smaller, and malignant cells are larger and misshapen.
I think pc2 is the best at describing what makes a cell malignant, while the other 2 pcs have mostly negative loadings, meaning they better desribe what a benign cell is. This is reinforced by looking at the boxplots.