If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
eigenvalues <-c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)prop_var <- eigenvalues /sum(eigenvalues)cum_var <-cumsum(prop_var)num_components <-which(cum_var >=0.90)[1]cat("Number of components to retain:", num_components, "\n")
Number of components to retain: 4
You should keep 4 of the principal components to explain at least 90% of the variability of the data set.
Problem 2
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, accademic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
ggplot(scree_data, aes(x = PC, y = Proportion)) +geom_bar(stat ="identity", fill ="steelblue") +geom_line(aes(y = Cumulative, group =1), color ="red", size =1) +geom_point(aes(y = Cumulative), color ="red", size =2) +ylab("Proportion of Variance Explained") +ggtitle("Scree Plot of PCA on Places Rated Criteria") +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
From the graph I can see that we should use about 5 of the principal components and they would explain roughly 85% of the total variability.
B.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
loadings <- places_pca$rotation[, 1:2] # PC1 and PC2loadings_df <-as.data.frame(loadings) %>%rownames_to_column("Variable")ggplot(loadings_df, aes(x = PC1, y = PC2, label = Variable)) +geom_segment(aes(x =0, y =0, xend = PC1, yend = PC2),arrow =arrow(length =unit(0.2, "cm")), color ="blue") +geom_text(hjust =0.5, vjust =-0.5, size =4) +xlim(-1, 1) +ylim(-1, 1) +labs(title ="Loadings Plot for PC1 and PC2",x ="PC1", y ="PC2") +theme_minimal()
The top 6 variables that will be used to separate the cities along the first and second principal axes will be Arts, HlthCare, and Educ for one and Econ, Recreat, and Housing for the other since they have the longest vectors, and the two groups are perpindicular to each other meaning that they help explain different things.
C.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
library(tidyverse)library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
places_with_scores <- places %>%mutate(PC1 = places_pca$x[, 1],PC2 = places_pca$x[, 2])ggplot(places_with_scores, aes(x = PC1, y = PC2, label = City)) +geom_point(color ="steelblue", size =2) +geom_text_repel(size =3.5, max.overlaps =20) +labs(title ="Biplot of Cities on First Two Principal Components",x ="PC1", y ="PC2") +theme_minimal()
Warning: ggrepel: 282 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
The biggest outlier cities seem to be New-YorkNY, San-FranciscoCA, and Los-AngelesLong-BeachCA. These are all more popular cites and more coastal cities with a higher population density.
Problem 4
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
ID - patient identification number (not used in PCA)
Diagnosis determined by biopsy - B = benign or M = malignant
Radius: mean of distances from center to points on the perimeter
Texture: standard deviation of gray-scale values
Smoothness: local variation in radius lengths
Compactness: perimeter^2 / area - 1.0
Concavity: severity of concave portions of the contour
Concavepts: number of concave portions of the contour
From what I can see it looks like PC1 is all negative, PC2 and PC3 is a mix of positive and negative
C.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
What characteristics distinguish malignant from benign cells?
Of the 3 PCs, which does the best job of differentiating malignant from benign cells?
library(tidyverse)shape_vars <- bc_cells %>%select(-Diagnosis) %>%scale()bc_pca <-prcomp(shape_vars, center =TRUE, scale. =TRUE)bc_scores <-as.data.frame(bc_pca$x[, 1:3]) %>%rename(PC1 = PC1, PC2 = PC2, PC3 = PC3)bc_plot_data <- bc_cells %>%select(Diagnosis) %>%bind_cols(bc_scores)ggplot(bc_plot_data, aes(x = PC1, y = PC2)) +geom_point(aes(color = Diagnosis, size =abs(PC3)), alpha =0.7) +scale_color_manual(values =c("B"="steelblue", "M"="firebrick")) +labs(title ="PCA Biplot of Breast Cancer Cells",x ="PC1", y ="PC2", size ="PC3 Magnitude") +theme_minimal()
Of the 3 PCs PC1 I feel does the best job at differentiating malignant from benign cells since with the coloring we are able to see that benign cells are further right most of the time then the malignant cells. It also seems that most of the benign cells are clustered where as teh malignant cells are a little more scattered in the plot.