If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
3.5/6+1/6+0.7/6+0.4/6
[1] 0.9333333
We need to keep 4 principal components to explain at least 90% of the variability in the data set. If we only keep 3 we will preserve 86.6%, but if we go to 4, we will keep 93.3%.
Problem 2
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
Iris 4: C - Small PC1 and PC2 both positive as well
Iris 5: E - Negative 2 for PC1 and Negative 1 for PC2
Problem 3
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
Climate & Terrain
Housing
Health Care & Environment
Crime
Transportation
Education
The Arts
Recreation
Economics
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
Crime: violent crime rate, property crime rate.
Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, accademic options in higher education.
The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
Economics: average household income adjusted for taxes and living costs, income growth, job growth.
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
A.
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.
I will use the first 2 principal components to explore this data set in lower dimensions we can see a large drop off in the variability explained after the first principal components. Thus, we will only use the first two components to explore this data set. The first two components explain 51.36% of the variability in our data set.
B.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
Warning: package 'patchwork' was built under R version 4.4.3
l1 <- base_plot +geom_col(aes(x =reorder(Variable,PC1, decreasing=TRUE), y = PC1))l2 <- base_plot+geom_col(aes(x =reorder(Variable,PC2, decreasing=TRUE), y = PC2)) (l1+l2) +plot_annotation(title ='Loadings plots for PCA on Places Rated',theme=theme(plot.title =element_text(hjust =0.5)))
Loadings for PC1: After looking at the loadings for PC1, we can see that each variable is positive. Meaning that cities that have a high score in each variable will have a high PC1. Notably, the two largest loadings are Arts and Healthcare/Environment. This tells us that these are the most important variables that will lead to a large PC1. Notably Economics and Climate are the two smallest loadings. This means that while a high climate score and economics score will contribute to a high PC1, they will have the smallest contribution.
Loadings for PC2: After looking at the loadings for PC2, we can see that half of the loadings are positive and half are negative. This means that high scores for economics, recreate, crime, housing and climate will contribute to a large PC2. However, a high score in transportation, arts, healthcare, or education will contribute a low PC2. The most important positive PCS are economics, recreation and climate. The most important negative PCs are education and health care.
C.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Warning: ggrepel: 314 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
From this plot we can see that the cities that are far away from the others are San Francisco, Long Beach/LA, and New York City. New York City has a higher score for transportation and arts then pretty much every other city. New York is much further in the direction those arrows are pointing then any other point. San Francisco and Long Beach/LA are closer to each other, but they are far away from every other point. They are in between the Housing arrow and the Arts and Transportation arrows. This tells us that they have a high arts and transportation score, but they also have a large housing price.
Problem 4
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
ID - patient identification number (not used in PCA)
Diagnosis determined by biopsy - B = benign or M = malignant
Radius: mean of distances from center to points on the perimeter
Texture: standard deviation of gray-scale values
Smoothness: local variation in radius lengths
Compactness: perimeter^2 / area - 1.0
Concavity: severity of concave portions of the contour
Concavepts: number of concave portions of the contour
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
The first 3 PCs explain 86.70% of the variability in cell shape features. We can see a drop in variability explained by each PC once we go from 2 to 3 PCs. That tells us that we can use 3 PCs moving forward.
B.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
By looking at the loadings for PC1, we can see that all of them are negative. That tells us that cells that have a large values for each measurement will have a low PC1. The largest loadings are the loadings for Compactness, Concavity, and Concave points. Meaning that cells that have low values for each of these are going to have a very high PC1.
Next, we can look into PC2 and see how it differs. We can see that a good portionof the loadings for PC2 are positive. The negatives are fractal dimension, symmetry, compactness, and concavity. The largest loadings for PC2 are radius and fractal dimension. Cells with a large radius will have a large PC2 and the points with a large fractal dimension will have a small PC2.
Lastly, lets look into PC3. PC3 has a lot of loadings that are close to 0, which makes sense because a lot fo the variability has already been explained in the other PCs. Notably, texture was a very negative loading. The loading for texture is -0.898, which is the largest individual loading in absolute value for any of the PCs. This tells us that cells with a large standard deviation in texture are going to have a small PC3.
C.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
What characteristics distinguish malignant from benign cells?
Of the 3 PCs, which does the best job of differentiating malignant from benign cells?
cells_pc_df <- (cbind(bc_cells, cells_pca$x[,1:4]))fviz_pca_biplot(cells_pca,col.ind='white') +geom_point(data = cells_pc_df, aes(x = PC1, y = PC2, size = PC3, color = Diagnosis))
The characteristics that do the best job of differentiating between benign and malignant cells are concavity, concave points, compactness, and radius. There are two large clouds of points, one is mostly malignant points and the other is mostly benign points. Most of the arrows are pointing in the direction of the malignant cloud, but those four arrows specifically have no benign points in those directions. The PC that is best at differentiating between the cells is PC1. On one end of the PC1 axis, we have almost all malignant cells (the left side). The other side is almost entirely the benign cells. Additionally, the most important factors for PC1 are Compactness, concavity, and concave points. Cells with large values of these have low PC1s. Each of these arrows is pointing towards the malignant cells.