As I was preparing an assignment for my statistics class in descriptive statistics using the U.S. Census survey, I was curious about using Principal Component Analysis (PCA) on the data set. This R Markdown file provides a detailed walk through of how to perform PCA on the U.S. Census survey data using R, including the necessary mathematical concepts.
# Load necessary libraries
library(readxl)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
The data used in this analysis is from the American Community Survey and is stored in an Excel file. We read this data into R using the readxl package.
# Read the American Community Survey data from Excel file
acs_data <- read_excel("/Users/cynthiamcginnis/Documents/incomeAgeGenderEdu.xlsx")
Before performing PCA, we need to prepare the data. This involves converting relevant columns to numeric and handling any missing values.
# Convert relevant columns to numeric
acs_data <- acs_data %>%
mutate(
income = as.numeric(income),
hrs_work = as.numeric(hrs_work),
age = as.numeric(age),
edu = as.factor(edu) # Convert education to a factor
)
# Encode education levels as numeric
edu_levels <- c("hs or lower" = 1, "college" = 2, "grad" = 3)
acs_data$edu <- as.numeric(recode(acs_data$edu, !!!edu_levels))
# Check for NA values and remove rows with any NA values
pca_data <- acs_data %>%
select(income, hrs_work, age, edu) %>%
na.omit() # Remove rows with missing values
Standardizing the data is crucial in PCA as it ensures that each feature contributes equally to the analysis. Standardization involves subtracting the mean and dividing by the standard deviation for each feature.
The standard z formula for a given data point \(x_i\) is defined as:
\[\begin{equation} z_i = \frac{x_i - \mu}{\sigma} \end{equation}\]
where:The formula subtracts the mean \(\mu\) from each data point \(x_i\) and then divides the result by the standard deviation \(\sigma\). This process centers the data around zero and scales it to have unit variance.
In R, the standardization can be easily performed using the
scale()
function, which applies the standard z formula to
each feature in the dataset.
By standardizing the data, we ensure that all features have a mean of zero and a standard deviation of one. This is important in PCA because it prevents features with larger values from dominating the analysis and allows each feature to contribute equally to the identification of the principal components.
Once the data is standardized, PCA can be applied to identify the principal components that capture the maximum variance in the dataset. These components serve as a new, reduced-dimensionality basis for the data, enabling further analysis and visualization.
# Standardize the data
pca_data_scaled <- scale(pca_data)
We perform PCA using the PCA function from the FactoMineR package. This function computes the principal components and provides various outputs for interpretation.
# Perform PCA
pca_result <- PCA(pca_data_scaled, graph = FALSE)
Visualizing the PCA results helps in understanding the distribution of data points and the importance of each principal component. We use the factoextra package to create these visualizations.
# Plot PCA results
fviz_pca_ind(pca_result, geom.ind = "point", pointshape = 21,
pointsize = 2, fill.ind = acs_data$gender, col.ind = "black",
palette = "jco", addEllipses = TRUE, legend.title = "Gender") +
theme_minimal()
fviz_pca_var(pca_result, col.var = "steelblue") +
theme_minimal()
Mathematical Calculation:
Loadings: Calculated as part of the PCA process through eigenvalue decomposition of the covariance matrix or correlation matrix. They represent the coefficients in the linear combinations of the original variables that form the principal components.
Correlation Coefficient: Calculated as the covariance of the two variables divided by the product of their standard deviations.
\[r_{xy} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}\]
where:
The correlation coefficient ranges from -1 to +1, where:
In PCA, the correlation matrix is often used instead of the covariance matrix when the variables are measured on different scales or have different units. The correlation matrix standardizes the variables by dividing the covariance by the product of their standard deviations, resulting in a scale-invariant measure of the relationship between variables.
The loadings obtained from the eigenvalue decomposition of the correlation matrix represent the correlations between the original variables and the principal components. These loadings help in interpreting the principal components and understanding the contribution of each variable to the overall variance in the dataset.
#install.packages("kableExtra")
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# Extract loadings for the first two principal components
loadings <- pca_result$var$coord[, 1:2]
# Create a dataframe for the loadings
loadings_df <- as.data.frame(loadings)
colnames(loadings_df) <- c("PC1", "PC2")
loadings_df$Variable <- rownames(loadings_df)
rownames(loadings_df) <- NULL
# Display the loadings table using kableExtra
kable(loadings_df, caption = "Loadings for the First Two Principal Components") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
PC1 | PC2 | Variable |
---|---|---|
0.8468957 | 0.0540743 | income |
0.4448023 | -0.5711667 | hrs_work |
0.3082897 | 0.8336759 | age |
0.7318477 | -0.0666159 | edu |
PC1 (38.6% of variance): Dominated by income and education, suggesting it represents a socioeconomic status dimension.
High positive loadings for income and education.
Moderate positive loading for hours worked.
Small positive loading for age.
PC2 (25.7% of variance): Dominated by age and hours worked, suggesting it represents a work-life balance or age-related dimension.
These interpretations help to understand the underlying structure of the data and the key factors that drive the variability captured by the first two principal components. By examining the loadings, you can identify which variables are most influential in defining each principal component.
# Print PCA results
print(pca_result)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 620 individuals, described by 4 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
Purpose: Understand the amount of variance captured by each principal component.
A scree plot is a graphical representation used in Principal Component Analysis (PCA) to display the variance explained by each principal component
Usage:
Plot the eigenvalues to create a scree plot. This helps in determining how many principal components to retain.
# Scree plot
fviz_eig(pca_result)
# Variable Results ($var)
Purpose: Understand the relationship between the original variables and the principal components.
Coordinates for the Variables (\(var\)coord): These coordinates indicate how the original variables are projected onto the principal components.
The provided PCA biplot (or variable plot) visualizes the relationships between the original variables and the first two principal components (PC1 and PC2).
Key Observations Axes and Variance Explained:
Dim1 (38.6%): The x-axis represents the first principal component, which explains 38.6% of the total variance in the data.
Dim2 (25.7%): The y-axis represents the second principal component, explaining 25.7% of the variance.
Together, these two components explain approximately 64.3% of the total variance, indicating a substantial portion of the data’s variability is captured by these two dimensions.
Variable Representation:
Vectors (Arrows): Each arrow represents one of the original variables. The direction and length of the arrows indicate the contribution of each variable to the principal components. Angles Between Arrows: The angles between the arrows show the correlation between variables. Small angles indicate high positive correlation. Angles close to 90 degrees indicate little to no correlation. Angles close to 180 degrees (opposite directions) indicate high negative correlation. Interpretation of Variables Income and Education:
Both income and education have vectors pointing in the same general direction along Dim1, indicating a positive correlation between these variables. Their vectors are relatively long, suggesting they contribute significantly to Dim1. Hours Worked:
The hours worked vector points in the negative direction along Dim1 and slightly along Dim2, indicating a negative correlation with income and education and a smaller contribution to Dim2. The long vector length indicates a significant contribution to Dim1 but in the opposite direction to income and education. Age:
The age vector points primarily along Dim2 and has a long length, indicating it has a strong contribution to Dim2. Its vector forms a large angle with income and education, indicating a weaker correlation with these variables. The angle between age and hours worked is close to 90 degrees, suggesting little to no correlation between these two variables.
# Plot variables on the PCA biplot
fviz_pca_var(pca_result, col.var = "steelblue") +
theme_minimal()
# Plot individuals on the PCA biplot with custom colors
fviz_pca_ind(pca_result,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("darkorchid4", "gold", "darkorange"),
repel = TRUE
) +
theme_minimal()
The provided PCA biplot visualizes the distribution of individuals in
the principal component space defined by the first two principal
components, Dim1 (PC1) and Dim2 (PC2). Here’s a detailed interpretation
of the plot:
Axes and Variance Explained:
Together, these two components explain approximately 64.3% of the total variance, indicating a substantial portion of the data’s variability is captured by these two dimensions.
Individual Representation:
Cosine Squared (cos2):
Density and Spread:
The R Markdown script provided demonstrates the process of performing Principal Component Analysis (PCA) on a data set from the American Community Survey and generating a scree plot to visualize the variance explained by each principal component.