Application of PCA Principle Component Analysis

As I was preparing an assignment for my statistics class in descriptive statistics using the U.S. Census survey, I was curious about using Principal Component Analysis (PCA) on the data set. This R Markdown file provides a detailed walk through of how to perform PCA on the U.S. Census survey data using R, including the necessary mathematical concepts.

Load the Libraries

# Load necessary libraries
library(readxl)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Read the Data

The data used in this analysis is from the American Community Survey and is stored in an Excel file. We read this data into R using the readxl package.

# Read the American Community Survey data from Excel file
acs_data <- read_excel("/Users/cynthiamcginnis/Documents/incomeAgeGenderEdu.xlsx")

Data Preparation

Before performing PCA, we need to prepare the data. This involves converting relevant columns to numeric and handling any missing values.

# Convert relevant columns to numeric
acs_data <- acs_data %>%
  mutate(
    income = as.numeric(income),
    hrs_work = as.numeric(hrs_work),
    age = as.numeric(age),
    edu = as.factor(edu)  # Convert education to a factor
  )

# Encode education levels as numeric
edu_levels <- c("hs or lower" = 1, "college" = 2, "grad" = 3)
acs_data$edu <- as.numeric(recode(acs_data$edu, !!!edu_levels))

# Check for NA values and remove rows with any NA values
pca_data <- acs_data %>%
  select(income, hrs_work, age, edu) %>%
  na.omit()  # Remove rows with missing values

Standardize the Data

Standardizing the data is crucial in PCA as it ensures that each feature contributes equally to the analysis. Standardization involves subtracting the mean and dividing by the standard deviation for each feature.

The standard z formula for a given data point \(x_i\) is defined as:

\[\begin{equation} z_i = \frac{x_i - \mu}{\sigma} \end{equation}\]

where:

The formula subtracts the mean \(\mu\) from each data point \(x_i\) and then divides the result by the standard deviation \(\sigma\). This process centers the data around zero and scales it to have unit variance.

In R, the standardization can be easily performed using the scale() function, which applies the standard z formula to each feature in the dataset.

By standardizing the data, we ensure that all features have a mean of zero and a standard deviation of one. This is important in PCA because it prevents features with larger values from dominating the analysis and allows each feature to contribute equally to the identification of the principal components.

Once the data is standardized, PCA can be applied to identify the principal components that capture the maximum variance in the dataset. These components serve as a new, reduced-dimensionality basis for the data, enabling further analysis and visualization.

# Standardize the data
pca_data_scaled <- scale(pca_data)

Perform PCA

We perform PCA using the PCA function from the FactoMineR package. This function computes the principal components and provides various outputs for interpretation.

# Perform PCA
pca_result <- PCA(pca_data_scaled, graph = FALSE)

Visualize PCA Results

Visualizing the PCA results helps in understanding the distribution of data points and the importance of each principal component. We use the factoextra package to create these visualizations.

# Plot PCA results
fviz_pca_ind(pca_result, geom.ind = "point", pointshape = 21, 
             pointsize = 2, fill.ind = acs_data$gender, col.ind = "black",
             palette = "jco", addEllipses = TRUE, legend.title = "Gender") +
  theme_minimal()

fviz_pca_var(pca_result, col.var = "steelblue") +
  theme_minimal()

Mathematical Calculation:

\[r_{xy} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}\]

where:

The correlation coefficient ranges from -1 to +1, where:

In PCA, the correlation matrix is often used instead of the covariance matrix when the variables are measured on different scales or have different units. The correlation matrix standardizes the variables by dividing the covariance by the product of their standard deviations, resulting in a scale-invariant measure of the relationship between variables.

The loadings obtained from the eigenvalue decomposition of the correlation matrix represent the correlations between the original variables and the principal components. These loadings help in interpreting the principal components and understanding the contribution of each variable to the overall variance in the dataset.

#install.packages("kableExtra")
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Extract loadings for the first two principal components
loadings <- pca_result$var$coord[, 1:2]

# Create a dataframe for the loadings
loadings_df <- as.data.frame(loadings)
               colnames(loadings_df) <- c("PC1", "PC2")
               loadings_df$Variable <- rownames(loadings_df)
               rownames(loadings_df) <- NULL

# Display the loadings table using kableExtra
kable(loadings_df, caption = "Loadings for the First Two Principal Components") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Loadings for the First Two Principal Components
PC1 PC2 Variable
0.8468957 0.0540743 income
0.4448023 -0.5711667 hrs_work
0.3082897 0.8336759 age
0.7318477 -0.0666159 edu

Interpretation

PC1 (38.6% of variance): Dominated by income and education, suggesting it represents a socioeconomic status dimension.

Small positive loading for age.

PC2 (25.7% of variance): Dominated by age and hours worked, suggesting it represents a work-life balance or age-related dimension.

These interpretations help to understand the underlying structure of the data and the key factors that drive the variability captured by the first two principal components. By examining the loadings, you can identify which variables are most influential in defining each principal component.

# Print PCA results
print(pca_result)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 620 individuals, described by 4 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

Eigenvalues ($eig)

Purpose: Understand the amount of variance captured by each principal component.

A scree plot is a graphical representation used in Principal Component Analysis (PCA) to display the variance explained by each principal component

Usage:

Plot the eigenvalues to create a scree plot. This helps in determining how many principal components to retain.

# Scree plot
fviz_eig(pca_result)

# Variable Results ($var)

Purpose: Understand the relationship between the original variables and the principal components.

Coordinates for the Variables (\(var\)coord): These coordinates indicate how the original variables are projected onto the principal components.

The provided PCA biplot (or variable plot) visualizes the relationships between the original variables and the first two principal components (PC1 and PC2).

Key Observations Axes and Variance Explained:

Dim1 (38.6%): The x-axis represents the first principal component, which explains 38.6% of the total variance in the data.

Dim2 (25.7%): The y-axis represents the second principal component, explaining 25.7% of the variance.

Together, these two components explain approximately 64.3% of the total variance, indicating a substantial portion of the data’s variability is captured by these two dimensions.

Variable Representation:

Vectors (Arrows): Each arrow represents one of the original variables. The direction and length of the arrows indicate the contribution of each variable to the principal components. Angles Between Arrows: The angles between the arrows show the correlation between variables. Small angles indicate high positive correlation. Angles close to 90 degrees indicate little to no correlation. Angles close to 180 degrees (opposite directions) indicate high negative correlation. Interpretation of Variables Income and Education:

Both income and education have vectors pointing in the same general direction along Dim1, indicating a positive correlation between these variables. Their vectors are relatively long, suggesting they contribute significantly to Dim1. Hours Worked:

The hours worked vector points in the negative direction along Dim1 and slightly along Dim2, indicating a negative correlation with income and education and a smaller contribution to Dim2. The long vector length indicates a significant contribution to Dim1 but in the opposite direction to income and education. Age:

The age vector points primarily along Dim2 and has a long length, indicating it has a strong contribution to Dim2. Its vector forms a large angle with income and education, indicating a weaker correlation with these variables. The angle between age and hours worked is close to 90 degrees, suggesting little to no correlation between these two variables.

# Plot variables on the PCA biplot
fviz_pca_var(pca_result, col.var = "steelblue") +
  theme_minimal()

# Plot individuals on the PCA biplot with custom colors
fviz_pca_ind(pca_result,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("darkorchid4", "gold", "darkorange"),
             repel = TRUE
             ) +
  theme_minimal()

The provided PCA biplot visualizes the distribution of individuals in the principal component space defined by the first two principal components, Dim1 (PC1) and Dim2 (PC2). Here’s a detailed interpretation of the plot:

Key Observations

  1. Axes and Variance Explained:

    • Dim1 (38.6%): The x-axis represents the first principal component, which explains 38.6% of the total variance in the data.
    • Dim2 (25.7%): The y-axis represents the second principal component, explaining 25.7% of the variance.

    Together, these two components explain approximately 64.3% of the total variance, indicating a substantial portion of the data’s variability is captured by these two dimensions.

  2. Individual Representation:

    • Each point on the plot represents an individual from the dataset.
    • The color of the points represents the quality of representation (cos2) on the principal components. The color gradient ranges from purple to yellow:
      • Purple: Low cos2 values, indicating poorer representation of the individual in the principal component space.
      • Yellow: High cos2 values, indicating better representation of the individual in the principal component space.
  3. Cosine Squared (cos2):

    • The cos2 values indicate the quality of representation of each individual in the PCA space. Higher cos2 values mean that the individual is well-represented by the principal components.
  4. Density and Spread:

    • The plot shows a high density of individuals clustered around the center, indicating that many individuals have similar scores on the first two principal components.
    • There are individuals spread out along both dimensions, indicating variability in the dataset.

Interpretation of Principal Components

  1. Dim1 (PC1):
    • This dimension explains the largest amount of variance in the data.
    • Individuals spread out along this axis differ the most in terms of the variables contributing to PC1 (likely income and education based on previous loadings).
  2. Dim2 (PC2):
    • This dimension explains the second-largest amount of variance.
    • Individuals spread out along this axis differ in terms of the variables contributing to PC2 (likely age and hours worked based on previous loadings).

Detailed Interpretation

  • High-Density Cluster:
    • The high-density cluster near the origin (0,0) indicates a large number of individuals with average values for the principal components. These individuals are not extreme in any of the variables.
  • Color Gradient:
    • Individuals with yellow coloring have high cos2 values, meaning they are well-represented by the first two principal components. These individuals are positioned more clearly along the directions of the principal components.
    • Individuals with purple coloring have low cos2 values, meaning they are less well-represented by these two components and

Conclusion:

The R Markdown script provided demonstrates the process of performing Principal Component Analysis (PCA) on a data set from the American Community Survey and generating a scree plot to visualize the variance explained by each principal component.