Discussion 1: PCA Basics Review

For this discussion we will be utilizing a data set that was a study investigating potential civil rights violations among insurance companies operating in Chicago, Illinois. A description of the data set can be found as well as a description for each variable. Please take the time and read through the description before continuing.

library(Sleuth3)
mydata<-ex1716
head(mydata)
##   ZIP Fire Theft  Age Income Race Vol Invol
## 1  26  6.2    29 60.4  11744 10.0 5.3   0.0
## 2  40  9.5    44 76.5   9323 22.2 3.1   0.1
## 3  13 10.5    36 73.5   9948 19.6 4.8   1.2
## 4  57  7.7    37 66.9  10656 17.3 5.7   0.5
## 5  14  8.6    53 81.4   9730 24.5 5.9   0.7
## 6  10 34.1    68 52.6   8231 54.0 4.0   0.3
names(mydata)
## [1] "ZIP"    "Fire"   "Theft"  "Age"    "Income" "Race"   "Vol"    "Invol"
#?ex1716  Run this for description.

For this exercise we will be treating the Vol variable as the response because it indicates the number of new policies provided per 100 housing units. These new policies were argued to be favorable to certain locations within Chicago and these policies were not offered to others. The age, income, race, theft, and fire variables will be considered as potential predictors of the voluntary new policies awarded by insurance companies. We will ignore the Invol variable.

In a regression or classification setting, PCA would be applied to the predictors to help lower the number of predictors to use. Lets run through a few computations and answer some basic questions about our understanding of PCA

  1. First off lets observe basic summary statistics of the predictor variables. Make note of anything interesting. Additionally, create a scatter plot and correlation matrix of the five predictor variables.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)



mysummary <- mydata %>%
  select(Fire, Theft, Age, Income,Race) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd,
                      variance= var))
## Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# reshape it using tidyr functions

clean.summary <- mysummary %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, max, mean, sd,variance) # reorder columns
print(clean.summary)
##      var  min     max        mean          sd     variance
## 1    Age    2    90.1    60.32766   22.574964 5.096290e+02
## 2   Fire    2    39.7    12.27872    9.302266 8.653215e+01
## 3 Income 5583 21480.0 10695.82979 2754.198008 7.585607e+06
## 4   Race    1    99.7    34.98511   32.587614 1.061953e+03
## 5  Theft    3    75.0    30.23404   14.533638 2.112266e+02

ANSWER A:

library(dplyr)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Summary statistics using across()
summary_stats <- mydata %>%
  select(Fire, Theft, Age, Income, Race) %>% 
  summarise(across(.cols = everything(), 
                   .fns = list(min = ~min(.),
                               q25 = ~quantile(., 0.25),
                               median = ~median(.),
                               q75 = ~quantile(., 0.75),
                               max = ~max(.),
                               mean = ~mean(.),
                               sd = ~sd(.),
                               variance = ~var(.))))

# Print summary statistics
print(summary_stats)
##   Fire_min Fire_q25 Fire_median Fire_q75 Fire_max Fire_mean  Fire_sd
## 1        2     5.65        10.4    16.05     39.7  12.27872 9.302266
##   Fire_variance Theft_min Theft_q25 Theft_median Theft_q75 Theft_max Theft_mean
## 1      86.53215         3        22           29        38        75   30.23404
##   Theft_sd Theft_variance Age_min Age_q25 Age_median Age_q75 Age_max Age_mean
## 1 14.53364       211.2266       2    48.6         65    77.3    90.1 60.32766
##     Age_sd Age_variance Income_min Income_q25 Income_median Income_q75
## 1 22.57496      509.629       5583       8447         10694      11989
##   Income_max Income_mean Income_sd Income_variance Race_min Race_q25
## 1      21480    10695.83  2754.198         7585607        1     3.75
##   Race_median Race_q75 Race_max Race_mean  Race_sd Race_variance
## 1        24.5    57.65     99.7  34.98511 32.58761      1061.953
# Scatter plot matrix using ggpairs()
scatter_plot_matrix <- ggpairs(mydata %>% select(Fire, Theft, Age, Income, Race))
print(scatter_plot_matrix)

From the matrix:

1. There is a significant negative correlation between Income and Race, meaning as income increases, the percentage of minority race decreases.

2. Fire incidents have a moderate positive correlation with Theft, suggesting that areas with more fires also tend to have more thefts.

3. Age shows little to no correlation with Theft and Income, indicating these variables do not change together in any specific linear pattern.

  1. The following code produces three key results. The eigenvectors, eigenvalues, and a scree plot from a PCA analysis that uses the covariance matrix rather than the correlation matrix. Note this is not typically what is done. What is the scree plot telling us here? Comment on the “weirdness” of the eigen vector for the first PC. Any thoughts on why this occurred?
pc.result<-prcomp(mydata[,2:6],scale.=FALSE)

#Eigen Vectors
pc.result$rotation
##                  PC1          PC2         PC3          PC4          PC5
## Fire    0.0020617673 -0.097740929 0.123122260 -0.209894627  0.965001355
## Theft   0.0004721111 -0.243966789 0.485518217 -0.798134174 -0.260257343
## Age     0.0043331964  0.252841469 0.860436215  0.442219199  0.012004758
## Income -0.9999537066 -0.006974434 0.004990079  0.004031319  0.001670201
## Race    0.0083267254 -0.931101211 0.093475068  0.351215055 -0.029859620
#Eigen Values
eigenvals<-pc.result$sdev^2
eigenvals
## [1] 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01
#Scree plot
par(mfrow=c(1,2))
plot(eigenvals,type="l",main="Scree Plot",ylab="Eigen Values",xlab="PC #")
plot(eigenvals/sum(eigenvals),type="l",main="Scree Plot",ylab="Prop. Var. Explained",xlab="PC #",ylim=c(0,1))
cumulative.prop<-cumsum(eigenvals/sum(eigenvals))
lines(cumulative.prop,lty=2)

ANSWER B:

The PCA analysis was performed using the covariance matrix, which is not the standard practice as it does not account for the different scales of measurement of the variables. This is likely the cause of the “weirdness” observed in the eigenvectors of the first principal component (PC). The first PC is largely influenced by the Income variable due to its larger scale compared to the others. This is evident in the fact that the first eigenvalue is much larger than the others, capturing most of the variability in the data set.

The scree plot displays the eigenvalues of the principal components, which helps in determining how many components to retain. The rule of thumb is to look for the “elbow,” which usually signifies where the explained variance by the components starts to level off, indicating that additional components contribute less to the explanation of the variance.

# PCA with scaled data
pc.result.scaled <- prcomp(mydata[, 2:6], scale. = TRUE)

# Eigen Vectors of scaled data
pc.result.scaled$rotation
##               PC1         PC2        PC3        PC4        PC5
## Fire    0.5048229 -0.03090255  0.1890711 -0.8296296  0.1420061
## Theft   0.3090776 -0.83261158  0.2407813  0.2183405 -0.3249320
## Age     0.3985347 -0.19233538 -0.8163274  0.1235444  0.3500337
## Income -0.5050429 -0.45724546  0.1426136 -0.1371460  0.7047769
## Race    0.4855169  0.24441088  0.4685589  0.4795518  0.5049944
# Eigen Values of scaled data
eigenvals.scaled <- pc.result.scaled$sdev^2
eigenvals.scaled
## [1] 2.7586160 0.9496824 0.7537101 0.3865744 0.1514171
# Scree plot for scaled data
par(mfrow = c(1, 2))
plot(eigenvals.scaled, type = "l", main = "Scree Plot with Scaled Data", ylab = "Eigen Values", xlab = "PC #")
plot(eigenvals.scaled / sum(eigenvals.scaled), type = "l", main = "Scree Plot with Scaled Data", ylab = "Prop. Var. Explained", xlab = "PC #", ylim = c(0, 1))
cumulative.prop.scaled <- cumsum(eigenvals.scaled / sum(eigenvals.scaled))
lines(cumulative.prop.scaled, lty = 2)

# Prepare data for regression by combining PCs and response into a new data frame
# regression_data <- data.frame(pcs_to_use, response = mydata$Vol)

# Fit a linear model using the selected PCs
# lm_result <- lm(response ~ ., data = regression_data)

# Look at the summary of the linear model
# summary(lm_result)

Call: lm(formula = response ~ ., data = regression_data)

Residuals: Min 1Q Median 3Q Max -3.8321 -1.1298 -0.1105 0.8892 4.6884

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7426 0.2636 25.580 < 2e-16 PC1 -2.2573 0.1604 -14.071 < 2e-16 PC2 -1.1595 0.2734 -4.241 0.000113 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.807 on 44 degrees of freedom Multiple R-squared: 0.8308, Adjusted R-squared: 0.8231 F-statistic: 108 on 2 and 44 DF, p-value: < 2.2e-16

Analysis

1: Unscaled Data Scree Plot: The first principal component (PC1) explains a vast majority of the variance, indicated by the eigenvalue on the left plot and the steep rise in cumulative variance explained on the right plot.

Subsequent components add very little to the explained variance, which suggests that PC1 dominates due to the scale differences among the variables.

2: Scaled Data Scree Plot:

The eigenvalues for each PC are closer in magnitude, suggesting a more balanced contribution of variance. The proportion of variance explained by each PC increases more gradually, indicating no single variable is dominating the PCA.

Summary from data:

The summary of the linear regression model provides several important pieces of information:

  1. Coefficients: The model estimated the intercept to be approximately 6.7426, with a standard error of 0.2636. Both principal components (PC1 and PC2) are significantly associated with the response variable (Vol), with p-values less than 0.001, indicating strong evidence against the null hypothesis of no association.

  2. PC1:

    • The coefficient for PC1 is -2.2573, suggesting that as the value of PC1 increases, the expected value of Vol decreases. The negative sign indicates an inverse relationship.
    • This component is highly significant, as indicated by the low p-value (< 2e-16).
  3. PC2:

    • The coefficient for PC2 is -1.1595, also indicating an inverse relationship with Vol.
    • PC2 is also statistically significant, with a p-value of 0.000113.
  4. Residuals: The residuals have a median close to zero, which is good; however, the range suggests there might be some outliers or that the variance is not constant (heteroscedasticity).

  5. R-squared: The Multiple R-squared value is 0.8308, which means that approximately 83.08% of the variability in Vol is explained by the two principal components used in the model. The Adjusted R-squared is 0.8231, slightly lower, adjusted for the number of predictors in the model.

  6. F-statistic: The overall F-statistic is significant (p < 2.2e-16), indicating that the model as a whole is statistically significant. These results suggest that the model fits the data well, with both principal components contributing significantly to explaining the variation in Vol.

C. We can extract the new variables created from PCA by using the following command. I am going to append the response variable to the data set for future work.

pc.scores<-pc.result$x
pc.scores<-data.frame(pc.scores,Vol=ex1716$Vol)

Use this new data set and compute the same summary statistics on these new variables. Examine the variance’s for these new variables. Do they look familiar?

ANSWER C:

# Extract the principal component scores
pc.scores <- pc.result$x

# Create a data frame of the PC scores
pc.scores.df <- data.frame(pc.scores)

# Calculate summary statistics for each principal component
summary_stats_pcs <- summary(pc.scores.df)

# Get the variances (squared standard deviations) for each PC
variances_pcs <- apply(pc.scores.df, 2, var)

# Compare the variances of the PCs to the eigenvalues from the PCA analysis
# This should be equivalent to the square of the standard deviation of the PCs
eigenvalues <- pc.result$sdev^2

# Printing the variances of the PCs and the eigenvalues for comparison
print(variances_pcs)
##          PC1          PC2          PC3          PC4          PC5 
## 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01
print(eigenvalues)
## [1] 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01

The output confirms that the variances of the new principal component scores (pc.scores.df) match the eigenvalues from the PCA analysis (eigenvalues). This is expected and correct, as the variance of each principal component in the transformed space should equal the corresponding eigenvalue from the covariance matrix of the original dataset.

In detail: - PC1 has a very large variance (eigenvalue of approximately 7.59 million), which indicates that it captures a vast amount of the variability within your data. - PC2 to PC5 have much smaller eigenvalues in comparison, indicating that they capture progressively less variability within the dataset.

Given these variances (eigenvalues), PC1 appears to be the most significant factor in terms of explaining the variance in the dataset. However, as seen from the scree plot with the scaled data, the first two PCs together explain a substantial portion of the variance (over 70%).

Here’s what we could do next with this information:

  • Interpret the PCs: Examine the loadings (the weights used to construct each PC from the original variables) to interpret what each principal component represents in terms of the original variables.
  • Use PCs in Further Analysis: Depending on our goals, we might use only the first few PCs as predictors in a regression model to predict Vol or for other analyses like clustering.

The exact steps we would take would depend on the objectives of the data analysis. If the aim is to build a predictive model, we would typically proceed with the regression analysis using the PCs as we discussed before. If we are more interested in data exploration, we might further investigate the loadings to understand the structure and relationships in the data.

  1. Create a scatterplot matrix for these new variables to visualize that they are uncorrelated

ANSWER D:

library(GGally)
ggpairs(pc.scores.df[, 1:5])

# We  should see that the off-diagonal plots do not show any trend or correlation, confirming the orthogonality of the principal components.

The scatterplot matrix illustrates the relationships between the principal components after performing PCA. As expected from the properties of PCA, the scatterplots between different principal components show no apparent correlation, which is confirmed by the correlation coefficients (Corr:) being zero or very close to zero. This is precisely what we expect because PCA is designed to create orthogonal components that capture the variance in the dataset without redundancy.

The histograms on the diagonal show the distribution of each principal component. Since PCA also orders the components by the amount of variance they capture from the data, you should typically see these distributions decrease in spread from PC1 to PC5.

Given that the correlation coefficients between the principal components are effectively zero, this confirms that the PCA has been conducted correctly and that the principal components are indeed orthogonal (independent) of each other. This orthogonality is a crucial characteristic of PCA that makes the components useful for various applications like dimensionality reduction, feature extraction, and data visualization.

This orthogonality also means that these components can be used as independent predictors in regression models without concerns about multicollinearity, which can be a problem when using the original, possibly correlated predictors.

  1. Run the PCA analysis again this time setting the scale option to true. Compare the eigenvectors, eigenvalues, and scree plot from the first run. How does the interpretation of the first principle component change?

ANSWER E:

#  response variable 'Vol' is in a variable called 'mydata$Vol'
pc.scores <- data.frame(pc.result.scaled$x)
regression_data <- cbind(pc.scores, Vol = mydata$Vol)

# Run a linear regression using PCs as predictors
lm_result <- lm(Vol ~ ., data = regression_data)
summary(lm_result)
## 
## Call:
## lm(formula = Vol ~ ., data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3771 -1.0478 -0.0123  0.8525  3.9422 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.7426     0.2478  27.208  < 2e-16 ***
## PC1          -2.2573     0.1508 -14.967  < 2e-16 ***
## PC2          -1.1595     0.2570  -4.511 5.32e-05 ***
## PC3           0.4958     0.2885   1.718   0.0933 .  
## PC4          -0.8393     0.4029  -2.083   0.0435 *  
## PC5           0.7850     0.6437   1.219   0.2297    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.699 on 41 degrees of freedom
## Multiple R-squared:  0.8606, Adjusted R-squared:  0.8436 
## F-statistic: 50.63 on 5 and 41 DF,  p-value: < 2.2e-16

The summary of the linear regression model with principal components as predictors is showing some insightful results. Here’s how we caninterpret them:

Coefficients: - Intercept: The average value of Vol when all PCs are zero is approximately 6.7426, with a very small p-value, indicating this estimate is significantly different from zero. - PC1: A one-unit increase in PC1 is associated with a decrease of about 2.2573 in Vol, which is highly significant (p < 2e-16). - PC2: A one-unit increase in PC2 is associated with a decrease of about 1.1595 in Vol, which is also significant (p-value 5.32e-05). - PC3: The association between PC3 and Vol is positive, but not quite significant at the 0.05 level (p-value 0.0933), suggesting it’s a less reliable predictor in this context. - PC4: There’s a significant (p-value 0.0435) negative relationship with Vol, but less impactful than PC1 or PC2. - PC5: This component is not statistically significant (p-value 0.2297), indicating it may not be a useful predictor of Vol.

Model Diagnostics: - Residuals: The median is close to zero, which is good, but there’s a spread between the min and max residuals that you might want to explore for potential outliers or heteroscedasticity. - Multiple R-squared: About 86.06% of the variability in Vol is explained by the model, which is quite high, suggesting a good fit. - Adjusted R-squared: This is slightly lower at 84.36%, adjusting for the number of predictors, but still indicates a good model fit. - F-statistic: The overall model is statistically significant with a very low p-value, meaning the model fits the data better than a model with no predictors.

These results indicate that the model with the principal components as predictors is explaining a significant amount of the variance in Vol. It’s clear that PC1 and PC2 are particularly important in this regression model.

Before finalizing the model, we would typically look at diagnostic plots to check for any assumptions that might have been violated. This includes checking for normality of residuals, constant variance of residuals (homoscedasticity), and influential points that might unduly affect the model.

# Diagnostic Plots
par(mfrow = c(2, 2)) # setting the plotting area into a 2x2 layout
plot(lm_result) # this will generate four diagnostic plots

Analysis of Plots:

The diagnostic plots give insight into how well the linear regression assumptions are met:

  1. Residuals vs Fitted:
    • This plot shows no clear patterns, which is a good sign that the relationship might be linear. However, there are some outliers visible, such as points 24 and 46, which could potentially be influential.
  2. Normal Q-Q:
    • Most points lie along the line, but there are deviations at the tails, suggesting that the residuals might not be perfectly normally distributed. Points 24, 46, and others far from the line might be outliers or influential points.
  3. Scale-Location (Spread-Location):
    • The red line is not perfectly horizontal, and there are variations in spread along the range of fitted values, indicating potential heteroscedasticity.
  4. Residuals vs Leverage:
    • There are a few points outside the Cook’s distance lines, notably points 24, 46, and a few others, indicating they may be influential points that could unduly affect the regression model.

Next Steps:

  • Investigate Outliers: The outliers identified could be impacting the model. We may want to investigate these points further to determine if they should be included or if they are the result of data entry errors or other issues.

  • Addressing Heteroscedasticity: If heteroscedasticity is present, we might consider transformations such as log, square root, or others, or use a different type of regression model like weighted least squares.

  • Testing Normality: If the normality assumption is violated, we could use non-parametric methods, or again, consider data transformation.

  • Influential Points: Investigate points with high leverage or large residuals to determine if they are valid observations or if they should be excluded or down-weighted in some way.

  • Model Refinement: Based on these diagnostics, we may need to refine the model. This might involve removing or transforming variables, adding interaction terms, or using a robust regression technique.

Each of these steps requires careful consideration of the data and the context of the analysis. It’s also crucial to document and justify any decisions we make regarding the handling of outliers or influential points.

During the live session, we will investigate a regression model using the original variables and a regression model using the PCs to further the discussion.

Discussion 2: Image Compression

This following bulleted list of commands will demonstrate the image compression example discussed in the videos. There is really nothing due for this section. It is just for curious students.

  1. Load in the packages to read in a JPEG file. It also contains a function to write objects back to JPEG. Note that the object is just an array with three matrices: one each for red, green, and blue channels. Each matrix is 516 rows by 470 columns.
library(jpeg)
#library(magick)
photo<-readJPEG("Luna.jpg")
dim(photo)
## [1] 516 470   3
attributes(photo)
## $dim
## [1] 516 470   3
  1. The next step is to perform data reduction using PCA.
pcaR<-prcomp(photo[,,1],scale.=T)
pcaG<-prcomp(photo[,,2],scale.=T)
pcaB<-prcomp(photo[,,3],scale.=T)
  1. Load in the function Dr. Turner provided that does the matrix multiplication to go from PCs back to the original variables scale.
#Function here always assumes you are centering and scaling variables
pc.approx<-function(pcres,num){
  n<-nrow(pcres$x)
  #X.approx<-(pcres$x[,1:num]%*%t(pcres$rotation[,1:num])+rep(pcres$center,each=n))/rep(pcres$scale,each=n)
  X.approx<-rep(pcres$scale,each=n)*(pcres$x[,1:num]%*%t(pcres$rotation[,1:num]))+rep(pcres$center,each=n)
  names(X.approx)<-names(pcres$center)
  X.approx
}
  1. Store the PCA results in a list, and apply the transformation function using just 20 PCs. Image is then written to your working directory. You can change the number of PCs to see how the image changes.
pcaRGB<-list(pcaR,pcaG,pcaB)

compImage<-sapply(pcaRGB,pc.approx,num=20,simplify='array')

writeJPEG(compImage,"Luna_PC20.jpg")
#File will be much smaller than original

Summary of discussion 2:

Image Compression with PCA

  • Reading the Image: We start by loading an image into R using the jpeg package, which represents the image as an array. This array is made up of three matrices corresponding to the red, green, and blue color channels of the image.

  • Applying PCA: PCA is performed separately on each of the three color matrices. By doing this, we reduce the dimensionality of each color channel, identifying the principal components that capture the most variance in the image’s colors.

  • Reconstructing the Image: After PCA, we use a custom function to approximate the original image from a reduced number of principal components. This function reconstructs the original data (scaled color channel values) by performing matrix multiplication with the selected principal components.

  • Compression and Saving: The image is then compressed by keeping only the most significant 20 principal components for each color channel. The result is a new, smaller image that is saved back to a file. This compressed image requires less storage space and will be smaller than the original image file.

  • Exploration: Students are encouraged to experiment with the number of principal components to see how it affects the image quality and file size. This exploration can help understand the trade-off between compression and image quality.

This exercise provides a hands-on example of how PCA can be used for practical applications like image compression, beyond the usual statistical analysis scenarios.