For this discussion we will be utilizing a data set that was a study investigating potential civil rights violations among insurance companies operating in Chicago, Illinois. A description of the data set can be found as well as a description for each variable. Please take the time and read through the description before continuing.
library(Sleuth3)
mydata<-ex1716
head(mydata)
## ZIP Fire Theft Age Income Race Vol Invol
## 1 26 6.2 29 60.4 11744 10.0 5.3 0.0
## 2 40 9.5 44 76.5 9323 22.2 3.1 0.1
## 3 13 10.5 36 73.5 9948 19.6 4.8 1.2
## 4 57 7.7 37 66.9 10656 17.3 5.7 0.5
## 5 14 8.6 53 81.4 9730 24.5 5.9 0.7
## 6 10 34.1 68 52.6 8231 54.0 4.0 0.3
names(mydata)
## [1] "ZIP" "Fire" "Theft" "Age" "Income" "Race" "Vol" "Invol"
#?ex1716 Run this for description.
For this exercise we will be treating the Vol
variable
as the response because it indicates the number of new policies provided
per 100 housing units. These new policies were argued to be favorable to
certain locations within Chicago and these policies were not offered to
others. The age, income, race, theft, and fire variables will be
considered as potential predictors of the voluntary new policies awarded
by insurance companies. We will ignore the Invol
variable.
In a regression or classification setting, PCA would be applied to the predictors to help lower the number of predictors to use. Lets run through a few computations and answer some basic questions about our understanding of PCA
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
mysummary <- mydata %>%
select(Fire, Theft, Age, Income,Race) %>% # select variables to summarise
summarise_each(funs(min = min,
q25 = quantile(., 0.25),
median = median,
q75 = quantile(., 0.75),
max = max,
mean = mean,
sd = sd,
variance= var))
## Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# reshape it using tidyr functions
clean.summary <- mysummary %>% gather(stat, val) %>%
separate(stat, into = c("var", "stat"), sep = "_") %>%
spread(stat, val) %>%
select(var, min, max, mean, sd,variance) # reorder columns
print(clean.summary)
## var min max mean sd variance
## 1 Age 2 90.1 60.32766 22.574964 5.096290e+02
## 2 Fire 2 39.7 12.27872 9.302266 8.653215e+01
## 3 Income 5583 21480.0 10695.82979 2754.198008 7.585607e+06
## 4 Race 1 99.7 34.98511 32.587614 1.061953e+03
## 5 Theft 3 75.0 30.23404 14.533638 2.112266e+02
library(dplyr)
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Summary statistics using across()
summary_stats <- mydata %>%
select(Fire, Theft, Age, Income, Race) %>%
summarise(across(.cols = everything(),
.fns = list(min = ~min(.),
q25 = ~quantile(., 0.25),
median = ~median(.),
q75 = ~quantile(., 0.75),
max = ~max(.),
mean = ~mean(.),
sd = ~sd(.),
variance = ~var(.))))
# Print summary statistics
print(summary_stats)
## Fire_min Fire_q25 Fire_median Fire_q75 Fire_max Fire_mean Fire_sd
## 1 2 5.65 10.4 16.05 39.7 12.27872 9.302266
## Fire_variance Theft_min Theft_q25 Theft_median Theft_q75 Theft_max Theft_mean
## 1 86.53215 3 22 29 38 75 30.23404
## Theft_sd Theft_variance Age_min Age_q25 Age_median Age_q75 Age_max Age_mean
## 1 14.53364 211.2266 2 48.6 65 77.3 90.1 60.32766
## Age_sd Age_variance Income_min Income_q25 Income_median Income_q75
## 1 22.57496 509.629 5583 8447 10694 11989
## Income_max Income_mean Income_sd Income_variance Race_min Race_q25
## 1 21480 10695.83 2754.198 7585607 1 3.75
## Race_median Race_q75 Race_max Race_mean Race_sd Race_variance
## 1 24.5 57.65 99.7 34.98511 32.58761 1061.953
# Scatter plot matrix using ggpairs()
scatter_plot_matrix <- ggpairs(mydata %>% select(Fire, Theft, Age, Income, Race))
print(scatter_plot_matrix)
1. There is a significant negative correlation between Income and Race, meaning as income increases, the percentage of minority race decreases.
2. Fire incidents have a moderate positive correlation with Theft, suggesting that areas with more fires also tend to have more thefts.
3. Age shows little to no correlation with Theft and Income, indicating these variables do not change together in any specific linear pattern.
pc.result<-prcomp(mydata[,2:6],scale.=FALSE)
#Eigen Vectors
pc.result$rotation
## PC1 PC2 PC3 PC4 PC5
## Fire 0.0020617673 -0.097740929 0.123122260 -0.209894627 0.965001355
## Theft 0.0004721111 -0.243966789 0.485518217 -0.798134174 -0.260257343
## Age 0.0043331964 0.252841469 0.860436215 0.442219199 0.012004758
## Income -0.9999537066 -0.006974434 0.004990079 0.004031319 0.001670201
## Race 0.0083267254 -0.931101211 0.093475068 0.351215055 -0.029859620
#Eigen Values
eigenvals<-pc.result$sdev^2
eigenvals
## [1] 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01
#Scree plot
par(mfrow=c(1,2))
plot(eigenvals,type="l",main="Scree Plot",ylab="Eigen Values",xlab="PC #")
plot(eigenvals/sum(eigenvals),type="l",main="Scree Plot",ylab="Prop. Var. Explained",xlab="PC #",ylim=c(0,1))
cumulative.prop<-cumsum(eigenvals/sum(eigenvals))
lines(cumulative.prop,lty=2)
The PCA analysis was performed using the covariance matrix, which is not the standard practice as it does not account for the different scales of measurement of the variables. This is likely the cause of the “weirdness” observed in the eigenvectors of the first principal component (PC). The first PC is largely influenced by the Income variable due to its larger scale compared to the others. This is evident in the fact that the first eigenvalue is much larger than the others, capturing most of the variability in the data set.
The scree plot displays the eigenvalues of the principal components, which helps in determining how many components to retain. The rule of thumb is to look for the “elbow,” which usually signifies where the explained variance by the components starts to level off, indicating that additional components contribute less to the explanation of the variance.
# PCA with scaled data
pc.result.scaled <- prcomp(mydata[, 2:6], scale. = TRUE)
# Eigen Vectors of scaled data
pc.result.scaled$rotation
## PC1 PC2 PC3 PC4 PC5
## Fire 0.5048229 -0.03090255 0.1890711 -0.8296296 0.1420061
## Theft 0.3090776 -0.83261158 0.2407813 0.2183405 -0.3249320
## Age 0.3985347 -0.19233538 -0.8163274 0.1235444 0.3500337
## Income -0.5050429 -0.45724546 0.1426136 -0.1371460 0.7047769
## Race 0.4855169 0.24441088 0.4685589 0.4795518 0.5049944
# Eigen Values of scaled data
eigenvals.scaled <- pc.result.scaled$sdev^2
eigenvals.scaled
## [1] 2.7586160 0.9496824 0.7537101 0.3865744 0.1514171
# Scree plot for scaled data
par(mfrow = c(1, 2))
plot(eigenvals.scaled, type = "l", main = "Scree Plot with Scaled Data", ylab = "Eigen Values", xlab = "PC #")
plot(eigenvals.scaled / sum(eigenvals.scaled), type = "l", main = "Scree Plot with Scaled Data", ylab = "Prop. Var. Explained", xlab = "PC #", ylim = c(0, 1))
cumulative.prop.scaled <- cumsum(eigenvals.scaled / sum(eigenvals.scaled))
lines(cumulative.prop.scaled, lty = 2)
# Prepare data for regression by combining PCs and response into a new data frame
# regression_data <- data.frame(pcs_to_use, response = mydata$Vol)
# Fit a linear model using the selected PCs
# lm_result <- lm(response ~ ., data = regression_data)
# Look at the summary of the linear model
# summary(lm_result)
Call: lm(formula = response ~ ., data = regression_data)
Residuals: Min 1Q Median 3Q Max -3.8321 -1.1298 -0.1105 0.8892 4.6884
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.7426 0.2636 25.580 < 2e-16 PC1 -2.2573
0.1604 -14.071 < 2e-16 PC2 -1.1595 0.2734 -4.241
0.000113 *** — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01
‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.807 on 44 degrees of freedom Multiple R-squared: 0.8308, Adjusted R-squared: 0.8231 F-statistic: 108 on 2 and 44 DF, p-value: < 2.2e-16
1: Unscaled Data Scree Plot: The first principal component (PC1) explains a vast majority of the variance, indicated by the eigenvalue on the left plot and the steep rise in cumulative variance explained on the right plot.
Subsequent components add very little to the explained variance, which suggests that PC1 dominates due to the scale differences among the variables.
2: Scaled Data Scree Plot:
The eigenvalues for each PC are closer in magnitude, suggesting a more balanced contribution of variance. The proportion of variance explained by each PC increases more gradually, indicating no single variable is dominating the PCA.
The summary of the linear regression model provides several important pieces of information:
Coefficients: The model estimated the intercept
to be approximately 6.7426, with a standard error of 0.2636. Both
principal components (PC1 and PC2) are significantly associated with the
response variable (Vol
), with p-values less than 0.001,
indicating strong evidence against the null hypothesis of no
association.
PC1:
Vol
decreases. The
negative sign indicates an inverse relationship.PC2:
Vol
.Residuals: The residuals have a median close to zero, which is good; however, the range suggests there might be some outliers or that the variance is not constant (heteroscedasticity).
R-squared: The Multiple R-squared
value is 0.8308, which means that approximately 83.08% of the
variability in Vol
is explained by the two principal
components used in the model. The Adjusted R-squared
is
0.8231, slightly lower, adjusted for the number of predictors in the
model.
F-statistic: The overall F-statistic is
significant (p < 2.2e-16), indicating that the model as a whole is
statistically significant. These results suggest that the model fits the
data well, with both principal components contributing significantly to
explaining the variation in Vol
.
pc.scores<-pc.result$x
pc.scores<-data.frame(pc.scores,Vol=ex1716$Vol)
Use this new data set and compute the same summary statistics on these new variables. Examine the variance’s for these new variables. Do they look familiar?
# Extract the principal component scores
pc.scores <- pc.result$x
# Create a data frame of the PC scores
pc.scores.df <- data.frame(pc.scores)
# Calculate summary statistics for each principal component
summary_stats_pcs <- summary(pc.scores.df)
# Get the variances (squared standard deviations) for each PC
variances_pcs <- apply(pc.scores.df, 2, var)
# Compare the variances of the PCs to the eigenvalues from the PCA analysis
# This should be equivalent to the square of the standard deviation of the PCs
eigenvalues <- pc.result$sdev^2
# Printing the variances of the PCs and the eigenvalues for comparison
print(variances_pcs)
## PC1 PC2 PC3 PC4 PC5
## 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01
print(eigenvalues)
## [1] 7.586309e+06 5.975363e+02 4.137852e+02 1.157325e+02 3.995145e+01
The output confirms that the variances of the new principal component
scores (pc.scores.df
) match the eigenvalues from the PCA
analysis (eigenvalues
). This is expected and correct, as
the variance of each principal component in the transformed space should
equal the corresponding eigenvalue from the covariance matrix of the
original dataset.
In detail: - PC1
has a very large
variance (eigenvalue of approximately 7.59 million), which indicates
that it captures a vast amount of the variability within your data. -
PC2
to PC5
have much smaller
eigenvalues in comparison, indicating that they capture progressively
less variability within the dataset.
Given these variances (eigenvalues), PC1
appears to be
the most significant factor in terms of explaining the variance in the
dataset. However, as seen from the scree plot with the scaled data, the
first two PCs together explain a substantial portion of the variance
(over 70%).
Here’s what we could do next with this information:
loadings
(the weights used to construct each PC from the
original variables) to interpret what each principal component
represents in terms of the original variables.Vol
or for other analyses like
clustering.The exact steps we would take would depend on the objectives of the data analysis. If the aim is to build a predictive model, we would typically proceed with the regression analysis using the PCs as we discussed before. If we are more interested in data exploration, we might further investigate the loadings to understand the structure and relationships in the data.
library(GGally)
ggpairs(pc.scores.df[, 1:5])
# We should see that the off-diagonal plots do not show any trend or correlation, confirming the orthogonality of the principal components.
The scatterplot matrix illustrates the relationships between the
principal components after performing PCA. As expected from the
properties of PCA, the scatterplots between different principal
components show no apparent correlation, which is confirmed by the
correlation coefficients (Corr:
) being zero or very close
to zero. This is precisely what we expect because PCA is designed to
create orthogonal components that capture the variance in the dataset
without redundancy.
The histograms on the diagonal show the distribution of each
principal component. Since PCA also orders the components by the amount
of variance they capture from the data, you should typically see these
distributions decrease in spread from PC1
to
PC5
.
Given that the correlation coefficients between the principal components are effectively zero, this confirms that the PCA has been conducted correctly and that the principal components are indeed orthogonal (independent) of each other. This orthogonality is a crucial characteristic of PCA that makes the components useful for various applications like dimensionality reduction, feature extraction, and data visualization.
This orthogonality also means that these components can be used as independent predictors in regression models without concerns about multicollinearity, which can be a problem when using the original, possibly correlated predictors.
# response variable 'Vol' is in a variable called 'mydata$Vol'
pc.scores <- data.frame(pc.result.scaled$x)
regression_data <- cbind(pc.scores, Vol = mydata$Vol)
# Run a linear regression using PCs as predictors
lm_result <- lm(Vol ~ ., data = regression_data)
summary(lm_result)
##
## Call:
## lm(formula = Vol ~ ., data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3771 -1.0478 -0.0123 0.8525 3.9422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7426 0.2478 27.208 < 2e-16 ***
## PC1 -2.2573 0.1508 -14.967 < 2e-16 ***
## PC2 -1.1595 0.2570 -4.511 5.32e-05 ***
## PC3 0.4958 0.2885 1.718 0.0933 .
## PC4 -0.8393 0.4029 -2.083 0.0435 *
## PC5 0.7850 0.6437 1.219 0.2297
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.699 on 41 degrees of freedom
## Multiple R-squared: 0.8606, Adjusted R-squared: 0.8436
## F-statistic: 50.63 on 5 and 41 DF, p-value: < 2.2e-16
The summary of the linear regression model with principal components as predictors is showing some insightful results. Here’s how we caninterpret them:
Coefficients: - Intercept: The
average value of Vol
when all PCs are zero is approximately
6.7426, with a very small p-value, indicating this estimate is
significantly different from zero. - PC1: A one-unit
increase in PC1 is associated with a decrease of about 2.2573 in
Vol
, which is highly significant (p < 2e-16). -
PC2: A one-unit increase in PC2 is associated with a
decrease of about 1.1595 in Vol
, which is also significant
(p-value 5.32e-05). - PC3: The association between PC3
and Vol
is positive, but not quite significant at the 0.05
level (p-value 0.0933), suggesting it’s a less reliable predictor in
this context. - PC4: There’s a significant (p-value
0.0435) negative relationship with Vol
, but less impactful
than PC1 or PC2. - PC5: This component is not
statistically significant (p-value 0.2297), indicating it may not be a
useful predictor of Vol
.
Model Diagnostics: - Residuals: The
median is close to zero, which is good, but there’s a spread between the
min and max residuals that you might want to explore for potential
outliers or heteroscedasticity. - Multiple R-squared:
About 86.06% of the variability in Vol
is explained by the
model, which is quite high, suggesting a good fit. - Adjusted
R-squared: This is slightly lower at 84.36%, adjusting for the
number of predictors, but still indicates a good model fit. -
F-statistic: The overall model is statistically
significant with a very low p-value, meaning the model fits the data
better than a model with no predictors.
These results indicate that the model with the principal components
as predictors is explaining a significant amount of the variance in
Vol
. It’s clear that PC1 and PC2 are particularly important
in this regression model.
Before finalizing the model, we would typically look at diagnostic plots to check for any assumptions that might have been violated. This includes checking for normality of residuals, constant variance of residuals (homoscedasticity), and influential points that might unduly affect the model.
# Diagnostic Plots
par(mfrow = c(2, 2)) # setting the plotting area into a 2x2 layout
plot(lm_result) # this will generate four diagnostic plots
The diagnostic plots give insight into how well the linear regression assumptions are met:
Investigate Outliers: The outliers identified could be impacting the model. We may want to investigate these points further to determine if they should be included or if they are the result of data entry errors or other issues.
Addressing Heteroscedasticity: If heteroscedasticity is present, we might consider transformations such as log, square root, or others, or use a different type of regression model like weighted least squares.
Testing Normality: If the normality assumption is violated, we could use non-parametric methods, or again, consider data transformation.
Influential Points: Investigate points with high leverage or large residuals to determine if they are valid observations or if they should be excluded or down-weighted in some way.
Model Refinement: Based on these diagnostics, we may need to refine the model. This might involve removing or transforming variables, adding interaction terms, or using a robust regression technique.
Each of these steps requires careful consideration of the data and the context of the analysis. It’s also crucial to document and justify any decisions we make regarding the handling of outliers or influential points.
During the live session, we will investigate a regression model using the original variables and a regression model using the PCs to further the discussion.
This following bulleted list of commands will demonstrate the image compression example discussed in the videos. There is really nothing due for this section. It is just for curious students.
library(jpeg)
#library(magick)
photo<-readJPEG("Luna.jpg")
dim(photo)
## [1] 516 470 3
attributes(photo)
## $dim
## [1] 516 470 3
pcaR<-prcomp(photo[,,1],scale.=T)
pcaG<-prcomp(photo[,,2],scale.=T)
pcaB<-prcomp(photo[,,3],scale.=T)
#Function here always assumes you are centering and scaling variables
pc.approx<-function(pcres,num){
n<-nrow(pcres$x)
#X.approx<-(pcres$x[,1:num]%*%t(pcres$rotation[,1:num])+rep(pcres$center,each=n))/rep(pcres$scale,each=n)
X.approx<-rep(pcres$scale,each=n)*(pcres$x[,1:num]%*%t(pcres$rotation[,1:num]))+rep(pcres$center,each=n)
names(X.approx)<-names(pcres$center)
X.approx
}
pcaRGB<-list(pcaR,pcaG,pcaB)
compImage<-sapply(pcaRGB,pc.approx,num=20,simplify='array')
writeJPEG(compImage,"Luna_PC20.jpg")
#File will be much smaller than original
Reading the Image: We start by loading an image
into R using the jpeg
package, which represents the image
as an array. This array is made up of three matrices corresponding to
the red, green, and blue color channels of the image.
Applying PCA: PCA is performed separately on each of the three color matrices. By doing this, we reduce the dimensionality of each color channel, identifying the principal components that capture the most variance in the image’s colors.
Reconstructing the Image: After PCA, we use a custom function to approximate the original image from a reduced number of principal components. This function reconstructs the original data (scaled color channel values) by performing matrix multiplication with the selected principal components.
Compression and Saving: The image is then compressed by keeping only the most significant 20 principal components for each color channel. The result is a new, smaller image that is saved back to a file. This compressed image requires less storage space and will be smaller than the original image file.
Exploration: Students are encouraged to experiment with the number of principal components to see how it affects the image quality and file size. This exploration can help understand the trade-off between compression and image quality.
This exercise provides a hands-on example of how PCA can be used for practical applications like image compression, beyond the usual statistical analysis scenarios.