Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components. Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)
ANSWER: For this question, we conducted Principlal Component Analysis on a set of data that we scale and then extract the first 4 principal components based on explaining 80% of the variance as well as a scree plot that we generate. We then use these components as predictors in an lm() model, and then we unscale the coefficients back to the original data’s scale and predict crime rates. We perform R-sq and adjusted R-sq analysis and we then take the test data from the previous week’s homework and use it to predicrt crime rates.
CODE:
First we’ll clear our environment
rm(list = ls())
Then we’ll set the seed for reproducibility and load our libraries
set.seed(123)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We’ll reimport the crime data from HW5 Q 8.2 for this question
crime_data <- read.table("/Users/djmariano/Downloads/hw6-SP22/uscrime.txt", header = T, stringsAsFactors = F)
summary(crime_data)
## M So Ed Po1
## Min. :11.90 Min. :0.0000 Min. : 8.70 Min. : 4.50
## 1st Qu.:13.00 1st Qu.:0.0000 1st Qu.: 9.75 1st Qu.: 6.25
## Median :13.60 Median :0.0000 Median :10.80 Median : 7.80
## Mean :13.86 Mean :0.3404 Mean :10.56 Mean : 8.50
## 3rd Qu.:14.60 3rd Qu.:1.0000 3rd Qu.:11.45 3rd Qu.:10.45
## Max. :17.70 Max. :1.0000 Max. :12.20 Max. :16.60
## Po2 LF M.F Pop
## Min. : 4.100 Min. :0.4800 Min. : 93.40 Min. : 3.00
## 1st Qu.: 5.850 1st Qu.:0.5305 1st Qu.: 96.45 1st Qu.: 10.00
## Median : 7.300 Median :0.5600 Median : 97.70 Median : 25.00
## Mean : 8.023 Mean :0.5612 Mean : 98.30 Mean : 36.62
## 3rd Qu.: 9.700 3rd Qu.:0.5930 3rd Qu.: 99.20 3rd Qu.: 41.50
## Max. :15.700 Max. :0.6410 Max. :107.10 Max. :168.00
## NW U1 U2 Wealth
## Min. : 0.20 Min. :0.07000 Min. :2.000 Min. :2880
## 1st Qu.: 2.40 1st Qu.:0.08050 1st Qu.:2.750 1st Qu.:4595
## Median : 7.60 Median :0.09200 Median :3.400 Median :5370
## Mean :10.11 Mean :0.09547 Mean :3.398 Mean :5254
## 3rd Qu.:13.25 3rd Qu.:0.10400 3rd Qu.:3.850 3rd Qu.:5915
## Max. :42.30 Max. :0.14200 Max. :5.800 Max. :6890
## Ineq Prob Time Crime
## Min. :12.60 Min. :0.00690 Min. :12.20 Min. : 342.0
## 1st Qu.:16.55 1st Qu.:0.03270 1st Qu.:21.60 1st Qu.: 658.5
## Median :17.60 Median :0.04210 Median :25.80 Median : 831.0
## Mean :19.40 Mean :0.04709 Mean :26.60 Mean : 905.1
## 3rd Qu.:22.75 3rd Qu.:0.05445 3rd Qu.:30.45 3rd Qu.:1057.5
## Max. :27.60 Max. :0.11980 Max. :44.00 Max. :1993.0
Now we will perform Principal Component Analysis (PCA) on the first 15 columns of our crime dataset, and we will scale all the variables before our analysis.
crimedata_pca <- prcomp(crime_data[,1:15], scale = T)
Let’s view a summary of the PCA results, specifically the variance of each component.
summary(crimedata_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion 0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
## PC15
## Standard deviation 0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion 1.00000
We’ll also do a scree plot to also help us determine how many components we should use.
screeplot(crimedata_pca, main = "pca components", type = "line")
According to the scree plot and the ordered importance of the components
in the summary, we’ll choose the first 4 components since their
eigenvalue > 1 and they account for 80% of the variance
We will now use these 4 components to build our model.
components <- crimedata_pca$x[,1:4] #Here we extract the first 4 PCs from the PCA
crimedata_pc <- cbind(components, crime_data[,16]) #We combine the first 4 components columns and the 16th Crime data column.
crimedata_components <- data.frame(crimedata_pc) #Here we turn the components into a dataframe before performing lm()
Here we’ll now perform linear regression with the 5 PCs as predictors and the Crime column as the response.
lm_pca <- lm(V5 ~ ., data = crimedata_components)
Let’s view a summary of our regression model and check out the coefficients and their significance levels.
summary(lm_pca)
##
## Call:
## lm(formula = V5 ~ ., data = crimedata_components)
##
## Residuals:
## Min 1Q Median 3Q Max
## -557.76 -210.91 -29.08 197.26 810.35
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 49.07 18.443 < 2e-16 ***
## PC1 65.22 20.22 3.225 0.00244 **
## PC2 -70.08 29.63 -2.365 0.02273 *
## PC3 25.19 35.03 0.719 0.47602
## PC4 69.45 46.01 1.509 0.13872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 336.4 on 42 degrees of freedom
## Multiple R-squared: 0.3091, Adjusted R-squared: 0.2433
## F-statistic: 4.698 on 4 and 42 DF, p-value: 0.003178
lm_pca$coefficients
## (Intercept) PC1 PC2 PC3 PC4
## 905.08511 65.21593 -70.08312 25.19408 69.44603
Now we’ll perform the transformation. We will need to keep in mind that our data has been scaled and therefore properly handle unscaling it.
int <- lm_pca$coefficients[1] #Store the intercept
beta <- lm_pca$coefficients[2:(4+1)] #Extract coefficients
Now, we’ll multiply the PCA rotation matrix (which contains the loadings of the original variables on each principal component) by the regression coefficients to calculate the loadings of the original variables and transform the regression coefficients back to the scale of the original variables by calculating the mu and the std of the first 15 columns in the crime dataset.
alpha <- crimedata_pca$rotation[,1:4] %*% beta
mu <- sapply(crime_data[,1:15], mean)
sigma <- sapply(crime_data[,1:15], sd)
ogalpha <- alpha / sigma
ogbeta <- int - sum(alpha*mu / sigma)
We compute the estimated values of the response variable using the original data and the adjusted model parameters.
est <- as.matrix(crime_data[,1:15]) %*% ogalpha + ogbeta
Let’s calculate the models SSE and total sum of squares.
SSE = sum((est - crime_data[,16])^2)
SS_sum = sum((crime_data[,16] - mean(crime_data[,16]))^2)
And now let’s calculate the R-squared and adjusted R-squared values.
Rsq <- 1 - SSE/SS_sum
Rsq
## [1] 0.3091121
Rsq2 <- Rsq - (1-Rsq)*4/(nrow(crime_data)-4-1)
Rsq2
## [1] 0.2433132
We now compare to the given city data in HW5 Q8.2 and compare to see what the new crime rate is for this model.
city_data <- data.frame(M= 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.040, Time = 39.0)
pred_city <- data.frame(predict(crimedata_pca, city_data))
pred_city_model <- predict(lm_pca, pred_city)
pred_city_model
## 1
## 1112.678
As we can see, the crime rate that this model calculates for the city_data still lies within a reasonable range of our the overall crime data. Our answer last week was similar (approx 1300~) so we believe this model can also be useful to help predict crime rates.