HW6

Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components. Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

ANSWER: For this question, we conducted Principlal Component Analysis on a set of data that we scale and then extract the first 4 principal components based on explaining 80% of the variance as well as a scree plot that we generate. We then use these components as predictors in an lm() model, and then we unscale the coefficients back to the original data’s scale and predict crime rates. We perform R-sq and adjusted R-sq analysis and we then take the test data from the previous week’s homework and use it to predicrt crime rates.

CODE:

First we’ll clear our environment

rm(list = ls())

Then we’ll set the seed for reproducibility and load our libraries

set.seed(123)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We’ll reimport the crime data from HW5 Q 8.2 for this question

crime_data <- read.table("/Users/djmariano/Downloads/hw6-SP22/uscrime.txt", header = T, stringsAsFactors = F)
summary(crime_data)

##        M               So               Ed             Po1       
##  Min.   :11.90   Min.   :0.0000   Min.   : 8.70   Min.   : 4.50  
##  1st Qu.:13.00   1st Qu.:0.0000   1st Qu.: 9.75   1st Qu.: 6.25  
##  Median :13.60   Median :0.0000   Median :10.80   Median : 7.80  
##  Mean   :13.86   Mean   :0.3404   Mean   :10.56   Mean   : 8.50  
##  3rd Qu.:14.60   3rd Qu.:1.0000   3rd Qu.:11.45   3rd Qu.:10.45  
##  Max.   :17.70   Max.   :1.0000   Max.   :12.20   Max.   :16.60  
##       Po2               LF              M.F              Pop        
##  Min.   : 4.100   Min.   :0.4800   Min.   : 93.40   Min.   :  3.00  
##  1st Qu.: 5.850   1st Qu.:0.5305   1st Qu.: 96.45   1st Qu.: 10.00  
##  Median : 7.300   Median :0.5600   Median : 97.70   Median : 25.00  
##  Mean   : 8.023   Mean   :0.5612   Mean   : 98.30   Mean   : 36.62  
##  3rd Qu.: 9.700   3rd Qu.:0.5930   3rd Qu.: 99.20   3rd Qu.: 41.50  
##  Max.   :15.700   Max.   :0.6410   Max.   :107.10   Max.   :168.00  
##        NW              U1                U2            Wealth    
##  Min.   : 0.20   Min.   :0.07000   Min.   :2.000   Min.   :2880  
##  1st Qu.: 2.40   1st Qu.:0.08050   1st Qu.:2.750   1st Qu.:4595  
##  Median : 7.60   Median :0.09200   Median :3.400   Median :5370  
##  Mean   :10.11   Mean   :0.09547   Mean   :3.398   Mean   :5254  
##  3rd Qu.:13.25   3rd Qu.:0.10400   3rd Qu.:3.850   3rd Qu.:5915  
##  Max.   :42.30   Max.   :0.14200   Max.   :5.800   Max.   :6890  
##       Ineq            Prob              Time           Crime       
##  Min.   :12.60   Min.   :0.00690   Min.   :12.20   Min.   : 342.0  
##  1st Qu.:16.55   1st Qu.:0.03270   1st Qu.:21.60   1st Qu.: 658.5  
##  Median :17.60   Median :0.04210   Median :25.80   Median : 831.0  
##  Mean   :19.40   Mean   :0.04709   Mean   :26.60   Mean   : 905.1  
##  3rd Qu.:22.75   3rd Qu.:0.05445   3rd Qu.:30.45   3rd Qu.:1057.5  
##  Max.   :27.60   Max.   :0.11980   Max.   :44.00   Max.   :1993.0

Now we will perform Principal Component Analysis (PCA) on the first 15 columns of our crime dataset, and we will scale all the variables before our analysis.

crimedata_pca <- prcomp(crime_data[,1:15], scale = T)

Let’s view a summary of the PCA results, specifically the variance of each component.

summary(crimedata_pca)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion  0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion  0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
##                           PC15
## Standard deviation     0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion  1.00000

We’ll also do a scree plot to also help us determine how many components we should use.

screeplot(crimedata_pca, main = "pca components", type = "line")

According to the scree plot and the ordered importance of the components in the summary, we’ll choose the first 4 components since their eigenvalue > 1 and they account for 80% of the variance

We will now use these 4 components to build our model.

components <- crimedata_pca$x[,1:4] #Here we extract the first 4 PCs from the PCA
crimedata_pc <- cbind(components, crime_data[,16]) #We combine the first 4 components columns and the 16th Crime data column.
crimedata_components <- data.frame(crimedata_pc) #Here we turn the components into a dataframe before performing lm()

Here we’ll now perform linear regression with the 5 PCs as predictors and the Crime column as the response.

lm_pca <- lm(V5 ~ ., data = crimedata_components)

Let’s view a summary of our regression model and check out the coefficients and their significance levels.

summary(lm_pca)

## 
## Call:
## lm(formula = V5 ~ ., data = crimedata_components)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557.76 -210.91  -29.08  197.26  810.35 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      49.07  18.443  < 2e-16 ***
## PC1            65.22      20.22   3.225  0.00244 ** 
## PC2           -70.08      29.63  -2.365  0.02273 *  
## PC3            25.19      35.03   0.719  0.47602    
## PC4            69.45      46.01   1.509  0.13872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 336.4 on 42 degrees of freedom
## Multiple R-squared:  0.3091, Adjusted R-squared:  0.2433 
## F-statistic: 4.698 on 4 and 42 DF,  p-value: 0.003178

lm_pca$coefficients

## (Intercept)         PC1         PC2         PC3         PC4 
##   905.08511    65.21593   -70.08312    25.19408    69.44603

Now we’ll perform the transformation. We will need to keep in mind that our data has been scaled and therefore properly handle unscaling it.

int <- lm_pca$coefficients[1] #Store the intercept
beta <- lm_pca$coefficients[2:(4+1)] #Extract coefficients

Now, we’ll multiply the PCA rotation matrix (which contains the loadings of the original variables on each principal component) by the regression coefficients to calculate the loadings of the original variables and transform the regression coefficients back to the scale of the original variables by calculating the mu and the std of the first 15 columns in the crime dataset.

alpha <- crimedata_pca$rotation[,1:4] %*% beta
mu <- sapply(crime_data[,1:15], mean)
sigma <- sapply(crime_data[,1:15], sd)
ogalpha <- alpha / sigma
ogbeta <- int - sum(alpha*mu / sigma)

We compute the estimated values of the response variable using the original data and the adjusted model parameters.

est <- as.matrix(crime_data[,1:15]) %*% ogalpha + ogbeta

Let’s calculate the models SSE and total sum of squares.

SSE = sum((est - crime_data[,16])^2)
SS_sum = sum((crime_data[,16] - mean(crime_data[,16]))^2)

And now let’s calculate the R-squared and adjusted R-squared values.

Rsq <- 1 - SSE/SS_sum
Rsq

## [1] 0.3091121

Rsq2 <- Rsq - (1-Rsq)*4/(nrow(crime_data)-4-1)
Rsq2

## [1] 0.2433132

We now compare to the given city data in HW5 Q8.2 and compare to see what the new crime rate is for this model.

city_data <- data.frame(M= 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.040, Time = 39.0)
pred_city <- data.frame(predict(crimedata_pca, city_data)) 
pred_city_model <- predict(lm_pca, pred_city)
pred_city_model

##        1 
## 1112.678

As we can see, the crime rate that this model calculates for the city_data still lies within a reasonable range of our the overall crime data. Our answer last week was similar (approx 1300~) so we believe this model can also be useful to help predict crime rates.

HW6

2024-02-21