Question 9.1 - Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis and then create a regression model using the first few principal components. Specify your new model in terms of the original variables (not the principal components), and compare its quality to that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do the scaling calculation in reverse)!)

Methodology - We are going to do a Principal Component Analysis to create a new model where our predictors are less correlated. Once we create our PCA Model, we are going to go ahead and only take so many of our Principal Component Variables depending on the cumulative variance explained. Then we are going to create a model of just this Principal Components to predict our new city values.

crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 6/Homework6_ISYE6501/Homework6_ISYE6501/data 9.1/uscrime.txt", header = TRUE)

head(crime_data)

We are going to use our prcomp function here to get our different principal components.

crime_pca = prcomp(crime_data[,1:15], scale = TRUE, center = TRUE)

summary(crime_pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion  0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion  0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
##                           PC15
## Standard deviation     0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion  1.00000
plot(crime_pca)

We can see that once we get to PC9, we are explaining over 95% of our variance, so we are going to choose 9 different PCs for our model.

crimes_pcs = crime_pca$x[,1:9]

head(crimes_pcs)
##            PC1        PC2         PC3         PC4         PC5        PC6
## [1,] -4.199284 -1.0938312 -1.11907395  0.67178115  0.05528338  0.3073383
## [2,]  1.172663  0.6770136 -0.05244634 -0.08350709 -1.17319982 -0.5832373
## [3,] -4.173725  0.2767750 -0.37107658  0.37793995  0.54134525  0.7187223
## [4,]  3.834962 -2.5769060  0.22793998  0.38262331 -1.64474650  0.7294884
## [5,]  1.839300  1.3309856  1.27882805  0.71814305  0.04159032 -0.3940902
## [6,]  2.907234 -0.3305421  0.53288181  1.22140635  1.37436096 -0.6922513
##              PC7          PC8         PC9
## [1,] -0.56640816 -0.007801727  0.22350995
## [2,]  0.19561119  0.154566472  0.43677720
## [3,]  0.10330693  0.351138883  0.06299232
## [4,]  0.26699499 -1.547460841 -0.37954181
## [5,]  0.07050766 -0.543237437  0.22463245
## [6,]  0.22648209  0.562323186  0.41772217

Let us make a new dataset of our PCs and our Crime Statistic.

new_crime = as.data.frame(cbind(crimes_pcs, crime_data[,16]))

head(new_crime)
stan_model = lm(V10~., data = new_crime)

summary(stan_model)
## 
## Call:
## lm(formula = V10 ~ ., data = new_crime)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -455.9 -132.5   21.5  139.9  393.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   905.09      34.91  25.928  < 2e-16 ***
## PC1            65.22      14.38   4.535 5.88e-05 ***
## PC2           -70.08      21.08  -3.325  0.00201 ** 
## PC3            25.19      24.92   1.011  0.31857    
## PC4            69.45      32.73   2.122  0.04061 *  
## PC5          -229.04      36.04  -6.355 2.08e-07 ***
## PC6           -60.21      47.44  -1.269  0.21228    
## PC7           117.26      62.20   1.885  0.06728 .  
## PC8            28.72      63.64   0.451  0.65446    
## PC9           -37.18      72.76  -0.511  0.61244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239.3 on 37 degrees of freedom
## Multiple R-squared:  0.692,  Adjusted R-squared:  0.6171 
## F-statistic: 9.239 on 9 and 37 DF,  p-value: 3.588e-07

Now that we have created our model, let us predict our value for our new city’s metrics.

city_data = data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)

pred_new = data.frame(predict(crime_pca, city_data))
pred_model = predict(stan_model,pred_new)

pred_model
##        1 
## 1136.169

We have a predicted value of 1136 where last week our predicted value was 1096.

Discussion of Results - Ultimately, our crime value this week is pretty similar to last week’s crime statistic found. Since our crime values are anywhere from around 350 - 2000, we are within 40 points of each other, so I would say that is pretty consistent. Ultimately, if we chose a lower amount of principal components, we would expect this value to vary more since less variance is being accounted for in our model.