Question 9.1 - Using the same crime data set uscrime.txt
as in Question 8.2, apply Principal Component Analysis and then create a
regression model using the first few principal components. Specify your
new model in terms of the original variables (not the principal
components), and compare its quality to that of your solution to
Question 8.2. You can use the R function prcomp for PCA. (Note that to
first scale the data, you can include scale. = TRUE to scale as part of
the PCA function. Don’t forget that, to make a prediction for the new
city, you’ll need to unscale the coefficients (i.e., do the scaling
calculation in reverse)!)
Methodology - We are going to do a Principal Component Analysis to create a new model where our predictors are less correlated. Once we create our PCA Model, we are going to go ahead and only take so many of our Principal Component Variables depending on the cumulative variance explained. Then we are going to create a model of just this Principal Components to predict our new city values.
crime_data <- read.table("C:/Users/james/OneDrive/Desktop/Georgia Tech/ISYE 6501/Homeworks/Homework 6/Homework6_ISYE6501/Homework6_ISYE6501/data 9.1/uscrime.txt", header = TRUE)
head(crime_data)
We are going to use our prcomp function here to get our different principal components.
crime_pca = prcomp(crime_data[,1:15], scale = TRUE, center = TRUE)
summary(crime_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion 0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
## PC15
## Standard deviation 0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion 1.00000
plot(crime_pca)
We can see that once we get to PC9, we are explaining over 95% of our variance, so we are going to choose 9 different PCs for our model.
crimes_pcs = crime_pca$x[,1:9]
head(crimes_pcs)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -4.199284 -1.0938312 -1.11907395 0.67178115 0.05528338 0.3073383
## [2,] 1.172663 0.6770136 -0.05244634 -0.08350709 -1.17319982 -0.5832373
## [3,] -4.173725 0.2767750 -0.37107658 0.37793995 0.54134525 0.7187223
## [4,] 3.834962 -2.5769060 0.22793998 0.38262331 -1.64474650 0.7294884
## [5,] 1.839300 1.3309856 1.27882805 0.71814305 0.04159032 -0.3940902
## [6,] 2.907234 -0.3305421 0.53288181 1.22140635 1.37436096 -0.6922513
## PC7 PC8 PC9
## [1,] -0.56640816 -0.007801727 0.22350995
## [2,] 0.19561119 0.154566472 0.43677720
## [3,] 0.10330693 0.351138883 0.06299232
## [4,] 0.26699499 -1.547460841 -0.37954181
## [5,] 0.07050766 -0.543237437 0.22463245
## [6,] 0.22648209 0.562323186 0.41772217
Let us make a new dataset of our PCs and our Crime Statistic.
new_crime = as.data.frame(cbind(crimes_pcs, crime_data[,16]))
head(new_crime)
stan_model = lm(V10~., data = new_crime)
summary(stan_model)
##
## Call:
## lm(formula = V10 ~ ., data = new_crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -455.9 -132.5 21.5 139.9 393.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 34.91 25.928 < 2e-16 ***
## PC1 65.22 14.38 4.535 5.88e-05 ***
## PC2 -70.08 21.08 -3.325 0.00201 **
## PC3 25.19 24.92 1.011 0.31857
## PC4 69.45 32.73 2.122 0.04061 *
## PC5 -229.04 36.04 -6.355 2.08e-07 ***
## PC6 -60.21 47.44 -1.269 0.21228
## PC7 117.26 62.20 1.885 0.06728 .
## PC8 28.72 63.64 0.451 0.65446
## PC9 -37.18 72.76 -0.511 0.61244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 239.3 on 37 degrees of freedom
## Multiple R-squared: 0.692, Adjusted R-squared: 0.6171
## F-statistic: 9.239 on 9 and 37 DF, p-value: 3.588e-07
Now that we have created our model, let us predict our value for our new city’s metrics.
city_data = data.frame(M = 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5, LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120, U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.04, Time = 39.0)
pred_new = data.frame(predict(crime_pca, city_data))
pred_model = predict(stan_model,pred_new)
pred_model
## 1
## 1136.169
We have a predicted value of 1136 where last week our predicted value was 1096.
Discussion of Results - Ultimately, our crime value this week is pretty similar to last week’s crime statistic found. Since our crime values are anywhere from around 350 - 2000, we are within 40 points of each other, so I would say that is pretty consistent. Ultimately, if we chose a lower amount of principal components, we would expect this value to vary more since less variance is being accounted for in our model.