library(DAAG)
library(magrittr)
#set the working directory so we can correctly pull in the data
setwd("C:/Users/tarra/Documents/Intro to Analytics Modeling/Fall2020hw6/data 9.1")
uscrime_data <- read.table("uscrime.txt",stringsAsFactors = FALSE, header = TRUE)
uscrime_prc <- prcomp(uscrime_data[,-16], center=TRUE, scale=TRUE)
summary(uscrime_prc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion 0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
## PC15
## Standard deviation 0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion 1.00000
screeplot(uscrime_prc, main = "Scree Plot", type = "line")
looking at the scree plot, we can see that we can choose the first five Pc’s in our model because they’re all above the STD. Next we want to focus on the model.
#we set the number of PC's we want to test, and then combine them with our original crime data.
pc = 5
uscrime_matrix <- cbind(uscrime_prc$x[,1:pc],uscrime_data[,16])
#now that its combines, we can creat the LR model.
uscrime_model <- lm(V6~., data = as.data.frame(uscrime_matrix))
summary(uscrime_model)
##
## Call:
## lm(formula = V6 ~ ., data = as.data.frame(uscrime_matrix))
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.79 -185.01 12.21 146.24 447.86
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.09 35.59 25.428 < 2e-16 ***
## PC1 65.22 14.67 4.447 6.51e-05 ***
## PC2 -70.08 21.49 -3.261 0.00224 **
## PC3 25.19 25.41 0.992 0.32725
## PC4 69.45 33.37 2.081 0.04374 *
## PC5 -229.04 36.75 -6.232 2.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared: 0.6452, Adjusted R-squared: 0.6019
## F-statistic: 14.91 on 5 and 41 DF, p-value: 2.446e-08
Now we want to focus on our transformation so that we can run our new model on our variables from 8.2 and compare the quality.
#what we want to do here is to preform the transformation as explain in this week's lessons.
#to start, we need our intercept
intercept <- uscrime_model$coefficients[1]
#now we need our other coefficients to make the beta vector
beta_vector <- uscrime_model$coefficients[2:(pc+1)]
#next we take the rotated matrix and multiply it be our coefficients to creat the alpha vector.
alpha_vector <- uscrime_prc$rotation[,1:pc] %*% beta_vector
#we can get our original alpha values by doing alpha_vector/signma, and our original beta by subtracting the intercept from the sum of (alpha*mu)/sigma
#for this we need to get mu and sigma.
mu <- sapply(uscrime_data[,1:15], mean)
sigma <- sapply(uscrime_data[,1:15], sd)
original_alpha <- alpha_vector / sigma
original_beta <- intercept - sum(alpha_vector*mu / sigma)
#with this we can now make our estimates based on the model y = aX +b
estimates <- as.matrix(uscrime_data[,1:15]) %*% original_alpha + original_beta
#We can use this to calculate R^2 and adjusted R^2 to compare to our 8.2 answer
SSE = sum((estimates - uscrime_data[,16])^2)
SStot = sum((uscrime_data[,16] - mean(uscrime_data[,16]))^2)
R2 <- 1 - SSE/SStot
R2
## [1] 0.6451941
R2_adjust <- R2 - (1-R2)*pc/(nrow(uscrime_data)-pc-1)
R2_adjust
## [1] 0.601925
#we can now move on to using last weeks data on our new model to compare.
new_city <- data.frame(M= 14.0, So = 0, Ed = 10.0, Po1 = 12.0, Po2 = 15.5,
LF = 0.640, M.F = 94.0, Pop = 150, NW = 1.1, U1 = 0.120,
U2 = 3.6, Wealth = 3200, Ineq = 20.1, Prob = 0.040,
Time = 39.0)
pred_newcity <- data.frame(predict(uscrime_prc, new_city))
pred_newcity_model <- predict(uscrime_model, pred_newcity)
pred_newcity_model
## 1
## 1388.926
Comparing this weeks predictions to last weeks predictions, we can see that PCA model returned pretty close results. With the prediction of 1389 compared to last weeks 1304 and an r squared of 0.645 to 0.671, it seems that this model is a little less satisfactory overall. Granted the data we used was a small amount, a bigger data set might prove to be the same if not better than last week’s. Overall we’ve shown that a PCA model can give just about the same accuracy when observing less predictors.