First, I want to check if there are outliers, in general, in the datset. The purpose of this is not to detect any specific outlier or eliminate any of them, but to know the general situation to provide proof for later data transformation.
As is seen in the boxplot below, most of the outliers are the ones closer to the upper limit, and variables like male percentage, number of population, non-white percentage have outliers that stretch very far to the upper boundary.
setwd("C:/Users/Jian Pang/Desktop/OMA/HW/HW6")
crime = read.csv("data 9.1/uscrime.txt",sep = "", header = T)
par(mfrow = c(4, 4), mar = c(1,2,1,1))
for (i in 1:16){
boxplot = boxplot(crime[,i], boxlty = 6, whisklty = 1, staplelwd = 4, medcol = 'blue',
outpch = 7,outcol = "red",main=names(crime)[i])
}
I want to test the normality of each variable. The purpose of this step is mainly to see if scaling is needed in the later PCA process. From the graph below, we can see that there are some variables that exhibits non-normality: “Souther State”, “Education Level”, “Male Percentage”, “Population”,“Non-white Percentage” and “unemployment rate of urban males 14-24”.
par(mfrow = c(4, 4), mar = c(1,2,3,1))
for (i in 1:16){
qqnorm(crime[,i], main = names(crime)[i])
qqline(crime[,i], col='blue')
}
To get a more precise idea of their normality, I also do the shapiro test to see the probability of them being normal. And the below table shows the variables that are not normally distributed.
plist = c()
for (i in 1:16){
s = shapiro.test(crime[,i])
p = s$p.value
plist[i] = p
}
pdf = data.frame(names(crime),plist)
library(knitr)
knitr::kable(pdf)
| names.crime. | plist |
|---|---|
| M | 0.0360754 |
| So | 0.0000000 |
| Ed | 0.0031970 |
| Po1 | 0.0042808 |
| Po2 | 0.0065808 |
| LF | 0.1719905 |
| M.F | 0.0037724 |
| Pop | 0.0000002 |
| NW | 0.0000047 |
| U1 | 0.0086167 |
| U2 | 0.2133081 |
| Wealth | 0.3375134 |
| Ineq | 0.0131909 |
| Prob | 0.0179052 |
| Time | 0.6132446 |
| Crime | 0.0018825 |
#print out the ones that have a p value below 0.05
knitr::kable(pdf[pdf$plist<0.05,])
| names.crime. | plist | |
|---|---|---|
| 1 | M | 0.0360754 |
| 2 | So | 0.0000000 |
| 3 | Ed | 0.0031970 |
| 4 | Po1 | 0.0042808 |
| 5 | Po2 | 0.0065808 |
| 7 | M.F | 0.0037724 |
| 8 | Pop | 0.0000002 |
| 9 | NW | 0.0000047 |
| 10 | U1 | 0.0086167 |
| 13 | Ineq | 0.0131909 |
| 14 | Prob | 0.0179052 |
| 16 | Crime | 0.0018825 |
It’s also noticeable that the variables do not have the same unit, so it is neccessary to scale them when doing PCA later.
Even though I test the normality of each variable before, and Souther State is unique in its nature to not having a normal distribution, instead, might follow a binomial distribution.
Since one of the steps of PCA is to center the data, and to construct PCs along the axis where the data has the largest variance, the binary indicator - Southern State might be an issue here because its mean and variance is not calculated in the same ways as continuous variables. However, I would still like to test to see if the PCA result with and without this variable will be different by comparing two models.
It can be seen from the screeplot below that both model’s PCs start to gain minimal variance explained staring from the fourth PC. So when I add up the first four PCs, model 1 shows a total variance explained of 79.9%, and the second model shows a total variance explained of 80.2%. This minimal increase brought by the addition variable is almost neglectable, but meanwhile, since it doesn’t bring down the whole model’s explanation effect either, I think it is still safe to keep it within the model.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(gridExtra)
library(ggpubr)
## Loading required package: magrittr
##this PCA model has all the variables
model = prcomp(crime[,1:16], center = T, scale = T)
##this PCA model exludes the Southern variable
model2 = prcomp(crime[,c(1,3:16)], center = T, scale = T)
pv = cbind(model$sdev,c(model2$sdev,0))
screeplt1= fviz_screeplot(model,addlabels = T)+labs(title = "All Variables")
screeplt2 = fviz_screeplot(model2,addlabels = T)+labs(title = "No Southern Variable")
figure <- ggarrange(screeplt1,screeplt2,ncol = 1, nrow = 2)
figure
So I decided to use all the variables for constructing PCA, and the result is already shown in the model above. I decide to go with 5 PCs.
I try to plot the data points in the new PC dimensions: The warmer the dots in color, the higher the contribution it has to the new PC, also meaning it is more correlated to the new PC.
indpcplot1= fviz_pca_ind(model, axes = c(1,2), col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
indpcplot2 = fviz_pca_ind(model, axes = c(3,4), col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
figure <- ggarrange(indpcplot1,indpcplot2,ncol = 1, nrow = 2)
figure
And I also plot the original variables on the new PC dimensions
varpcplot1= fviz_pca_var(model, axes = c(1,2), col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
varpcplot2= fviz_pca_var(model, axes = c(3,4), col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
figure <- ggarrange(varpcplot1,varpcplot2,ncol = 1, nrow = 2)
figure
before doing the regression, I need to reverse the new data points in the PC dimensions back to the original data dimensions, and also scale it back to the original units and scales. Therefore, I get the newcrime data
I also extract the loading matrix in the PCA result for later use.
columnmeans = colMeans(crime)
columnsd = sapply(crime, sd)
newcrime = matrix(, nrow=47, ncol=16)
for (i in 1:16){
newcrime[,i]=(crime[,i]-columnmeans[i])/columnsd[i]
}
newcrime = as.data.frame(newcrime)
summary(model)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4944 1.7111 1.4208 1.19585 1.06341 0.75087 0.60237
## Proportion of Variance 0.3889 0.1830 0.1262 0.08938 0.07068 0.03524 0.02268
## Cumulative Proportion 0.3889 0.5719 0.6981 0.78744 0.85812 0.89336 0.91603
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55503 0.49244 0.47036 0.43856 0.41777 0.29147 0.26063
## Proportion of Variance 0.01925 0.01516 0.01383 0.01202 0.01091 0.00531 0.00425
## Cumulative Proportion 0.93529 0.95044 0.96427 0.97629 0.98720 0.99251 0.99676
## PC15 PC16
## Standard deviation 0.21813 0.06584
## Proportion of Variance 0.00297 0.00027
## Cumulative Proportion 0.99973 1.00000
loadings = model$rotation[,1:5]
loadingssum = apply(loadings, 1, function(x) sum(x))
Then I use the new crime data to fit a linear regression, using a 5-fold cross validation, and I use mean squared error as my indicator to see the goodness of fit of the model.
As is shown below, the MSE values are very low
library(caret)
## Loading required package: lattice
set.seed(11)
cv_full_data = as.data.frame(cbind(model$x[,1:5], newcrime[,16]))
samples <- createFolds(y=cv_full_data[,6],k=5)
mselist = c()
coefficientdf = matrix(,nrow = 0, ncol = 4)
rsqlist = c()
for (i in 1:5){
test_sample <- cv_full_data[samples[[i]],]
train_sample <- cv_full_data[-samples[[i]],]
lrmodel = lm( V6 ~ PC1 + PC2 + PC3 + PC4 + PC5,
data = train_sample)
p = predict(lrmodel, test_sample)
mse = mean((test_sample$V6 - p)^2)
mselist = c(mselist, mse)
coefficients = summary(lrmodel)$coefficients
coefficientdf = rbind(coefficientdf, coefficients)
rsqlist = c(rsqlist, summary(lrmodel)$r.squared)
}
knitr::kable(coefficientdf)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0105649 | 0.0583987 | -0.1809100 | 0.8575447 |
| PC1 | 0.1828170 | 0.0238526 | 7.6644423 | 0.0000000 |
| PC2 | -0.2386618 | 0.0337460 | -7.0723032 | 0.0000000 |
| PC3 | 0.0861524 | 0.0397226 | 2.1688498 | 0.0374005 |
| PC4 | -0.5641051 | 0.0474047 | -11.8997703 | 0.0000000 |
| PC5 | 0.1042511 | 0.0654031 | 1.5939778 | 0.1204754 |
| (Intercept) | 0.0315471 | 0.0636961 | 0.4952758 | 0.6240139 |
| PC1 | 0.2041121 | 0.0244072 | 8.3627949 | 0.0000000 |
| PC2 | -0.2151018 | 0.0371996 | -5.7823741 | 0.0000026 |
| PC3 | 0.0843171 | 0.0439652 | 1.9178130 | 0.0646984 |
| PC4 | -0.5188083 | 0.0611355 | -8.4862081 | 0.0000000 |
| PC5 | 0.1464174 | 0.0605796 | 2.4169408 | 0.0219408 |
| (Intercept) | -0.0146769 | 0.0589547 | -0.2489516 | 0.8049892 |
| PC1 | 0.2038253 | 0.0245213 | 8.3121692 | 0.0000000 |
| PC2 | -0.2271544 | 0.0339961 | -6.6817801 | 0.0000002 |
| PC3 | 0.1039358 | 0.0426976 | 2.4342285 | 0.0206829 |
| PC4 | -0.5459811 | 0.0484723 | -11.2637662 | 0.0000000 |
| PC5 | 0.1081045 | 0.0526829 | 2.0519870 | 0.0484279 |
| (Intercept) | 0.0621212 | 0.0571977 | 1.0860790 | 0.2855581 |
| PC1 | 0.1856399 | 0.0227869 | 8.1467675 | 0.0000000 |
| PC2 | -0.2700878 | 0.0351675 | -7.6800496 | 0.0000000 |
| PC3 | 0.1527147 | 0.0439095 | 3.4779443 | 0.0014782 |
| PC4 | -0.5599658 | 0.0497201 | -11.2623519 | 0.0000000 |
| PC5 | 0.1575920 | 0.0516872 | 3.0489545 | 0.0045825 |
| (Intercept) | -0.0531016 | 0.0630829 | -0.8417748 | 0.4063612 |
| PC1 | 0.2074998 | 0.0271854 | 7.6327628 | 0.0000000 |
| PC2 | -0.2442538 | 0.0375338 | -6.5075775 | 0.0000003 |
| PC3 | 0.1157941 | 0.0446701 | 2.5922041 | 0.0144181 |
| PC4 | -0.5533537 | 0.0501055 | -11.0437761 | 0.0000000 |
| PC5 | 0.1563405 | 0.0579853 | 2.6962098 | 0.0112331 |
print(rsqlist)
## [1] 0.8872666 0.8631892 0.8930504 0.9057765 0.8719073
Above is the result for all the 5 cv trials (because it’s 5 fold).
It can be seen from the above R square value that the 2nd cv trial has the best fit.
From the coefficient table above, we can also see that the Intercept term is highly likely not correlated with the crime response, and also it has a very small value as well, meaning that it has little influence. Besides PC3 and PC4, the rest 3 PCs are all significant. PC1 and PC2 have the same amount of opposite influence to the response, and PC5 has the highest influence of all.
I use this regression model to predict the final crime value for the given input. Since the 2nd cv trial has the best fit, I extract the coefficient from the 2nd cv trial.
#There is a model that has the highest r square, which is the 3rd one
lmcoeff = coefficientdf[7:12,1]
input = c(
M = 14.0,
So = 0,
Ed = 10.0,
Po1 = 12.0,
Po2 = 15.5,
LF = 0.640,
M.F = 94.0,
Pop = 150,
NW = 1.1,
U1 = 0.120,
U2 = 3.6,
Wealth = 3200,
Ineq = 20.1,
Prob = 0.04,
Time = 39.0)
finalpred = 0
colsum = 0
for (i in 1:5){
for (j in 1:15){
a = lmcoeff[i+1]* loadings[j,i] * (input[j] - columnmeans[j])/columnsd[j]
colsum = colsum + a
}
finalpred = finalpred + colsum
}
finalpred = finalpred + lmcoeff[1]
finalpredict = finalpred*columnsd[16] + columnmeans[16]
finalpredict
## PC1
## 2156.721
As shown above, the final prediction for crime is 2574, which is way beyond our original data range. Considering that the linear regression with 5 PCs do not perform very well, with 2 PCs being very insignificant, I decided to remove the 2 PCs and do a new linear regression to fit the data.
I have to also repeat the cv process again to make sure to find a best trial within the 5 trials.
mselist2 = c()
coefficientdf2 = matrix(,nrow = 0, ncol = 4)
rsqlist2 = c()
for (i in 1:5){
test_sample <- cv_full_data[samples[[i]],]
train_sample <- cv_full_data[-samples[[i]],]
lrmodel = lm( V6 ~ PC1 + PC2 + PC5,
data = train_sample)
p = predict(lrmodel, test_sample)
mse = mean((test_sample$V6 - p)^2)
mselist = c(mselist, mse)
coefficients = summary(lrmodel)$coefficients
coefficientdf2 = rbind(coefficientdf2, coefficients)
rsqlist2 = c(rsqlist, summary(lrmodel)$r.squared)
}
lmcoeff2 = coefficientdf2[5:8,1]
knitr::kable(coefficientdf2)
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.0053310 | 0.1322794 | -0.0403013 | 0.9680819 |
| PC1 | 0.1915169 | 0.0540282 | 3.5447604 | 0.0011382 |
| PC2 | -0.2209979 | 0.0763490 | -2.8945757 | 0.0064971 |
| PC5 | 0.2125099 | 0.1462619 | 1.4529407 | 0.1551478 |
| (Intercept) | -0.0144060 | 0.1141162 | -0.1262399 | 0.9003319 |
| PC1 | 0.1759272 | 0.0435363 | 4.0409336 | 0.0003121 |
| PC2 | -0.2502902 | 0.0665351 | -3.7617785 | 0.0006805 |
| PC5 | 0.1490230 | 0.1087104 | 1.3708262 | 0.1799681 |
| (Intercept) | -0.0306512 | 0.1291058 | -0.2374113 | 0.8137625 |
| PC1 | 0.2010474 | 0.0537406 | 3.7410704 | 0.0006756 |
| PC2 | -0.2532496 | 0.0741899 | -3.4135307 | 0.0016737 |
| PC5 | 0.0921799 | 0.1148911 | 0.8023238 | 0.4279384 |
| (Intercept) | 0.0894205 | 0.1285137 | 0.6958054 | 0.4912798 |
| PC1 | 0.2323975 | 0.0510999 | 4.5479061 | 0.0000657 |
| PC2 | -0.2661636 | 0.0798236 | -3.3343988 | 0.0020745 |
| PC5 | 0.1016820 | 0.1175452 | 0.8650466 | 0.3930759 |
| (Intercept) | -0.0552218 | 0.1381241 | -0.3997985 | 0.6918810 |
| PC1 | 0.1842553 | 0.0594928 | 3.0971007 | 0.0039724 |
| PC2 | -0.2191160 | 0.0826273 | -2.6518596 | 0.0122062 |
| PC5 | 0.1295725 | 0.1276173 | 1.0153206 | 0.3173397 |
print(rsqlist2)
## [1] 0.8872666 0.8631892 0.8930504 0.9057765 0.8719073 0.3364407
Although the R squared value doesn’t improve compared to the last model, or the fitting quality doesn’t improve, it can still be shown that the coefficients are all significant, except the intercept term still. The 2nd cv trial turns out to be the best performer, so I still extract the coefficient from the trial.
Same as before, I predict the crime with this new regression model
finalpred = 0
colsum = 0
for (i in 1:3){
for (j in 1:15){
a = lmcoeff2[i+1]* loadings[,-c(3:4)][j,i] * (input[j] - columnmeans[j])/columnsd[j]
colsum = colsum + a
}
finalpred = finalpred + colsum
}
finalpred2 = finalpred + lmcoeff2[1]
finalpredict2 = finalpred*columnsd[16] + columnmeans[16]
print(finalpredict2)
## PC1
## 1752.131
This time, the crime value is almost 1980. This value is still within our data range, although it is very close to the upper bound. This value is much higher than the last prediction. However, for the pca method, it proves that it’s still a valid approach to deal with high dimensional data, and return a data matrix that still retains good amount of information for doing linear regression.