R Notebook

Outlier Test

First, I want to check if there are outliers, in general, in the datset. The purpose of this is not to detect any specific outlier or eliminate any of them, but to know the general situation to provide proof for later data transformation.

As is seen in the boxplot below, most of the outliers are the ones closer to the upper limit, and variables like male percentage, number of population, non-white percentage have outliers that stretch very far to the upper boundary.

setwd("C:/Users/Jian Pang/Desktop/OMA/HW/HW6")
crime = read.csv("data 9.1/uscrime.txt",sep = "", header = T)

par(mfrow = c(4, 4), mar = c(1,2,1,1))

for (i in 1:16){
  
  boxplot = boxplot(crime[,i], boxlty = 6, whisklty = 1, staplelwd = 4, medcol = 'blue', 
                    outpch = 7,outcol = "red",main=names(crime)[i])
}

Normality Test

I want to test the normality of each variable. The purpose of this step is mainly to see if scaling is needed in the later PCA process. From the graph below, we can see that there are some variables that exhibits non-normality: “Souther State”, “Education Level”, “Male Percentage”, “Population”,“Non-white Percentage” and “unemployment rate of urban males 14-24”.

par(mfrow = c(4, 4), mar = c(1,2,3,1))

for (i in 1:16){
  qqnorm(crime[,i], main = names(crime)[i])
  qqline(crime[,i], col='blue')
}

To get a more precise idea of their normality, I also do the shapiro test to see the probability of them being normal. And the below table shows the variables that are not normally distributed.

plist = c()
for (i in 1:16){
  s = shapiro.test(crime[,i])
  p = s$p.value
  plist[i] = p
}


pdf = data.frame(names(crime),plist)
library(knitr)

knitr::kable(pdf)

names.crime.	plist
M	0.0360754
So	0.0000000
Ed	0.0031970
Po1	0.0042808
Po2	0.0065808
LF	0.1719905
M.F	0.0037724
Pop	0.0000002
NW	0.0000047
U1	0.0086167
U2	0.2133081
Wealth	0.3375134
Ineq	0.0131909
Prob	0.0179052
Time	0.6132446
Crime	0.0018825

#print out the ones that have a p value below 0.05
knitr::kable(pdf[pdf$plist<0.05,])

	names.crime.	plist
1	M	0.0360754
2	So	0.0000000
3	Ed	0.0031970
4	Po1	0.0042808
5	Po2	0.0065808
7	M.F	0.0037724
8	Pop	0.0000002
9	NW	0.0000047
10	U1	0.0086167
13	Ineq	0.0131909
14	Prob	0.0179052
16	Crime	0.0018825

It’s also noticeable that the variables do not have the same unit, so it is neccessary to scale them when doing PCA later.

Test on the binary indicator - Souther State

Even though I test the normality of each variable before, and Souther State is unique in its nature to not having a normal distribution, instead, might follow a binomial distribution.

Since one of the steps of PCA is to center the data, and to construct PCs along the axis where the data has the largest variance, the binary indicator - Southern State might be an issue here because its mean and variance is not calculated in the same ways as continuous variables. However, I would still like to test to see if the PCA result with and without this variable will be different by comparing two models.

It can be seen from the screeplot below that both model’s PCs start to gain minimal variance explained staring from the fourth PC. So when I add up the first four PCs, model 1 shows a total variance explained of 79.9%, and the second model shows a total variance explained of 80.2%. This minimal increase brought by the addition variable is almost neglectable, but meanwhile, since it doesn’t bring down the whole model’s explanation effect either, I think it is still safe to keep it within the model.

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(gridExtra)
library(ggpubr)

## Loading required package: magrittr

##this PCA model has all the variables
model = prcomp(crime[,1:16], center = T, scale = T)

##this PCA model exludes the Southern variable
model2 = prcomp(crime[,c(1,3:16)], center = T, scale = T)


pv = cbind(model$sdev,c(model2$sdev,0))


screeplt1= fviz_screeplot(model,addlabels = T)+labs(title = "All Variables")

screeplt2 = fviz_screeplot(model2,addlabels = T)+labs(title = "No Southern Variable")

figure <- ggarrange(screeplt1,screeplt2,ncol = 1, nrow = 2)
figure

PCA Process

So I decided to use all the variables for constructing PCA, and the result is already shown in the model above. I decide to go with 5 PCs.

I try to plot the data points in the new PC dimensions: The warmer the dots in color, the higher the contribution it has to the new PC, also meaning it is more correlated to the new PC.

indpcplot1= fviz_pca_ind(model, axes = c(1,2), col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

indpcplot2 = fviz_pca_ind(model, axes = c(3,4), col.ind = "cos2", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

figure <- ggarrange(indpcplot1,indpcplot2,ncol = 1, nrow = 2)
figure

And I also plot the original variables on the new PC dimensions

varpcplot1= fviz_pca_var(model, axes = c(1,2), col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

varpcplot2= fviz_pca_var(model, axes = c(3,4), col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE)

figure <- ggarrange(varpcplot1,varpcplot2,ncol = 1, nrow = 2)
figure

Linear Regression and Cross Validation

before doing the regression, I need to reverse the new data points in the PC dimensions back to the original data dimensions, and also scale it back to the original units and scales. Therefore, I get the newcrime data

I also extract the loading matrix in the PCA result for later use.

columnmeans = colMeans(crime)
columnsd = sapply(crime, sd)


newcrime = matrix(, nrow=47, ncol=16)

for (i in 1:16){
  newcrime[,i]=(crime[,i]-columnmeans[i])/columnsd[i]
}

newcrime = as.data.frame(newcrime)


summary(model)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4944 1.7111 1.4208 1.19585 1.06341 0.75087 0.60237
## Proportion of Variance 0.3889 0.1830 0.1262 0.08938 0.07068 0.03524 0.02268
## Cumulative Proportion  0.3889 0.5719 0.6981 0.78744 0.85812 0.89336 0.91603
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.55503 0.49244 0.47036 0.43856 0.41777 0.29147 0.26063
## Proportion of Variance 0.01925 0.01516 0.01383 0.01202 0.01091 0.00531 0.00425
## Cumulative Proportion  0.93529 0.95044 0.96427 0.97629 0.98720 0.99251 0.99676
##                           PC15    PC16
## Standard deviation     0.21813 0.06584
## Proportion of Variance 0.00297 0.00027
## Cumulative Proportion  0.99973 1.00000

loadings = model$rotation[,1:5]
loadingssum = apply(loadings, 1, function(x) sum(x))

Then I use the new crime data to fit a linear regression, using a 5-fold cross validation, and I use mean squared error as my indicator to see the goodness of fit of the model.

As is shown below, the MSE values are very low

library(caret)

## Loading required package: lattice

set.seed(11)
cv_full_data = as.data.frame(cbind(model$x[,1:5], newcrime[,16]))
samples <- createFolds(y=cv_full_data[,6],k=5)
mselist = c()
coefficientdf =  matrix(,nrow = 0, ncol = 4)
rsqlist = c()

for (i in 1:5){
  
  test_sample <- cv_full_data[samples[[i]],]  
  
  train_sample <- cv_full_data[-samples[[i]],] 
  
  lrmodel = lm( V6 ~ PC1 + PC2 + PC3 + PC4 + PC5,
                data = train_sample)
  
  p = predict(lrmodel, test_sample)
  
  mse = mean((test_sample$V6 - p)^2)
  
  mselist = c(mselist, mse)
  
  coefficients = summary(lrmodel)$coefficients
  
  coefficientdf = rbind(coefficientdf, coefficients)
  
  rsqlist = c(rsqlist, summary(lrmodel)$r.squared)
}


knitr::kable(coefficientdf)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0105649	0.0583987	-0.1809100	0.8575447
PC1	0.1828170	0.0238526	7.6644423	0.0000000
PC2	-0.2386618	0.0337460	-7.0723032	0.0000000
PC3	0.0861524	0.0397226	2.1688498	0.0374005
PC4	-0.5641051	0.0474047	-11.8997703	0.0000000
PC5	0.1042511	0.0654031	1.5939778	0.1204754
(Intercept)	0.0315471	0.0636961	0.4952758	0.6240139
PC1	0.2041121	0.0244072	8.3627949	0.0000000
PC2	-0.2151018	0.0371996	-5.7823741	0.0000026
PC3	0.0843171	0.0439652	1.9178130	0.0646984
PC4	-0.5188083	0.0611355	-8.4862081	0.0000000
PC5	0.1464174	0.0605796	2.4169408	0.0219408
(Intercept)	-0.0146769	0.0589547	-0.2489516	0.8049892
PC1	0.2038253	0.0245213	8.3121692	0.0000000
PC2	-0.2271544	0.0339961	-6.6817801	0.0000002
PC3	0.1039358	0.0426976	2.4342285	0.0206829
PC4	-0.5459811	0.0484723	-11.2637662	0.0000000
PC5	0.1081045	0.0526829	2.0519870	0.0484279
(Intercept)	0.0621212	0.0571977	1.0860790	0.2855581
PC1	0.1856399	0.0227869	8.1467675	0.0000000
PC2	-0.2700878	0.0351675	-7.6800496	0.0000000
PC3	0.1527147	0.0439095	3.4779443	0.0014782
PC4	-0.5599658	0.0497201	-11.2623519	0.0000000
PC5	0.1575920	0.0516872	3.0489545	0.0045825
(Intercept)	-0.0531016	0.0630829	-0.8417748	0.4063612
PC1	0.2074998	0.0271854	7.6327628	0.0000000
PC2	-0.2442538	0.0375338	-6.5075775	0.0000003
PC3	0.1157941	0.0446701	2.5922041	0.0144181
PC4	-0.5533537	0.0501055	-11.0437761	0.0000000
PC5	0.1563405	0.0579853	2.6962098	0.0112331

print(rsqlist)

## [1] 0.8872666 0.8631892 0.8930504 0.9057765 0.8719073

Above is the result for all the 5 cv trials (because it’s 5 fold).

It can be seen from the above R square value that the 2nd cv trial has the best fit.

From the coefficient table above, we can also see that the Intercept term is highly likely not correlated with the crime response, and also it has a very small value as well, meaning that it has little influence. Besides PC3 and PC4, the rest 3 PCs are all significant. PC1 and PC2 have the same amount of opposite influence to the response, and PC5 has the highest influence of all.

I use this regression model to predict the final crime value for the given input. Since the 2nd cv trial has the best fit, I extract the coefficient from the 2nd cv trial.

#There is a model that has the highest r square, which is the 3rd one

lmcoeff = coefficientdf[7:12,1]

input = c(
  M = 14.0,
  So = 0,
  Ed = 10.0,
  Po1 = 12.0,
  Po2 = 15.5,
  LF = 0.640,
  M.F = 94.0,
  Pop = 150,
  NW = 1.1,
  U1 = 0.120,
  U2 = 3.6,
  Wealth = 3200,
  Ineq = 20.1,
  Prob = 0.04,
  Time = 39.0)

finalpred = 0
colsum = 0

for (i in 1:5){
  for (j in 1:15){
    a = lmcoeff[i+1]* loadings[j,i] * (input[j] - columnmeans[j])/columnsd[j]
    colsum = colsum + a
  }
  finalpred = finalpred + colsum
}

finalpred = finalpred + lmcoeff[1]

finalpredict = finalpred*columnsd[16] + columnmeans[16]

finalpredict

##      PC1 
## 2156.721

Troubleshoot Linear Regression

As shown above, the final prediction for crime is 2574, which is way beyond our original data range. Considering that the linear regression with 5 PCs do not perform very well, with 2 PCs being very insignificant, I decided to remove the 2 PCs and do a new linear regression to fit the data.

I have to also repeat the cv process again to make sure to find a best trial within the 5 trials.

mselist2 = c()
coefficientdf2 =  matrix(,nrow = 0, ncol = 4)
rsqlist2 = c()

for (i in 1:5){
  
  test_sample <- cv_full_data[samples[[i]],]  
  
  train_sample <- cv_full_data[-samples[[i]],] 
  
  lrmodel = lm( V6 ~ PC1 + PC2 + PC5,
                data = train_sample)
  
  p = predict(lrmodel, test_sample)
  
  mse = mean((test_sample$V6 - p)^2)
  
  mselist = c(mselist, mse)
  
  coefficients = summary(lrmodel)$coefficients
  
  coefficientdf2 = rbind(coefficientdf2, coefficients)
  
  rsqlist2 = c(rsqlist, summary(lrmodel)$r.squared)
}


lmcoeff2 = coefficientdf2[5:8,1]


knitr::kable(coefficientdf2)

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-0.0053310	0.1322794	-0.0403013	0.9680819
PC1	0.1915169	0.0540282	3.5447604	0.0011382
PC2	-0.2209979	0.0763490	-2.8945757	0.0064971
PC5	0.2125099	0.1462619	1.4529407	0.1551478
(Intercept)	-0.0144060	0.1141162	-0.1262399	0.9003319
PC1	0.1759272	0.0435363	4.0409336	0.0003121
PC2	-0.2502902	0.0665351	-3.7617785	0.0006805
PC5	0.1490230	0.1087104	1.3708262	0.1799681
(Intercept)	-0.0306512	0.1291058	-0.2374113	0.8137625
PC1	0.2010474	0.0537406	3.7410704	0.0006756
PC2	-0.2532496	0.0741899	-3.4135307	0.0016737
PC5	0.0921799	0.1148911	0.8023238	0.4279384
(Intercept)	0.0894205	0.1285137	0.6958054	0.4912798
PC1	0.2323975	0.0510999	4.5479061	0.0000657
PC2	-0.2661636	0.0798236	-3.3343988	0.0020745
PC5	0.1016820	0.1175452	0.8650466	0.3930759
(Intercept)	-0.0552218	0.1381241	-0.3997985	0.6918810
PC1	0.1842553	0.0594928	3.0971007	0.0039724
PC2	-0.2191160	0.0826273	-2.6518596	0.0122062
PC5	0.1295725	0.1276173	1.0153206	0.3173397

print(rsqlist2)

## [1] 0.8872666 0.8631892 0.8930504 0.9057765 0.8719073 0.3364407

Although the R squared value doesn’t improve compared to the last model, or the fitting quality doesn’t improve, it can still be shown that the coefficients are all significant, except the intercept term still. The 2nd cv trial turns out to be the best performer, so I still extract the coefficient from the trial.

Same as before, I predict the crime with this new regression model

finalpred = 0
colsum = 0

for (i in 1:3){
  for (j in 1:15){
    a = lmcoeff2[i+1]* loadings[,-c(3:4)][j,i] * (input[j] - columnmeans[j])/columnsd[j]
    colsum = colsum + a
  }
  finalpred = finalpred + colsum
}

finalpred2 = finalpred + lmcoeff2[1]

finalpredict2 = finalpred*columnsd[16] + columnmeans[16]

print(finalpredict2)

##      PC1 
## 1752.131

This time, the crime value is almost 1980. This value is still within our data range, although it is very close to the upper bound. This value is much higher than the last prediction. However, for the pca method, it proves that it’s still a valid approach to deal with high dimensional data, and return a data matrix that still retains good amount of information for doing linear regression.