A multivariate normal distribution is made up of multiple single dimentional normal distributions. To accomplish a normal distribution for each of the variables, each variable will be graphed to visually detect the linearity. The following series of variables illustrates the correction for each variable.
First some data work is needed to set up the data set we will be using.
## 'data.frame': 381 obs. of 15 variables:
## $ CSA : int 999 184 999 440 104 106 999 408 999 108 ...
## $ CBSA : int 10180 10420 10500 10540 10580 10740 10780 10900 11020 11100 ...
## $ Name : Factor w/ 381 levels "AbileneTX","AkronOH",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ TotalV : int 58424 181440 30971 77351 406696 436320 57600 261907 17545 173848 ...
## $ OneV : int 58424 173540 28674 65175 296663 407739 56565 221371 16215 128990 ...
## $ TwoV : int 0 0 1651 409 8161 1700 1035 350 640 575 ...
## $ ThreeFourV : int 0 1900 458 0 16113 5618 0 2164 0 2279 ...
## $ FiveMoreV : int 0 6000 188 11767 85759 21263 0 38022 690 42004 ...
## $ TotalN : int 284 763 271 393 2231 2543 323 1801 100 918 ...
## $ OneN : int 284 684 235 270 1203 2128 311 1051 88 496 ...
## $ TwoN : int 0 0 22 2 62 12 12 2 4 6 ...
## $ ThreeFourN : int 0 16 9 0 127 67 0 14 0 44 ...
## $ FiveMoreN : int 0 63 5 121 839 336 0 734 8 372 ...
## $ FiveStructN: int 0 1 1 11 70 15 0 38 1 13 ...
## $ Units : int 70507 313241 66262 49304 397239 380455 66123 344181 56067 107463 ...
one <- permits$OneV/permits$OneN
two <- permits$TwoV/(permits$TwoN*2)
three <- permits$ThreeFourV/(permits$ThreeFourN*3.5)
five <- permits$FiveMoreV/(permits$FiveMoreN*6)
total <- permits$TotalV/permits$TotalN
perunit <- data.frame(cbind(one, two, three, five, total))
colnames(perunit) <- c("one", "two", "three", "five", "total")
str(perunit)
## 'data.frame': 381 obs. of 5 variables:
## $ one : num 206 254 122 241 247 ...
## $ two : num NaN NaN 37.5 102.2 65.8 ...
## $ three: num NaN 33.9 14.5 NaN 36.2 ...
## $ five : num NaN 15.87 6.27 16.21 17.04 ...
## $ total: num 206 238 114 197 182 ...
For future use, I will reassign these new variables to the original dataset, “permits.”
permits$single <- perunit$one
permits$multi <- apply(perunit[,2:4], 1, mean, na.rm = TRUE)
permits$perunit <- apply(perunit, 1, mean, na.rm = TRUE)
There may be some value in exploring
summary(permits$single)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.35 184.30 212.60 219.10 247.80 545.10
summary(permits$multi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.556 25.460 34.480 37.110 43.890 510.600 33
To find out, we can do a Q-Q plot in R and eye-ball it to determine if the variable is normally distributed.
Perunit cost of: All units built Single unit dwellings Multiunit dwellings Two unit dwellings Three/four unit dwellingz Five-of-more unit dwellings
Number of units
The red line represents the linear model
par(mfrow = c(3,3), mar = c(2,2,1,1))
qqnorm(perunit$total,main = "Per Unit - All Units")
qqline(perunit$total,col = "red")
qqnorm(perunit$one,main = "Per Unit - Single Units")
qqline(perunit$one,col = "red")
qqnorm(perunit$two,main = "Per Unit -Two Units")
qqline(perunit$two,col = "red")
qqnorm(perunit$three,main = "Per Unit - Three Units")
qqline(perunit$three,col = "red")
qqnorm(perunit$five,main = "Per Unit - Five-or-more Units")
qqline(perunit$five,col = "red")
qqnorm(log(permits$TotalN),main = "Total Number of Units")
qqline(log(permits$TotalN),col = "red")
qqnorm(permits$multi, main = "All Multi Unit")
qqline(permits$multi,col = "red")
Ok, now that the variables have been checked (and passed) for normality, I will combine them into one dataset.
dev.off()
## null device
## 1
data <- data.frame(cbind(perunit[,1:5],log(permits$TotalN),permits$Name),row.names = NULL)
colnames(data) <- c("one", "two", "three", "five", "total","totaln","name")
data <- na.omit(data)
#note : this omits the names of locations with incomplete records
data.names <- data[,6]
data.pca <- prcomp(data[,1:6],
center = T,
scale. = T)
print(data.pca)
## Standard deviations:
## [1] 1.6427419 0.9742211 0.9307107 0.8672024 0.7768310 0.3613358
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5
## one 0.5341055 -0.06606659 -0.09289466 0.4385035 0.12537952
## two 0.2841587 0.35667362 -0.76589337 -0.4440871 -0.09022784
## three 0.4020910 -0.02038588 0.32989565 -0.4731929 0.70446058
## five 0.3837891 0.18115432 0.50293671 -0.3394017 -0.66103990
## total 0.5233138 0.14644462 -0.02253836 0.4865521 -0.04841227
## totaln 0.2261783 -0.90207574 -0.20613945 -0.1861809 -0.20138696
## PC6
## one 0.702660529
## two 0.009524601
## three -0.094339218
## five 0.121558294
## total -0.681991144
## totaln -0.131868250
plot(data.pca, type = "l")
summary(data.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6427 0.9742 0.9307 0.8672 0.7768 0.36134
## Proportion of Variance 0.4498 0.1582 0.1444 0.1253 0.1006 0.02176
## Cumulative Proportion 0.4498 0.6079 0.7523 0.8777 0.9782 1.00000
library(devtools)
library(ggbiplot)
g <- ggbiplot(data.pca, obs.scale = 1, var.scale = 1,
groups = NULL, ellipse = TRUE,
circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)