Multivariate Normal

A multivariate normal distribution is made up of multiple single dimentional normal distributions. To accomplish a normal distribution for each of the variables, each variable will be graphed to visually detect the linearity. The following series of variables illustrates the correction for each variable.

First some data work is needed to set up the data set we will be using.

## 'data.frame':    381 obs. of  15 variables:
##  $ CSA        : int  999 184 999 440 104 106 999 408 999 108 ...
##  $ CBSA       : int  10180 10420 10500 10540 10580 10740 10780 10900 11020 11100 ...
##  $ Name       : Factor w/ 381 levels "AbileneTX","AkronOH",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ TotalV     : int  58424 181440 30971 77351 406696 436320 57600 261907 17545 173848 ...
##  $ OneV       : int  58424 173540 28674 65175 296663 407739 56565 221371 16215 128990 ...
##  $ TwoV       : int  0 0 1651 409 8161 1700 1035 350 640 575 ...
##  $ ThreeFourV : int  0 1900 458 0 16113 5618 0 2164 0 2279 ...
##  $ FiveMoreV  : int  0 6000 188 11767 85759 21263 0 38022 690 42004 ...
##  $ TotalN     : int  284 763 271 393 2231 2543 323 1801 100 918 ...
##  $ OneN       : int  284 684 235 270 1203 2128 311 1051 88 496 ...
##  $ TwoN       : int  0 0 22 2 62 12 12 2 4 6 ...
##  $ ThreeFourN : int  0 16 9 0 127 67 0 14 0 44 ...
##  $ FiveMoreN  : int  0 63 5 121 839 336 0 734 8 372 ...
##  $ FiveStructN: int  0 1 1 11 70 15 0 38 1 13 ...
##  $ Units      : int  70507 313241 66262 49304 397239 380455 66123 344181 56067 107463 ...

one <- permits$OneV/permits$OneN
two <- permits$TwoV/(permits$TwoN*2)
three <- permits$ThreeFourV/(permits$ThreeFourN*3.5)
five <- permits$FiveMoreV/(permits$FiveMoreN*6)
total <- permits$TotalV/permits$TotalN

perunit <- data.frame(cbind(one, two, three, five, total))
colnames(perunit) <- c("one", "two", "three", "five", "total")

str(perunit)

## 'data.frame':    381 obs. of  5 variables:
##  $ one  : num  206 254 122 241 247 ...
##  $ two  : num  NaN NaN 37.5 102.2 65.8 ...
##  $ three: num  NaN 33.9 14.5 NaN 36.2 ...
##  $ five : num  NaN 15.87 6.27 16.21 17.04 ...
##  $ total: num  206 238 114 197 182 ...

For future use, I will reassign these new variables to the original dataset, “permits.”

permits$single <- perunit$one
permits$multi <- apply(perunit[,2:4], 1, mean, na.rm = TRUE)
permits$perunit <- apply(perunit, 1, mean, na.rm = TRUE)

There may be some value in exploring

summary(permits$single)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.35  184.30  212.60  219.10  247.80  545.10

summary(permits$multi)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   5.556  25.460  34.480  37.110  43.890 510.600      33

To find out, we can do a Q-Q plot in R and eye-ball it to determine if the variable is normally distributed.

Perunit cost of: All units built Single unit dwellings Multiunit dwellings Two unit dwellings Three/four unit dwellingz Five-of-more unit dwellings

Number of units

The red line represents the linear model

par(mfrow = c(3,3), mar  = c(2,2,1,1))
qqnorm(perunit$total,main = "Per Unit - All Units")
qqline(perunit$total,col = "red")

qqnorm(perunit$one,main = "Per Unit - Single Units")
qqline(perunit$one,col = "red")

qqnorm(perunit$two,main = "Per Unit -Two Units")
qqline(perunit$two,col = "red")

qqnorm(perunit$three,main = "Per Unit - Three Units")
qqline(perunit$three,col = "red")

qqnorm(perunit$five,main  = "Per Unit - Five-or-more Units")
qqline(perunit$five,col = "red")

qqnorm(log(permits$TotalN),main = "Total Number of Units")
qqline(log(permits$TotalN),col = "red")

qqnorm(permits$multi, main = "All Multi Unit")
qqline(permits$multi,col = "red")

Ok, now that the variables have been checked (and passed) for normality, I will combine them into one dataset.

dev.off()

## null device 
##           1

data <- data.frame(cbind(perunit[,1:5],log(permits$TotalN),permits$Name),row.names = NULL)
colnames(data) <- c("one", "two", "three", "five", "total","totaln","name")
data <- na.omit(data)
#note : this omits the names of locations with incomplete records
data.names <- data[,6]

data.pca <- prcomp(data[,1:6],
       center = T,
       scale. = T)

print(data.pca)

## Standard deviations:
## [1] 1.6427419 0.9742211 0.9307107 0.8672024 0.7768310 0.3613358
## 
## Rotation:
##              PC1         PC2         PC3        PC4         PC5
## one    0.5341055 -0.06606659 -0.09289466  0.4385035  0.12537952
## two    0.2841587  0.35667362 -0.76589337 -0.4440871 -0.09022784
## three  0.4020910 -0.02038588  0.32989565 -0.4731929  0.70446058
## five   0.3837891  0.18115432  0.50293671 -0.3394017 -0.66103990
## total  0.5233138  0.14644462 -0.02253836  0.4865521 -0.04841227
## totaln 0.2261783 -0.90207574 -0.20613945 -0.1861809 -0.20138696
##                 PC6
## one     0.702660529
## two     0.009524601
## three  -0.094339218
## five    0.121558294
## total  -0.681991144
## totaln -0.131868250

plot(data.pca, type = "l")

summary(data.pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6
## Standard deviation     1.6427 0.9742 0.9307 0.8672 0.7768 0.36134
## Proportion of Variance 0.4498 0.1582 0.1444 0.1253 0.1006 0.02176
## Cumulative Proportion  0.4498 0.6079 0.7523 0.8777 0.9782 1.00000

library(devtools)
library(ggbiplot)
g <- ggbiplot(data.pca, obs.scale = 1, var.scale = 1, 
              groups = NULL, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)

Multivariate Normal

Ryan Thomas

January 28, 2016