mu <- c(2,1,2) Sigma <- matrix(c(2,1,1,1,3,0,1,0,1), nrow=3, ncol=3) Sigma
[,1] [,2] [,3] [1,] 2 1 1 [2,] 1 3 0 [3,] 1 0 1
a <- c(1,-1,0) a%*%mu
[,1] [1,] 1
t(a)%*%mu
[,1] [1,] 1
a%*%Sigma%*%a
[,1] [1,] 3
t(a)%*%Sigma%*%a
[,1] [1,] 3
A <- matrix(c(1,1,1,1,-1,0), nrow=2, ncol=3, byrow=TRUE) A
[,1] [,2] [,3] [1,] 1 1 1 [2,] 1 -1 0
A%*%mu
[,1] [1,] 5 [2,] 1
A%*%Sigma%*%t(A)
[,1] [,2] [1,] 10 0 [2,] 0 3
setwd("C:/Users/asank/Desktop/My Personal/01-JobWork/UOM Docs/MultivariateMethods/Intake 20/RLabs")
data <- read.table("datasets/T4-1.dat") data
V1 1 0.15 2 0.09 3 0.18 4 0.10 5 0.05 6 0.12 7 0.08 8 0.05 9 0.08 10 0.10 11 0.07 12 0.02 13 0.01 14 0.10 15 0.10 16 0.10 17 0.02 18 0.10 19 0.01 20 0.40 21 0.10 22 0.05 23 0.03 24 0.05 25 0.15 26 0.10 27 0.15 28 0.09 29 0.08 30 0.18 31 0.10 32 0.20 33 0.11 34 0.30 35 0.02 36 0.20 37 0.20 38 0.30 39 0.30 40 0.40 41 0.30 42 0.05
The R base functions qqnorm() and qqplot() can be used to produce quantile-quantile plots.
qqnorm(): Produces a normal Q-Q plot of the variable
qqline(): Adds a reference line
qqnorm(data$V1, pch=19, frame=FALSE) qqline(data$V1, col="blue", lwd=2)
It is also possible to use the function qqPlot() in car package.
library(car, quietly=TRUE) qqPlot(data$V1, pch=19)
[1] 20 40
It appears from the plot that the data as a whole are not normally distributed. It seems that the points indicated by 20 and 40 are outliers - values that are too large relative to the rest of the observations.
For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same sample quantile.
shapiro.test(data$V1)
Shapiro-Wilk normality test data: data$V1 W = 0.85793, p-value = 9.902e-05
We reject the null hypothesis of normality at level of significance 5% as p-value=9.902e-05 < 0.05. We can also produce a histogram to visually see that the data is not normally distributed.
hist(data$V1, col="steelblue", main="Histogram of radiation data")
We can see that the distribution is right-skewed and does not have the typical "bell-shape" associated with a normal distribution. Thus, the histogram matches the results of the Shapiro-Wilk test and confirms that the data does not come from a normal distribution.
For right-skewed data: common transformations include square root, cube root, and log.
For left-skewed data: common transformations include square root (constant - x), cube root (constant - x), and log (constant - x). Because log(0) is undefined, as is the log of any negative number, when using a log transformation, a constant should be added to all values to make them all positive before transformation. It is also sometimes helpful to add a constant when using other transformations.
Another approach is to use a general power transformation, such as a Box-Cox transformation. These determine a lambda value, which is used as the power coefficient to transform values.
The square root transformation improves the distribution of the data somewhat.
V1sqrt <- sqrt(data$V1) hist(V1sqrt, col="steelblue", main="")
qqPlot(V1sqrt, pch=19)
[1] 20 40
shapiro.test(V1sqrt)
Shapiro-Wilk normality test data: V1sqrt W = 0.95242, p-value = 0.07889
The cube root transformation is stronger than the square root transformation.
V1cub <- sign(data$V1) * abs(data$V1)^(1/3) # To avoid complex numbers for some cube roots hist(V1cub, col="steelblue", main="")
qqPlot(V1cub, pch=19)
[1] 20 40
shapiro.test(V1cub)
Shapiro-Wilk normality test data: V1cub W = 0.96457, p-value = 0.2147
The log transformation is a relatively strong transformation. Because certain measurements in nature are naturally log-normal, it is often a successful transformation for certain data sets.
V1log <- log(data$V1) hist(V1log, col="steelblue", main="")
qqPlot(V1log, pch=19)
[1] 13 19
shapiro.test(V1log)
Shapiro-Wilk normality test data: V1log W = 0.93878, p-value = 0.02591
It seems that for the radiation data, log transformation doesn't work well!
The Box-Cox procedure is available with the boxcox() function in the MASS package. However, a few steps are needed to extract the lambda value and transform the data set.
You must compute a linear model with the lm() function and pass it to the boxcox() function as shown below in order to determine the appropriate lambda value.
library(MASS, quietly=TRUE) boxcox(lm(data$V1 ~ 1))
Keep in mind that the dashed vertical line in the middle represents the estimated parameter lambda hat. It seems that the lambda hat value is in between 0 and 0.5. Note that if the lambda hat value is around 0, use the logarithmic transformation of the data.
b <- boxcox(lm(data$V1 ~ 1))
lambda <- b$x[which.max(b$y)] lambda
[1] 0.2626263
It seems that the cube root transformation works well for this data set.
data("USArrests") #install.packages("QuantPsyc") library(QuantPsyc, quietly=TRUE, warn.conflicts=FALSE)
# Perform Multivariate normality test mult.norm(USArrests)$mult.test
Beta-hat kappa p-val Skewness 4.362273 36.3522752 0.01397765 Kurtosis 24.860830 0.4392903 0.66045117
The mult.norm() function tests for multivariate normality in both the skewness and kurtosis of the data set. Since the p-value for skewness is less than 0.05, we reject the null hypothesis of the test. That means, we have enough evidence to say that the four variables in our data set do not follow a multivariate distribution.
Note that, you need to check both p-values for the skewness and kurtosis to say that the variables in a given data set follow a multivariate distribution.
# Consider the data on lumber mydata <- read.table("datasets/T4-3.dat") mydata1 <- cbind(mydata, ObsNo=c(1:30)) mydata1
V1 V2 V3 V4 ObsNo 1 1889 1651 1561 1778 1 2 2403 2048 2087 2197 2 3 2119 1700 1815 2222 3 4 1645 1627 1110 1533 4 5 1976 1916 1614 1883 5 6 1712 1712 1439 1546 6 7 1943 1685 1271 1671 7 8 2104 1820 1717 1874 8 9 2983 2794 2412 2581 9 10 1745 1600 1384 1508 10 11 1710 1591 1518 1667 11 12 2046 1907 1627 1898 12 13 1840 1841 1595 1741 13 14 1867 1685 1493 1678 14 15 1859 1649 1389 1714 15 16 1954 2149 1180 1281 16 17 1325 1170 1002 1176 17 18 1419 1371 1252 1308 18 19 1828 1634 1602 1755 19 20 1725 1594 1313 1646 20 21 2276 2189 1547 2111 21 22 1899 1614 1422 1477 22 23 1633 1513 1290 1516 23 24 2061 1867 1646 2037 24 25 1856 1493 1356 1533 25 26 1727 1412 1238 1469 26 27 2168 1896 1701 1834 27 28 1655 1675 1414 1597 28 29 2326 2301 2065 2234 29 30 1490 1382 1214 1284 30
dotchart(mydata1$V1, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V2, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V3, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V4, labels=mydata1$ObsNo, pch=19)
It seems that the observation 9 stands out as clearly different from the rest of the pattern.
pairs(mydata[1:4], pch=19)
round(scale(mydata[1:4]), 1)
V1 V2 V3 V4 [1,] -0.1 -0.3 0.2 0.2 [2,] 1.5 0.9 1.9 1.5 [3,] 0.7 -0.2 1.0 1.5 [4,] -0.8 -0.4 -1.3 -0.6 [5,] 0.2 0.5 0.3 0.5 [6,] -0.6 -0.1 -0.2 -0.6 [7,] 0.1 -0.2 -0.8 -0.2 [8,] 0.6 0.2 0.7 0.5 [9,] 3.3 3.3 3.0 2.7 [10,] -0.5 -0.5 -0.4 -0.7 [11,] -0.6 -0.5 0.0 -0.2 [12,] 0.4 0.5 0.4 0.5 [13,] -0.2 0.3 0.3 0.0 [14,] -0.1 -0.2 -0.1 -0.1 [15,] -0.1 -0.3 -0.4 0.0 [16,] 0.1 1.3 -1.1 -1.4 [17,] -1.8 -1.8 -1.7 -1.7 [18,] -1.5 -1.2 -0.8 -1.3 [19,] -0.2 -0.4 0.3 0.1 [20,] -0.6 -0.5 -0.6 -0.2 [21,] 1.1 1.4 0.1 1.2 [22,] 0.0 -0.4 -0.3 -0.8 [23,] -0.8 -0.7 -0.7 -0.6 [24,] 0.5 0.4 0.5 1.0 [25,] -0.2 -0.8 -0.5 -0.6 [26,] -0.6 -1.1 -0.9 -0.8 [27,] 0.8 0.5 0.6 0.3 [28,] -0.8 -0.2 -0.3 -0.4 [29,] 1.3 1.7 1.8 1.6 [30,] -1.3 -1.2 -1.0 -1.4 attr(,"scaled:center") V1 V2 V3 V4 1906.100 1749.533 1509.133 1724.967 attr(,"scaled:scale") V1 V2 V3 V4 324.9866 318.6065 303.1783 322.8436
round(mahalanobis(mydata, center=colMeans(mydata), cov=cov(mydata)), 2)
[1] 0.60 5.48 7.62 5.21 1.40 2.22 4.99 1.49 12.26 0.77 1.93 0.46 [13] 2.70 0.13 1.08 16.85 3.50 3.99 1.36 1.46 9.90 5.06 0.80 2.54 [25] 4.58 3.40 2.38 3.00 6.28 2.58
The two observations (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern.
If outliers are identified, they should be examined for content. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or appropriately "weighted" in a subsequent analysis.