setwd("C:/Users/asank/Desktop/My Personal/01-JobWork/UOM Docs/MultivariateMethods/Intake21/RLabs")
# Radiation data
data <- read.table("datasets/T4-1.dat")
data
V1
1 0.15
2 0.09
3 0.18
4 0.10
5 0.05
6 0.12
7 0.08
8 0.05
9 0.08
10 0.10
11 0.07
12 0.02
13 0.01
14 0.10
15 0.10
16 0.10
17 0.02
18 0.10
19 0.01
20 0.40
21 0.10
22 0.05
23 0.03
24 0.05
25 0.15
26 0.10
27 0.15
28 0.09
29 0.08
30 0.18
31 0.10
32 0.20
33 0.11
34 0.30
35 0.02
36 0.20
37 0.20
38 0.30
39 0.30
40 0.40
41 0.30
42 0.05
We can use a histogram to visually see that the data is not normally distributed.
hist(data$V1, probability=TRUE, col="steelblue", main="Histogram of radiation data")
We can see that the distribution is right-skewed and does not have the typical “bell-shape” associated with a normal distribution.
The R base function qqnorm()
can be used to produce a
quantile-quantile (Q-Q) plot.
qqnorm(): Produces a normal Q-Q plot of the variable qqline(): Adds a reference line
qqnorm(data$V1, pch=19, frame=FALSE)
qqline(data$V1, col="blue", lwd=2)
It is also possible to use the function qqPlot()
in
car
package.
library(car)
qqPlot(data$V1, pch=19)
[1] 20 40
It appears from the plot that the data as a whole are not normally distributed. It seems that the points indicated by 20 and 40 are outliers - values that are too large relative to the rest of the observations.
For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same sample quantile.
shapiro.test(data$V1)
Shapiro-Wilk normality test
data: data$V1
W = 0.85793, p-value = 9.902e-05
We reject the null hypothesis of normality at level of significance 5% as p-value = 9.902e-05 < 0.05.
For right-skewed data: common transformations include square root, cube root, and log.
For left-skewed data: common transformations include square root (constant - x), cube root (constant - x), and log (constant - x). Because log(0) is undefined, as is the log of any negative number, when using a log transformation, a constant should be added to all values to make them all positive before transformation. It is also sometimes helpful to add a constant when using other transformations.
Another approach is to use a general power transformation, such as a Box-Cox transformation. These determine a lambda value, which is used as the power coefficient to transform values.
The square root transformation improves the distribution of the data somewhat.
V1sqrt <- sqrt(data$V1)
hist(V1sqrt, probability=TRUE, col="steelblue", main="")
qqPlot(V1sqrt, pch=19)
[1] 20 40
shapiro.test(V1sqrt)
Shapiro-Wilk normality test
data: V1sqrt
W = 0.95242, p-value = 0.07889
The cube root transformation is stronger than the square root transformation.
V1cub <- sign(data$V1) * abs(data$V1)^(1/3) # To avoid complex numbers for some cube roots
hist(V1cub, probability=TRUE, col="steelblue", main="")
qqPlot(V1cub, pch=19)
[1] 20 40
shapiro.test(V1cub)
Shapiro-Wilk normality test
data: V1cub
W = 0.96457, p-value = 0.2147
The log transformation is a relatively strong transformation. Because certain measurements in nature are naturally log-normal, it is often a successful transformation for certain data sets.
V1log <- log(data$V1)
hist(V1log, probability=TRUE, col="steelblue", main="")
qqPlot(V1log, pch=19)
[1] 13 19
shapiro.test(V1log)
Shapiro-Wilk normality test
data: V1log
W = 0.93878, p-value = 0.02591
It seems that for the radiation data, log transformation doesn’t work well!
The Box-Cox procedure is available with the boxcox()
function in the MASS
package. However, a few steps are
needed to extract the lambda value and transform the data set.
You must compute a linear model with the lm()
function
and pass it to the boxcox()
function as shown below in
order to determine the appropriate lambda value.
library(MASS)
boxcox(lm(data$V1 ~ 1))
Keep in mind that the dashed vertical line in the middle represents the estimated parameter lambda hat. It seems that the lambda hat value is in between 0 and 0.5. Note that if the lambda hat value is around 0, use the logarithmic transformation of the data.
b <- boxcox(lm(data$V1 ~ 1))
lambda <- b$x[which.max(b$y)]
lambda
[1] 0.2626263
It seems that the cube root transformation works well for this data set.
The function chisq.plot
in mvoutlier
R
package plots the ordered mahalanobis distances of the data against the
quantiles of the Chi-squared distribution.
If the data is normal distributed, the plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention.
data("USArrests")
library(mvoutlier)
chisq.plot(USArrests, pch=19)
Remove outliers with left-click, stop with right-click on plotting device
$outliers
NULL
abline(a=0, b=1, col="red", lwd=2)
The points do not fall on the straight line, indicating that the data do not probably follow a multivariate distribution.
library(QuantPsyc)
# Perform Multivariate normality test
mult.norm(USArrests)$mult.test
Beta-hat kappa p-val
Skewness 4.362273 36.3522752 0.01397765
Kurtosis 24.860830 0.4392903 0.66045117
The mult.norm()
function tests for multivariate
normality in both the skewness and kurtosis of the data set. Since the
p-value for skewness is less than 0.05, we reject the null hypothesis of
the test. That means, we have enough evidence to say that the four
variables in our data set do not follow a multivariate distribution.
Note that, you need to check both p-values for the skewness and kurtosis to say that the variables in a given data set follow a multivariate distribution.
# Consider the data on lumber
mydata <- read.table("datasets/T4-3.dat")
mydata1 <- cbind(mydata, ObsNo=c(1:30))
mydata1
V1 V2 V3 V4 ObsNo
1 1889 1651 1561 1778 1
2 2403 2048 2087 2197 2
3 2119 1700 1815 2222 3
4 1645 1627 1110 1533 4
5 1976 1916 1614 1883 5
6 1712 1712 1439 1546 6
7 1943 1685 1271 1671 7
8 2104 1820 1717 1874 8
9 2983 2794 2412 2581 9
10 1745 1600 1384 1508 10
11 1710 1591 1518 1667 11
12 2046 1907 1627 1898 12
13 1840 1841 1595 1741 13
14 1867 1685 1493 1678 14
15 1859 1649 1389 1714 15
16 1954 2149 1180 1281 16
17 1325 1170 1002 1176 17
18 1419 1371 1252 1308 18
19 1828 1634 1602 1755 19
20 1725 1594 1313 1646 20
21 2276 2189 1547 2111 21
22 1899 1614 1422 1477 22
23 1633 1513 1290 1516 23
24 2061 1867 1646 2037 24
25 1856 1493 1356 1533 25
26 1727 1412 1238 1469 26
27 2168 1896 1701 1834 27
28 1655 1675 1414 1597 28
29 2326 2301 2065 2234 29
30 1490 1382 1214 1284 30
dotchart(mydata1$V1, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V2, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V3, labels=mydata1$ObsNo, pch=19)
dotchart(mydata1$V4, labels=mydata1$ObsNo, pch=19)
It seems that the observation 9 stands out as clearly different from the rest of the pattern.
pairs(mydata[1:4], pch=19)
round(scale(mydata[1:4]), 1)
V1 V2 V3 V4
[1,] -0.1 -0.3 0.2 0.2
[2,] 1.5 0.9 1.9 1.5
[3,] 0.7 -0.2 1.0 1.5
[4,] -0.8 -0.4 -1.3 -0.6
[5,] 0.2 0.5 0.3 0.5
[6,] -0.6 -0.1 -0.2 -0.6
[7,] 0.1 -0.2 -0.8 -0.2
[8,] 0.6 0.2 0.7 0.5
[9,] 3.3 3.3 3.0 2.7
[10,] -0.5 -0.5 -0.4 -0.7
[11,] -0.6 -0.5 0.0 -0.2
[12,] 0.4 0.5 0.4 0.5
[13,] -0.2 0.3 0.3 0.0
[14,] -0.1 -0.2 -0.1 -0.1
[15,] -0.1 -0.3 -0.4 0.0
[16,] 0.1 1.3 -1.1 -1.4
[17,] -1.8 -1.8 -1.7 -1.7
[18,] -1.5 -1.2 -0.8 -1.3
[19,] -0.2 -0.4 0.3 0.1
[20,] -0.6 -0.5 -0.6 -0.2
[21,] 1.1 1.4 0.1 1.2
[22,] 0.0 -0.4 -0.3 -0.8
[23,] -0.8 -0.7 -0.7 -0.6
[24,] 0.5 0.4 0.5 1.0
[25,] -0.2 -0.8 -0.5 -0.6
[26,] -0.6 -1.1 -0.9 -0.8
[27,] 0.8 0.5 0.6 0.3
[28,] -0.8 -0.2 -0.3 -0.4
[29,] 1.3 1.7 1.8 1.6
[30,] -1.3 -1.2 -1.0 -1.4
attr(,"scaled:center")
V1 V2 V3 V4
1906.100 1749.533 1509.133 1724.967
attr(,"scaled:scale")
V1 V2 V3 V4
324.9866 318.6065 303.1783 322.8436
round(mahalanobis(mydata, center=colMeans(mydata), cov=cov(mydata)), 2)
[1] 0.60 5.48 7.62 5.21 1.40 2.22 4.99 1.49 12.26 0.77 1.93 0.46
[13] 2.70 0.13 1.08 16.85 3.50 3.99 1.36 1.46 9.90 5.06 0.80 2.54
[25] 4.58 3.40 2.38 3.00 6.28 2.58
The two observations (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern.
chisq.plot(mydata, pch=19)
Remove outliers with left-click, stop with right-click on plotting device
$outliers
NULL
abline(a=0, b=1, col="red", lwd=2)
If outliers are identified, they should be examined for content. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or appropriately “weighted” in a subsequent analysis.