DA3420: R Lab 4 (II)

mu <- c(2,1,2)
Sigma <- matrix(c(2,1,1,1,3,0,1,0,1), nrow=3, ncol=3)
Sigma

     [,1] [,2] [,3]
[1,]    2    1    1
[2,]    1    3    0
[3,]    1    0    1

a <- c(1,-1,0)

a%*%mu

     [,1]
[1,]    1

t(a)%*%mu

     [,1]
[1,]    1

a%*%Sigma%*%a

     [,1]
[1,]    3

t(a)%*%Sigma%*%a

     [,1]
[1,]    3

Result 2 - Example 2

A <- matrix(c(1,1,1,1,-1,0), nrow=2, ncol=3, byrow=TRUE)
A

     [,1] [,2] [,3]
[1,]    1    1    1
[2,]    1   -1    0

A%*%mu

     [,1]
[1,]    5
[2,]    1

A%*%Sigma%*%t(A)

     [,1] [,2]
[1,]   10    0
[2,]    0    3

Assessing the assumption of normality

setwd("C:/Users/asank/Desktop/My Personal/01-JobWork/UOM Docs/MultivariateMethods/Intake 20/RLabs")

Create a Q-Q plot for radiation data

data <- read.table("datasets/T4-1.dat")
data

The R base functions qqnorm() and qqplot() can be used to produce quantile-quantile plots.

qqnorm(): Produces a normal Q-Q plot of the variable
qqline(): Adds a reference line

qqnorm(data$V1, pch=19, frame=FALSE)
qqline(data$V1, col="blue", lwd=2)

library(car, quietly=TRUE)
qqPlot(data$V1, pch=19)

[1] 20 40

It appears from the plot that the data as a whole are not normally distributed. It seems that the points indicated by 20 and 40 are outliers - values that are too large relative to the rest of the observations.

For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same sample quantile.

A test of normality

shapiro.test(data$V1)

    Shapiro-Wilk normality test

data:  data$V1
W = 0.85793, p-value = 9.902e-05

We reject the null hypothesis of normality at level of significance 5% as p-value=9.902e-05 < 0.05. We can also produce a histogram to visually see that the data is not normally distributed.

hist(data$V1, col="steelblue", main="Histogram of radiation data")

We can see that the distribution is right-skewed and does not have the typical "bell-shape" associated with a normal distribution. Thus, the histogram matches the results of the Shapiro-Wilk test and confirms that the data does not come from a normal distribution.

Transformations to near normality

For right-skewed data: common transformations include square root, cube root, and log.

For left-skewed data: common transformations include square root (constant - x), cube root (constant - x), and log (constant - x). Because log(0) is undefined, as is the log of any negative number, when using a log transformation, a constant should be added to all values to make them all positive before transformation. It is also sometimes helpful to add a constant when using other transformations.

Another approach is to use a general power transformation, such as a Box-Cox transformation. These determine a lambda value, which is used as the power coefficient to transform values.

Square root transformation

V1sqrt <- sqrt(data$V1)
hist(V1sqrt, col="steelblue", main="")

qqPlot(V1sqrt, pch=19)

[1] 20 40

shapiro.test(V1sqrt)

    Shapiro-Wilk normality test

data:  V1sqrt
W = 0.95242, p-value = 0.07889

Cube root transformation

V1cub <- sign(data$V1) * abs(data$V1)^(1/3) # To avoid complex numbers for some cube roots
hist(V1cub, col="steelblue", main="")

qqPlot(V1cub, pch=19)

[1] 20 40

shapiro.test(V1cub)

    Shapiro-Wilk normality test

data:  V1cub
W = 0.96457, p-value = 0.2147

Log transformation

The log transformation is a relatively strong transformation. Because certain measurements in nature are naturally log-normal, it is often a successful transformation for certain data sets.

V1log <- log(data$V1)
hist(V1log, col="steelblue", main="")

qqPlot(V1log, pch=19)

[1] 13 19

shapiro.test(V1log)

    Shapiro-Wilk normality test

data:  V1log
W = 0.93878, p-value = 0.02591

Box-Cox transformation

The Box-Cox procedure is available with the boxcox() function in the MASS package. However, a few steps are needed to extract the lambda value and transform the data set.

You must compute a linear model with the lm() function and pass it to the boxcox() function as shown below in order to determine the appropriate lambda value.

library(MASS, quietly=TRUE)
boxcox(lm(data$V1 ~ 1))

Keep in mind that the dashed vertical line in the middle represents the estimated parameter lambda hat. It seems that the lambda hat value is in between 0 and 0.5. Note that if the lambda hat value is around 0, use the logarithmic transformation of the data.

Extracting the exact lambda

b <- boxcox(lm(data$V1 ~ 1))

lambda <- b$x[which.max(b$y)]
lambda

[1] 0.2626263

Testing multivariate normality

data("USArrests")

#install.packages("QuantPsyc")
library(QuantPsyc, quietly=TRUE, warn.conflicts=FALSE)

Attaching package: 'boot'

The following object is masked from 'package:car':

    logit

Attaching package: 'dplyr'

The following object is masked from 'package:MASS':

    select

The following object is masked from 'package:car':

    recode

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'purrr'

The following object is masked from 'package:car':

    some

# Perform Multivariate normality test
mult.norm(USArrests)$mult.test

          Beta-hat      kappa      p-val
Skewness  4.362273 36.3522752 0.01397765
Kurtosis 24.860830  0.4392903 0.66045117

The mult.norm() function tests for multivariate normality in both the skewness and kurtosis of the data set. Since the p-value for skewness is less than 0.05, we reject the null hypothesis of the test. That means, we have enough evidence to say that the four variables in our data set do not follow a multivariate distribution.

Note that, you need to check both p-values for the skewness and kurtosis to say that the variables in a given data set follow a multivariate distribution.

Detecting outliers

# Consider the data on lumber
mydata <- read.table("datasets/T4-3.dat")
mydata1 <- cbind(mydata, ObsNo=c(1:30))
mydata1

     V1   V2   V3   V4 ObsNo
1  1889 1651 1561 1778     1
2  2403 2048 2087 2197     2
3  2119 1700 1815 2222     3
4  1645 1627 1110 1533     4
5  1976 1916 1614 1883     5
6  1712 1712 1439 1546     6
7  1943 1685 1271 1671     7
8  2104 1820 1717 1874     8
9  2983 2794 2412 2581     9
10 1745 1600 1384 1508    10
11 1710 1591 1518 1667    11
12 2046 1907 1627 1898    12
13 1840 1841 1595 1741    13
14 1867 1685 1493 1678    14
15 1859 1649 1389 1714    15
16 1954 2149 1180 1281    16
17 1325 1170 1002 1176    17
18 1419 1371 1252 1308    18
19 1828 1634 1602 1755    19
20 1725 1594 1313 1646    20
21 2276 2189 1547 2111    21
22 1899 1614 1422 1477    22
23 1633 1513 1290 1516    23
24 2061 1867 1646 2037    24
25 1856 1493 1356 1533    25
26 1727 1412 1238 1469    26
27 2168 1896 1701 1834    27
28 1655 1675 1414 1597    28
29 2326 2301 2065 2234    29
30 1490 1382 1214 1284    30

Make a dot plot for each variable

dotchart(mydata1$V1, labels=mydata1$ObsNo, pch=19)

dotchart(mydata1$V2, labels=mydata1$ObsNo, pch=19)

dotchart(mydata1$V3, labels=mydata1$ObsNo, pch=19)

dotchart(mydata1$V4, labels=mydata1$ObsNo, pch=19)

It seems that the observation 9 stands out as clearly different from the rest of the pattern.

Make a scatter plot for each pair of variables

pairs(mydata[1:4], pch=19)

Calculate the standardized values

round(scale(mydata[1:4]), 1)

        V1   V2   V3   V4
 [1,] -0.1 -0.3  0.2  0.2
 [2,]  1.5  0.9  1.9  1.5
 [3,]  0.7 -0.2  1.0  1.5
 [4,] -0.8 -0.4 -1.3 -0.6
 [5,]  0.2  0.5  0.3  0.5
 [6,] -0.6 -0.1 -0.2 -0.6
 [7,]  0.1 -0.2 -0.8 -0.2
 [8,]  0.6  0.2  0.7  0.5
 [9,]  3.3  3.3  3.0  2.7
[10,] -0.5 -0.5 -0.4 -0.7
[11,] -0.6 -0.5  0.0 -0.2
[12,]  0.4  0.5  0.4  0.5
[13,] -0.2  0.3  0.3  0.0
[14,] -0.1 -0.2 -0.1 -0.1
[15,] -0.1 -0.3 -0.4  0.0
[16,]  0.1  1.3 -1.1 -1.4
[17,] -1.8 -1.8 -1.7 -1.7
[18,] -1.5 -1.2 -0.8 -1.3
[19,] -0.2 -0.4  0.3  0.1
[20,] -0.6 -0.5 -0.6 -0.2
[21,]  1.1  1.4  0.1  1.2
[22,]  0.0 -0.4 -0.3 -0.8
[23,] -0.8 -0.7 -0.7 -0.6
[24,]  0.5  0.4  0.5  1.0
[25,] -0.2 -0.8 -0.5 -0.6
[26,] -0.6 -1.1 -0.9 -0.8
[27,]  0.8  0.5  0.6  0.3
[28,] -0.8 -0.2 -0.3 -0.4
[29,]  1.3  1.7  1.8  1.6
[30,] -1.3 -1.2 -1.0 -1.4
attr(,"scaled:center")
      V1       V2       V3       V4 
1906.100 1749.533 1509.133 1724.967 
attr(,"scaled:scale")
      V1       V2       V3       V4 
324.9866 318.6065 303.1783 322.8436

Calculate the generalized squared distances

round(mahalanobis(mydata, center=colMeans(mydata), cov=cov(mydata)), 2)

 [1]  0.60  5.48  7.62  5.21  1.40  2.22  4.99  1.49 12.26  0.77  1.93  0.46
[13]  2.70  0.13  1.08 16.85  3.50  3.99  1.36  1.46  9.90  5.06  0.80  2.54
[25]  4.58  3.40  2.38  3.00  6.28  2.58

The two observations (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern.

If outliers are identified, they should be examined for content. Depending upon the nature of the outliers and the objectives of the investigation, outliers may be deleted or appropriately "weighted" in a subsequent analysis.

R Lab 4 (II): 2023-05-08

Properties of the multivariate normal distribution

Result 1 - Example 1

Result 2 - Example 2

Assessing the assumption of normality

Create a Q-Q plot for radiation data

A test of normality

Transformations to near normality

Square root transformation

Cube root transformation

Log transformation

Box-Cox transformation

Extracting the exact lambda

Testing multivariate normality

Detecting outliers

Make a dot plot for each variable

Make a scatter plot for each pair of variables

Calculate the standardized values

Calculate the generalized squared distances