describeBy(iris[,1:4])
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31 -0.61 0.07
## Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31 0.14 0.04
## Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27 -1.42 0.14
## Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10 -1.36 0.06
The iris data set consist of observations about the structure of three flowers specimens. The observed variables are sepal length and width, and petal length and width. there are 150 observations, 50 for each species. The dataset exhibits some skewness and negative kurtosis
iris[,-5] %>% t() %>% mshapiro.test()
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.97935, p-value = 0.02342
Ho: {Sepal Lenght ^ Sepal Width ^ Petal Lenght ^ Petal Width} follow a multivariate normal distribution
Ha: one or more of the variables are not normally distributed
At a 5% significance level we can reject the null hypothesis that the variables have multivariate normality. However, at the 1% level we fail to reject the null hypothesis that the variables have multivariate normality.
iris[,-5] %>% jb.norm.test()
##
## Jarque-Bera test for normality
##
## data: .
## JB = 1.4978, p-value < 2.2e-16
help('jb.norm.test')
Ho: data is normally distributed
Ha: data is not normally distributed
We can reject the null hypothesis of normality at the 1% level given the small p-value obtained from the test
options(scipen = 20)
Norm.Test <- c('Anderson-Darling test','Cramer-von Mises test','Kolmogorov-Smirnov test','Pearson chi-square test','Shapiro-Francia test')
processor <- function(x){
a1 <- ad.test(x)$statistic
a2 <- cvm.test(x)$statistic
a3 <- lillie.test(x)$statistic
a4 <- pearson.test(x)$statistic
a5 <- sf.test(x)$statistic
vec <- c(a1,a2,a3,a4,a5)
return(vec)
}
processor.p <- function(x){
a1 <- ad.test(x)$p
a2 <- cvm.test(x)$p
a3 <- lillie.test(x)$p
a4 <- pearson.test(x)$p
a5 <- sf.test(x)$p
vec <- c(a1,a2,a3,a4,a5)
return(vec)
}
Sepal.Lenght.p <- processor.p(iris$Sepal.Length)
Sepal.Width.p <- processor.p(iris$Sepal.Width)
Petal.Lenght.p <- processor.p(iris$Petal.Length)
Petal.Width.p <- processor.p(iris$Petal.Width)
df.p <- data.frame(Sepal.Lenght.p, Sepal.Width.p, Petal.Lenght.p, Petal.Width.p)
b.p <- cbind(Norm.Test, df.p)
Sepal.Lenght <- processor(iris$Sepal.Length)
Sepal.Width <- processor(iris$Sepal.Width)
Petal.Lenght <- processor(iris$Petal.Length)
Petal.Width <- processor(iris$Petal.Width)
df <- data.frame(Sepal.Lenght, Sepal.Width, Petal.Lenght, Petal.Width)
b <- cbind(Norm.Test, df)
b
## Norm.Test Sepal.Lenght Sepal.Width Petal.Lenght Petal.Width
## 1 Anderson-Darling test 0.88919949 0.9079550 7.6785455 5.1056620
## 2 Cramer-von Mises test 0.12739762 0.1806514 1.2222851 0.7215556
## 3 Kolmogorov-Smirnov test 0.08865361 0.1056588 0.1981541 0.1728342
## 4 Pearson chi-square test 17.40000000 46.2000000 192.8000000 155.6000000
## 5 Shapiro-Francia test 0.97961292 0.9848323 0.8818577 0.9083072
b.p
## Norm.Test Sepal.Lenght.p Sepal.Width.p Petal.Lenght.p Petal.Width.p
## 1 Anderson-Darling test 0.022510515 0.020226513623 8.087375e-19 1.124861e-12
## 2 Cramer-von Mises test 0.047064590 0.009335620521 7.370000e-10 4.337563e-08
## 3 Kolmogorov-Smirnov test 0.005788395 0.000314214678 7.900863e-16 7.330204e-12
## 4 Pearson chi-square test 0.135159998 0.000006408907 9.956178e-35 4.131010e-27
## 5 Shapiro-Francia test 0.026210801 0.090202398614 1.844088e-08 3.005838e-07
The null hypothesis for all the tests is that the data is normally distributed. The second table contains the p-values for all the tests accross all the iris variables. Petal Length and Petal Width are not normally distributed given the small p-values(<0.01) for all the test results, since we can reject the null hypothesis that the data is normal distributed at the 1% level. Sepal Length seems to be normally distributed since the Kolmogorv-Smirnov test is the only one that rejects the null hypothesis at the 1% level, the remainder of test fail to reject the null hypothesis at the 1% level. The null hypothesis for Sepal Width normality can be rejected at the 1% level for the Anderson Darling test and the Shapiro-Francia test. Overall, only sepal lenght and sepal width seem to violate the normality assumption
par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), main = 'Density Plot of Sepal Length', xlab = ' Sepal Length')
plot(density(iris$Sepal.Width), main = 'Density Plot of Sepal Width', xlab = ' Sepal Width')
plot(density(iris$Petal.Length), main = 'Density Plot of Petal Length', xlab = ' Petal Width')
plot(density(iris$Petal.Width), main = 'Density Plot of Petal Width', xlab = 'Petal Width')
Visually the variables do not seem normally distributed. Sepal lenght has a flat top with positive skewness. Sepal width seems to have positive kurtosis, hence the fat tails. Petal lenght and petal width are bimodal, which might indicate the presence of multiple populations
par(mfrow=c(2,2))
hist(iris$Sepal.Length, probability = T, main = 'Histogram of Sepal Length', xlab = 'Sepal length')
hist(iris$Sepal.Width, probability = T, main = 'Histogram of Sepal Width', xlab = 'Sepal Width')
hist(iris$Petal.Length, probability = T, main = 'Histogram of Petal Length', xlab='Petal Length')
hist(iris$Petal.Width, probability = T, main = 'Histogram of Petal Width', xlab='Petal Width')
We can observe the same trend plotting the histograms. However, if we break drown the graphs by species we might observe different behavior.
petlen.plot <- ggplot(data = iris, aes(y = Petal.Length,x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Petal Length')
ptwidth.plot <- ggplot(data = iris, aes(y = Petal.Width, x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Petal Width')
grid.arrange(petlen.plot,ptwidth.plot,ncol = 2)
We can observe that the Setosa species greatly differs from Versicolor and Virginica both in length and width. It also seems to have fatter tails and positive skewness given the outliers and the assymetic shape of the boxplot, showing visually that the 95th percentile is relatively distant from the 50th and 25th percentile, that seem to be packed altogether. Versicolor and Virginica seem to be more uniform relative to Setosa, but still exhibit some skewness in their distributions.
seplen.plot <- ggplot(data = iris, aes(y = Sepal.Length,x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Sepal Length')
sepwid.plot <- ggplot(data = iris, aes(y = Sepal.Width, x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Sepal Width')
grid.arrange(seplen.plot,sepwid.plot,ncol=2)
Sepal length and sepal width seem to be slightly more uniform accross the three different species, although Setosa and Virginica have some outliers and fat tails.
cov(iris[,-5])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
cor(iris[,-5])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
det(cor(iris[,-5]))
## [1] 0.008109611
Sepal lenght is highly positively correlated with petal length and petal width. Petal lenght and petal width are also highly correlated. These observations might indicate that multicolinearity could be present among the variables. The determinant of the correlation matrix is greater than zero, indicating that our statistical analysis can proceed given that we can solve for the inverse.
BoxM(data = iris[,1:4], group = iris[,5])
## $Chisq
## [1] 140.943
##
## $df
## [1] 20
##
## $p.value
## [1] 0.00000000000000000003352034
##
## $Test
## [1] "BoxM"
##
## attr(,"class")
## [1] "MVTests" "list"
Ho: Var-cov M setosa = var cov M versi = var-cov M virginica
Ha: at least two matrices are not equal
Variance-covariance matrices are not the same given that p-value < 0.01, we reject the null hypothesis at the 1% level. However, the BoxM test is sensitive to violations of multivariate nonnormality. Hence, we can be suspicious of the BoxM results given that we observed that the data is not multivariate normally distributed in prior tests. The rejection of the null might be associated with the absence of normality in the data rather than the variance-covariance matrices being unequal.