HW 1

describeBy(iris[,1:4])

##              vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31    -0.61 0.07
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31     0.14 0.04
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27    -1.42 0.14
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10    -1.36 0.06

The iris data set consist of observations about the structure of three flowers specimens. The observed variables are sepal length and width, and petal length and width. there are 150 observations, 50 for each species. The dataset exhibits some skewness and negative kurtosis

iris[,-5] %>% t() %>% mshapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.97935, p-value = 0.02342

Ho: {Sepal Lenght ^ Sepal Width ^ Petal Lenght ^ Petal Width} follow a multivariate normal distribution

Ha: one or more of the variables are not normally distributed

At a 5% significance level we can reject the null hypothesis that the variables have multivariate normality. However, at the 1% level we fail to reject the null hypothesis that the variables have multivariate normality.

iris[,-5] %>% jb.norm.test()

## 
##  Jarque-Bera test for normality
## 
## data:  .
## JB = 1.4978, p-value < 2.2e-16

help('jb.norm.test')

Ho: data is normally distributed

Ha: data is not normally distributed

We can reject the null hypothesis of normality at the 1% level given the small p-value obtained from the test

options(scipen = 20)
Norm.Test <- c('Anderson-Darling test','Cramer-von Mises test','Kolmogorov-Smirnov test','Pearson chi-square test','Shapiro-Francia test')
processor <- function(x){
  a1 <- ad.test(x)$statistic
  a2 <- cvm.test(x)$statistic
  a3 <- lillie.test(x)$statistic
  a4 <- pearson.test(x)$statistic
  a5 <- sf.test(x)$statistic
  vec <- c(a1,a2,a3,a4,a5)
  return(vec)
} 
processor.p <- function(x){
  a1 <- ad.test(x)$p
  a2 <- cvm.test(x)$p
  a3 <- lillie.test(x)$p
  a4 <- pearson.test(x)$p
  a5 <- sf.test(x)$p
  vec <- c(a1,a2,a3,a4,a5)
  return(vec)
}
Sepal.Lenght.p <- processor.p(iris$Sepal.Length)
Sepal.Width.p <-  processor.p(iris$Sepal.Width)
Petal.Lenght.p <- processor.p(iris$Petal.Length)
Petal.Width.p <-  processor.p(iris$Petal.Width)
df.p <- data.frame(Sepal.Lenght.p, Sepal.Width.p, Petal.Lenght.p, Petal.Width.p)
b.p <- cbind(Norm.Test, df.p)

Sepal.Lenght <- processor(iris$Sepal.Length)
Sepal.Width <-  processor(iris$Sepal.Width)
Petal.Lenght <- processor(iris$Petal.Length)
Petal.Width <-  processor(iris$Petal.Width)
df <- data.frame(Sepal.Lenght, Sepal.Width, Petal.Lenght, Petal.Width)
b <- cbind(Norm.Test, df)
b

##                 Norm.Test Sepal.Lenght Sepal.Width Petal.Lenght Petal.Width
## 1   Anderson-Darling test   0.88919949   0.9079550    7.6785455   5.1056620
## 2   Cramer-von Mises test   0.12739762   0.1806514    1.2222851   0.7215556
## 3 Kolmogorov-Smirnov test   0.08865361   0.1056588    0.1981541   0.1728342
## 4 Pearson chi-square test  17.40000000  46.2000000  192.8000000 155.6000000
## 5    Shapiro-Francia test   0.97961292   0.9848323    0.8818577   0.9083072

b.p

##                 Norm.Test Sepal.Lenght.p  Sepal.Width.p Petal.Lenght.p Petal.Width.p
## 1   Anderson-Darling test    0.022510515 0.020226513623   8.087375e-19  1.124861e-12
## 2   Cramer-von Mises test    0.047064590 0.009335620521   7.370000e-10  4.337563e-08
## 3 Kolmogorov-Smirnov test    0.005788395 0.000314214678   7.900863e-16  7.330204e-12
## 4 Pearson chi-square test    0.135159998 0.000006408907   9.956178e-35  4.131010e-27
## 5    Shapiro-Francia test    0.026210801 0.090202398614   1.844088e-08  3.005838e-07

The null hypothesis for all the tests is that the data is normally distributed. The second table contains the p-values for all the tests accross all the iris variables. Petal Length and Petal Width are not normally distributed given the small p-values(<0.01) for all the test results, since we can reject the null hypothesis that the data is normal distributed at the 1% level. Sepal Length seems to be normally distributed since the Kolmogorv-Smirnov test is the only one that rejects the null hypothesis at the 1% level, the remainder of test fail to reject the null hypothesis at the 1% level. The null hypothesis for Sepal Width normality can be rejected at the 1% level for the Anderson Darling test and the Shapiro-Francia test. Overall, only sepal lenght and sepal width seem to violate the normality assumption

par(mfrow=c(2,2))
plot(density(iris$Sepal.Length), main = 'Density Plot of Sepal Length', xlab = ' Sepal Length')
plot(density(iris$Sepal.Width), main = 'Density Plot of Sepal Width', xlab = ' Sepal Width')
plot(density(iris$Petal.Length), main = 'Density Plot of Petal Length', xlab = ' Petal Width')
plot(density(iris$Petal.Width), main = 'Density Plot of Petal Width', xlab = 'Petal Width')

Visually the variables do not seem normally distributed. Sepal lenght has a flat top with positive skewness. Sepal width seems to have positive kurtosis, hence the fat tails. Petal lenght and petal width are bimodal, which might indicate the presence of multiple populations

par(mfrow=c(2,2))
hist(iris$Sepal.Length, probability = T, main = 'Histogram of Sepal Length', xlab = 'Sepal length')
hist(iris$Sepal.Width, probability = T, main = 'Histogram of Sepal Width', xlab = 'Sepal Width')
hist(iris$Petal.Length, probability = T, main = 'Histogram of Petal Length', xlab='Petal Length')
hist(iris$Petal.Width, probability = T, main = 'Histogram of Petal Width', xlab='Petal Width')

We can observe the same trend plotting the histograms. However, if we break drown the graphs by species we might observe different behavior.

petlen.plot <- ggplot(data = iris, aes(y = Petal.Length,x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Petal Length')
ptwidth.plot <- ggplot(data = iris, aes(y = Petal.Width, x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Petal Width')
grid.arrange(petlen.plot,ptwidth.plot,ncol = 2)

We can observe that the Setosa species greatly differs from Versicolor and Virginica both in length and width. It also seems to have fatter tails and positive skewness given the outliers and the assymetic shape of the boxplot, showing visually that the 95th percentile is relatively distant from the 50th and 25th percentile, that seem to be packed altogether. Versicolor and Virginica seem to be more uniform relative to Setosa, but still exhibit some skewness in their distributions.

seplen.plot <- ggplot(data = iris, aes(y = Sepal.Length,x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Sepal Length')
sepwid.plot <- ggplot(data = iris, aes(y = Sepal.Width, x = Species, colour = Species)) + geom_boxplot() + labs(title = 'Sepal Width')

grid.arrange(seplen.plot,sepwid.plot,ncol=2)

Sepal length and sepal width seem to be slightly more uniform accross the three different species, although Setosa and Virginica have some outliers and fat tails.

cov(iris[,-5])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

cor(iris[,-5])

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

det(cor(iris[,-5]))

## [1] 0.008109611

Sepal lenght is highly positively correlated with petal length and petal width. Petal lenght and petal width are also highly correlated. These observations might indicate that multicolinearity could be present among the variables. The determinant of the correlation matrix is greater than zero, indicating that our statistical analysis can proceed given that we can solve for the inverse.

BoxM(data = iris[,1:4], group = iris[,5])

## $Chisq
## [1] 140.943
## 
## $df
## [1] 20
## 
## $p.value
## [1] 0.00000000000000000003352034
## 
## $Test
## [1] "BoxM"
## 
## attr(,"class")
## [1] "MVTests" "list"

Ho: Var-cov M setosa = var cov M versi = var-cov M virginica

Ha: at least two matrices are not equal

Variance-covariance matrices are not the same given that p-value < 0.01, we reject the null hypothesis at the 1% level. However, the BoxM test is sensitive to violations of multivariate nonnormality. Hence, we can be suspicious of the BoxM results given that we observed that the data is not multivariate normally distributed in prior tests. The rejection of the null might be associated with the absence of normality in the data rather than the variance-covariance matrices being unequal.