Homework 1

R take-home exam - Luka Mihailović Potrč

Introduction

In this research, I will observe the differences between petal length, petal width, sepal length, and sepal width of irises. The dataset I will be using throughout this research separates observations of irises by three species - virginica, setosa, and versicolor.

Since I have one-time observations of various iris species (as groups of the population), I am able to observe and conduct research on the them as independent samples.

I want to know, for example, whether irises of the virginica, setosa, and versicolor species differ in petal width. To figure this out, I will conduct an independent samples test - whether it will be the parametric version of one-way ANOVA or the non-parametric version of Kruskal-Wallis rank sum test.

Independent samples ANOVA requires the following assumptions:

The analyzed variable is numeric. This is fulfilled when it comes to petal width.
The distribution of the variable (petal width) is normal within each of the species groups in the population. If this is violated, I will conduct the non-parametric version of the test.
There is homoskedasticity; the variance of the analyzed variable (petal width) is the same within all groups (species). To deduce this, I will use Levene’s test for homogeneity of variance. If only the assumption of homogeneity of variances turns out to be violated (but the assumption of normal distribution within each of the species groups holds), I will apply the parametric Welch’s F-test instead of ANOVA, because the former is less affected by the conditions of heteroskedasticity. As soon as (also) the assumption of normal distribution of the variable within the species groups is violated, I will apply the non-parametric Kruskal-Wallis rank sum test.
There are no significant outliers.

As such, my research question is the following: Do petal widths differ between the iris species of virginica, setosa, and versicolor?

To answer this research question, I will put forward my research hypothesis as follows:

H0: Mu(virginica) = Mu(setosa) = Mu(versicolor)

H1: At least one Mu is different from the others (note: I denoted population arithmetic mean as Mu)

I will now import my data in R studio and display some of it.

#data()
#data(package = .packages(all.available = TRUE))
mydata <- force(iris)
head(mydata)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Explanations of the data set:

the respective unit (ID) represents an observed iris flower
Sepal.Length: Measurement of the respective iris flower’s sepal length in centimeters
Sepal.Width: Measurement of the respective iris flower’s sepal width in centimeters
Petal.Length: Measurement of the respective iris flower’s petal length in centimeters
Petal.Width: Measurement of the respective iris flower’s petal width in centimeters
Species: Classification of the respective iris flower into one of the three species: setosa, versicolor, or virginica.

The unit of observation is a separate iris flower. A consecutive unit is represented by its ID respectively.

The sample size of our data encompasses 150 units along the 5 variables delineated above.

Explanations of variables:

The first four variables (first four columns) measure an iris’ sepal as well as petal length and width. All four variables are numerical on a ratio scale. This is because the measurements are scalable (I can say that one flower’s petal is twice as wide as another) and because they present an absolute zero in the sense that length and width yield non negative results.

The fifth variable tells us about the species the observed iris flower belongs to - it is either of setosa, versicolor, or virginica species. This is a nominal variable. For example, among the first 6 units observed, all of the flowers belong to setosa species - this is because they are sorted by species by default.

Following is the data source:

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

Finally, before proceeding, I will let the program know to treat the Species variable as a factor.

mydata$SpeciesFactor <- factor(mydata$Species, 
                             levels = c("setosa", "versicolor", "virginica"), 
                             labels = c("setosa", "versicolor", "virginica"))
head(mydata)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesFactor
## 1          5.1         3.5          1.4         0.2  setosa        setosa
## 2          4.9         3.0          1.4         0.2  setosa        setosa
## 3          4.7         3.2          1.3         0.2  setosa        setosa
## 4          4.6         3.1          1.5         0.2  setosa        setosa
## 5          5.0         3.6          1.4         0.2  setosa        setosa
## 6          5.4         3.9          1.7         0.4  setosa        setosa

Descriptive Statistics of the data

I will not perform data manipulations here because it is not called for. Just for little practice, I can, for example, rename columns of variables for illustration.

mydata2 <- mydata
colnames(mydata2) <- c("Sepal L", "Sepal W", "Petal L", "Petal W", "Species", "SpeciesF")
head(mydata2)

##   Sepal L Sepal W Petal L Petal W Species SpeciesF
## 1     5.1     3.5     1.4     0.2  setosa   setosa
## 2     4.9     3.0     1.4     0.2  setosa   setosa
## 3     4.7     3.2     1.3     0.2  setosa   setosa
## 4     4.6     3.1     1.5     0.2  setosa   setosa
## 5     5.0     3.6     1.4     0.2  setosa   setosa
## 6     5.4     3.9     1.7     0.4  setosa   setosa

Let us now take a look at some descriptive statistics. I will not be using all possible functions, for they serve a similar purpose, but will make use of some for showing understanding.

If I wish to check the arithmetic mean of petal width of all irises in centimeters, irrespective of species, I can do the following:

round(mean(mydata$Petal.Width), 2)

## [1] 1.2

This tells me that on average, the observed irises’ petals are 1.2cm wide (rounding applies).

median(mydata$Petal.Width)

## [1] 1.3

Looking at the median above, I can say that irrespective of species, a half of the observed irises’ petals are up to 1.3cm wide, and the other half are wider than 1.3cm.

I will now use the “describeBy” function for a broader overview of descriptive statistics of petal width.

library(psych)
describeBy(x = mydata$Petal.Width, group = mydata$SpeciesFactor)

## 
##  Descriptive statistics by group 
## group: setosa
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 50 0.25 0.11    0.2    0.24   0 0.1 0.6   0.5 1.18     1.26 0.01
## ------------------------------------------------------------------------------------------ 
## group: versicolor
##    vars  n mean  sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 50 1.33 0.2    1.3    1.32 0.22   1 1.8   0.8 -0.03    -0.59 0.03
## ------------------------------------------------------------------------------------------ 
## group: virginica
##    vars  n mean   sd median trimmed mad min max range  skew kurtosis   se
## X1    1 50 2.03 0.27      2    2.03 0.3 1.4 2.5   1.1 -0.12    -0.75 0.04

Interpretations of some results:

min of “Petal.Width”: The narrowest petal among all measured irises was 0.1cm wide.
max of “Petal.Width”: The widest petal among all measured irises was 2.5cm wide.
skew of “Petal Width”: When looking at versicolor and virginica species’ widths of the irises’ petals, I can see that the distribution is slightly left (negatively) skewed. This means that our results tend to cluster slightly on the right and the distribution’s slope can be seen falling slightly on the left (the “tail” is longer on the left side). The case is the opposite when it comes to the setosa species where skewness is the other way around; positive (right) skewness.

As displayed above, I can see that petal width differs notably both in arithmetic mean and median across species. I expect I will be able to confirm my research hypothesis, but it remains to be seen.

Graphing the distribution - differences by species

I will now graph the histogram of irises’ Petal Width below. But because one of the assumptions for ANOVA is the variable being normally distributed in the population within all groups of species, we want to graph the data accounting for species and observe distributions as such. To do this, I can utilize the “ggplot” function.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(mydata, aes(x=Petal.Width)) +
  geom_histogram(binwidth = 0.1, colour="black", fill="darkred") +
  facet_wrap(~SpeciesFactor, ncol = 1) + 
  xlab("Petal Width") +
  ylab("Frequency")

As I can see from the histograms above, neither of the species appears to feature a clearly normal distribution of the variable. I can confirm this by looking at the “decribeBy” function I used above and noting skewness. This could pose a problem, because in order to use ANOVA later on, I want to be sure that the variable is normally distributed within each of the groups (species) in the population. To test this formally, I will later use the Shapiro-Wilks test.

In order to test assumptions for ANOVA, let’s first test to see whether petal width in the population is normally distributed within all species groups. I will do this with the Shapiro-Wilks test for each of the population groups separately.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.3.2

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(SpeciesFactor) %>%
  shapiro_test(Petal.Width)

## # A tibble: 3 × 4
##   SpeciesFactor variable    statistic           p
##   <fct>         <chr>           <dbl>       <dbl>
## 1 setosa        Petal.Width     0.800 0.000000866
## 2 versicolor    Petal.Width     0.948 0.0273     
## 3 virginica     Petal.Width     0.960 0.0870

Our hypotheses for this are as follows for each of our species groups:

H0: The variable of petal width in the population is normally distributed within the respective species groups.

H1: The variable of petal width in the population is not normally distributed within the respective species groups.

As I can see from the test, petal width is not normally distributed in our setosa (p<0.001) and versicolor groups (p=0.03). I reject the null hypothesis for those two groups at p<0.001 and p=0.03 respectively. This violates our normality assumption for ANOVA, which states that the variable in the population should be normally distributed within each of the species groups. Consequently, I will have to use the non-parametric version of the test; Kruskal-Wallis rank sum test.

I lack sufficient evidence to claim that the distribution of the variable is not normal with virginica species, but that does not matter, because a violation in one species is sufficient to violate ANOVA assumptions.

Regardless of the fact that I will be using the non-parametric test etiher way, I will also use Levene’s test for homogeneity of variance to determine whether I can claim with statistical significance that the variance of petal width is the same within all species groups.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(mydata$Petal.Width, group = mydata$SpeciesFactor)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2  19.892 2.261e-08 ***
##       147                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With this test, our hypotheses are:

H0: The variance of petal width is the same within each species group.

H0: The variance of petal width differs for at least one species group.

As I can see from the Levene’s test , the variance of petal width is not the same between species groups and I cannot claim homogeneity of variance (p<0.001). Because both of the ANOVA assumptions about homoskedasticity and normal distribution of variable in the population within each group are violated, I will have to resort to using the non-parametric Kruskal-Wallis rank sum test (if only homoskedasticity was violated but the variable was normally distributed in the population in each of the groups, I could have used Welch’s F-test, a parametric test less sensitive to heteroskedasticity than ANOVA).

Kruskal-Wallis rank sum test and post-hoc tests

kruskal.test(Petal.Width ~ SpeciesFactor, 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Petal.Width by SpeciesFactor
## Kruskal-Wallis chi-squared = 131.19, df = 2, p-value < 2.2e-16

The hypotheses for the Kruskal-Wallis rank sum test are as follows

H0: The location distribution of petal width is the same for all three species of irises

H1: The location distribution of petal width differs for at least one species of iris

I can reject the null hypothesis at statistical significance and conclude that the location distribution of petal width differs with at least one species of iris (p<0.001).

Let’s check the effect size of our test.

kruskal_effsize(Petal.Width ~ SpeciesFactor, 
                data = mydata)

## # A tibble: 1 × 5
##   .y.             n effsize method  magnitude
## * <chr>       <int>   <dbl> <chr>   <ord>    
## 1 Petal.Width   150   0.879 eta2[H] large

As I can see, the observed effect of differing in location distribution of petal width of at least one of the iris species is large.

But with the Kruskall-Wallis rank sum test I cannot claim anything beyond at least one species of iris differing in its petal width location distribution and the magnitute of that effect. In order to better compare irises of differing species with each other, I can conduct post-hoc tests. When doing so, I am essentially comparing each of the species to its counterparts and will as such end up observing 3 combinations. Because these are comparisons of medians (I have a non-parametric test so I am no longer observing the mean), I will use Wilcoxon Rank Sum Test in every combination of species to determine whether in each of the combinations the respective two species differ in petal width location distribution at statistical significance.

library(rstatix)

groups_nonpar <- wilcox_test(Petal.Width ~ SpeciesFactor,
                             paired = FALSE,
                             p.adjust.method = "bonferroni",
                             data = mydata)

groups_nonpar

## # A tibble: 3 × 9
##   .y.         group1     group2        n1    n2 statistic        p    p.adj p.adj.signif
## * <chr>       <chr>      <chr>      <int> <int>     <dbl>    <dbl>    <dbl> <chr>       
## 1 Petal.Width setosa     versicolor    50    50         0 2.28e-18 6.84e-18 ****        
## 2 Petal.Width setosa     virginica     50    50         0 2.43e-18 7.29e-18 ****        
## 3 Petal.Width versicolor virginica     50    50        49 9.70e-17 2.91e-16 ****

I used bonferroni which is essentially a method for adjusting p-values when doing multiple tests. When drawing conclusions, I observe the adjusted p-values by bonferroni and can see that all of them present statistical significance.

For the sake of showing understanding, I will explain one of the three combinations above. Let’s take the first one. The first test compares location distribution of iris width of setosa species to those of versicolor species. As such, its hypotheses are:

H0: The distribution locations of iris width of setosa and versicolor species are the same.

H1: The distribution locations of iris width of setosa and versicolor species are not the same.

I can reject the null hypothesis and conclude that setosa irises differ from versicolor irises in petal width location distribution with statistical significance (p<0.001).

By repeating the Wilcoxon rank sum test 3 times, for each of the possible combinations, I can see that all 3 combinations turn out to be statistically significant; all observed iris species differ from the other species in petal width location distribution at statistical significance (p<0.001).

Finally, I will just paint the whole picture in a boxplot graph. The red dots in the graph lie exactly on median lines of boxplots because, due to me having used a non-parametric test, they are themselves medians. The graph essentially explains what I have already said before, so I will not go over it again in detail; but I can see the results of the Kruskal-Wallis test as well as the confirmed statistical significance of all three possible combinations of Wilcoxon rank sum tests.

#install.packages("ggpubr")
pwc <- mydata %>%
  wilcox_test(Petal.Width ~ SpeciesFactor, 
              paired = FALSE, 
              p.adjust.method = "bonferroni")
 
Kruskal_results <- kruskal_test(Petal.Width ~ SpeciesFactor, 
                                data = mydata)

library(rstatix)
pwc <- pwc %>% 
       add_y_position(fun = "median", step.increase = 0.35)

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.3.2

ggboxplot(mydata, x = "SpeciesFactor", y = "Petal.Width", add = "point", ylim=c(0, 12)) +
  stat_pvalue_manual(pwc, hide.ns = FALSE) +
  stat_summary(fun = median, geom = "point", shape = 16, size = 4,
               aes(group = SpeciesFactor), color = "darkred",
               position = position_dodge(width = 0.8)) +
  stat_summary(fun = median, colour = "darkred", 
               position = position_dodge(width = 0.8),
               geom = "text", vjust = -0.5, hjust = -1,
               aes(label = round(after_stat(y), digits = 2), group = SpeciesFactor)) +
  labs(subtitle = get_test_label(Kruskal_results,  detailed = TRUE),
       caption = get_pwc_label(pwc))

Back to our research question

I framed my research question as: Do petal widths differ between the iris species of virginica, setosa, and versicolor?

To answer my research question, I can attest to the differences in petal width between iris species by differences in the location distributions.

I have essentially rejected the null hypothesis along with the Kruskall-Wallis rank sum test when I concluded that at least one species of iris differs in the location distribution of its petal width (p<0.001).

I have then further gone to prove with Wilcoxon rank sum tests that each of the iris species differs from the others in the location distribution of its petal width. I have thus shown that the widths of irises differs between iris species with statistical significance. Irises of setosa species have the narrowest petal width, followed by irises of versicolor species. Finally, irises of virginica speceis have the widest petal width (the p-values for all three t-tests are less than 0.001).

Sorry for the wall of text. I’ll get you eye drops.