mydata <- read.csv("./penguins_size.csv", header = TRUE, sep = ",", dec = ".")
mydata <- drop_na(mydata)
mydata <- mydata[mydata$sex != ".",]

mydata <- mydata %>% rename(Species=species, Island=island, Culmen_Length=culmen_length_mm, Culmen_Depth=culmen_depth_mm, Flipper_Length=flipper_length_mm, Body_Mass=body_mass_g, Sex=sex)


head(mydata)
##   Species    Island Culmen_Length Culmen_Depth Flipper_Length Body_Mass    Sex
## 1  Adelie Torgersen          39.1         18.7            181      3750   MALE
## 2  Adelie Torgersen          39.5         17.4            186      3800 FEMALE
## 3  Adelie Torgersen          40.3         18.0            195      3250 FEMALE
## 4  Adelie Torgersen          36.7         19.3            193      3450 FEMALE
## 5  Adelie Torgersen          39.3         20.6            190      3650   MALE
## 6  Adelie Torgersen          38.9         17.8            181      3625 FEMALE

Description of variables

Source: https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris/notebook#Flipperlength-distribution

HYPOTHESIS TESTING

1. HYPOTHESIS ABOUT THE DIFFERENCE BETWEEN TWO POPULATION ARITHMETIC MEANS (independent samples)

describeBy(mydata$Body_Mass, group = mydata$Sex)
## 
##  Descriptive statistics by group 
## group: FEMALE
##    vars   n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 165 3862.27 666.17   3650 3834.02 667.17 2700 5200  2500 0.44    -1.15
##       se
## X1 51.86
## ------------------------------------------------------------ 
## group: MALE
##    vars   n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 168 4545.68 787.63   4300 4515.07 815.43 3250 6300  3050 0.37    -1.23
##       se
## X1 60.77
describeBy(mydata$Body_Mass, g = mydata$Sex,mat = TRUE)
##     item group1 vars   n     mean       sd median  trimmed    mad  min  max
## X11    1 FEMALE    1 165 3862.273 666.1720   3650 3834.023 667.17 2700 5200
## X12    2   MALE    1 168 4545.685 787.6289   4300 4515.074 815.43 3250 6300
##     range      skew  kurtosis       se
## X11  2500 0.4370033 -1.145108 51.86142
## X12  3050 0.3693754 -1.225441 60.76689
  1. MEAN represents the arithmetic average of the data. It is calculated by taking the sum of the values and dividing by the number of observations.

Female and Male Penguin Body Mass: From the output, we can infer that the average female penguin body mass is 3862.273 and average male penguin body mass is 4545.685.

  1. MEDIAN value is the middle most value of a variable in a data.

Female and Male Penguin Body Mass: From the output, we can infer that the median (the middle most value of a variable) female penguin body mass is 3650 and male penguin body mass is 4300.

From the above we can see there is difference in arithmetic mean and median of female penguins body mass in comparison to male penguins body mass. In the following I will make a formal test and if the assumptions are violated I will use non-parametric alternative test.

Independent samples t-test

H0: Female and male penguins have the same body mass.

H1: Female and male penguins don’t have the same body mass.

In the firs step I will check if there is normal distribution (graphically). Then I will use Shapiro-Wilk test. If there is a violation in assumption, I will apply non-parametric test: Wilcoxon Rank Sum Test.

mydata$SexF <- factor(mydata$Sex, levels = c("FEMALE", "MALE"))
ggplot(mydata, aes(x = Body_Mass)) + geom_histogram(bins = 10, fill = "lightskyblue", color = "white") +
  facet_wrap(~SexF, ncol=1) +
  ylab("Frequency") +
  theme_bw()

From the histogram above we cannot conclude if distribution of variable (in both groups) is normal or not so I will use Shapiro Wilk test in the following.

H0: Distribution of variable body mass is normal

H1: Distribution of variable body mass is not normal

mydata %>% group_by(Sex) %>% shapiro_test(Body_Mass)
## # A tibble: 2 × 4
##   Sex    variable  statistic            p
##   <chr>  <chr>         <dbl>        <dbl>
## 1 FEMALE Body_Mass     0.919 0.0000000616
## 2 MALE   Body_Mass     0.925 0.000000123

From the results above we can say:

Category FEMALE: p value= 6.155316e-08 < 5 %

Category MALE: p value= 1.226814e-07 < 5 %

We can reject null hypothesis, meaning distribution of body mass in both categories is not normal.

Because distribution is not normal I will apply alternative test –> non-parametric test: Wilcoxon Rank Sum Test.

H0: Distribution locations of variable Body Mass are the same.

H1: Distribution location of variable Body Mass are not the same.

wilcox.test(mydata$Body_Mass ~ mydata$Sex, paired = FALSE, correct = FALSE, exact = FALSE, alternative = "two.sided")
## 
##  Wilcoxon rank sum test
## 
## data:  mydata$Body_Mass by mydata$Sex
## W = 6874.5, p-value = 1.805e-15
## alternative hypothesis: true location shift is not equal to 0

Based on the sample data, we find that female and male penguins differ in the body mass (𝑝 < 0.001) - male penguins weight more than female penguins. To see how big the difference is I will use effectsize.

effectsize(wilcox.test(mydata$Body_Mass ~ mydata$Sex, paired = FALSE, correct = FALSE, exact = FALSE, alternative = "two.sided"))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.50             | [-0.59, -0.41]
interpret_rank_biserial(0.5)
## [1] "very large"
## (Rules: funder2019)

The difference in distribution is very large (𝑟 = 0.50 ).

Based on the sample data, we have found that female and male penguins differ in the body mass. Male penguins weigh more. The difference in distribution of variable is very large (r = 0.50).

HYPOTHESIS ABOUT THE EQUALITY OF THREE OR MORE POPULATION ARITHMETIC MEANS FOR INDEPENDENT SAMPLES

In the first step I will do the one-way analysis of variance - Anova and if assumptions are violated then I will use Kruskal-Wallis rank sum test.

Research hypothesis: All penguin species (Adelie, Chinstrap and Gentoo) have the same flipper length.

H0: μ (Adelie flipper length) = μ (Chinstrap flipper length) = μ (Gentoo flipper length)

H1: At least one μ is different.

mydata$Species <- factor(mydata$Species)
levels(mydata$Species)
## [1] "Adelie"    "Chinstrap" "Gentoo"
describeBy(mydata$Flipper_Length, group = mydata$Species)
## 
##  Descriptive statistics by group 
## group: Adelie
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 146 190.1 6.52    190  190.08 7.41 172 210    38 0.08     0.28 0.54
## ------------------------------------------------------------ 
## group: Chinstrap
##    vars  n   mean   sd median trimmed  mad min max range  skew kurtosis   se
## X1    1 68 195.82 7.13    196  195.75 7.41 178 212    34 -0.01    -0.13 0.86
## ------------------------------------------------------------ 
## group: Gentoo
##    vars   n   mean   sd median trimmed  mad min max range skew kurtosis  se
## X1    1 119 217.24 6.59    216  216.91 7.41 203 231    28 0.36    -0.72 0.6

First I will graphicaly check if variable (Flipper_Length) is normally distributed in all three groups.

library(ggplot2)
ggplot(mydata, aes(x = Flipper_Length)) +
  geom_histogram(binwidth = 10, colour="white", fill = "lightskyblue") +
  facet_wrap(~Species , ncol = 1) + 
  ylab("Frequency") +
  theme_bw()

With Shapiro-Wilk test I will check the normality.

Hypothesis for each group separately:

H0: Distribution of variable Flipper_Length is normal.

H1: Distribution of variable Flipper_Length is not normal.

mydata %>%
  group_by(Species) %>%
  shapiro_test(Flipper_Length)
## # A tibble: 3 × 4
##   Species   variable       statistic       p
##   <fct>     <chr>              <dbl>   <dbl>
## 1 Adelie    Flipper_Length     0.993 0.743  
## 2 Chinstrap Flipper_Length     0.989 0.811  
## 3 Gentoo    Flipper_Length     0.961 0.00176

From the results above we can say:

  • Category Adelie: p value= 0.742742565 > 5 %: we cannot reject H0.

  • Category Chinstrap: p value= 0.810644659 > 5 %: we cannot reject H0.

  • Category Gentoo: p value= 0.001759983 < 5 %: we can reject H0, meaning distribution of flipper length in this group is not normal.

Because distribution of choosen variable is not normal I will apply alternative test non-parametric Kruskal-Wallis Rank Sum Test.

H0: All distribution locations of variable Flipper_Length are the same in all 3 groups

H1: At least one distribution location of variable Flipper_Length is different.

kruskal.test(Flipper_Length ~ Species,
             data = mydata)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Flipper_Length by Species
## Kruskal-Wallis chi-squared = 237.35, df = 2, p-value < 2.2e-16

Based on Kruskal-Wallis rank sum test we can reject H0, because p-value (=2.2e-16) is smaller than 0.001.

kruskal_effsize(Flipper_Length ~ Species,
             data = mydata)
## # A tibble: 1 × 5
##   .y.                n effsize method  magnitude
## * <chr>          <int>   <dbl> <chr>   <ord>    
## 1 Flipper_Length   333   0.713 eta2[H] large

Based on the sample data, we have found that the distribution of flipper length differs for at least one the group (chi-squared = 237.35 and p < 0.001 ) - the difference in distribution locations of variable is large (effsize=0.7131689).

ASSOCIATION BETWEEN CATEGORIES - CHI-SQUARE TEST

I the following I want to test if the species (Adelie, Chinstrap, Gentoo) varied depending on the island (Biscoe, Dream, Torgersen). I will use Chi-square test.

H0: There is no association between the species and island

H1: There is an association between the species and island

results <- chisq.test(mydata$Species, mydata$Island,
                      correct = TRUE)
results
## 
##  Pearson's Chi-squared test
## 
## data:  mydata$Species and mydata$Island
## X-squared = 284.59, df = 4, p-value < 2.2e-16

Based on Chi-square test we can reject H0, because p-value (=2.2e-16) is smaller than 0.001. Therefore we can conclude that there is an association between species and island.

addmargins(results$observed)
##               mydata$Island
## mydata$Species Biscoe Dream Torgersen Sum
##      Adelie        44    55        47 146
##      Chinstrap      0    68         0  68
##      Gentoo       119     0         0 119
##      Sum          163   123        47 333

As seen from above results, Chinstraps only live on one island (Dream) as well as Gentoo species where only live on Biscoe Island. Only Adelie species lives on all three islands.

round(results$expected, 2)
##               mydata$Island
## mydata$Species Biscoe Dream Torgersen
##      Adelie     71.47 53.93     20.61
##      Chinstrap  33.29 25.12      9.60
##      Gentoo     58.25 43.95     16.80

We can see that all of expected frequencies are larger than 5, therefore we can confirm first assumption. The other assumption is that the observations are independent which is also true.

round(results$res, 2)
##               mydata$Island
## mydata$Species Biscoe Dream Torgersen
##      Adelie     -3.25  0.15      5.81
##      Chinstrap  -5.77  8.56     -3.10
##      Gentoo      7.96 -6.63     -4.10
effectsize::cramers_v(mydata$Species, mydata$Island)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.65              | [0.58, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.65)
## [1] "very large"
## (Rules: funder2019)

The effect size is very large (r=0,65)