mydata <- read.csv("./penguins_size.csv", header = TRUE, sep = ",", dec = ".")
mydata <- drop_na(mydata)
mydata <- mydata[mydata$sex != ".",]
mydata <- mydata %>% rename(Species=species, Island=island, Culmen_Length=culmen_length_mm, Culmen_Depth=culmen_depth_mm, Flipper_Length=flipper_length_mm, Body_Mass=body_mass_g, Sex=sex)
head(mydata)
## Species Island Culmen_Length Culmen_Depth Flipper_Length Body_Mass Sex
## 1 Adelie Torgersen 39.1 18.7 181 3750 MALE
## 2 Adelie Torgersen 39.5 17.4 186 3800 FEMALE
## 3 Adelie Torgersen 40.3 18.0 195 3250 FEMALE
## 4 Adelie Torgersen 36.7 19.3 193 3450 FEMALE
## 5 Adelie Torgersen 39.3 20.6 190 3650 MALE
## 6 Adelie Torgersen 38.9 17.8 181 3625 FEMALE
Description of variables
describeBy(mydata$Body_Mass, group = mydata$Sex)
##
## Descriptive statistics by group
## group: FEMALE
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 165 3862.27 666.17 3650 3834.02 667.17 2700 5200 2500 0.44 -1.15
## se
## X1 51.86
## ------------------------------------------------------------
## group: MALE
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 168 4545.68 787.63 4300 4515.07 815.43 3250 6300 3050 0.37 -1.23
## se
## X1 60.77
describeBy(mydata$Body_Mass, g = mydata$Sex,mat = TRUE)
## item group1 vars n mean sd median trimmed mad min max
## X11 1 FEMALE 1 165 3862.273 666.1720 3650 3834.023 667.17 2700 5200
## X12 2 MALE 1 168 4545.685 787.6289 4300 4515.074 815.43 3250 6300
## range skew kurtosis se
## X11 2500 0.4370033 -1.145108 51.86142
## X12 3050 0.3693754 -1.225441 60.76689
Female and Male Penguin Body Mass: From the output, we can infer that the average female penguin body mass is 3862.273 and average male penguin body mass is 4545.685.
Female and Male Penguin Body Mass: From the output, we can infer that the median (the middle most value of a variable) female penguin body mass is 3650 and male penguin body mass is 4300.
From the above we can see there is difference in arithmetic mean and median of female penguins body mass in comparison to male penguins body mass. In the following I will make a formal test and if the assumptions are violated I will use non-parametric alternative test.
Independent samples t-test
H0: Female and male penguins have the same body mass.
H1: Female and male penguins don’t have the same body mass.
In the firs step I will check if there is normal distribution (graphically). Then I will use Shapiro-Wilk test. If there is a violation in assumption, I will apply non-parametric test: Wilcoxon Rank Sum Test.
mydata$SexF <- factor(mydata$Sex, levels = c("FEMALE", "MALE"))
ggplot(mydata, aes(x = Body_Mass)) + geom_histogram(bins = 10, fill = "lightskyblue", color = "white") +
facet_wrap(~SexF, ncol=1) +
ylab("Frequency") +
theme_bw()
From the histogram above we cannot conclude if distribution of variable (in both groups) is normal or not so I will use Shapiro Wilk test in the following.
H0: Distribution of variable body mass is normal
H1: Distribution of variable body mass is not normal
mydata %>% group_by(Sex) %>% shapiro_test(Body_Mass)
## # A tibble: 2 × 4
## Sex variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 FEMALE Body_Mass 0.919 0.0000000616
## 2 MALE Body_Mass 0.925 0.000000123
From the results above we can say:
Category FEMALE: p value= 6.155316e-08 < 5 %
Category MALE: p value= 1.226814e-07 < 5 %
We can reject null hypothesis, meaning distribution of body mass in both categories is not normal.
Because distribution is not normal I will apply alternative test –> non-parametric test: Wilcoxon Rank Sum Test.
H0: Distribution locations of variable Body Mass are the same.
H1: Distribution location of variable Body Mass are not the same.
wilcox.test(mydata$Body_Mass ~ mydata$Sex, paired = FALSE, correct = FALSE, exact = FALSE, alternative = "two.sided")
##
## Wilcoxon rank sum test
##
## data: mydata$Body_Mass by mydata$Sex
## W = 6874.5, p-value = 1.805e-15
## alternative hypothesis: true location shift is not equal to 0
Based on the sample data, we find that female and male penguins differ in the body mass (𝑝 < 0.001) - male penguins weight more than female penguins. To see how big the difference is I will use effectsize.
effectsize(wilcox.test(mydata$Body_Mass ~ mydata$Sex, paired = FALSE, correct = FALSE, exact = FALSE, alternative = "two.sided"))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.50 | [-0.59, -0.41]
interpret_rank_biserial(0.5)
## [1] "very large"
## (Rules: funder2019)
The difference in distribution is very large (𝑟 = 0.50 ).
Based on the sample data, we have found that female and male penguins differ in the body mass. Male penguins weigh more. The difference in distribution of variable is very large (r = 0.50).
In the first step I will do the one-way analysis of variance - Anova and if assumptions are violated then I will use Kruskal-Wallis rank sum test.
Research hypothesis: All penguin species (Adelie, Chinstrap and Gentoo) have the same flipper length.
H0: μ (Adelie flipper length) = μ (Chinstrap flipper length) = μ (Gentoo flipper length)
H1: At least one μ is different.
mydata$Species <- factor(mydata$Species)
levels(mydata$Species)
## [1] "Adelie" "Chinstrap" "Gentoo"
describeBy(mydata$Flipper_Length, group = mydata$Species)
##
## Descriptive statistics by group
## group: Adelie
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 146 190.1 6.52 190 190.08 7.41 172 210 38 0.08 0.28 0.54
## ------------------------------------------------------------
## group: Chinstrap
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 68 195.82 7.13 196 195.75 7.41 178 212 34 -0.01 -0.13 0.86
## ------------------------------------------------------------
## group: Gentoo
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 119 217.24 6.59 216 216.91 7.41 203 231 28 0.36 -0.72 0.6
First I will graphicaly check if variable (Flipper_Length) is normally distributed in all three groups.
library(ggplot2)
ggplot(mydata, aes(x = Flipper_Length)) +
geom_histogram(binwidth = 10, colour="white", fill = "lightskyblue") +
facet_wrap(~Species , ncol = 1) +
ylab("Frequency") +
theme_bw()
With Shapiro-Wilk test I will check the normality.
Hypothesis for each group separately:
H0: Distribution of variable Flipper_Length is normal.
H1: Distribution of variable Flipper_Length is not normal.
mydata %>%
group_by(Species) %>%
shapiro_test(Flipper_Length)
## # A tibble: 3 × 4
## Species variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 Adelie Flipper_Length 0.993 0.743
## 2 Chinstrap Flipper_Length 0.989 0.811
## 3 Gentoo Flipper_Length 0.961 0.00176
From the results above we can say:
Category Adelie: p value= 0.742742565 > 5 %: we cannot reject H0.
Category Chinstrap: p value= 0.810644659 > 5 %: we cannot reject H0.
Category Gentoo: p value= 0.001759983 < 5 %: we can reject H0, meaning distribution of flipper length in this group is not normal.
Because distribution of choosen variable is not normal I will apply alternative test non-parametric Kruskal-Wallis Rank Sum Test.
H0: All distribution locations of variable Flipper_Length are the same in all 3 groups
H1: At least one distribution location of variable Flipper_Length is different.
kruskal.test(Flipper_Length ~ Species,
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: Flipper_Length by Species
## Kruskal-Wallis chi-squared = 237.35, df = 2, p-value < 2.2e-16
Based on Kruskal-Wallis rank sum test we can reject H0, because p-value (=2.2e-16) is smaller than 0.001.
kruskal_effsize(Flipper_Length ~ Species,
data = mydata)
## # A tibble: 1 × 5
## .y. n effsize method magnitude
## * <chr> <int> <dbl> <chr> <ord>
## 1 Flipper_Length 333 0.713 eta2[H] large
Based on the sample data, we have found that the distribution of flipper length differs for at least one the group (chi-squared = 237.35 and p < 0.001 ) - the difference in distribution locations of variable is large (effsize=0.7131689).
I the following I want to test if the species (Adelie, Chinstrap, Gentoo) varied depending on the island (Biscoe, Dream, Torgersen). I will use Chi-square test.
H0: There is no association between the species and island
H1: There is an association between the species and island
results <- chisq.test(mydata$Species, mydata$Island,
correct = TRUE)
results
##
## Pearson's Chi-squared test
##
## data: mydata$Species and mydata$Island
## X-squared = 284.59, df = 4, p-value < 2.2e-16
Based on Chi-square test we can reject H0, because p-value (=2.2e-16) is smaller than 0.001. Therefore we can conclude that there is an association between species and island.
addmargins(results$observed)
## mydata$Island
## mydata$Species Biscoe Dream Torgersen Sum
## Adelie 44 55 47 146
## Chinstrap 0 68 0 68
## Gentoo 119 0 0 119
## Sum 163 123 47 333
As seen from above results, Chinstraps only live on one island (Dream) as well as Gentoo species where only live on Biscoe Island. Only Adelie species lives on all three islands.
round(results$expected, 2)
## mydata$Island
## mydata$Species Biscoe Dream Torgersen
## Adelie 71.47 53.93 20.61
## Chinstrap 33.29 25.12 9.60
## Gentoo 58.25 43.95 16.80
We can see that all of expected frequencies are larger than 5, therefore we can confirm first assumption. The other assumption is that the observations are independent which is also true.
round(results$res, 2)
## mydata$Island
## mydata$Species Biscoe Dream Torgersen
## Adelie -3.25 0.15 5.81
## Chinstrap -5.77 8.56 -3.10
## Gentoo 7.96 -6.63 -4.10
effectsize::cramers_v(mydata$Species, mydata$Island)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.65 | [0.58, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.65)
## [1] "very large"
## (Rules: funder2019)
The effect size is very large (r=0,65)