#install.packages("palmerpenguins")
library(palmerpenguins)
library(ggplot2)
library(ggpubr)
library(dplyr)The Palmer penguins dataset by Allison Horst, Alison Hill, and Kristen Gorman was first made publicly available as an R package.
Size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adelie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.
The variables in the dataset are:
species
island
bill length
bill depth
flipper length
body max
sex
year
dat<- penguins
dat <- na.omit(dat)
dat<- dat[-c(259,260,263), ]
#View(penguins)
#View(dat)
dat# A tibble: 330 × 8
species island bill_length_mm bill_depth_mm flipper_…¹ body_…² sex year
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
4 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
5 Adelie Torgersen 39.3 20.6 190 3650 male 2007
6 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
7 Adelie Torgersen 39.2 19.6 195 4675 male 2007
8 Adelie Torgersen 41.1 17.6 182 3200 fema… 2007
9 Adelie Torgersen 38.6 21.2 191 3800 male 2007
10 Adelie Torgersen 34.6 21.1 198 4400 male 2007
# … with 320 more rows, and abbreviated variable names ¹flipper_length_mm,
# ²body_mass_g
# ℹ Use `print(n = ...)` to see more rows
As a first analysis I would focus my attention on the male and female penguins of the “Adelie” species
Let’s test with a two sample unpaired t-test
H0 : there’s no difference between the bill length of the male and female penguins (μM=μF)
H1: there’s difference between the bill length of the male and female penguins (μM≠μF)
display our data
library(tseries)
library(dplyr)
new_male <- dat %>%
filter(species == 'Adelie' & sex == 'male')
new_female <- dat %>%
filter(species == 'Adelie' & sex == 'female')
male<- new_male[,3] #select only value of interest
male<- as.numeric(unlist(male)) #eliminate list
female<- new_female[,3] #select only value of interest
female<- as.numeric(unlist(female)) #eliminate listLet’s display our data in order to see graphically if there are already some difference.
par(mfrow=c(1,2))
boxplot(male, main ="male", col ="orange")
boxplot(female, main = "female", col ="yellow")check for normality
shapiro.test(male)
Shapiro-Wilk normality test
data: male
W = 0.98613, p-value = 0.6067
shapiro.test(female)
Shapiro-Wilk normality test
data: female
W = 0.99117, p-value = 0.8952
since that the value is not less than 0.5 we can say that, our data, are normally distributed.
we can also display it graphically
par(mfrow=c(1,2))
x1<- hist(male, col='steelblue')
x2<- hist(female,col='steelblue')check for “equal” variances
sd(male)/sd(female) should not be greater than 2.
In this case we’ll use an F- test.\
var.test(male, female,
alternative = "two.sided")
F test to compare two variances
data: male and female
F = 1.2597, num df = 72, denom df = 72, p-value = 0.3296
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.7907415 2.0067311
sample estimates:
ratio of variances
1.259685
The p-value of F-test is p = 0.3296 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.
So we can say that there’s homogeneity of variances
par(mfrow=c(1,2))
boxplot(male,main = "male", col ="orange")
outliers_values=boxplot.stats(male)
outliers_values$stats
[1] 36.3 39.0 40.6 41.5 44.1
$n
[1] 73
$conf
[1] 40.13769 41.06231
$out
[1] 34.6 46.0 45.8 35.1 45.6
boxplot(female,main = "female", col ="yellow")outliers_values=boxplot.stats(female)$out
outliers_valuesnumeric(0)
remove_outliers_iqr=function(x){
IQR=summary(x)[5]-summary(x)[2]
up_out=summary(x)[5]+1.5*IQR
down_out=summary(x)[2]-1.5*IQR
out_data=c(which(x<down_out), which(x>up_out))
x_new=x[-out_data]
print(length(x_new))
boxplot(x_new)
summary(x_new)}
remove_outliers_iqr(male)[1] 68
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.30 39.08 40.60 40.31 41.42 44.10
now let’s perform our test
source('http://www.sthda.com/upload/rquery_t_test.r')
rquery.t.test(male, female, paired=FALSE)
Two Sample t-test
data: x and y
t = 8.7765, df = 144, p-value = 4.44e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.427319 3.838435
sample estimates:
mean of x mean of y
40.39041 37.25753
There aren’t enough evidence to claim that the null hypothesis is true.
The means of the bill length of the penguins are different
Now we can focus our research on the difference of flipper length of three species of penguins.
We can also divide them by sex. The test that we can use is a two way anova.
There is no difference in the means of first factor
There is no difference in means of second factor
There is no interaction between factors
The alternative hypothesis for cases 1 and 2 is: the means are not equal.
The alternative hypothesis for case 3 is: there is an interaction between factor
let’s see our data
str(dat)tibble [330 × 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:330] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
$ bill_depth_mm : num [1:330] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
$ flipper_length_mm: int [1:330] 181 186 195 193 190 181 195 182 191 198 ...
$ body_mass_g : int [1:330] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
$ year : int [1:330] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
- attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...
Balanced design?
table(dat$species, dat$sex)
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 58
display our data
library("ggpubr")
ggboxplot(dat, x = "species", y = "flipper_length_mm", color = "sex",
palette = c("#00AFBB", "#E7B800"))ggline(dat, x = "species", y = "flipper_length_mm", color = "sex",
add = c("mean_se", "dotplot"),
palette = c("#00AFBB", "#E7B800"))Remove outliers
there aren’t significant outliers
#find Q1, Q3, and interquartile range for values in column A
df<- dat
Q1 <- quantile(df$flipper_length_mm, .25)
Q3 <- quantile(df$flipper_length_mm, .65)
IQR <- IQR(df$flipper_length_mm)
#only keep rows in dataframe that have values within 1.5*IQR of Q1 and Q3
no_outlier <- subset(df, df$flipper_length_mm > (Q1 - 1.5*IQR) & df$flipper_length_mm < (Q3 + 1.5*IQR))
#view row and column count of new data frame
print("dimension dataset always equal : 330 8")[1] "dimension dataset always equal : 330 8"
#dim(no_outlier) Check for normality
library(rstatix)
no_outlier %>%
group_by(species,sex) %>%
shapiro_test(flipper_length_mm)# A tibble: 6 × 5
species sex variable statistic p
<fct> <fct> <chr> <dbl> <dbl>
1 Adelie female flipper_length_mm 0.984 0.491
2 Adelie male flipper_length_mm 0.984 0.498
3 Chinstrap female flipper_length_mm 0.972 0.507
4 Chinstrap male flipper_length_mm 0.975 0.620
5 Gentoo female flipper_length_mm 0.974 0.245
6 Gentoo male flipper_length_mm 0.963 0.0746
the p-value is never lower than 0.05
now let’s run our test
Two-way interaction plot
library(stats)
interaction.plot(x.factor = no_outlier$species, trace.factor = no_outlier$sex,
response = no_outlier$flipper_length_mm, fun = mean,
type = "b", legend = TRUE,
xlab = "species", ylab="flipper_length_mm",
pch=c(1,19), col = c("#00AFBB", "#E7B800"))Two-way ANOVA with interaction effect
mod2 <- aov(flipper_length_mm ~ sex + species, data = no_outlier)
summary(mod2) Df Sum Sq Mean Sq F value Pr(>F)
sex 1 3788 3788 115.5 <2e-16 ***
species 2 48995 24498 746.9 <2e-16 ***
Residuals 326 10692 33
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Two-way ANOVA without interaction effect
mod3 <- aov(flipper_length_mm ~ sex * species, data = no_outlier)
summary(mod3) Df Sum Sq Mean Sq F value Pr(>F)
sex 1 3788 3788 118.175 <2e-16 ***
species 2 48995 24498 764.330 <2e-16 ***
sex:species 2 308 154 4.803 0.0088 **
Residuals 324 10385 32
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
check for normality of the residuals
first plot
# 2. Normality
plot(mod3, 2)then use the shapiro test
# Extract the residuals
aov_residuals <- residuals(object = mod3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )
Shapiro-Wilk normality test
data: aov_residuals
W = 0.99565, p-value = 0.4917
check for HOMOSCEDASTICITY
first plot residuals
Use the Levene’s test to check the homogeneity of variances. The function leveneTest() [in car package] will be used:
library(car)
library(PMCMRplus)
leveneTest(flipper_length_mm ~ species * sex, data = no_outlier)## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 2.4276 0.03519 *
## 324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
try another test
hartleyTest(flipper_length_mm ~ interaction(sex,species), data = no_outlier)##
## Hartley's maximum F-ratio test of homogeneity of variances
##
## data: flipper_length_mm by interaction(sex, species)
## F Max = 2.8665, df = 57, k = 6, p-value = 0.001734
library(onewaytests)
welch.test(flipper_length_mm ~ interaction(sex,species), data = no_outlier)##
## Welch's Heteroscedastic F Test (alpha = 0.05)
## -------------------------------------------------------------
## data : flipper_length_mm and interaction(sex, species)
##
## statistic : 358.0201
## num df : 5
## denom df : 129.3854
## p.value : 6.10673e-74
##
## Result : Difference is statistically significant.
## -------------------------------------------------------------
Another interesting aspect was to understand if the bill length could have changed from an island to another.
So I’ve completed this task thanks to a non parametric test
#View(dat)
new_m <- dat %>%
filter(island == 'Dream')
new_f <- dat %>%
filter(island == 'Biscoe')
dream<- new_m[1:100,3] #select only value of interest
dream<- as.numeric(unlist(dream)) #eliminate list
biscoe<- new_f[1:100,3] #select only value of interest
biscoe<- as.numeric(unlist(biscoe)) #eliminate listcheck normality
shapiro.test(dream)
Shapiro-Wilk normality test
data: dream
W = 0.94772, p-value = 0.0005885
shapiro.test(biscoe)
Shapiro-Wilk normality test
data: biscoe
W = 0.96445, p-value = 0.008451
the p value is lower than 0.05 so, in this case, the value of the bill length is not normally distributed.
fligner.test(dream ~ biscoe)
Fligner-Killeen test of homogeneity of variances
data: dream by biscoe
Fligner-Killeen:med chi-squared = 73.958, df = 72, p-value = 0.414
display our data
library(tidyverse)
library(rstatix)
library(ggpubr)
my_data <- data.frame(
group = rep(c("dream", "biscoe"), each = 100),
weight = c(dream, biscoe)
)
ggboxplot(my_data, x = "group", y = "weight",
color = "group", palette = c("red", "pink"),
ylab = "Weight", xlab = "Groups")run our test
res <- wilcox.test(dream , biscoe)
res
Wilcoxon rank sum test with continuity correction
data: dream and biscoe
W = 4819, p-value = 0.6592
alternative hypothesis: true location shift is not equal to 0
it doesn’t seem that there’s a real difference.
effect size
library(coin)
my_data %>% wilcox_effsize(weight ~ group)# A tibble: 1 × 7
.y. group1 group2 effsize n1 n2 magnitude
* <chr> <chr> <chr> <dbl> <int> <int> <ord>
1 weight biscoe dream 0.0313 100 100 small
The Wilcoxon test showed that the difference was not significant (p > 0.05, effect size r = 0.0313).