Palmer Penguin data

dataset

The Palmer penguins dataset by Allison Horst, Alison Hill, and Kristen Gorman was first made publicly available as an R package.

Size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adelie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.

The variables in the dataset are:

species
island
bill length
bill depth
flipper length
body max
sex
year

dat<- penguins
dat <- na.omit(dat)

dat<- dat[-c(259,260,263), ]

#View(penguins)
#View(dat)

dat

# A tibble: 330 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
 9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
# … with 320 more rows, and abbreviated variable names ¹flipper_length_mm,
#   ²body_mass_g
# ℹ Use `print(n = ...)` to see more rows

bill length by sex

As a first analysis I would focus my attention on the male and female penguins of the “Adelie” species

Let’s test with a two sample unpaired t-test

H0 : there’s no difference between the bill length of the male and female penguins (μM=μF)

H1: there’s difference between the bill length of the male and female penguins (μM≠μF)

display our data

library(tseries)
library(dplyr)

new_male <- dat %>%
  filter(species == 'Adelie' & sex == 'male')

new_female <- dat %>%
  filter(species == 'Adelie' & sex == 'female')


male<- new_male[,3] #select only value of interest
male<- as.numeric(unlist(male)) #eliminate list

female<- new_female[,3] #select only value of interest
female<- as.numeric(unlist(female)) #eliminate list

Let’s display our data in order to see graphically if there are already some difference.

par(mfrow=c(1,2))
boxplot(male, main ="male", col ="orange")
boxplot(female, main = "female", col ="yellow")

check for normality

shapiro.test(male)


    Shapiro-Wilk normality test

data:  male
W = 0.98613, p-value = 0.6067

shapiro.test(female)


    Shapiro-Wilk normality test

data:  female
W = 0.99117, p-value = 0.8952

since that the value is not less than 0.5 we can say that, our data, are normally distributed.

we can also display it graphically

par(mfrow=c(1,2))

x1<- hist(male, col='steelblue')
x2<- hist(female,col='steelblue')

check for “equal” variances

sd(male)/sd(female) should not be greater than 2.

In this case we’ll use an F- test.\

var.test(male, female, 
         alternative = "two.sided")


    F test to compare two variances

data:  male and female
F = 1.2597, num df = 72, denom df = 72, p-value = 0.3296
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7907415 2.0067311
sample estimates:
ratio of variances 
          1.259685

The p-value of F-test is p = 0.3296 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.

So we can say that there’s homogeneity of variances

check for outliers

par(mfrow=c(1,2))
boxplot(male,main = "male", col ="orange")
outliers_values=boxplot.stats(male)
outliers_values

$stats
[1] 36.3 39.0 40.6 41.5 44.1

$n
[1] 73

$conf
[1] 40.13769 41.06231

$out
[1] 34.6 46.0 45.8 35.1 45.6

boxplot(female,main = "female", col ="yellow")

outliers_values=boxplot.stats(female)$out
outliers_values

numeric(0)

remove outliers for male

remove_outliers_iqr=function(x){
  IQR=summary(x)[5]-summary(x)[2]
  up_out=summary(x)[5]+1.5*IQR
  down_out=summary(x)[2]-1.5*IQR
  out_data=c(which(x<down_out), which(x>up_out))
  x_new=x[-out_data]
  print(length(x_new))
  boxplot(x_new)
  summary(x_new)}



remove_outliers_iqr(male)

[1] 68

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  36.30   39.08   40.60   40.31   41.42   44.10

now let’s perform our test

source('http://www.sthda.com/upload/rquery_t_test.r')
rquery.t.test(male, female, paired=FALSE)


    Two Sample t-test

data:  x and y
t = 8.7765, df = 144, p-value = 4.44e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.427319 3.838435
sample estimates:
mean of x mean of y 
 40.39041  37.25753

There aren’t enough evidence to claim that the null hypothesis is true.

The means of the bill length of the penguins are different

Difference of flipper length by species and sex

Now we can focus our research on the difference of flipper length of three species of penguins.

We can also divide them by sex. The test that we can use is a two way anova.

Two-way ANOVA test hypotheses

There is no difference in the means of first factor
There is no difference in means of second factor
There is no interaction between factors

The alternative hypothesis for cases 1 and 2 is: the means are not equal.

The alternative hypothesis for case 3 is: there is an interaction between factor

let’s see our data

str(dat)

tibble [330 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:330] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
 $ bill_depth_mm    : num [1:330] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
 $ flipper_length_mm: int [1:330] 181 186 195 193 190 181 195 182 191 198 ...
 $ body_mass_g      : int [1:330] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 1 2 2 ...
 $ year             : int [1:330] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 - attr(*, "na.action")= 'omit' Named int [1:11] 4 9 10 11 12 48 179 219 257 269 ...
  ..- attr(*, "names")= chr [1:11] "4" "9" "10" "11" ...

Balanced design?

table(dat$species, dat$sex)

           
            female male
  Adelie        73   73
  Chinstrap     34   34
  Gentoo        58   58

display our data

library("ggpubr")

ggboxplot(dat, x = "species", y = "flipper_length_mm", color = "sex",
          palette = c("#00AFBB", "#E7B800"))

ggline(dat, x = "species", y = "flipper_length_mm", color = "sex",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))

Remove outliers

there aren’t significant outliers

#find Q1, Q3, and interquartile range for values in column A

df<- dat

Q1 <- quantile(df$flipper_length_mm, .25)
Q3 <- quantile(df$flipper_length_mm, .65)
IQR <- IQR(df$flipper_length_mm)

#only keep rows in dataframe that have values within 1.5*IQR of Q1 and Q3
no_outlier <- subset(df, df$flipper_length_mm > (Q1 - 1.5*IQR) & df$flipper_length_mm < (Q3 + 1.5*IQR))

#view row and column count of new data frame
print("dimension dataset always equal : 330   8")

[1] "dimension dataset always equal : 330   8"

#dim(no_outlier)

Check for normality

library(rstatix)

 no_outlier %>%
  group_by(species,sex) %>%
  shapiro_test(flipper_length_mm)

# A tibble: 6 × 5
  species   sex    variable          statistic      p
  <fct>     <fct>  <chr>                 <dbl>  <dbl>
1 Adelie    female flipper_length_mm     0.984 0.491 
2 Adelie    male   flipper_length_mm     0.984 0.498 
3 Chinstrap female flipper_length_mm     0.972 0.507 
4 Chinstrap male   flipper_length_mm     0.975 0.620 
5 Gentoo    female flipper_length_mm     0.974 0.245 
6 Gentoo    male   flipper_length_mm     0.963 0.0746

the p-value is never lower than 0.05

normality checked

now let’s run our test

Two-way interaction plot

library(stats)
interaction.plot(x.factor = no_outlier$species, trace.factor = no_outlier$sex, 
                 response = no_outlier$flipper_length_mm, fun = mean, 
                 type = "b", legend = TRUE, 
                 xlab = "species", ylab="flipper_length_mm",
                 pch=c(1,19), col = c("#00AFBB", "#E7B800"))

Two-way ANOVA with interaction effect

mod2 <- aov(flipper_length_mm ~ sex + species, data = no_outlier)
summary(mod2)

             Df Sum Sq Mean Sq F value Pr(>F)    
sex           1   3788    3788   115.5 <2e-16 ***
species       2  48995   24498   746.9 <2e-16 ***
Residuals   326  10692      33                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two-way ANOVA without interaction effect

mod3 <- aov(flipper_length_mm ~ sex * species, data = no_outlier)
summary(mod3)

             Df Sum Sq Mean Sq F value Pr(>F)    
sex           1   3788    3788 118.175 <2e-16 ***
species       2  48995   24498 764.330 <2e-16 ***
sex:species   2    308     154   4.803 0.0088 ** 
Residuals   324  10385      32                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

check for normality of the residuals

first plot

# 2. Normality
plot(mod3, 2)

then use the shapiro test

# Extract the residuals
aov_residuals <- residuals(object = mod3)
# Run Shapiro-Wilk test
shapiro.test(x = aov_residuals )


    Shapiro-Wilk normality test

data:  aov_residuals
W = 0.99565, p-value = 0.4917

checked

check for HOMOSCEDASTICITY

first plot residuals

Use the Levene’s test to check the homogeneity of variances. The function leveneTest() [in car package] will be used:

library(car)
library(PMCMRplus)
leveneTest(flipper_length_mm ~ species * sex, data = no_outlier)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   5  2.4276 0.03519 *
##       324                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

try another test

hartleyTest(flipper_length_mm ~ interaction(sex,species), data = no_outlier)

## 
##  Hartley's maximum F-ratio test of homogeneity of variances
## 
## data:  flipper_length_mm by interaction(sex, species)
## F Max = 2.8665, df = 57, k = 6, p-value = 0.001734

library(onewaytests)

welch.test(flipper_length_mm ~ interaction(sex,species), data = no_outlier)

## 
##   Welch's Heteroscedastic F Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : flipper_length_mm and interaction(sex, species) 
## 
##   statistic  : 358.0201 
##   num df     : 5 
##   denom df   : 129.3854 
##   p.value    : 6.10673e-74 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------

homogeneity of variances not present

bill length by island

Another interesting aspect was to understand if the bill length could have changed from an island to another.

So I’ve completed this task thanks to a non parametric test

#View(dat)
new_m <- dat %>%
  filter(island == 'Dream')


new_f <- dat %>%
  filter(island == 'Biscoe')


dream<- new_m[1:100,3] #select only value of interest
dream<- as.numeric(unlist(dream)) #eliminate list

biscoe<- new_f[1:100,3] #select only value of interest
biscoe<- as.numeric(unlist(biscoe)) #eliminate list

check normality

shapiro.test(dream)


    Shapiro-Wilk normality test

data:  dream
W = 0.94772, p-value = 0.0005885

shapiro.test(biscoe)


    Shapiro-Wilk normality test

data:  biscoe
W = 0.96445, p-value = 0.008451

the p value is lower than 0.05 so, in this case, the value of the bill length is not normally distributed.

check variances

fligner.test(dream ~ biscoe)


    Fligner-Killeen test of homogeneity of variances

data:  dream by biscoe
Fligner-Killeen:med chi-squared = 73.958, df = 72, p-value = 0.414

does the median value of the penguins on the first island differ from the median of the penguin on the second one?

display our data

library(tidyverse)
library(rstatix)
library(ggpubr)

my_data <- data.frame( 
  group = rep(c("dream", "biscoe"), each = 100),
  weight = c(dream,  biscoe)
)


ggboxplot(my_data, x = "group", y = "weight", 

          color = "group", palette = c("red", "pink"),

          ylab = "Weight", xlab = "Groups")

run our test

res <- wilcox.test(dream , biscoe)

res


    Wilcoxon rank sum test with continuity correction

data:  dream and biscoe
W = 4819, p-value = 0.6592
alternative hypothesis: true location shift is not equal to 0

it doesn’t seem that there’s a real difference.

effect size

library(coin)
my_data %>% wilcox_effsize(weight ~ group)

# A tibble: 1 × 7
  .y.    group1 group2 effsize    n1    n2 magnitude
* <chr>  <chr>  <chr>    <dbl> <int> <int> <ord>    
1 weight biscoe dream   0.0313   100   100 small

The Wilcoxon test showed that the difference was not significant (p > 0.05, effect size r = 0.0313).

EX. RESEARCH DESIGN

Antonio

Palmer Penguin data

dataset

bill length by sex

check for outliers

remove outliers for male

Difference of flipper length by species and sex

Two-way ANOVA test hypotheses

bill length by island

does the median value of the penguins on the first island differ from the median of the penguin on the second one?