Lab_6_Assignment

Author

Chii Kojima

Essentials

1.

Load PalmerPenguins and other necessary packages

#insert packages here :)
library(tidyverse) #For data manipulation and graphing
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggsci) #for color scales
library(patchwork)  #to panel our graphs
library(performance) #check stat/model assumptions
library(see)

Attaching package: 'see'

The following objects are masked from 'package:ggsci':

    scale_color_material, scale_colour_material, scale_fill_material
library(rstatix) #test for outliers, welch_anova_test

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
library(car) #levenes test
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(palmerpenguins)

2.

Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.

#Null hypothesis: There is no difference in flipper length among species
#Alternate hypothesis: There are differences in flipper length among species

head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
#species vs flipper_length_mm

pen1<-penguins

#DROP NA
pen2<-pen1%>%
  drop_na(flipper_length_mm,species)

pen2
# A tibble: 342 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
 9 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
10 Adelie  Torgersen           37.8          17.1        186    3300 <NA>   2007
# … with 332 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
#Group and filter the data
pen_flipp<-pen2%>%
  group_by(species)%>%
  drop_na(flipper_length_mm)%>%
  dplyr::summarize(meanflipp=mean(flipper_length_mm),sd=sd(flipper_length_mm),n=n(),se=sd/sqrt(n))

pen_flipp
# A tibble: 3 × 5
  species   meanflipp    sd     n    se
  <fct>         <dbl> <dbl> <int> <dbl>
1 Adelie         190.  6.54   151 0.532
2 Chinstrap      196.  7.13    68 0.865
3 Gentoo         217.  6.48   123 0.585
#make a graph
ggplot(data=pen_flipp,aes(x=species,y=meanflipp,color=species))+
  geom_point()+
  geom_errorbar(aes(ymin = meanflipp - se, ymax = meanflipp + se), width = 0.2)+
  theme_bw()+
  geom_point(data=pen2,aes(x=species,y=flipper_length_mm,group=species, fill=species))+
  scale_color_npg()+
  scale_fill_npg()

#Gentoo has the longest flipper, and Chinstrap has the second longest flipper. Adelie has the shortest flipper.


#Regular ANOVA
anovflipp1<-aov(flipper_length_mm~species,data=pen2)
anovflipp1
Call:
   aov(formula = flipper_length_mm ~ species, data = pen2)

Terms:
                 species Residuals
Sum of Squares  52473.28  14953.26
Deg. of Freedom        2       339

Residual standard error: 6.641529
Estimated effects may be unbalanced
summary(anovflipp1) #The p value is less than 0.01, there is a significant difference in flipper_length_mm among  species. We reject our null hypothesis that there is no difference in flipper length among species. The box plot also supports this.
             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  52473   26237   594.8 <2e-16 ***
Residuals   339  14953      44                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.

Test your assumptions (individually– do not use check_model) and interpret your assumption checks

#Check outliers
pen_flipp1<-penguins%>%
  group_by(species)%>%
  drop_na(flipper_length_mm)%>%
  identify_outliers(flipper_length_mm)

pen_flipp1 ##There are only 2 outliers, (not extreme), so we can still use the data.
# A tibble: 2 × 10
  species island    bill_l…¹ bill_…² flipp…³ body_…⁴ sex    year is.ou…⁵ is.ex…⁶
  <fct>   <fct>        <dbl>   <dbl>   <int>   <int> <fct> <int> <lgl>   <lgl>  
1 Adelie  Biscoe        37.9    18.6     172    3150 fema…  2007 TRUE    FALSE  
2 Adelie  Torgersen     44.1    18       210    4000 male   2009 TRUE    FALSE  
# … with abbreviated variable names ¹​bill_length_mm, ²​bill_depth_mm,
#   ³​flipper_length_mm, ⁴​body_mass_g, ⁵​is.outlier, ⁶​is.extreme
modelflipp<-aov(flipper_length_mm ~ species, data=pen2)
modelflipp
Call:
   aov(formula = flipper_length_mm ~ species, data = pen2)

Terms:
                 species Residuals
Sum of Squares  52473.28  14953.26
Deg. of Freedom        2       339

Residual standard error: 6.641529
Estimated effects may be unbalanced
#entire model normality
shapiro_test(residuals(modelflipp)) #p.value greater than 0.05 indicates the nornality in the residuals. 
# A tibble: 1 × 3
  variable              statistic p.value
  <chr>                     <dbl>   <dbl>
1 residuals(modelflipp)     0.995   0.261
#normality of each group
pen1%>%
  group_by(species)%>%
  shapiro_test(flipper_length_mm) #Adelie and Chinstrap are normal (p>0.05) but Gentoo is not normal (p<0.01)
# A tibble: 3 × 4
  species   variable          statistic       p
  <fct>     <chr>                 <dbl>   <dbl>
1 Adelie    flipper_length_mm     0.993 0.720  
2 Chinstrap flipper_length_mm     0.989 0.811  
3 Gentoo    flipper_length_mm     0.962 0.00162
#Homogeneity of variance/Homoscedasticity
levene_test(flipper_length_mm~species,data=pen2) #This is not significant (p>0.05) so heterogeneity of variance is fine and the assumption is met. We don't need to do welch test.
# A tibble: 1 × 4
    df1   df2 statistic     p
  <int> <int>     <dbl> <dbl>
1     2   339     0.331 0.719
#Visual Check Using check_model()
check_model(modelflipp) 

4.

Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).

aovtest<-aov(flipper_length_mm ~ species, data=pen2)
TukeyHSD(aovtest) #p<0.01 for all 3 interactions. There are significant differencce between all 3 species.
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = flipper_length_mm ~ species, data = pen2)

$species
                      diff       lwr       upr p adj
Chinstrap-Adelie  5.869887  3.586583  8.153191     0
Gentoo-Adelie    27.233349 25.334376 29.132323     0
Gentoo-Chinstrap 21.363462 19.000841 23.726084     0

Depth

Repeat 1-4 above using 2 or more explanatory variables from Palmer Penguins. Assess the effecs of multiple variable on a single numerical variable in the data frame. Make a graph or graphs (if needed). You will need to group, filter, and summarize your data to do this. Perform the necessary ANOVA and TukeyHSD test. Interpret your results (using the graph(s) and stats outputs). Check your assumptions and interpret your assumption check (you can do this individually or with check_model() if you can get the later to work with your ANOVA– it is not optimized for ANOVA and often does not work)

1.

Load PalmerPenguins and other necessary packages

#insert packages here :)
library(tidyverse) #For data manipulation and graphing
library(ggsci) #for color scales
library(patchwork)  #to panel our graphs
library(performance) #check stat/model assumptions
library(see)
library(rstatix) #test for outliers, welch_anova_test
library(car) #levenes test
library(palmerpenguins)

2.

Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.

#Hypotheses
#Null Hypothesis: There is no difference in flipper length among species nor sex
#Alternate Hypothesis: There is differences in flipper length among species and sex

#Group and Filter the data

pen_flipp_sex<-pen1%>%
  group_by(species,sex)%>%
  drop_na(flipper_length_mm)%>%
   dplyr::summarize(meanflipp=mean(flipper_length_mm),sd=sd(flipper_length_mm),n=n(),se=sd/sqrt(n))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
pen_flipp_sex
# A tibble: 8 × 6
# Groups:   species [3]
  species   sex    meanflipp    sd     n    se
  <fct>     <fct>      <dbl> <dbl> <int> <dbl>
1 Adelie    female      188.  5.60    73 0.655
2 Adelie    male        192.  6.60    73 0.772
3 Adelie    <NA>        186.  6.11     5 2.73 
4 Chinstrap female      192.  5.75    34 0.987
5 Chinstrap male        200.  5.98    34 1.02 
6 Gentoo    female      213.  3.90    58 0.512
7 Gentoo    male        222.  5.67    61 0.726
8 Gentoo    <NA>        216.  1.26     4 0.629
#Drop NA
pen3<-pen1%>%
  drop_na(flipper_length_mm,sex)
pen3
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           39.1          18.7        181    3750 male   2007
 2 Adelie  Torgersen           39.5          17.4        186    3800 fema…  2007
 3 Adelie  Torgersen           40.3          18          195    3250 fema…  2007
 4 Adelie  Torgersen           36.7          19.3        193    3450 fema…  2007
 5 Adelie  Torgersen           39.3          20.6        190    3650 male   2007
 6 Adelie  Torgersen           38.9          17.8        181    3625 fema…  2007
 7 Adelie  Torgersen           39.2          19.6        195    4675 male   2007
 8 Adelie  Torgersen           41.1          17.6        182    3200 fema…  2007
 9 Adelie  Torgersen           38.6          21.2        191    3800 male   2007
10 Adelie  Torgersen           34.6          21.1        198    4400 male   2007
# … with 323 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
#Make a graph
ggplot(data=pen_flipp_sex, aes(x=species, y=meanflipp, color = sex))+
  geom_point()+
  geom_errorbar(aes(ymin = meanflipp - se, ymax = meanflipp + se), width = 0.2)+
  geom_point(data = pen3, aes(x=species, y=flipper_length_mm,color=sex), position = position_jitterdodge(dodge.width = 0.75), alpha = 0.3)

#For all 3 species, male tend to have longer flipper than female. Gentoo has the longest flipper, and Chinstrap has the second longest flipper. Adelie has the shortest flipper.


aov_pen2<-aov(flipper_length_mm ~ species*sex, data=pen3)
summary(aov_pen2) #The low p-values (<0.01) suggest the sigificant difference in flipper length among species and sex. We can reject our null hypothesis that there is no difference in flipper length among species nor sex. The graph also supports this.
             Df Sum Sq Mean Sq F value  Pr(>F)    
species       2  50526   25263 789.912 < 2e-16 ***
sex           1   3906    3906 122.119 < 2e-16 ***
species:sex   2    329     165   5.144 0.00631 ** 
Residuals   327  10458      32                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.

Test your assumptions (individually– do not use check_model) and interpret your assumption checks

#Visual Check Using check_model()
check_model(aov_pen2) #Linearity, Homogeneity, Normality look fine to some extent for ANOVA.
Variable `Component` is not in your data frame :/

#Outlier test
pen_flipp2<-pen3%>%
  group_by(species,sex)%>%
  drop_na(flipper_length_mm)%>%
  identify_outliers(flipper_length_mm)

pen_flipp2 ##There are 4 outliers, (not extreme), so we can still use the data.
# A tibble: 4 × 10
  species sex    island    bill_…¹ bill_…² flipp…³ body_…⁴  year is.ou…⁵ is.ex…⁶
  <fct>   <fct>  <fct>       <dbl>   <dbl>   <int>   <int> <int> <lgl>   <lgl>  
1 Adelie  female Biscoe       37.8    18.3     174    3400  2007 TRUE    FALSE  
2 Adelie  female Biscoe       37.9    18.6     172    3150  2007 TRUE    FALSE  
3 Adelie  female Dream        35.7    18       202    3550  2008 TRUE    FALSE  
4 Adelie  male   Torgersen    44.1    18       210    4000  2009 TRUE    FALSE  
# … with abbreviated variable names ¹​bill_length_mm, ²​bill_depth_mm,
#   ³​flipper_length_mm, ⁴​body_mass_g, ⁵​is.outlier, ⁶​is.extreme

4.

Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).

TukeyHSD(aov_pen2) #Adelie:male and Adelie:female, Chinstrap:male-Chinstrap:female and Gentoo:male-Gentoo:female have p<0.01, which suggest that there is a significant difference in lipper length between sex in species. male-female has p=~0, which suggests the significant difference in flipper length between female and male.
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = flipper_length_mm ~ species * sex, data = pen3)

$species
                     diff      lwr       upr p adj
Chinstrap-Adelie  5.72079  3.76593  7.675649     0
Gentoo-Adelie    27.13255 25.48814 28.776974     0
Gentoo-Chinstrap 21.41176 19.38766 23.435867     0

$sex
                diff      lwr      upr p adj
male-female 6.849169 5.629796 8.068542     0

$`species:sex`
                                       diff        lwr        upr     p adj
Chinstrap:female-Adelie:female    3.9407736   0.574682   7.306865 0.0113141
Gentoo:female-Adelie:female      24.9123760  22.060733  27.764019 0.0000000
Adelie:male-Adelie:female         4.6164384   1.933019   7.299857 0.0000191
Chinstrap:male-Adelie:female     12.1172442   8.751153  15.483336 0.0000000
Gentoo:male-Adelie:female        33.7464631  30.934167  36.558759 0.0000000
Gentoo:female-Chinstrap:female   20.9716024  17.469931  24.473274 0.0000000
Adelie:male-Chinstrap:female      0.6756648  -2.690427   4.041756 0.9925701
Chinstrap:male-Chinstrap:female   8.1764706   4.244498  12.108443 0.0000001
Gentoo:male-Chinstrap:female     29.8056895  26.335986  33.275393 0.0000000
Adelie:male-Gentoo:female       -20.2959376 -23.147581 -17.444295 0.0000000
Chinstrap:male-Gentoo:female    -12.7951318 -16.296803  -9.293460 0.0000000
Gentoo:male-Gentoo:female         8.8340871   5.860850  11.807324 0.0000000
Chinstrap:male-Adelie:male        7.5008058   4.134714  10.866897 0.0000000
Gentoo:male-Adelie:male          29.1300247  26.317729  31.942320 0.0000000
Gentoo:male-Chinstrap:male       21.6292189  18.159515  25.098922 0.0000000