Attaching package: 'see'
The following objects are masked from 'package:ggsci':
scale_color_material, scale_colour_material, scale_fill_material
library(rstatix)
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.
penguin <- penguins %>%drop_na()aov_penguins <-aov(body_mass_g ~ island, penguin)summary(aov_penguins)
Df Sum Sq Mean Sq F value Pr(>F)
island 2 83740680 41870340 105.1 <2e-16 ***
Residuals 330 131518986 398542
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
penguin_graph <- penguin %>%group_by(island) %>%summarise(mean =mean(body_mass_g), sd =sd(body_mass_g), n =n(), se = sd/sqrt(n))ggplot(data = penguin_graph, aes(x = island, y = mean, color = island)) +geom_point() +geom_errorbar(data = penguin_graph, aes(x = island, ymin = mean-se, ymax = mean+se), width =0.2) +theme_classic() +labs(y ="Mean body mass", x ="Island") +scale_color_npg()
\(H_0:\) There is no difference between the mean body masses found on each island. \(H_A:\) At least one island has a different mean body mass. \(\alpha = 0.05\) The F-statistic is 105.1 and the p-value is less than 2e-16 which is less than our significance threshold (\(\alpha = 0.05\)). There is extremely strong evidence against the null hypothesis. We would conclude that at least one island has a mean body mass that has a statistically significant difference than the others.
Test your assumptions (individually– do not use check_model) and interpret your assumption checks
Independence
We don’t know much about the experimental design so I cannot say anything concrete regarding whether or not we will be violating the independence assumption. That being said, I will continue to run this test because it is a lab assignment.
Outliers
ggplot(data = penguin, aes(x = island, y = body_mass_g, fill = island)) +geom_boxplot() +theme_classic() +scale_fill_npg()
# A tibble: 1 × 10
island species bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
<fct> <fct> <dbl> <dbl> <int> <int> <fct>
1 Dream Chinst… 52 20.7 210 4800 male
# … with 3 more variables: year <int>, is.outlier <lgl>, is.extreme <lgl>
There is one outlier: a penguin found on Dream island who has a higher body mass than the others. However, his/her body mass is approximately the same as the median on Biscoe Island and it is possible that this is just a large penguin in comparison to the rest on the island. Therefore, I will keep the single outlier.
# A tibble: 3 × 4
island variable statistic p
<fct> <chr> <dbl> <dbl>
1 Biscoe body_mass_g 0.971 0.00159
2 Dream body_mass_g 0.988 0.367
3 Torgersen body_mass_g 0.973 0.341
According to these tests, we cannot assume normality for the whole dataset as the p-value is less than 0.05. The Shapiro-Wilk test for each group indicates that we can assume normality for the Torgersen and Dream island data since the p-values are higher than 0.05 but we cannot assume normality for Biscoe island because the p-value is less than 0.05. However, the sample size for this dataset is 276 which is sufficiently large so we can run an ANOVA and TukeyHSD even though the assumption regarding normality is violated.
Homoscedaticity
leveneTest(body_mass_g ~ island, data = penguin)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 25.969 3.358e-11 ***
330
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the levene test, we do not have homogeneity of variance because we have a p-value of 3.358e-11 which is smaller than 0.05. So we should run a welch test one-way anova.
welchpenguin <-welch_anova_test(body_mass_g ~ island, data = penguin)welchpenguin
# A tibble: 1 × 7
.y. n statistic DFn DFd p method
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 body_mass_g 333 102. 2 137. 6.06e-28 Welch ANOVA
The p-value is less than 0.05 so we reject the null hypothesis and conclude that there is a significant difference of mean body masses between islands even when using a test that does not require homogeneity of variance.
Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).
TukeyHSD(aov_penguins)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = body_mass_g ~ island, data = penguin)
$island
diff lwr upr p adj
Dream-Biscoe -1000.2693 -1177.7875 -822.7512 0.000000
Torgersen-Biscoe -1010.6611 -1256.7392 -764.5830 0.000000
Torgersen-Dream -10.3918 -265.2678 244.4842 0.994933
According to the TukeyHSD, there is a statistically significant difference between the mean body mass found on Dream and Biscoe islands (p-value = 0.00) and between Torgersen and Bisoce islands (p-value = 0.00). There is not a statistically significant difference between the amen body mass found on Torgersen and Dream islands (p-value = 0.99). This is consistent with our graph from above where visually the mean body mass for Torgersen island is much higher than the other two.
Depth
Repeat 1-4 above using 2 or more explanatory variables from Palmer Penguins. Assess the effecs of multiple variable on a single numerical variable in the data frame. Make a graph or graphs (if needed). You will need to group, filter, and summarize your data to do this. Perform the necessary ANOVA and TukeyHSD test. Interpret your results (using the graph(s) and stats outputs). Check your assumptions and interpret your assumption check (you can do this individually or with check_model() if you can get the later to work with your ANOVA– it is not optimized for ANOVA and often does not work)
# A tibble: 2 × 10
species sex island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
<fct> <fct> <fct> <dbl> <dbl> <int> <int>
1 Chinst… fema… Dream 46.9 16.6 192 2700
2 Chinst… male Dream 52 20.7 210 4800
# … with 3 more variables: year <int>, is.outlier <lgl>, is.extreme <lgl>
According to these graphs the assumption of normality and homoscedasticity are both met. There are two outliers but I think that it is okay in this case because it is probably just one larger male Chinstrap and one smaller female Chinstrap. In terms of the assumption of independence, we still don’t know much about the experimental design so we cannot make a definite conclusion but I will continue with this anova anyways.
summary(aov_penmss)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 145190219 72595110 758.358 < 2e-16 ***
sex 1 37090262 37090262 387.460 < 2e-16 ***
species:sex 2 1676557 838278 8.757 0.000197 ***
Residuals 327 31302628 95727
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\(H_0:\) There is no interactive effect of species and sex on mean body mass, or an effect of species or sex on mean body mass.
\(H_A:\) There is an interactive effect between species and sex on mean body mass.
\(\alpha = 0.05\)
According to this ANOVA, at least one species has a significantly different mean body mass (p-value <2e-16) and male and female penguins have a significantly different mean body masses (p-value <2e-16). There is a significant interactive effect of species and sex on mean body mass (p-value = 0.000197). Therefore we reject the null hypothesis that says
The Tukey Post-Hoc Honest Significant Difference indicates that there is a significant difference in mean body mass between male and female penguins (p-value = 0) which is consistent with the ANOVA test. There is not a significant difference in mean body mass between Chinstrap and Adelie (p-value = 0.824) which follows what the graph shows us visually but there is a significant difference in mean body mass between Gentoo and Adelie penguins (p= 0.00) and between Gentoo and Chinstrap (p=0.00). There is a significant interactive effects between sex and species on mean body mass for all groups except Chinstrap and Adelie males (p-value = 0.5812) and Chinstrap and Adelie females (p-value = 0.1376). This makes sense because when one looks at the effect of species alone on mean body mass, there is not a significant difference between Chinstrap and Adelie. Additionally, in the graph, we can see that male Chinstrap and male Adelie appear to have similar mean body masses and the same is true for female Chinstrap and Adelie penguins.