Author

Hazel Glazer

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(see)
library(car)

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Code

library(patchwork)
library(ggsci)


Attaching package: 'ggsci'

The following objects are masked from 'package:see':

    scale_color_material, scale_colour_material, scale_fill_material

Code

library(ggridges)
library(performance)
library(Hmisc) #for correlation matrix


Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

Code

library(corrplot)#to visualize correlation matrices

corrplot 0.95 loaded

Code

library(car) #contains some statistical tests we need to assess assumptions
library(gt)


Attaching package: 'gt'

The following objects are masked from 'package:Hmisc':

    html, latex

Code

library(rstatix) #test for outliers, welch_anova_test


Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter

Code

library(ggplot2)

Lab 5

November-December 2025

Step 1: Species and Bill Length

1.) Grab your trusty penguins data frame and test the following hypothesis: “There is no effect of species on bill length in penguins.” Show a useful graphical representation of the data, run the appropriate stats, and interpret the results. Put the stats into a table! Don’t forget to check assumptions.

Hypotheses

H0: There is no effect of species on bill length in penguins.

HA: There is an effect of species on bill length in penguins.

Code

pens <- penguins %>% #make a data frame without na values to make the anova tests smoother
  na.omit()

pens

      species    island bill_len bill_dep flipper_len body_mass    sex year
1      Adelie Torgersen     39.1     18.7         181      3750   male 2007
2      Adelie Torgersen     39.5     17.4         186      3800 female 2007
3      Adelie Torgersen     40.3     18.0         195      3250 female 2007
5      Adelie Torgersen     36.7     19.3         193      3450 female 2007
6      Adelie Torgersen     39.3     20.6         190      3650   male 2007
7      Adelie Torgersen     38.9     17.8         181      3625 female 2007
8      Adelie Torgersen     39.2     19.6         195      4675   male 2007
13     Adelie Torgersen     41.1     17.6         182      3200 female 2007
14     Adelie Torgersen     38.6     21.2         191      3800   male 2007
15     Adelie Torgersen     34.6     21.1         198      4400   male 2007
16     Adelie Torgersen     36.6     17.8         185      3700 female 2007
17     Adelie Torgersen     38.7     19.0         195      3450 female 2007
18     Adelie Torgersen     42.5     20.7         197      4500   male 2007
19     Adelie Torgersen     34.4     18.4         184      3325 female 2007
20     Adelie Torgersen     46.0     21.5         194      4200   male 2007
21     Adelie    Biscoe     37.8     18.3         174      3400 female 2007
22     Adelie    Biscoe     37.7     18.7         180      3600   male 2007
23     Adelie    Biscoe     35.9     19.2         189      3800 female 2007
24     Adelie    Biscoe     38.2     18.1         185      3950   male 2007
25     Adelie    Biscoe     38.8     17.2         180      3800   male 2007
26     Adelie    Biscoe     35.3     18.9         187      3800 female 2007
27     Adelie    Biscoe     40.6     18.6         183      3550   male 2007
28     Adelie    Biscoe     40.5     17.9         187      3200 female 2007
29     Adelie    Biscoe     37.9     18.6         172      3150 female 2007
30     Adelie    Biscoe     40.5     18.9         180      3950   male 2007
31     Adelie     Dream     39.5     16.7         178      3250 female 2007
32     Adelie     Dream     37.2     18.1         178      3900   male 2007
33     Adelie     Dream     39.5     17.8         188      3300 female 2007
34     Adelie     Dream     40.9     18.9         184      3900   male 2007
35     Adelie     Dream     36.4     17.0         195      3325 female 2007
36     Adelie     Dream     39.2     21.1         196      4150   male 2007
37     Adelie     Dream     38.8     20.0         190      3950   male 2007
38     Adelie     Dream     42.2     18.5         180      3550 female 2007
39     Adelie     Dream     37.6     19.3         181      3300 female 2007
40     Adelie     Dream     39.8     19.1         184      4650   male 2007
41     Adelie     Dream     36.5     18.0         182      3150 female 2007
42     Adelie     Dream     40.8     18.4         195      3900   male 2007
43     Adelie     Dream     36.0     18.5         186      3100 female 2007
44     Adelie     Dream     44.1     19.7         196      4400   male 2007
45     Adelie     Dream     37.0     16.9         185      3000 female 2007
46     Adelie     Dream     39.6     18.8         190      4600   male 2007
47     Adelie     Dream     41.1     19.0         182      3425   male 2007
49     Adelie     Dream     36.0     17.9         190      3450 female 2007
50     Adelie     Dream     42.3     21.2         191      4150   male 2007
51     Adelie    Biscoe     39.6     17.7         186      3500 female 2008
52     Adelie    Biscoe     40.1     18.9         188      4300   male 2008
53     Adelie    Biscoe     35.0     17.9         190      3450 female 2008
54     Adelie    Biscoe     42.0     19.5         200      4050   male 2008
55     Adelie    Biscoe     34.5     18.1         187      2900 female 2008
56     Adelie    Biscoe     41.4     18.6         191      3700   male 2008
57     Adelie    Biscoe     39.0     17.5         186      3550 female 2008
58     Adelie    Biscoe     40.6     18.8         193      3800   male 2008
59     Adelie    Biscoe     36.5     16.6         181      2850 female 2008
60     Adelie    Biscoe     37.6     19.1         194      3750   male 2008
61     Adelie    Biscoe     35.7     16.9         185      3150 female 2008
62     Adelie    Biscoe     41.3     21.1         195      4400   male 2008
63     Adelie    Biscoe     37.6     17.0         185      3600 female 2008
64     Adelie    Biscoe     41.1     18.2         192      4050   male 2008
65     Adelie    Biscoe     36.4     17.1         184      2850 female 2008
66     Adelie    Biscoe     41.6     18.0         192      3950   male 2008
67     Adelie    Biscoe     35.5     16.2         195      3350 female 2008
68     Adelie    Biscoe     41.1     19.1         188      4100   male 2008
69     Adelie Torgersen     35.9     16.6         190      3050 female 2008
70     Adelie Torgersen     41.8     19.4         198      4450   male 2008
71     Adelie Torgersen     33.5     19.0         190      3600 female 2008
72     Adelie Torgersen     39.7     18.4         190      3900   male 2008
73     Adelie Torgersen     39.6     17.2         196      3550 female 2008
74     Adelie Torgersen     45.8     18.9         197      4150   male 2008
75     Adelie Torgersen     35.5     17.5         190      3700 female 2008
76     Adelie Torgersen     42.8     18.5         195      4250   male 2008
77     Adelie Torgersen     40.9     16.8         191      3700 female 2008
78     Adelie Torgersen     37.2     19.4         184      3900   male 2008
79     Adelie Torgersen     36.2     16.1         187      3550 female 2008
80     Adelie Torgersen     42.1     19.1         195      4000   male 2008
81     Adelie Torgersen     34.6     17.2         189      3200 female 2008
82     Adelie Torgersen     42.9     17.6         196      4700   male 2008
83     Adelie Torgersen     36.7     18.8         187      3800 female 2008
84     Adelie Torgersen     35.1     19.4         193      4200   male 2008
85     Adelie     Dream     37.3     17.8         191      3350 female 2008
86     Adelie     Dream     41.3     20.3         194      3550   male 2008
87     Adelie     Dream     36.3     19.5         190      3800   male 2008
88     Adelie     Dream     36.9     18.6         189      3500 female 2008
89     Adelie     Dream     38.3     19.2         189      3950   male 2008
90     Adelie     Dream     38.9     18.8         190      3600 female 2008
91     Adelie     Dream     35.7     18.0         202      3550 female 2008
92     Adelie     Dream     41.1     18.1         205      4300   male 2008
93     Adelie     Dream     34.0     17.1         185      3400 female 2008
94     Adelie     Dream     39.6     18.1         186      4450   male 2008
95     Adelie     Dream     36.2     17.3         187      3300 female 2008
96     Adelie     Dream     40.8     18.9         208      4300   male 2008
97     Adelie     Dream     38.1     18.6         190      3700 female 2008
98     Adelie     Dream     40.3     18.5         196      4350   male 2008
99     Adelie     Dream     33.1     16.1         178      2900 female 2008
100    Adelie     Dream     43.2     18.5         192      4100   male 2008
101    Adelie    Biscoe     35.0     17.9         192      3725 female 2009
102    Adelie    Biscoe     41.0     20.0         203      4725   male 2009
103    Adelie    Biscoe     37.7     16.0         183      3075 female 2009
104    Adelie    Biscoe     37.8     20.0         190      4250   male 2009
105    Adelie    Biscoe     37.9     18.6         193      2925 female 2009
106    Adelie    Biscoe     39.7     18.9         184      3550   male 2009
107    Adelie    Biscoe     38.6     17.2         199      3750 female 2009
108    Adelie    Biscoe     38.2     20.0         190      3900   male 2009
109    Adelie    Biscoe     38.1     17.0         181      3175 female 2009
110    Adelie    Biscoe     43.2     19.0         197      4775   male 2009
111    Adelie    Biscoe     38.1     16.5         198      3825 female 2009
112    Adelie    Biscoe     45.6     20.3         191      4600   male 2009
113    Adelie    Biscoe     39.7     17.7         193      3200 female 2009
114    Adelie    Biscoe     42.2     19.5         197      4275   male 2009
115    Adelie    Biscoe     39.6     20.7         191      3900 female 2009
116    Adelie    Biscoe     42.7     18.3         196      4075   male 2009
117    Adelie Torgersen     38.6     17.0         188      2900 female 2009
118    Adelie Torgersen     37.3     20.5         199      3775   male 2009
119    Adelie Torgersen     35.7     17.0         189      3350 female 2009
120    Adelie Torgersen     41.1     18.6         189      3325   male 2009
121    Adelie Torgersen     36.2     17.2         187      3150 female 2009
122    Adelie Torgersen     37.7     19.8         198      3500   male 2009
123    Adelie Torgersen     40.2     17.0         176      3450 female 2009
124    Adelie Torgersen     41.4     18.5         202      3875   male 2009
125    Adelie Torgersen     35.2     15.9         186      3050 female 2009
126    Adelie Torgersen     40.6     19.0         199      4000   male 2009
127    Adelie Torgersen     38.8     17.6         191      3275 female 2009
128    Adelie Torgersen     41.5     18.3         195      4300   male 2009
129    Adelie Torgersen     39.0     17.1         191      3050 female 2009
130    Adelie Torgersen     44.1     18.0         210      4000   male 2009
131    Adelie Torgersen     38.5     17.9         190      3325 female 2009
132    Adelie Torgersen     43.1     19.2         197      3500   male 2009
133    Adelie     Dream     36.8     18.5         193      3500 female 2009
134    Adelie     Dream     37.5     18.5         199      4475   male 2009
135    Adelie     Dream     38.1     17.6         187      3425 female 2009
136    Adelie     Dream     41.1     17.5         190      3900   male 2009
137    Adelie     Dream     35.6     17.5         191      3175 female 2009
138    Adelie     Dream     40.2     20.1         200      3975   male 2009
139    Adelie     Dream     37.0     16.5         185      3400 female 2009
140    Adelie     Dream     39.7     17.9         193      4250   male 2009
141    Adelie     Dream     40.2     17.1         193      3400 female 2009
142    Adelie     Dream     40.6     17.2         187      3475   male 2009
143    Adelie     Dream     32.1     15.5         188      3050 female 2009
144    Adelie     Dream     40.7     17.0         190      3725   male 2009
145    Adelie     Dream     37.3     16.8         192      3000 female 2009
146    Adelie     Dream     39.0     18.7         185      3650   male 2009
147    Adelie     Dream     39.2     18.6         190      4250   male 2009
148    Adelie     Dream     36.6     18.4         184      3475 female 2009
149    Adelie     Dream     36.0     17.8         195      3450 female 2009
150    Adelie     Dream     37.8     18.1         193      3750   male 2009
151    Adelie     Dream     36.0     17.1         187      3700 female 2009
152    Adelie     Dream     41.5     18.5         201      4000   male 2009
153    Gentoo    Biscoe     46.1     13.2         211      4500 female 2007
154    Gentoo    Biscoe     50.0     16.3         230      5700   male 2007
155    Gentoo    Biscoe     48.7     14.1         210      4450 female 2007
156    Gentoo    Biscoe     50.0     15.2         218      5700   male 2007
157    Gentoo    Biscoe     47.6     14.5         215      5400   male 2007
158    Gentoo    Biscoe     46.5     13.5         210      4550 female 2007
159    Gentoo    Biscoe     45.4     14.6         211      4800 female 2007
160    Gentoo    Biscoe     46.7     15.3         219      5200   male 2007
161    Gentoo    Biscoe     43.3     13.4         209      4400 female 2007
162    Gentoo    Biscoe     46.8     15.4         215      5150   male 2007
163    Gentoo    Biscoe     40.9     13.7         214      4650 female 2007
164    Gentoo    Biscoe     49.0     16.1         216      5550   male 2007
165    Gentoo    Biscoe     45.5     13.7         214      4650 female 2007
166    Gentoo    Biscoe     48.4     14.6         213      5850   male 2007
167    Gentoo    Biscoe     45.8     14.6         210      4200 female 2007
168    Gentoo    Biscoe     49.3     15.7         217      5850   male 2007
169    Gentoo    Biscoe     42.0     13.5         210      4150 female 2007
170    Gentoo    Biscoe     49.2     15.2         221      6300   male 2007
171    Gentoo    Biscoe     46.2     14.5         209      4800 female 2007
172    Gentoo    Biscoe     48.7     15.1         222      5350   male 2007
173    Gentoo    Biscoe     50.2     14.3         218      5700   male 2007
174    Gentoo    Biscoe     45.1     14.5         215      5000 female 2007
175    Gentoo    Biscoe     46.5     14.5         213      4400 female 2007
176    Gentoo    Biscoe     46.3     15.8         215      5050   male 2007
177    Gentoo    Biscoe     42.9     13.1         215      5000 female 2007
178    Gentoo    Biscoe     46.1     15.1         215      5100   male 2007
180    Gentoo    Biscoe     47.8     15.0         215      5650   male 2007
181    Gentoo    Biscoe     48.2     14.3         210      4600 female 2007
182    Gentoo    Biscoe     50.0     15.3         220      5550   male 2007
183    Gentoo    Biscoe     47.3     15.3         222      5250   male 2007
184    Gentoo    Biscoe     42.8     14.2         209      4700 female 2007
185    Gentoo    Biscoe     45.1     14.5         207      5050 female 2007
186    Gentoo    Biscoe     59.6     17.0         230      6050   male 2007
187    Gentoo    Biscoe     49.1     14.8         220      5150 female 2008
188    Gentoo    Biscoe     48.4     16.3         220      5400   male 2008
189    Gentoo    Biscoe     42.6     13.7         213      4950 female 2008
190    Gentoo    Biscoe     44.4     17.3         219      5250   male 2008
191    Gentoo    Biscoe     44.0     13.6         208      4350 female 2008
192    Gentoo    Biscoe     48.7     15.7         208      5350   male 2008
193    Gentoo    Biscoe     42.7     13.7         208      3950 female 2008
194    Gentoo    Biscoe     49.6     16.0         225      5700   male 2008
195    Gentoo    Biscoe     45.3     13.7         210      4300 female 2008
196    Gentoo    Biscoe     49.6     15.0         216      4750   male 2008
197    Gentoo    Biscoe     50.5     15.9         222      5550   male 2008
198    Gentoo    Biscoe     43.6     13.9         217      4900 female 2008
199    Gentoo    Biscoe     45.5     13.9         210      4200 female 2008
200    Gentoo    Biscoe     50.5     15.9         225      5400   male 2008
201    Gentoo    Biscoe     44.9     13.3         213      5100 female 2008
202    Gentoo    Biscoe     45.2     15.8         215      5300   male 2008
203    Gentoo    Biscoe     46.6     14.2         210      4850 female 2008
204    Gentoo    Biscoe     48.5     14.1         220      5300   male 2008
205    Gentoo    Biscoe     45.1     14.4         210      4400 female 2008
206    Gentoo    Biscoe     50.1     15.0         225      5000   male 2008
207    Gentoo    Biscoe     46.5     14.4         217      4900 female 2008
208    Gentoo    Biscoe     45.0     15.4         220      5050   male 2008
209    Gentoo    Biscoe     43.8     13.9         208      4300 female 2008
210    Gentoo    Biscoe     45.5     15.0         220      5000   male 2008
211    Gentoo    Biscoe     43.2     14.5         208      4450 female 2008
212    Gentoo    Biscoe     50.4     15.3         224      5550   male 2008
213    Gentoo    Biscoe     45.3     13.8         208      4200 female 2008
214    Gentoo    Biscoe     46.2     14.9         221      5300   male 2008
215    Gentoo    Biscoe     45.7     13.9         214      4400 female 2008
216    Gentoo    Biscoe     54.3     15.7         231      5650   male 2008
217    Gentoo    Biscoe     45.8     14.2         219      4700 female 2008
218    Gentoo    Biscoe     49.8     16.8         230      5700   male 2008
220    Gentoo    Biscoe     49.5     16.2         229      5800   male 2008
221    Gentoo    Biscoe     43.5     14.2         220      4700 female 2008
222    Gentoo    Biscoe     50.7     15.0         223      5550   male 2008
223    Gentoo    Biscoe     47.7     15.0         216      4750 female 2008
224    Gentoo    Biscoe     46.4     15.6         221      5000   male 2008
225    Gentoo    Biscoe     48.2     15.6         221      5100   male 2008
226    Gentoo    Biscoe     46.5     14.8         217      5200 female 2008
227    Gentoo    Biscoe     46.4     15.0         216      4700 female 2008
228    Gentoo    Biscoe     48.6     16.0         230      5800   male 2008
229    Gentoo    Biscoe     47.5     14.2         209      4600 female 2008
230    Gentoo    Biscoe     51.1     16.3         220      6000   male 2008
231    Gentoo    Biscoe     45.2     13.8         215      4750 female 2008
232    Gentoo    Biscoe     45.2     16.4         223      5950   male 2008
233    Gentoo    Biscoe     49.1     14.5         212      4625 female 2009
234    Gentoo    Biscoe     52.5     15.6         221      5450   male 2009
235    Gentoo    Biscoe     47.4     14.6         212      4725 female 2009
236    Gentoo    Biscoe     50.0     15.9         224      5350   male 2009
237    Gentoo    Biscoe     44.9     13.8         212      4750 female 2009
238    Gentoo    Biscoe     50.8     17.3         228      5600   male 2009
239    Gentoo    Biscoe     43.4     14.4         218      4600 female 2009
240    Gentoo    Biscoe     51.3     14.2         218      5300   male 2009
241    Gentoo    Biscoe     47.5     14.0         212      4875 female 2009
242    Gentoo    Biscoe     52.1     17.0         230      5550   male 2009
243    Gentoo    Biscoe     47.5     15.0         218      4950 female 2009
244    Gentoo    Biscoe     52.2     17.1         228      5400   male 2009
245    Gentoo    Biscoe     45.5     14.5         212      4750 female 2009
246    Gentoo    Biscoe     49.5     16.1         224      5650   male 2009
247    Gentoo    Biscoe     44.5     14.7         214      4850 female 2009
248    Gentoo    Biscoe     50.8     15.7         226      5200   male 2009
249    Gentoo    Biscoe     49.4     15.8         216      4925   male 2009
250    Gentoo    Biscoe     46.9     14.6         222      4875 female 2009
251    Gentoo    Biscoe     48.4     14.4         203      4625 female 2009
252    Gentoo    Biscoe     51.1     16.5         225      5250   male 2009
253    Gentoo    Biscoe     48.5     15.0         219      4850 female 2009
254    Gentoo    Biscoe     55.9     17.0         228      5600   male 2009
255    Gentoo    Biscoe     47.2     15.5         215      4975 female 2009
256    Gentoo    Biscoe     49.1     15.0         228      5500   male 2009
258    Gentoo    Biscoe     46.8     16.1         215      5500   male 2009
259    Gentoo    Biscoe     41.7     14.7         210      4700 female 2009
260    Gentoo    Biscoe     53.4     15.8         219      5500   male 2009
261    Gentoo    Biscoe     43.3     14.0         208      4575 female 2009
262    Gentoo    Biscoe     48.1     15.1         209      5500   male 2009
263    Gentoo    Biscoe     50.5     15.2         216      5000 female 2009
264    Gentoo    Biscoe     49.8     15.9         229      5950   male 2009
265    Gentoo    Biscoe     43.5     15.2         213      4650 female 2009
266    Gentoo    Biscoe     51.5     16.3         230      5500   male 2009
267    Gentoo    Biscoe     46.2     14.1         217      4375 female 2009
268    Gentoo    Biscoe     55.1     16.0         230      5850   male 2009
270    Gentoo    Biscoe     48.8     16.2         222      6000   male 2009
271    Gentoo    Biscoe     47.2     13.7         214      4925 female 2009
273    Gentoo    Biscoe     46.8     14.3         215      4850 female 2009
274    Gentoo    Biscoe     50.4     15.7         222      5750   male 2009
275    Gentoo    Biscoe     45.2     14.8         212      5200 female 2009
276    Gentoo    Biscoe     49.9     16.1         213      5400   male 2009
277 Chinstrap     Dream     46.5     17.9         192      3500 female 2007
278 Chinstrap     Dream     50.0     19.5         196      3900   male 2007
279 Chinstrap     Dream     51.3     19.2         193      3650   male 2007
280 Chinstrap     Dream     45.4     18.7         188      3525 female 2007
281 Chinstrap     Dream     52.7     19.8         197      3725   male 2007
282 Chinstrap     Dream     45.2     17.8         198      3950 female 2007
283 Chinstrap     Dream     46.1     18.2         178      3250 female 2007
284 Chinstrap     Dream     51.3     18.2         197      3750   male 2007
285 Chinstrap     Dream     46.0     18.9         195      4150 female 2007
286 Chinstrap     Dream     51.3     19.9         198      3700   male 2007
287 Chinstrap     Dream     46.6     17.8         193      3800 female 2007
288 Chinstrap     Dream     51.7     20.3         194      3775   male 2007
289 Chinstrap     Dream     47.0     17.3         185      3700 female 2007
290 Chinstrap     Dream     52.0     18.1         201      4050   male 2007
291 Chinstrap     Dream     45.9     17.1         190      3575 female 2007
292 Chinstrap     Dream     50.5     19.6         201      4050   male 2007
293 Chinstrap     Dream     50.3     20.0         197      3300   male 2007
294 Chinstrap     Dream     58.0     17.8         181      3700 female 2007
295 Chinstrap     Dream     46.4     18.6         190      3450 female 2007
296 Chinstrap     Dream     49.2     18.2         195      4400   male 2007
297 Chinstrap     Dream     42.4     17.3         181      3600 female 2007
298 Chinstrap     Dream     48.5     17.5         191      3400   male 2007
299 Chinstrap     Dream     43.2     16.6         187      2900 female 2007
300 Chinstrap     Dream     50.6     19.4         193      3800   male 2007
301 Chinstrap     Dream     46.7     17.9         195      3300 female 2007
302 Chinstrap     Dream     52.0     19.0         197      4150   male 2007
303 Chinstrap     Dream     50.5     18.4         200      3400 female 2008
304 Chinstrap     Dream     49.5     19.0         200      3800   male 2008
305 Chinstrap     Dream     46.4     17.8         191      3700 female 2008
306 Chinstrap     Dream     52.8     20.0         205      4550   male 2008
307 Chinstrap     Dream     40.9     16.6         187      3200 female 2008
308 Chinstrap     Dream     54.2     20.8         201      4300   male 2008
309 Chinstrap     Dream     42.5     16.7         187      3350 female 2008
310 Chinstrap     Dream     51.0     18.8         203      4100   male 2008
311 Chinstrap     Dream     49.7     18.6         195      3600   male 2008
312 Chinstrap     Dream     47.5     16.8         199      3900 female 2008
313 Chinstrap     Dream     47.6     18.3         195      3850 female 2008
314 Chinstrap     Dream     52.0     20.7         210      4800   male 2008
315 Chinstrap     Dream     46.9     16.6         192      2700 female 2008
316 Chinstrap     Dream     53.5     19.9         205      4500   male 2008
317 Chinstrap     Dream     49.0     19.5         210      3950   male 2008
318 Chinstrap     Dream     46.2     17.5         187      3650 female 2008
319 Chinstrap     Dream     50.9     19.1         196      3550   male 2008
320 Chinstrap     Dream     45.5     17.0         196      3500 female 2008
321 Chinstrap     Dream     50.9     17.9         196      3675 female 2009
322 Chinstrap     Dream     50.8     18.5         201      4450   male 2009
323 Chinstrap     Dream     50.1     17.9         190      3400 female 2009
324 Chinstrap     Dream     49.0     19.6         212      4300   male 2009
325 Chinstrap     Dream     51.5     18.7         187      3250   male 2009
326 Chinstrap     Dream     49.8     17.3         198      3675 female 2009
327 Chinstrap     Dream     48.1     16.4         199      3325 female 2009
328 Chinstrap     Dream     51.4     19.0         201      3950   male 2009
329 Chinstrap     Dream     45.7     17.3         193      3600 female 2009
330 Chinstrap     Dream     50.7     19.7         203      4050   male 2009
331 Chinstrap     Dream     42.5     17.3         187      3350 female 2009
332 Chinstrap     Dream     52.2     18.8         197      3450   male 2009
333 Chinstrap     Dream     45.2     16.6         191      3250 female 2009
334 Chinstrap     Dream     49.3     19.9         203      4050   male 2009
335 Chinstrap     Dream     50.2     18.8         202      3800   male 2009
336 Chinstrap     Dream     45.6     19.4         194      3525 female 2009
337 Chinstrap     Dream     51.9     19.5         206      3950   male 2009
338 Chinstrap     Dream     46.8     16.5         189      3650 female 2009
339 Chinstrap     Dream     45.7     17.0         195      3650 female 2009
340 Chinstrap     Dream     55.8     19.8         207      4000   male 2009
341 Chinstrap     Dream     43.5     18.1         202      3400 female 2009
342 Chinstrap     Dream     49.6     18.2         193      3775   male 2009
343 Chinstrap     Dream     50.8     19.0         210      4100   male 2009
344 Chinstrap     Dream     50.2     18.7         198      3775 female 2009

Code

colors<- c('#b9ef69', '#ed8e88', '#ce8bee') # this is a BEAUTIFUL, custom, 3-color palette


ggplot(data=pens, aes(x=species, y=bill_len, group=species, color=species))+ 
  geom_boxplot()+ #box plot
  theme_classic()+ #white bg
  scale_color_manual(values=colors)+ # use my colors
  labs(title="Species and Bill Length Box Plot", x="Species",y="Bill Length (mm)",color="Species")+ #add titles
  theme(text=element_text(size=18)) #change text size

there is one outlier for gentoo penguins, but there does not appear to be other significant outliers. The graph shows that there is very likely an effect of species on bill length.

just to double check, i am running a group by function to identify the outliers of bill length for each species. as seen in the box plot, there is only one outlier, which is determined to be “not extreme.” this means that outliers are not a huge deal when analyzing the anova data moving forward.

Code

pens %>% 
  group_by(species) %>%
  identify_outliers(bill_len)

# A tibble: 1 × 10
  species island bill_len bill_dep flipper_len body_mass sex    year is.outlier
  <fct>   <fct>     <dbl>    <dbl>       <int>     <int> <fct> <int> <lgl>     
1 Gentoo  Biscoe     59.6       17         230      6050 male   2007 TRUE      
# ℹ 1 more variable: is.extreme <lgl>

use one-way anova

Code

pensaov<-aov(bill_len~species, data=pens) 
pensaov

Call:
   aov(formula = bill_len ~ species, data = pens)

Terms:
                 species Residuals
Sum of Squares  7015.386  2913.517
Deg. of Freedom        2       330

Residual standard error: 2.971336
Estimated effects may be unbalanced

the p value is “unbalanced”, so I am running the summary function to get the p value

Code

summary(pensaov) #get p value

             Df Sum Sq Mean Sq F value Pr(>F)    
species       2   7015    3508   397.3 <2e-16 ***
Residuals   330   2914       9                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the p value is less than 0.05, which means that there is a correlation between species and bill length.

Welch’s anova test

another test of p value - “Welch’s ANOVA compares two means to see if they are equal.” (https://www.statisticshowto.com/welchs-anova/)

Code

welch_anova_test(bill_len~species, data=pens)

# A tibble: 1 × 7
  .y.          n statistic   DFn   DFd        p method     
* <chr>    <int>     <dbl> <dbl> <dbl>    <dbl> <chr>      
1 bill_len   333      410.     2  165. 8.27e-65 Welch ANOVA

the p value for the welch’s anova test is 6.71e-66, which is a very small number. The p value for the bill length / species test is less than 0.05, so there is an effect of species on bill length. The null hypothesis is rejected.

Code

modelpens<-aov(bill_len ~ species, data=pens) #check for normality
check_model(modelpens)

Code

shapiro_test(residuals(modelpens)) # this is a normality test

# A tibble: 1 × 3
  variable             statistic p.value
  <chr>                    <dbl>   <dbl>
1 residuals(modelpens)     0.989  0.0127

the shapiro test is significant because the p-value is less than 0.05, indicating that there is abnormality in the data. The <0.05 p-value supports rejecting the null hypothesis.

Code

pens %>%
  group_by(species)%>%
  shapiro_test(bill_len) # check normality by species

# A tibble: 3 × 4
  species   variable statistic      p
  <fct>     <chr>        <dbl>  <dbl>
1 Adelie    bill_len     0.993 0.685 
2 Chinstrap bill_len     0.975 0.194 
3 Gentoo    bill_len     0.974 0.0199

the shapiro test shows that, for adelie and chinstrap penguins, there is normality, but the gentoo penguin data has abnormalities. This an be used in the next section with Two-Way ANOVA.

Code

#Homogeneity of variance/Homoscedasticity 
leveneTest(bill_len~species, data=pens)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)
group   2  2.2855 0.1033
      330

The P value is greater than 0.05, so the data is not skewed significantly. The data is centered, but it still has abnormalities, as seen in the shapiro test.

Step 2: Tukey Test

2.) We almost never care ONLY about an effect or no effect, we also care about magnitude and direction of effects. For example, a better hypothesis for #1 might be: “There is no difference in bill length between the three species of penguin.” Test this hypothesis! – You will need to do additional stats. Interpret your results!

Hypotheses

H0 = There is no difference in bill length between the three species of penguin.

HA = There is a difference in bill length between the three species of penguin.

Code

pensaov<-aov(bill_len~species, data=pens) #make an ANOVA dataset of bill length relation to the penguin species
summary(pensaov)

             Df Sum Sq Mean Sq F value Pr(>F)    
species       2   7015    3508   397.3 <2e-16 ***
Residuals   330   2914       9                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than 0.05, so there is significant relation between bill length and species.

Code

billaov<-aov(bill_len~species, data=pens)
TukeyHSD(billaov)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = bill_len ~ species, data = pens)

$species
                      diff       lwr        upr     p adj
Chinstrap-Adelie 10.009851  8.982789 11.0369128 0.0000000
Gentoo-Adelie     8.744095  7.880135  9.6080546 0.0000000
Gentoo-Chinstrap -1.265756 -2.329197 -0.2023151 0.0148212

The individual species have different relations of bill length in comparison with each other. Therefore, the H0 can be rejected because there is a difference between the bill lengths of each species. To check this, I will graph the data with boxplots and mean and error bars.

Code

billsum <- penguins %>% #first I have to make a summary dataset to get the means
  group_by(species) %>%
  na.omit() %>%
  summarise(meanbill=mean(bill_len), sd=sd(bill_len), n=n(), se=sd/sqrt(n)) # this uses standard deviation to calculate the means and range

billsum

# A tibble: 3 × 5
  species   meanbill    sd     n    se
  <fct>        <dbl> <dbl> <int> <dbl>
1 Adelie        38.8  2.66   146 0.220
2 Chinstrap     48.8  3.34    68 0.405
3 Gentoo        47.6  3.11   119 0.285

Code

ggplot()+ 
  geom_boxplot(data=pens, aes(x=species, y=bill_len, fill=species))+ #box plot
  geom_errorbar(data=billsum, aes(x=species, ymin=meanbill-se, ymax=meanbill+se), width=0.2)+ # error bars with billsum data
  geom_point(data=billsum, aes(x=species, y=meanbill), size=1)+ # averages with billsum data
  theme_classic()+ #white bg with horizontal lines to help compare data points
  scale_fill_manual(values=colors)+ # use my colors
  labs(title="Bill Length by Species", x="Species",y="Bill Length (mm)",color="Species")+ #add titles
  theme(text=element_text(size=12)) #change text size

Ignoring unknown labels:
• colour : "Species"

As seen in the graph, the adelie penguin bill lengths are significantly lower on average compared to the bill lengths of chinstrap and gentoo penguins. The tukey test supports this claim because the mean difference between adelie and gentoo bill lengths is 8.744095 mm and the mean difference between adelie and chinstrap flipper lengths is 10.009851 mm, whereas the mean difference between gentoo and chinstrap bill lengths is only 1.265756 mm. Therefore, the null hypohtesis is rejected because there is a difference between bill lengths by species.

Step 3: Two-Way ANOVA

3.) Repeat the process for the following hypothesis: “There is no difference in flipper length between binary biological sexes and/or penguin species.” Include an optimal graph (means+error at least- raw data in the background is best). Keep in mind that you now have 3 variables and only 2 dimensions, so you will need to figure out how to represent the 3rd variable. Run the appropriate stats and interpret. Don’t forget about assumptions.

Hypotheses

H0 = There is no difference in flipper length between binary biological sexes and/or penguin species.

HA = There is a difference in flipper length between binary biological sexes and/or penguin species.

Code

flipsum<- pens %>% #get the data for means and error bars
  group_by(species, sex) %>%
  na.omit() %>% #removes rows with NA values (a few rows may otherwise have NA due to sampling error in the field)
  summarise(meanflip=mean(flipper_len), sd=sd(flipper_len), n=n(), se=sd/sqrt(n))

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

Code

flipsum

# A tibble: 6 × 6
# Groups:   species [3]
  species   sex    meanflip    sd     n    se
  <fct>     <fct>     <dbl> <dbl> <int> <dbl>
1 Adelie    female     188.  5.60    73 0.655
2 Adelie    male       192.  6.60    73 0.772
3 Chinstrap female     192.  5.75    34 0.987
4 Chinstrap male       200.  5.98    34 1.02 
5 Gentoo    female     213.  3.90    58 0.512
6 Gentoo    male       222.  5.67    61 0.726

This is a patchwork plot of the data of flipper length by species and sex:

Code

a<-ggplot()+ 
  geom_point(data= pens, aes(x=sex, y=flipper_len, color=sex), alpha=0.2)+ #raw data
  geom_errorbar(data=flipsum, aes(x=sex, ymin=meanflip-se, ymax=meanflip+se, color=sex, width=0.7))+
  scale_color_manual(values=colors)+ # use my colors
  labs(title="Flipper Length by Sex", x="Sex",y="Flipper Length (mm)",color="Sex")+ #add titles
  theme_classic()+
  theme(text=element_text(size=10)) #change text size

b<-ggplot()+ 
  geom_point(data= pens, aes(x=species, y=flipper_len, color=species), alpha=0.2)+ #raw data
  geom_errorbar(data=flipsum, aes(x=species, ymin=meanflip-se, ymax=meanflip+se, color=species, width=0.9))+
  theme_classic()+ #white bg with horizontal lines to help compare data points
  scale_color_manual(values=colors)+ # use my colors
  labs(title="Flipper Length by Species", x="Species",y="Flipper Length (mm)",color="Species")+ #add titles
  theme(text=element_text(size=6)) #change text size


c<- ggplot()+ 
  geom_point(data= pens, aes(x=species, y=flipper_len, color=sex), alpha=0.2)+ #raw data
  geom_errorbar(data=flipsum, aes(x=species, ymin=meanflip-se, ymax=meanflip+se, color=sex, width=0.9))+
  theme_classic()+ #white bg with horizontal lines to help compare data points
  scale_color_manual(values=colors)+ # use my colors
  labs(title="Flipper Length by Sex and Species", x="Species",y="Flipper Length (mm)",color="Sex")+ #add titles
  theme(text=element_text(size=6)) #change text size


a+b+c #patchwork!

Code

pens %>% 
  group_by(species, sex) %>%
  identify_outliers(flipper_len)%>% #identify outliers
  na.omit()

# A tibble: 4 × 10
  species sex    island bill_len bill_dep flipper_len body_mass  year is.outlier
  <fct>   <fct>  <fct>     <dbl>    <dbl>       <int>     <int> <int> <lgl>     
1 Adelie  female Biscoe     37.8     18.3         174      3400  2007 TRUE      
2 Adelie  female Biscoe     37.9     18.6         172      3150  2007 TRUE      
3 Adelie  female Dream      35.7     18           202      3550  2008 TRUE      
4 Adelie  male   Torge…     44.1     18           210      4000  2009 TRUE      
# ℹ 1 more variable: is.extreme <lgl>

Code

head(pens)

  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007
7  Adelie Torgersen     38.9     17.8         181      3625 female 2007

Code

# additive anova
flipaov<-aov(flipper_len~ species + sex, data=pens) #this calls up the anova data, but i'll mostly be looking at the summary output
flipaov

Call:
   aov(formula = flipper_len ~ species + sex, data = pens)

Terms:
                 species      sex Residuals
Sum of Squares  50525.88  3905.60  10787.15
Deg. of Freedom        2        1       329

Residual standard error: 5.726053
Estimated effects may be unbalanced

Code

summary(flipaov) # anova data, but prettier!

             Df Sum Sq Mean Sq F value Pr(>F)    
species       2  50526   25263   770.5 <2e-16 ***
sex           1   3906    3906   119.1 <2e-16 ***
Residuals   329  10787      33                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA summary has three main takeaways: (1) species is correlated to flipper length because the p-value is ~<2e-16, which is less than 0.05. (2) sex is correlated to flipper length because the p-value is ~<2e-16, which is less than 0.05. (3) The residuals row shows the discrepancy between individual group’s data and the values of all of the groups averaged out (statsandr.com/blog/anova-in-r/). The high number of 329 means that there is a significant difference between the individual flipper length values by species and sex and the average values of flipper length by species and sex. In conclusion, the null hypothesis is rejected. Sex and species have correlation with flipper length. The graphs support this conclusion because gentoo penguins have significantly higher flipper lengths on average compared to adelie and chinstrap penguins.

Step 4: Mr. Trash Wheel

Mr. Trash Wheel and friends collect pollution from tributaries and Baltimore Harbor. There are now 4 such trash wheels in the area. Let’s see if we can figure out which of these wheels does the most cleaning! Using the data you just read in, test the following hypothesis: “There is no difference in plastic bottle collection across months or trashwheels.” Remember, you will need to make a useful graph, run your stats, test assumptions, and interpret.

Hypotheses

H0 = There is no difference in plastic bottle collection across months or trashwheels. HA = There is a difference in plastic bottle collection across months or trashwheels

First, I have to read in the data and make a dataframe. I will call it “trash”.

Code

trash <- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv")
head(trash)#show data

      ID               Name Dumpster Month Year      Date Weight Volume
1 mister Mister Trash Wheel        1   May 2014 5/16/2014   4.31     18
2 mister Mister Trash Wheel        2   May 2014 5/16/2014   2.74     13
3 mister Mister Trash Wheel        3   May 2014 5/16/2014   3.45     15
4 mister Mister Trash Wheel        4   May 2014 5/17/2014   3.10     15
5 mister Mister Trash Wheel        5   May 2014 5/17/2014   4.06     18
6 mister Mister Trash Wheel        6   May 2014 5/20/2014   2.71     13
  PlasticBottles Polystyrene CigaretteButts GlassBottles PlasticBags Wrappers
1           1450        1820         126000           72         584     1162
2           1120        1030          91000           42         496      874
3           2450        3100         105000           50        1080     2032
4           2380        2730         100000           52         896     1971
5            980         870         120000           72         368      753
6           1430        2140          90000           46         672     1144
  SportsBalls HomesPowered
1           7            0
2           5            0
3           6            0
4           6            0
5           7            0
6           5            0

Because I am trying to find the relations of plastic bottle collections across months AND trashwheels, I will use two-way ANOVA.

Code

trash %>% 
  group_by(Month, Dumpster) %>%
  identify_outliers(PlasticBottles)%>% #identify outliers
  na.omit() #get rid of NA values

 [1] Dumpster       Month          ID             Name           Year          
 [6] Date           Weight         Volume         PlasticBottles Polystyrene   
[11] CigaretteButts GlassBottles   PlasticBags    Wrappers       SportsBalls   
[16] HomesPowered   is.outlier     is.extreme    
<0 rows> (or 0-length row.names)

Code

head(trash)

      ID               Name Dumpster Month Year      Date Weight Volume
1 mister Mister Trash Wheel        1   May 2014 5/16/2014   4.31     18
2 mister Mister Trash Wheel        2   May 2014 5/16/2014   2.74     13
3 mister Mister Trash Wheel        3   May 2014 5/16/2014   3.45     15
4 mister Mister Trash Wheel        4   May 2014 5/17/2014   3.10     15
5 mister Mister Trash Wheel        5   May 2014 5/17/2014   4.06     18
6 mister Mister Trash Wheel        6   May 2014 5/20/2014   2.71     13
  PlasticBottles Polystyrene CigaretteButts GlassBottles PlasticBags Wrappers
1           1450        1820         126000           72         584     1162
2           1120        1030          91000           42         496      874
3           2450        3100         105000           50        1080     2032
4           2380        2730         100000           52         896     1971
5            980         870         120000           72         368      753
6           1430        2140          90000           46         672     1144
  SportsBalls HomesPowered
1           7            0
2           5            0
3           6            0
4           6            0
5           7            0
6           5            0

this is additive ANOVA:

Code

plasticaov<-aov(PlasticBottles~ Month + Dumpster, data=trash) #this calls up the anova data, but i'll mostly be looking at the summary output
plasticaov

Call:
   aov(formula = PlasticBottles ~ Month + Dumpster, data = trash)

Terms:
                     Month   Dumpster  Residuals
Sum of Squares    80806432   48879879 2569781377
Deg. of Freedom         13          1        977

Residual standard error: 1621.813
Estimated effects may be unbalanced
1 observation deleted due to missingness

Code

summary(plasticaov) #here's the important info

             Df    Sum Sq  Mean Sq F value   Pr(>F)    
Month        13 8.081e+07  6215879   2.363   0.0041 ** 
Dumpster      1 4.888e+07 48879879  18.584 1.79e-05 ***
Residuals   977 2.570e+09  2630278                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 observation deleted due to missingness

The data shows that there is a difference in plastic bottle collection by month and dumpster. The p-value for plastic bottles by month is 0.0041, which is less than 0.05, so there is significance. The p-value for plastic bottles by dumpster is 1.79e-05, which is less than 0.05, so there is significance.

I will now graph this data to see a visual representation.

Code

plasum<- trash %>% #get the data for means and error bars
  group_by(Dumpster, Month) %>%
    na.omit() %>% #removes rows with NA values (a few rows may otherwise have NA due to sampling error in the field)
  summarise(meanpla=mean(PlasticBottles), sd=sd(PlasticBottles), n=n(), se=sd/sqrt(n))

`summarise()` has grouped output by 'Dumpster'. You can override using the
`.groups` argument.

Code

plasum

# A tibble: 629 × 6
# Groups:   Dumpster [629]
   Dumpster Month meanpla    sd     n    se
      <int> <chr>   <dbl> <dbl> <int> <dbl>
 1        1 May      1450    NA     1    NA
 2        2 May      1120    NA     1    NA
 3        3 May      2450    NA     1    NA
 4        4 May      2380    NA     1    NA
 5        5 May       980    NA     1    NA
 6        6 May      1430    NA     1    NA
 7        7 May       910    NA     1    NA
 8        8 May      3580    NA     1    NA
 9        9 June     2400    NA     1    NA
10       10 June     1340    NA     1    NA
# ℹ 619 more rows

I troubledshooted for over two hours, but I can’t figure out why se and sd are not registered with values rather than NA. This is particularly annoying because it means that I do not have error bars. Instead, I used a line of best fit / linear regression line to show averages. This fits the dataset more properly, anyways.

Code

d<-ggplot()+ 
  geom_point(data= trash, aes(x=Dumpster, y=PlasticBottles, color=Dumpster), alpha=0.2)+ #raw data
  geom_smooth(data=trash, aes(x=Dumpster, y=PlasticBottles), method='lm')+
  labs(title="Plastic Bottles by Dumpster", x="Dumpster",y="Plastic Bottles",color="Dumpster")+ #add titles
  theme_classic()+
  theme(text=element_text(size=8)) #change text size

e<-ggplot()+ 
  geom_point(data= trash, aes(x=Month, y=PlasticBottles, color=Month), alpha=0.2)+ #raw data
  geom_smooth(data=trash, aes(x=Month, y=PlasticBottles), method='lm')+
  labs(title="Plastic Bottles by Month", x="Month",y="Plastic Bottles",color="Month")+ #add titles
  theme_classic()+
  theme(text=element_text(size=6)) #change text size


f<-ggplot()+ 
  geom_point(data= trash, aes(x=Dumpster, y=PlasticBottles, color=Month), alpha=0.2)+ #raw data
  geom_smooth(data=trash, aes(x=Dumpster, y=PlasticBottles), method='lm')+
  labs(title="Plastic Bottles by Month and Dumpster", x="Dumpster",y="Plastic Bottles",color="Month")+ #add titles
  theme_classic()+
  theme(text=element_text(size=8)) #change text size


d+e+f #patchwork <3

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

from the graphs we can see that there is a negative relationship between dumpster and plastic bottle collection. The amount of bottles collected decreases as the dumpster number increases on average. There does not appear to be significant visual difference of plastic bottle collection by month. The F-values of the dumpster’s plastic bottle collection is 18.584, which is much higher than the month’s plastic bottle collection f-value of 2.363. The higher f-value indicates that there is greater variance within the dumpsters’ plastic bottle collections.

Bonus Content

this is extra work to fulfill the requirements for an E. If there is not enough data analysis, please email me and I will add more information.

I am using a dataset from Tidy Tuesday. the all recipes dataset has lots of information about each recipe, including various nutrition facts (fats, calories, etc.), reviews, prep time, cook time, servings, and ratings. First, I will be focusing on the relationship between calories and reviews.

H0 = There is no relationship between calories and average reviews of a recipe. HA = There is a relationship between calories and average reviews of a recipe.

Code

all_recipes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-09-16/all_recipes.csv') #read in the data

Rows: 14426 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): name, url, author, ingredients
dbl  (11): calories, fat, carbs, protein, avg_rating, total_ratings, reviews...
date  (1): date_published

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

head(all_recipes) #show data

# A tibble: 6 × 16
  name      url   author date_published ingredients calories   fat carbs protein
  <chr>     <chr> <chr>  <date>         <chr>          <dbl> <dbl> <dbl>   <dbl>
1 Chewy Wh… http… DMOMMY 2020-06-18     ⅓ cup marg…      222    13    24       6
2 Pumpkin … http… Bobbi… 2022-09-26     12  egg yo…      477    31    43       8
3 Eggs Poa… http… Bren   2018-06-08     2 tablespo…      354    18    32      20
4 Minestro… http… Sarah… 2025-03-03     4 cups dri…      356     9    53      19
5 Yummy St… http… Procr… 2024-12-11     4 green be…      366    22    23      19
6 Prime Ri… http… Cajun… 2019-04-03     1 (8 pound…      709    47    31      37
# ℹ 7 more variables: avg_rating <dbl>, total_ratings <dbl>, reviews <dbl>,
#   prep_time <dbl>, cook_time <dbl>, total_time <dbl>, servings <dbl>

Code

calsandreviews <- all_recipes[,c(1,6,10)]%>%
  na.omit()#make a dataframe with only the calories and average reviews of each recipe

head(calsandreviews)

# A tibble: 6 × 3
  name                                     calories avg_rating
  <chr>                                       <dbl>      <dbl>
1 Chewy Whole Wheat Peanut Butter Brownies      222        4.4
2 Pumpkin Pie Eggnog                            477        5  
3 Eggs Poached in Tomato Sauce                  354        4.8
4 Minestrone Casserole                          356        4.3
5 Yummy Stuffed Peppers                         366        4.7
6 Prime Rib Our Way                             709        4.2

First, I will do a Spearman’s correlation test from Lab 4. This will give me information about whether there is any correlation to analyze.

Code

cor.test(calsandreviews$calories, calsandreviews$avg_rating, method="spearman")

Warning in cor.test.default(calsandreviews$calories, calsandreviews$avg_rating,
: Cannot compute exact p-value with ties


    Spearman's rank correlation rho

data:  calsandreviews$calories and calsandreviews$avg_rating
S = 3.7529e+11, p-value = 8.827e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
0.03855334

The p-value is 8.827e-06, which is less than 0.05, so there is a correlation between average reviews and calories. However, the rho is 0.03855334, which means that there is a very low, almost nonexistent correlation. Therefore, the correlation is not significant enough to pursue further inquiry by doing ANOVA tests. I will make a graph to double check the visuals to see if there is a correlation that was not reflected in the spearman’s test.

Code

calsandreviews %>%
    identify_outliers(calories) %>%
    identify_outliers(avg_rating)

# A tibble: 20 × 5
   name                                calories avg_rating is.outlier is.extreme
   <chr>                                  <dbl>      <dbl> <lgl>      <lgl>     
 1 Super Meat Lovers Cheddar Biscuit …      954        3   TRUE       TRUE      
 2 Easy Gingerbread House                  1132        3.3 TRUE       FALSE     
 3 Absolutely Delicious Stuffed Calam…     1019        3.8 TRUE       FALSE     
 4 Japanese-Style Triple-Seared Steak       997        3.5 TRUE       FALSE     
 5 Bacon Dill Pickle Pizzas                1041        3   TRUE       TRUE      
 6 One Pan Breakfast Skillet Mac and …      877        3   TRUE       TRUE      
 7 Bacon Pancakes                          1408        3   TRUE       TRUE      
 8 Steak Fajitas                            896        3.7 TRUE       FALSE     
 9 Slow-Smoked Pork Spareribs               878        3.7 TRUE       FALSE     
10 Mom's Goulash                            896        3.7 TRUE       FALSE     
11 Chicken Alfredo with Fettuccini No…     1014        3.8 TRUE       FALSE     
12 Bekki's Mexican Egg Rolls                980        3.7 TRUE       FALSE     
13 72-Hour Sous Vide Short Ribs             940        3.5 TRUE       FALSE     
14 Chocolate Potato Chip Pie               1305        1   TRUE       TRUE      
15 Broccoli and Cheese Rice Casserole…      903        3   TRUE       TRUE      
16 Lisa's Macaroni and Cheese               860        3.7 TRUE       FALSE     
17 Coffee-Rubbed Steak                     9538        2   TRUE       TRUE      
18 Pumpkin Patch Tarts                      879        3.8 TRUE       FALSE     
19 Mini Carrot Cake Loaves                 1037        3.5 TRUE       FALSE     
20 Eggnog Cream Pie with Gingersnap C…     5109        3.7 TRUE       FALSE

Code

ggplot(calsandreviews, aes(x=avg_rating, y=calories, color=avg_rating))+ 
  geom_point(alpha=0.5)+
  geom_smooth(method='lm')+
  theme_classic()+
  labs(title="Calories and Average Ratings Correlation",x="Average Rating",y="Calories",color="Average Rating")+
    theme(text=element_text(size=14)) #change text size

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

When I was doing this graph, I used a color scale to show the average rating because there are only the integer data points 1, 2, 3, 4, and 5 to represent the ratings, but, in the raw data, the ratings go to the tenth. I also used a linear regression line to show the correlation, or lack thereof, of average ratings and calories. The line is almost entirely flat, showing that there is hardly a relationship between the two values. There are outliers that skew data, but the line remains flat, despite the calorie values being the outliers, so there is minimal influence on the graph. The null hypothesis can be rejected because there IS a relationship between calories and average ratings, but it is a VERY SMALL, POSITIVE relationship.

Since this first hypothesis test was a bust, I will continue examining the data to hopefully find a correlation. As someone that has always enjoyed baking as a hobby, I am fairly sure that there will be an effect on calories from the fat content and/or carbohydrate content. I believe that there will be a positive correlation.

Hypotheses

H0 = There is no difference in calories from fat content and carb content.

HA = There is a difference in calories from fat content and carb content.

I’m going to start by making a graph to show the data. Then, I will use Two-Way ANOVA.

Code

somecals<-subset(all_recipes, calories<9000) #this removes the most extreme outlier

somecals

# A tibble: 14,225 × 16
   name     url   author date_published ingredients calories   fat carbs protein
   <chr>    <chr> <chr>  <date>         <chr>          <dbl> <dbl> <dbl>   <dbl>
 1 Chewy W… http… DMOMMY 2020-06-18     ⅓ cup marg…      222    13    24       6
 2 Pumpkin… http… Bobbi… 2022-09-26     12  egg yo…      477    31    43       8
 3 Eggs Po… http… Bren   2018-06-08     2 tablespo…      354    18    32      20
 4 Minestr… http… Sarah… 2025-03-03     4 cups dri…      356     9    53      19
 5 Yummy S… http… Procr… 2024-12-11     4 green be…      366    22    23      19
 6 Prime R… http… Cajun… 2019-04-03     1 (8 pound…      709    47    31      37
 7 Parmesa… http… Anika  2023-01-04     1 large eg…      466    27     1      52
 8 Chicken… http… Bob C… 2022-07-14     12 cups wa…      782    61    19      40
 9 Sweet P… http… Dean   2023-01-19     3 pounds p…      355    15    33      23
10 Quick B… http… Chris… 2024-11-14     canola oil…      395    12    33      37
# ℹ 14,215 more rows
# ℹ 7 more variables: avg_rating <dbl>, total_ratings <dbl>, reviews <dbl>,
#   prep_time <dbl>, cook_time <dbl>, total_time <dbl>, servings <dbl>

Code

calsum<-somecals %>% #get the data for means and error bars
  group_by(fat, carbs) %>%
  na.omit() %>% #removes rows with NA values
  summarise(meancal=mean(calories), sd=sd(calories), n=n(), se=sd/sqrt(n))

`summarise()` has grouped output by 'fat'. You can override using the `.groups`
argument.

Code

calsum

# A tibble: 3,873 × 6
# Groups:   fat [117]
     fat carbs meancal    sd     n     se
   <dbl> <dbl>   <dbl> <dbl> <int>  <dbl>
 1     0     0     1   NA        1 NA    
 2     0     1    13.1 28.2     19  6.48 
 3     0     2    11.2  4.62    18  1.09 
 4     0     3    18.2 21.6     19  4.95 
 5     0     4    21.3  8.19    10  2.59 
 6     0     5    22    2.08    13  0.577
 7     0     6    40.9 42.0     18  9.89 
 8     0     7    36.9 26.9     13  7.45 
 9     0     8    42.1 20.5     13  5.67 
10     0     9    55.3 41.9     15 10.8  
# ℹ 3,863 more rows

I will do a patchwork graph to show the individual relationshps of calroies and fat, calories and carbs, and the combined relationship of calories and fat and carbs.

Code

g<-ggplot()+ 
  geom_point(data= somecals, aes(x=fat, y=calories, color=fat), alpha=0.2)+ #raw data
  labs(title="Calories by Fat Content", x="Fat",y="Calories",color="Fat")+ #add titles
  geom_smooth(data=somecals, aes(x=fat, y=calories), method='lm')+#add linear regression
  theme_classic()+
  theme(text=element_text(size=10)) #change text size

h<-ggplot()+ 
  geom_point(data= somecals, aes(x=carbs, y=calories, color=carbs), alpha=0.2)+ #raw data
  theme_classic()+ #white bg with horizontal lines to help compare data points
  geom_smooth(data=somecals, aes(x=carbs, y=calories), method='lm')+#add linear regression
  labs(title="Calories by Carb Content", x="Carbohydrates",y="Calories",color="Carbs")+ #add titles
  theme(text=element_text(size=10)) #change text size


i<- ggplot()+ 
  geom_point(data= somecals, aes(x=fat, y=calories, color=carbs), alpha=0.2)+ #raw data
  theme_classic()+ #white bg with horizontal lines to help compare data points
  geom_smooth(data=somecals, aes(x=fat, y=carbs), method='lm')+#add linear regression of fat and carbs
  labs(title="Calories by Fat and Carbs", x="Fat",y="Calories",color="Carbs")+ #add titles
  theme(text=element_text(size=10)) #change text size


g+h+i #patchwork :0

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 156 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 156 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 170 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 156 rows containing missing values or values outside the scale range
(`geom_point()`).

The graph supports that there is a difference in calories by fat and carb levels. The lines of best fit is increasing for all thrree graphs, showing a positive relationship. Let’s see if ANOVA supports this with numbers.

Code

aovcals<-aov(calories~fat*carbs, data=somecals) #use ANOVA to get data info
summary(aovcals) #brings up p-value and other improtant numbers

               Df    Sum Sq   Mean Sq   F value Pr(>F)    
fat             1 603035489 603035489 1.557e+05 <2e-16 ***
carbs           1 134609245 134609245 3.474e+04 <2e-16 ***
fat:carbs       1     24541     24541 6.334e+00 0.0119 *  
Residuals   14051  54437811      3874                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
170 observations deleted due to missingness

This ANOVA data shows that there is significant correlation between fat adn calories where the p-value is <2e-16, as well as carbs and calories where the p-value is <2e-16. There is some correlation between fats/carbs and calories, but it is not as significant because the p-value is 0.0119.

Code

check_model(aovcals)

There is definitely data skewed towards 0-2000 calories, rather than 2000+ calories. Now I’ll identify outliers

Code

less_outliers<- somecals %>% 
  group_by(fat, carbs) %>%
  identify_outliers(calories) #No extreme outliers, so let's carry on

Code

aovnormal<-aov(calories~fat*carbs, data=less_outliers) #now im checking anova data without the outliers
summary(aovnormal)

             Df  Sum Sq Mean Sq  F value Pr(>F)    
fat           1 9567007 9567007 2178.596 <2e-16 ***
carbs         1 2020677 2020677  460.148 <2e-16 ***
fat:carbs     1    3094    3094    0.705  0.402    
Residuals   727 3192522    4391                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
15 observations deleted due to missingness

Code

check_model(aovnormal) #see the data

The data looks MUCH more normal and less skewed. The ANOVA results show that there is still significant correlation between calories and fats with a p-value of <2e-16 and between carbs and calories with a p-value of <2e-16. However, the p-value of carbs AND fats relation to calories is 0.402, which is much higher than 0.05, so there is not a relationship between these values. This shows the importance of testing and accommodating for normalcy! The null hypothesis is accepted.