── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(see)library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Code
library(patchwork)library(ggsci)
Attaching package: 'ggsci'
The following objects are masked from 'package:see':
scale_color_material, scale_colour_material, scale_fill_material
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(car) #contains some statistical tests we need to assess assumptionslibrary(gt)
Attaching package: 'gt'
The following objects are masked from 'package:Hmisc':
html, latex
Code
library(rstatix) #test for outliers, welch_anova_test
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
Code
library(ggplot2)
Lab 5
November-December 2025
Step 1: Species and Bill Length
1.) Grab your trusty penguins data frame and test the following hypothesis: “There is no effect of species on bill length in penguins.” Show a useful graphical representation of the data, run the appropriate stats, and interpret the results. Put the stats into a table! Don’t forget to check assumptions.
Hypotheses
H0: There is no effect of species on bill length in penguins.
HA: There is an effect of species on bill length in penguins.
Code
pens <- penguins %>%#make a data frame without na values to make the anova tests smootherna.omit()pens
colors<-c('#b9ef69', '#ed8e88', '#ce8bee') # this is a BEAUTIFUL, custom, 3-color paletteggplot(data=pens, aes(x=species, y=bill_len, group=species, color=species))+geom_boxplot()+#box plottheme_classic()+#white bgscale_color_manual(values=colors)+# use my colorslabs(title="Species and Bill Length Box Plot", x="Species",y="Bill Length (mm)",color="Species")+#add titlestheme(text=element_text(size=18)) #change text size
there is one outlier for gentoo penguins, but there does not appear to be other significant outliers. The graph shows that there is very likely an effect of species on bill length.
just to double check, i am running a group by function to identify the outliers of bill length for each species. as seen in the box plot, there is only one outlier, which is determined to be “not extreme.” this means that outliers are not a huge deal when analyzing the anova data moving forward.
# A tibble: 1 × 10
species island bill_len bill_dep flipper_len body_mass sex year is.outlier
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <lgl>
1 Gentoo Biscoe 59.6 17 230 6050 male 2007 TRUE
# ℹ 1 more variable: is.extreme <lgl>
use one-way anova
Code
pensaov<-aov(bill_len~species, data=pens) pensaov
Call:
aov(formula = bill_len ~ species, data = pens)
Terms:
species Residuals
Sum of Squares 7015.386 2913.517
Deg. of Freedom 2 330
Residual standard error: 2.971336
Estimated effects may be unbalanced
the p value is “unbalanced”, so I am running the summary function to get the p value
Code
summary(pensaov) #get p value
Df Sum Sq Mean Sq F value Pr(>F)
species 2 7015 3508 397.3 <2e-16 ***
Residuals 330 2914 9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the p value is less than 0.05, which means that there is a correlation between species and bill length.
Welch’s anova test
another test of p value - “Welch’s ANOVA compares two means to see if they are equal.” (https://www.statisticshowto.com/welchs-anova/)
Code
welch_anova_test(bill_len~species, data=pens)
# A tibble: 1 × 7
.y. n statistic DFn DFd p method
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <chr>
1 bill_len 333 410. 2 165. 8.27e-65 Welch ANOVA
the p value for the welch’s anova test is 6.71e-66, which is a very small number. The p value for the bill length / species test is less than 0.05, so there is an effect of species on bill length. The null hypothesis is rejected.
Code
modelpens<-aov(bill_len ~ species, data=pens) #check for normalitycheck_model(modelpens)
Code
shapiro_test(residuals(modelpens)) # this is a normality test
the shapiro test is significant because the p-value is less than 0.05, indicating that there is abnormality in the data. The <0.05 p-value supports rejecting the null hypothesis.
Code
pens %>%group_by(species)%>%shapiro_test(bill_len) # check normality by species
# A tibble: 3 × 4
species variable statistic p
<fct> <chr> <dbl> <dbl>
1 Adelie bill_len 0.993 0.685
2 Chinstrap bill_len 0.975 0.194
3 Gentoo bill_len 0.974 0.0199
the shapiro test shows that, for adelie and chinstrap penguins, there is normality, but the gentoo penguin data has abnormalities. This an be used in the next section with Two-Way ANOVA.
Code
#Homogeneity of variance/Homoscedasticity leveneTest(bill_len~species, data=pens)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.2855 0.1033
330
The P value is greater than 0.05, so the data is not skewed significantly. The data is centered, but it still has abnormalities, as seen in the shapiro test.
Step 2: Tukey Test
2.) We almost never care ONLY about an effect or no effect, we also care about magnitude and direction of effects. For example, a better hypothesis for #1 might be: “There is no difference in bill length between the three species of penguin.” Test this hypothesis! – You will need to do additional stats. Interpret your results!
Hypotheses
H0 = There is no difference in bill length between the three species of penguin.
HA = There is a difference in bill length between the three species of penguin.
Code
pensaov<-aov(bill_len~species, data=pens) #make an ANOVA dataset of bill length relation to the penguin speciessummary(pensaov)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 7015 3508 397.3 <2e-16 ***
Residuals 330 2914 9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.05, so there is significant relation between bill length and species.
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = bill_len ~ species, data = pens)
$species
diff lwr upr p adj
Chinstrap-Adelie 10.009851 8.982789 11.0369128 0.0000000
Gentoo-Adelie 8.744095 7.880135 9.6080546 0.0000000
Gentoo-Chinstrap -1.265756 -2.329197 -0.2023151 0.0148212
The individual species have different relations of bill length in comparison with each other. Therefore, the H0 can be rejected because there is a difference between the bill lengths of each species. To check this, I will graph the data with boxplots and mean and error bars.
Code
billsum <- penguins %>%#first I have to make a summary dataset to get the meansgroup_by(species) %>%na.omit() %>%summarise(meanbill=mean(bill_len), sd=sd(bill_len), n=n(), se=sd/sqrt(n)) # this uses standard deviation to calculate the means and rangebillsum
# A tibble: 3 × 5
species meanbill sd n se
<fct> <dbl> <dbl> <int> <dbl>
1 Adelie 38.8 2.66 146 0.220
2 Chinstrap 48.8 3.34 68 0.405
3 Gentoo 47.6 3.11 119 0.285
Code
ggplot()+geom_boxplot(data=pens, aes(x=species, y=bill_len, fill=species))+#box plotgeom_errorbar(data=billsum, aes(x=species, ymin=meanbill-se, ymax=meanbill+se), width=0.2)+# error bars with billsum datageom_point(data=billsum, aes(x=species, y=meanbill), size=1)+# averages with billsum datatheme_classic()+#white bg with horizontal lines to help compare data pointsscale_fill_manual(values=colors)+# use my colorslabs(title="Bill Length by Species", x="Species",y="Bill Length (mm)",color="Species")+#add titlestheme(text=element_text(size=12)) #change text size
Ignoring unknown labels:
• colour : "Species"
As seen in the graph, the adelie penguin bill lengths are significantly lower on average compared to the bill lengths of chinstrap and gentoo penguins. The tukey test supports this claim because the mean difference between adelie and gentoo bill lengths is 8.744095 mm and the mean difference between adelie and chinstrap flipper lengths is 10.009851 mm, whereas the mean difference between gentoo and chinstrap bill lengths is only 1.265756 mm. Therefore, the null hypohtesis is rejected because there is a difference between bill lengths by species.
Step 3: Two-Way ANOVA
3.) Repeat the process for the following hypothesis: “There is no difference in flipper length between binary biological sexes and/or penguin species.” Include an optimal graph (means+error at least- raw data in the background is best). Keep in mind that you now have 3 variables and only 2 dimensions, so you will need to figure out how to represent the 3rd variable. Run the appropriate stats and interpret. Don’t forget about assumptions.
Hypotheses
H0 = There is no difference in flipper length between binary biological sexes and/or penguin species.
HA = There is a difference in flipper length between binary biological sexes and/or penguin species.
Code
flipsum<- pens %>%#get the data for means and error barsgroup_by(species, sex) %>%na.omit() %>%#removes rows with NA values (a few rows may otherwise have NA due to sampling error in the field)summarise(meanflip=mean(flipper_len), sd=sd(flipper_len), n=n(), se=sd/sqrt(n))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
Code
flipsum
# A tibble: 6 × 6
# Groups: species [3]
species sex meanflip sd n se
<fct> <fct> <dbl> <dbl> <int> <dbl>
1 Adelie female 188. 5.60 73 0.655
2 Adelie male 192. 6.60 73 0.772
3 Chinstrap female 192. 5.75 34 0.987
4 Chinstrap male 200. 5.98 34 1.02
5 Gentoo female 213. 3.90 58 0.512
6 Gentoo male 222. 5.67 61 0.726
This is a patchwork plot of the data of flipper length by species and sex:
Code
a<-ggplot()+geom_point(data= pens, aes(x=sex, y=flipper_len, color=sex), alpha=0.2)+#raw datageom_errorbar(data=flipsum, aes(x=sex, ymin=meanflip-se, ymax=meanflip+se, color=sex, width=0.7))+scale_color_manual(values=colors)+# use my colorslabs(title="Flipper Length by Sex", x="Sex",y="Flipper Length (mm)",color="Sex")+#add titlestheme_classic()+theme(text=element_text(size=10)) #change text sizeb<-ggplot()+geom_point(data= pens, aes(x=species, y=flipper_len, color=species), alpha=0.2)+#raw datageom_errorbar(data=flipsum, aes(x=species, ymin=meanflip-se, ymax=meanflip+se, color=species, width=0.9))+theme_classic()+#white bg with horizontal lines to help compare data pointsscale_color_manual(values=colors)+# use my colorslabs(title="Flipper Length by Species", x="Species",y="Flipper Length (mm)",color="Species")+#add titlestheme(text=element_text(size=6)) #change text sizec<-ggplot()+geom_point(data= pens, aes(x=species, y=flipper_len, color=sex), alpha=0.2)+#raw datageom_errorbar(data=flipsum, aes(x=species, ymin=meanflip-se, ymax=meanflip+se, color=sex, width=0.9))+theme_classic()+#white bg with horizontal lines to help compare data pointsscale_color_manual(values=colors)+# use my colorslabs(title="Flipper Length by Sex and Species", x="Species",y="Flipper Length (mm)",color="Sex")+#add titlestheme(text=element_text(size=6)) #change text sizea+b+c #patchwork!
# additive anovaflipaov<-aov(flipper_len~ species + sex, data=pens) #this calls up the anova data, but i'll mostly be looking at the summary outputflipaov
Call:
aov(formula = flipper_len ~ species + sex, data = pens)
Terms:
species sex Residuals
Sum of Squares 50525.88 3905.60 10787.15
Deg. of Freedom 2 1 329
Residual standard error: 5.726053
Estimated effects may be unbalanced
Code
summary(flipaov) # anova data, but prettier!
Df Sum Sq Mean Sq F value Pr(>F)
species 2 50526 25263 770.5 <2e-16 ***
sex 1 3906 3906 119.1 <2e-16 ***
Residuals 329 10787 33
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA summary has three main takeaways: (1) species is correlated to flipper length because the p-value is ~<2e-16, which is less than 0.05. (2) sex is correlated to flipper length because the p-value is ~<2e-16, which is less than 0.05. (3) The residuals row shows the discrepancy between individual group’s data and the values of all of the groups averaged out (statsandr.com/blog/anova-in-r/). The high number of 329 means that there is a significant difference between the individual flipper length values by species and sex and the average values of flipper length by species and sex. In conclusion, the null hypothesis is rejected. Sex and species have correlation with flipper length. The graphs support this conclusion because gentoo penguins have significantly higher flipper lengths on average compared to adelie and chinstrap penguins.
Step 4: Mr. Trash Wheel
Mr. Trash Wheel and friends collect pollution from tributaries and Baltimore Harbor. There are now 4 such trash wheels in the area. Let’s see if we can figure out which of these wheels does the most cleaning! Using the data you just read in, test the following hypothesis: “There is no difference in plastic bottle collection across months or trashwheels.” Remember, you will need to make a useful graph, run your stats, test assumptions, and interpret.
Hypotheses
H0 = There is no difference in plastic bottle collection across months or trashwheels. HA = There is a difference in plastic bottle collection across months or trashwheels
First, I have to read in the data and make a dataframe. I will call it “trash”.
Code
trash <-read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-03-05/trashwheel.csv")head(trash)#show data
plasticaov<-aov(PlasticBottles~ Month + Dumpster, data=trash) #this calls up the anova data, but i'll mostly be looking at the summary outputplasticaov
Call:
aov(formula = PlasticBottles ~ Month + Dumpster, data = trash)
Terms:
Month Dumpster Residuals
Sum of Squares 80806432 48879879 2569781377
Deg. of Freedom 13 1 977
Residual standard error: 1621.813
Estimated effects may be unbalanced
1 observation deleted due to missingness
Code
summary(plasticaov) #here's the important info
Df Sum Sq Mean Sq F value Pr(>F)
Month 13 8.081e+07 6215879 2.363 0.0041 **
Dumpster 1 4.888e+07 48879879 18.584 1.79e-05 ***
Residuals 977 2.570e+09 2630278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 observation deleted due to missingness
The data shows that there is a difference in plastic bottle collection by month and dumpster. The p-value for plastic bottles by month is 0.0041, which is less than 0.05, so there is significance. The p-value for plastic bottles by dumpster is 1.79e-05, which is less than 0.05, so there is significance.
I will now graph this data to see a visual representation.
Code
plasum<- trash %>%#get the data for means and error barsgroup_by(Dumpster, Month) %>%na.omit() %>%#removes rows with NA values (a few rows may otherwise have NA due to sampling error in the field)summarise(meanpla=mean(PlasticBottles), sd=sd(PlasticBottles), n=n(), se=sd/sqrt(n))
`summarise()` has grouped output by 'Dumpster'. You can override using the
`.groups` argument.
Code
plasum
# A tibble: 629 × 6
# Groups: Dumpster [629]
Dumpster Month meanpla sd n se
<int> <chr> <dbl> <dbl> <int> <dbl>
1 1 May 1450 NA 1 NA
2 2 May 1120 NA 1 NA
3 3 May 2450 NA 1 NA
4 4 May 2380 NA 1 NA
5 5 May 980 NA 1 NA
6 6 May 1430 NA 1 NA
7 7 May 910 NA 1 NA
8 8 May 3580 NA 1 NA
9 9 June 2400 NA 1 NA
10 10 June 1340 NA 1 NA
# ℹ 619 more rows
I troubledshooted for over two hours, but I can’t figure out why se and sd are not registered with values rather than NA. This is particularly annoying because it means that I do not have error bars. Instead, I used a line of best fit / linear regression line to show averages. This fits the dataset more properly, anyways.
Code
d<-ggplot()+geom_point(data= trash, aes(x=Dumpster, y=PlasticBottles, color=Dumpster), alpha=0.2)+#raw datageom_smooth(data=trash, aes(x=Dumpster, y=PlasticBottles), method='lm')+labs(title="Plastic Bottles by Dumpster", x="Dumpster",y="Plastic Bottles",color="Dumpster")+#add titlestheme_classic()+theme(text=element_text(size=8)) #change text sizee<-ggplot()+geom_point(data= trash, aes(x=Month, y=PlasticBottles, color=Month), alpha=0.2)+#raw datageom_smooth(data=trash, aes(x=Month, y=PlasticBottles), method='lm')+labs(title="Plastic Bottles by Month", x="Month",y="Plastic Bottles",color="Month")+#add titlestheme_classic()+theme(text=element_text(size=6)) #change text sizef<-ggplot()+geom_point(data= trash, aes(x=Dumpster, y=PlasticBottles, color=Month), alpha=0.2)+#raw datageom_smooth(data=trash, aes(x=Dumpster, y=PlasticBottles), method='lm')+labs(title="Plastic Bottles by Month and Dumpster", x="Dumpster",y="Plastic Bottles",color="Month")+#add titlestheme_classic()+theme(text=element_text(size=8)) #change text sized+e+f #patchwork <3
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range (`stat_smooth()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
from the graphs we can see that there is a negative relationship between dumpster and plastic bottle collection. The amount of bottles collected decreases as the dumpster number increases on average. There does not appear to be significant visual difference of plastic bottle collection by month. The F-values of the dumpster’s plastic bottle collection is 18.584, which is much higher than the month’s plastic bottle collection f-value of 2.363. The higher f-value indicates that there is greater variance within the dumpsters’ plastic bottle collections.
Bonus Content
this is extra work to fulfill the requirements for an E. If there is not enough data analysis, please email me and I will add more information.
I am using a dataset from Tidy Tuesday. the all recipes dataset has lots of information about each recipe, including various nutrition facts (fats, calories, etc.), reviews, prep time, cook time, servings, and ratings. First, I will be focusing on the relationship between calories and reviews.
H0 = There is no relationship between calories and average reviews of a recipe. HA = There is a relationship between calories and average reviews of a recipe.
Code
all_recipes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2025/2025-09-16/all_recipes.csv') #read in the data
Rows: 14426 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, url, author, ingredients
dbl (11): calories, fat, carbs, protein, avg_rating, total_ratings, reviews...
date (1): date_published
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning in cor.test.default(calsandreviews$calories, calsandreviews$avg_rating,
: Cannot compute exact p-value with ties
Spearman's rank correlation rho
data: calsandreviews$calories and calsandreviews$avg_rating
S = 3.7529e+11, p-value = 8.827e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.03855334
The p-value is 8.827e-06, which is less than 0.05, so there is a correlation between average reviews and calories. However, the rho is 0.03855334, which means that there is a very low, almost nonexistent correlation. Therefore, the correlation is not significant enough to pursue further inquiry by doing ANOVA tests. I will make a graph to double check the visuals to see if there is a correlation that was not reflected in the spearman’s test.
ggplot(calsandreviews, aes(x=avg_rating, y=calories, color=avg_rating))+geom_point(alpha=0.5)+geom_smooth(method='lm')+theme_classic()+labs(title="Calories and Average Ratings Correlation",x="Average Rating",y="Calories",color="Average Rating")+theme(text=element_text(size=14)) #change text size
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
When I was doing this graph, I used a color scale to show the average rating because there are only the integer data points 1, 2, 3, 4, and 5 to represent the ratings, but, in the raw data, the ratings go to the tenth. I also used a linear regression line to show the correlation, or lack thereof, of average ratings and calories. The line is almost entirely flat, showing that there is hardly a relationship between the two values. There are outliers that skew data, but the line remains flat, despite the calorie values being the outliers, so there is minimal influence on the graph. The null hypothesis can be rejected because there IS a relationship between calories and average ratings, but it is a VERY SMALL, POSITIVE relationship.
Since this first hypothesis test was a bust, I will continue examining the data to hopefully find a correlation. As someone that has always enjoyed baking as a hobby, I am fairly sure that there will be an effect on calories from the fat content and/or carbohydrate content. I believe that there will be a positive correlation.
Hypotheses
H0 = There is no difference in calories from fat content and carb content.
HA = There is a difference in calories from fat content and carb content.
I’m going to start by making a graph to show the data. Then, I will use Two-Way ANOVA.
Code
somecals<-subset(all_recipes, calories<9000) #this removes the most extreme outliersomecals
calsum<-somecals %>%#get the data for means and error barsgroup_by(fat, carbs) %>%na.omit() %>%#removes rows with NA valuessummarise(meancal=mean(calories), sd=sd(calories), n=n(), se=sd/sqrt(n))
`summarise()` has grouped output by 'fat'. You can override using the `.groups`
argument.
I will do a patchwork graph to show the individual relationshps of calroies and fat, calories and carbs, and the combined relationship of calories and fat and carbs.
Code
g<-ggplot()+geom_point(data= somecals, aes(x=fat, y=calories, color=fat), alpha=0.2)+#raw datalabs(title="Calories by Fat Content", x="Fat",y="Calories",color="Fat")+#add titlesgeom_smooth(data=somecals, aes(x=fat, y=calories), method='lm')+#add linear regressiontheme_classic()+theme(text=element_text(size=10)) #change text sizeh<-ggplot()+geom_point(data= somecals, aes(x=carbs, y=calories, color=carbs), alpha=0.2)+#raw datatheme_classic()+#white bg with horizontal lines to help compare data pointsgeom_smooth(data=somecals, aes(x=carbs, y=calories), method='lm')+#add linear regressionlabs(title="Calories by Carb Content", x="Carbohydrates",y="Calories",color="Carbs")+#add titlestheme(text=element_text(size=10)) #change text sizei<-ggplot()+geom_point(data= somecals, aes(x=fat, y=calories, color=carbs), alpha=0.2)+#raw datatheme_classic()+#white bg with horizontal lines to help compare data pointsgeom_smooth(data=somecals, aes(x=fat, y=carbs), method='lm')+#add linear regression of fat and carbslabs(title="Calories by Fat and Carbs", x="Fat",y="Calories",color="Carbs")+#add titlestheme(text=element_text(size=10)) #change text sizeg+h+i #patchwork :0
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 156 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 156 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 170 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 156 rows containing missing values or values outside the scale range
(`geom_point()`).
The graph supports that there is a difference in calories by fat and carb levels. The lines of best fit is increasing for all thrree graphs, showing a positive relationship. Let’s see if ANOVA supports this with numbers.
Code
aovcals<-aov(calories~fat*carbs, data=somecals) #use ANOVA to get data infosummary(aovcals) #brings up p-value and other improtant numbers
Df Sum Sq Mean Sq F value Pr(>F)
fat 1 603035489 603035489 1.557e+05 <2e-16 ***
carbs 1 134609245 134609245 3.474e+04 <2e-16 ***
fat:carbs 1 24541 24541 6.334e+00 0.0119 *
Residuals 14051 54437811 3874
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
170 observations deleted due to missingness
This ANOVA data shows that there is significant correlation between fat adn calories where the p-value is <2e-16, as well as carbs and calories where the p-value is <2e-16. There is some correlation between fats/carbs and calories, but it is not as significant because the p-value is 0.0119.
Code
check_model(aovcals)
There is definitely data skewed towards 0-2000 calories, rather than 2000+ calories. Now I’ll identify outliers
Code
less_outliers<- somecals %>%group_by(fat, carbs) %>%identify_outliers(calories) #No extreme outliers, so let's carry on
Code
aovnormal<-aov(calories~fat*carbs, data=less_outliers) #now im checking anova data without the outlierssummary(aovnormal)
Df Sum Sq Mean Sq F value Pr(>F)
fat 1 9567007 9567007 2178.596 <2e-16 ***
carbs 1 2020677 2020677 460.148 <2e-16 ***
fat:carbs 1 3094 3094 0.705 0.402
Residuals 727 3192522 4391
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
15 observations deleted due to missingness
Code
check_model(aovnormal) #see the data
The data looks MUCH more normal and less skewed. The ANOVA results show that there is still significant correlation between calories and fats with a p-value of <2e-16 and between carbs and calories with a p-value of <2e-16. However, the p-value of carbs AND fats relation to calories is 0.402, which is much higher than 0.05, so there is not a relationship between these values. This shows the importance of testing and accommodating for normalcy! The null hypothesis is accepted.