Anova HW: Fish

Markdown Author: Jessie Bell, 2023

I. Fish

a

Question: State your statistical hypothesis (null and alternative).

Answer:

H₀: μ_tf = μ_d = μ_e

H_A: not all means are equal

b

Question: Generate a subsample of 25 fish from our fish data.

Answer: See Table 1

set.seed(50) #integer vetor containing randomly generated numbers

tf <- round(rnorm(25, 29.7, 0.6), 1) #rnorm(n, mean = 0, sd = 1), round(x, digits = 0)
#here you can round (values, a) where a is the number of digits. You can input function in for x and call out a randomly generated dataset, where the dataset has the length 25, mean of 29.7, and standard deviation of 0.6

#let tf represent tidal fresh (higher order stream sample)
#let d represent delta (mid order stream)
#let e represent the mouth

d <- round(rnorm(25, 32.7, 0.8), 1)
e <- round(rnorm(25, 33.5, 1.2), 1)

#now add categorical column to identify site: 
T <- rep("tidal", 25) #OH WOW. I could have used this when I didn't know ifelse during lab1
D <- rep("delta", 25)
E <- rep("estuary", 25)

#now combine the data
length_mm <- c(tf ,d, e) #combine all of column 1
habitat <- c(T, D, E)

fishData <- data.frame(habitat, length_mm)

gt(fishData) |>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 1: Fish Data",
    subtitle = "Data collected from Skagit River, high order (tf), midorder (d), and low order (e)")|>
  opt_stylize(style = 5, color = "cyan")#styles refer to the look of the table.

habitat	length_mm
Table 1: Fish Data
Data collected from Skagit River, high order (tf), midorder (d), and low order (e)
tidal	30.0
tidal	29.2
tidal	29.7
tidal	30.0
tidal	28.7
tidal	29.5
tidal	29.9
tidal	29.3
tidal	30.3
tidal	28.8
tidal	29.9
tidal	30.0
tidal	29.4
tidal	29.8
tidal	29.4
tidal	29.5
tidal	29.6
tidal	29.2
tidal	29.0
tidal	29.5
tidal	29.5
tidal	29.3
tidal	28.7
tidal	30.7
tidal	30.0
delta	34.8
delta	33.0
delta	32.4
delta	33.2
delta	32.7
delta	32.9
delta	32.0
delta	31.8
delta	33.2
delta	31.8
delta	32.7
delta	33.0
delta	34.2
delta	33.0
delta	33.0
delta	32.8
delta	32.4
delta	32.8
delta	30.4
delta	32.8
delta	33.1
delta	33.5
delta	31.4
delta	31.5
delta	32.3
estuary	33.5
estuary	33.9
estuary	31.1
estuary	30.7
estuary	34.7
estuary	34.9
estuary	32.5
estuary	31.8
estuary	31.9
estuary	33.3
estuary	33.9
estuary	33.6
estuary	33.5
estuary	32.7
estuary	31.8
estuary	33.1
estuary	34.5
estuary	32.0
estuary	33.6
estuary	32.4
estuary	33.6
estuary	32.7
estuary	31.5
estuary	34.2
estuary	33.6

c

Question: Assessment of assumptions: Generate plot(s) and/or statistical tests to assess the data. Paste any plot(s) into your file and justify (1-2 sentences) why you have selected the type of plot you have and summarize in words what it shows. Any statistical tests should be explained–why did you use this and what did you find? Label as appropriate.

Answer: Figure 1 below shows that all fish length data are pretty bell-shaped. If it wasn’t obvious yet, I did a normal quantile plot and noticed that the fish length data follow a pretty straight line suggesting they are following a normal distribution. Finally, I ran the Shapiro-Wilk test to triple check my assumption and found that all locations fail to reject H₀ and fishData is normal enough.

The assumption of equal variance: Table 2 shows the standard deviation for tidal is 0.49, estuary sd is more than 2 times that value. I decided to run Levene’s test for homogeneity. The data shown above tells us that there is a violation of homogeneity, but because our sample size is less that 30, I will accept this violation as a problem with sample size.*

Step 1: look at summary stats

#basic summary stats
fishmeans <- tapply(fishData$length_mm, fishData$habitat, mean)
fishsds<- tapply(fishData$length_mm, fishData$habitat, sd)
Skagit_locations <- c("Delta", "Estuary", "Tidal")

fishiesummarystats <- data.frame(Skagit_locations, fishmeans, fishsds)

gt(fishiesummarystats)|>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 2: FishData Summary Statistics")|>
  opt_stylize(style = 5, color = "cyan")%>%
  tab_options(table.align='left')

Skagit_locations	fishmeans	fishsds
Table 2: FishData Summary Statistics
Delta	32.668	0.8952281
Estuary	33.000	1.1343133
Tidal	29.556	0.4899660

Step 2: Check for normality

ggplot(fishData, aes(length_mm))+
  geom_histogram(bins=30, color="white", fill="#db9799")+
  labs(title="Figure 1: Fish Length Distribution", x="Fish Length (mm)", 
       caption = "Figure 1: Distribution data for fish lengths at different sites.")+
  theme(plot.caption =
          element_text(hjust = 0))+
  facet_wrap(as.factor(habitat))

qqnorm(fishData$length_mm, main="All Fish Length") #Looks relatively straight, but I will run the Shapiro-wilk test to double check.

gt(fishData %>%
     group_by(habitat) %>%
     summarize(Statistic = shapiro.test(length_mm)$statistic, p.value = 
shapiro.test(length_mm)$p.value)) |>
  opt_table_font(google_font("Caveat")) |>
  tab_header(
    title = "Table 3: FishData Shapiro-Wilk")|>
  opt_stylize(style=5, color = "cyan")%>%
  tab_options(table.align='left')

habitat	Statistic	p.value
Table 3: FishData Shapiro-Wilk
delta	0.9516023	0.2723815
estuary	0.9666068	0.5607988
tidal	0.9717184	0.6888610

Step 3: Check for homeoscedasticity

ggplot(fishData, 
       aes(habitat, length_mm, color=habitat))+
  geom_jitter(width=0.15, show.legend = F)+
  scale_color_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
  labs(title="Homeoscedasticity for Fish Length", 
       caption = "Figure 2: Variance in fishlength (mm) among habitat type", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations.

suppressWarnings({gt(leveneTest(fishData$length_mm, fishData$habitat, data=fishData))|>
  opt_table_font(google_font("Caveat"))|>
  tab_header(
    title = "Table 4: FishData Levene's Test")|>
  opt_stylize(style=5, color = "cyan")|>
  tab_options(table.align='left')})

Df	F value	Pr(>F)
Table 4: FishData Levene's Test
2	5.879092	0.00431508
72	NA	NA

d

Question: Conduct an ANOVA (using α=0.05) to determine if fish size differs among habitats, AND if so, how it does. Provide relevant statistical output, which means pulling the important elements out of your R output and putting them in your doc for submittal, not just pasting the R output in your doc–you may paste, but you clearly need to explain which parts are relevant/important to your conclusions (which you will articulate in part e).

Does mean fish length differ among habitat type?

H₀: μ_tf = μ_d = μ_e

H_A: not all means are equal

fish.aov<- aov(fishData$length_mm ~ fishData$habitat)

summary(fish.aov)

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## fishData$habitat  2 180.47   90.23   116.3 <2e-16 ***
## Residuals        72  55.88    0.78                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AOV Summary Yes, mean fish length differs between at least 2 habitats. Using the aov function above, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F_{2, 72}=116, p<0.001).

Which habitats show difference in mean fish length?

tukeytable <- TukeyHSD(fish.aov, ordered = TRUE)

tukeytable <- as.data.frame(tukeytable$`fishData$habitat`)

Habitat <- c("delta-tidal", "estuary-tidal", "estuary-delta")

tukeytable<- add_column(tukeytable, Habitat, .before = "diff")

gt(tukeytable) |>
  opt_table_font(google_font("Caveat"))|>
  tab_header(
    title = "Table 5: FishData Tukey HSD")|>
  opt_stylize(style=5, color = "cyan")|>
  tab_options(table.align='left')

Habitat	diff	lwr	upr	p adj
Table 5: FishData Tukey HSD
delta-tidal	3.112	2.5157113	3.7082887	0.0000000
estuary-tidal	3.444	2.8477113	4.0402887	0.0000000
estuary-delta	0.332	-0.2642887	0.9282887	0.3820299

Tukey HSD Summary The fishData shows that mean fish length differs between tidal and the other two habitat types (F_{2, 72}=116, p<0.001) with tidal having the shortest fish and delta and estuary having the longest (Tukey HSD, p < 0.001)

e

Question: What are your conclusions about fish size in different habitats? Provide supporting information (descriptive statistics, quality plots, statistical output) and write your summary in plain language with the appropriate evidence.

AOV Summary The mean fish length does differ between at least 2 habitats. Using the aov function in section d, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F_{2, 72}=116, p<0.001). Upon further investigation using the Tukey HSD post-hoc test to determine which means differ, it was determined that tidal freshwater salmon have on average the shortest length. (Tukey HSD, p < 0.001)

ggplot(fishData, 
       aes(habitat, length_mm, fill=habitat))+ #scale_fill_manual only works with fill not color. INTERESTING!
  geom_boxplot(width=0.15, show.legend = F)+
  scale_fill_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
  labs(title="Fish length differs among stream order", x="", 
       caption = "Figure 3: Fish length is smallest in higher order streams (tidal)", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations.

—————————————————————-

II. Fish with a covariate

temp<-c(rnorm(25, 15, 1.5), rnorm(25, 14, 2.5), rnorm(25, 12, 2.5)) 
fish2<-cbind(fishData, temp)

a

Question: Carry the analysis through. Does the covariate make a difference? Show your work.

#basic summary stats
fishmeans <- tapply(fish2$length_mm, fishData$habitat, mean)
fishsds<- tapply(fish2$length_mm, fishData$habitat, sd)
temperaturemeans <- tapply(fish2$temp, fish2$habitat, mean)
temperaturesds <- tapply(fish2$temp, fish2$habitat, sd)
Skagit_locations <- c("Delta", "Estuary", "Tidal")

fishiesummarystats2 <- data.frame(Skagit_locations, fishmeans, fishsds, temperaturemeans, temperaturesds)

gt(fishiesummarystats2)|>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 6: Fish2 Summary Statistics")|>
  opt_stylize(style = 5, color = "cyan")%>%
  tab_options(table.align='left')

Skagit_locations	fishmeans	fishsds	temperaturemeans	temperaturesds
Table 6: Fish2 Summary Statistics
Delta	32.668	0.8952281	14.27953	2.710873
Estuary	33.000	1.1343133	11.79314	2.913717
Tidal	29.556	0.4899660	14.92297	1.554025

Look at the data:

suppressWarnings({ggplot(fish2, aes(temp, length_mm, color=habitat, shape=habitat))+geom_point()+geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
    scale_color_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
    labs(title="Fish Length vs. Temperature", x="temperature", 
       caption = "Figure 4: Is there a relationship between fish tail length and temperature?", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0))})

## `geom_smooth()` using formula = 'y ~ x'

Check for both normality and homeoscedasticity of residuals:

ancova <- lm(length_mm ~ temp*habitat, data=fish2)

#do residuals follow normal distribution?
plot(density(ancova$residuals)) #looks so cute!

#do residuals follow homeoscedasticity rule?
plot(ancova$residuals~ancova$fitted.values)+
lines(lowess(ancova$fitted.values,ancova$residuals), col="lightblue")+
text(ancova$fitted.values, ancova$residuals, row.names(fish2), cex=0.6, pos=4, col="tomato")

## integer(0)

#I honestly have NO IDEA how to interpret this situation. :()

Running the ANOVA on the lm()

anova(ancova)

## Analysis of Variance Table
## 
## Response: length_mm
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## temp          1  39.372  39.372 53.8711 3.212e-10 ***
## habitat       2 144.224  72.112 98.6677 < 2.2e-16 ***
## temp:habitat  2   2.317   1.158  1.5851    0.2123    
## Residuals    69  50.429   0.731                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b

Question: Summarize your findings here.

Answer **The affect of temperature on fish tail length is statistically significant (p=3.210^-10). The affect of habitat on fish tail length is also statistically significant (p=2.210^-16). The relationship between temperature and habitat is not statistically significant (p=0.212), meaning that these two predictor variables are dependent of one another.

—————————————————————-

III. ANOVA in the Wild

a

Question: Cite the article and give a one-two sentence summary of the paper and what it addresses. Include a link to the article.

Answer: Zvereva et al., 2010 uses ANCOVA to determine the effects of climate change and sap-feeding insects on growth and reproduction of plants. ANCOVA is an ANOVA with an added covariate (just like we did in this lab). Researchers found that higher greenhouse temperatures yielded decrease in plant growth when predated by sap-feeding insects.

Anova HW: Fish

I. Fish

a

Question: State your statistical hypothesis (null and alternative).

Answer:

H0: μtf = μd = μe

HA: not all means are equal

b

Question: Generate a subsample of 25 fish from our fish data.

Answer: See Table 1

c

Step 1: look at summary stats

Step 2: Check for normality

Step 3: Check for homeoscedasticity

d

Does mean fish length differ among habitat type?

H0: μtf = μd = μe

HA: not all means are equal

AOV Summary Yes, mean fish length differs between at least 2 habitats. Using the aov function above, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F2, 72=116, p<0.001).

Which habitats show difference in mean fish length?

Tukey HSD Summary The fishData shows that mean fish length differs between tidal and the other two habitat types (F2, 72=116, p<0.001) with tidal having the shortest fish and delta and estuary having the longest (Tukey HSD, p < 0.001)

e

Question: What are your conclusions about fish size in different habitats? Provide supporting information (descriptive statistics, quality plots, statistical output) and write your summary in plain language with the appropriate evidence.

—————————————————————-

II. Fish with a covariate

a

Question: Carry the analysis through. Does the covariate make a difference? Show your work.

Check for both normality and homeoscedasticity of residuals:

Running the ANOVA on the lm()

b

Question: Summarize your findings here.

—————————————————————-

III. ANOVA in the Wild

a

Question: Cite the article and give a one-two sentence summary of the paper and what it addresses. Include a link to the article.

b

c

Question: Do you understand the question, the null hypothesis, and the findings? Elaborate a little – what is clear and what is not? Be prepared to share something about the article you found with your work group.

H₀: μ_tf = μ_d = μ_e

H_A: not all means are equal

H₀: μ_tf = μ_d = μ_e

H_A: not all means are equal

AOV Summary Yes, mean fish length differs between at least 2 habitats. Using the aov function above, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F_{2, 72}=116, p<0.001).

Tukey HSD Summary The fishData shows that mean fish length differs between tidal and the other two habitat types (F_{2, 72}=116, p<0.001) with tidal having the shortest fish and delta and estuary having the longest (Tukey HSD, p < 0.001)