Markdown Author: Jessie Bell, 2023

I. Fish

a

Question: State your statistical hypothesis (null and alternative).
Answer:
H0: μtf = μd = μe
HA: not all means are equal

b

Question: Generate a subsample of 25 fish from our fish data.
Answer: See Table 1
set.seed(50) #integer vetor containing randomly generated numbers

tf <- round(rnorm(25, 29.7, 0.6), 1) #rnorm(n, mean = 0, sd = 1), round(x, digits = 0)
#here you can round (values, a) where a is the number of digits. You can input function in for x and call out a randomly generated dataset, where the dataset has the length 25, mean of 29.7, and standard deviation of 0.6

#let tf represent tidal fresh (higher order stream sample)
#let d represent delta (mid order stream)
#let e represent the mouth

d <- round(rnorm(25, 32.7, 0.8), 1)
e <- round(rnorm(25, 33.5, 1.2), 1)

#now add categorical column to identify site: 
T <- rep("tidal", 25) #OH WOW. I could have used this when I didn't know ifelse during lab1
D <- rep("delta", 25)
E <- rep("estuary", 25)

#now combine the data
length_mm <- c(tf ,d, e) #combine all of column 1
habitat <- c(T, D, E)

fishData <- data.frame(habitat, length_mm)

gt(fishData) |>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 1: Fish Data",
    subtitle = "Data collected from Skagit River, high order (tf), midorder (d), and low order (e)")|>
  opt_stylize(style = 5, color = "cyan")#styles refer to the look of the table. 
Table 1: Fish Data
Data collected from Skagit River, high order (tf), midorder (d), and low order (e)
habitat length_mm
tidal 30.0
tidal 29.2
tidal 29.7
tidal 30.0
tidal 28.7
tidal 29.5
tidal 29.9
tidal 29.3
tidal 30.3
tidal 28.8
tidal 29.9
tidal 30.0
tidal 29.4
tidal 29.8
tidal 29.4
tidal 29.5
tidal 29.6
tidal 29.2
tidal 29.0
tidal 29.5
tidal 29.5
tidal 29.3
tidal 28.7
tidal 30.7
tidal 30.0
delta 34.8
delta 33.0
delta 32.4
delta 33.2
delta 32.7
delta 32.9
delta 32.0
delta 31.8
delta 33.2
delta 31.8
delta 32.7
delta 33.0
delta 34.2
delta 33.0
delta 33.0
delta 32.8
delta 32.4
delta 32.8
delta 30.4
delta 32.8
delta 33.1
delta 33.5
delta 31.4
delta 31.5
delta 32.3
estuary 33.5
estuary 33.9
estuary 31.1
estuary 30.7
estuary 34.7
estuary 34.9
estuary 32.5
estuary 31.8
estuary 31.9
estuary 33.3
estuary 33.9
estuary 33.6
estuary 33.5
estuary 32.7
estuary 31.8
estuary 33.1
estuary 34.5
estuary 32.0
estuary 33.6
estuary 32.4
estuary 33.6
estuary 32.7
estuary 31.5
estuary 34.2
estuary 33.6

c

Question: Assessment of assumptions: Generate plot(s) and/or statistical tests to assess the data. Paste any plot(s) into your file and justify (1-2 sentences) why you have selected the type of plot you have and summarize in words what it shows. Any statistical tests should be explained–why did you use this and what did you find? Label as appropriate.
Answer: Figure 1 below shows that all fish length data are pretty bell-shaped. If it wasn’t obvious yet, I did a normal quantile plot and noticed that the fish length data follow a pretty straight line suggesting they are following a normal distribution. Finally, I ran the Shapiro-Wilk test to triple check my assumption and found that all locations fail to reject H0 and fishData is normal enough.
The assumption of equal variance: Table 2 shows the standard deviation for tidal is 0.49, estuary sd is more than 2 times that value. I decided to run Levene’s test for homogeneity. The data shown above tells us that there is a violation of homogeneity, but because our sample size is less that 30, I will accept this violation as a problem with sample size.*

Step 1: look at summary stats

#basic summary stats
fishmeans <- tapply(fishData$length_mm, fishData$habitat, mean)
fishsds<- tapply(fishData$length_mm, fishData$habitat, sd)
Skagit_locations <- c("Delta", "Estuary", "Tidal")

fishiesummarystats <- data.frame(Skagit_locations, fishmeans, fishsds)

gt(fishiesummarystats)|>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 2: FishData Summary Statistics")|>
  opt_stylize(style = 5, color = "cyan")%>%
  tab_options(table.align='left') 
Table 2: FishData Summary Statistics
Skagit_locations fishmeans fishsds
Delta 32.668 0.8952281
Estuary 33.000 1.1343133
Tidal 29.556 0.4899660

Step 2: Check for normality

ggplot(fishData, aes(length_mm))+
  geom_histogram(bins=30, color="white", fill="#db9799")+
  labs(title="Figure 1: Fish Length Distribution", x="Fish Length (mm)", 
       caption = "Figure 1: Distribution data for fish lengths at different sites.")+
  theme(plot.caption =
          element_text(hjust = 0))+
  facet_wrap(as.factor(habitat))

qqnorm(fishData$length_mm, main="All Fish Length") #Looks relatively straight, but I will run the Shapiro-wilk test to double check. 

gt(fishData %>%
     group_by(habitat) %>%
     summarize(Statistic = shapiro.test(length_mm)$statistic, p.value = 
shapiro.test(length_mm)$p.value)) |>
  opt_table_font(google_font("Caveat")) |>
  tab_header(
    title = "Table 3: FishData Shapiro-Wilk")|>
  opt_stylize(style=5, color = "cyan")%>%
  tab_options(table.align='left')
Table 3: FishData Shapiro-Wilk
habitat Statistic p.value
delta 0.9516023 0.2723815
estuary 0.9666068 0.5607988
tidal 0.9717184 0.6888610

Step 3: Check for homeoscedasticity

ggplot(fishData, 
       aes(habitat, length_mm, color=habitat))+
  geom_jitter(width=0.15, show.legend = F)+
  scale_color_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
  labs(title="Homeoscedasticity for Fish Length", 
       caption = "Figure 2: Variance in fishlength (mm) among habitat type", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations. 

suppressWarnings({gt(leveneTest(fishData$length_mm, fishData$habitat, data=fishData))|>
  opt_table_font(google_font("Caveat"))|>
  tab_header(
    title = "Table 4: FishData Levene's Test")|>
  opt_stylize(style=5, color = "cyan")|>
  tab_options(table.align='left')})
Table 4: FishData Levene's Test
Df F value Pr(>F)
2 5.879092 0.00431508
72 NA NA

d

Question: Conduct an ANOVA (using α=0.05) to determine if fish size differs among habitats, AND if so, how it does. Provide relevant statistical output, which means pulling the important elements out of your R output and putting them in your doc for submittal, not just pasting the R output in your doc–you may paste, but you clearly need to explain which parts are relevant/important to your conclusions (which you will articulate in part e).
Does mean fish length differ among habitat type?
H0: μtf = μd = μe
HA: not all means are equal
fish.aov<- aov(fishData$length_mm ~ fishData$habitat)

summary(fish.aov)
##                  Df Sum Sq Mean Sq F value Pr(>F)    
## fishData$habitat  2 180.47   90.23   116.3 <2e-16 ***
## Residuals        72  55.88    0.78                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AOV Summary Yes, mean fish length differs between at least 2 habitats. Using the aov function above, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F2, 72=116, p<0.001).
Which habitats show difference in mean fish length?
tukeytable <- TukeyHSD(fish.aov, ordered = TRUE)

tukeytable <- as.data.frame(tukeytable$`fishData$habitat`)

Habitat <- c("delta-tidal", "estuary-tidal", "estuary-delta")

tukeytable<- add_column(tukeytable, Habitat, .before = "diff")

gt(tukeytable) |>
  opt_table_font(google_font("Caveat"))|>
  tab_header(
    title = "Table 5: FishData Tukey HSD")|>
  opt_stylize(style=5, color = "cyan")|>
  tab_options(table.align='left')
Table 5: FishData Tukey HSD
Habitat diff lwr upr p adj
delta-tidal 3.112 2.5157113 3.7082887 0.0000000
estuary-tidal 3.444 2.8477113 4.0402887 0.0000000
estuary-delta 0.332 -0.2642887 0.9282887 0.3820299
Tukey HSD Summary The fishData shows that mean fish length differs between tidal and the other two habitat types (F2, 72=116, p<0.001) with tidal having the shortest fish and delta and estuary having the longest (Tukey HSD, p < 0.001)

e

Question: What are your conclusions about fish size in different habitats? Provide supporting information (descriptive statistics, quality plots, statistical output) and write your summary in plain language with the appropriate evidence.
AOV Summary The mean fish length does differ between at least 2 habitats. Using the aov function in section d, you can see that p< 0.05, evidence that mean fish length varies among at least 2 of the following habitat types: tidal, delta, estuary (F2, 72=116, p<0.001). Upon further investigation using the Tukey HSD post-hoc test to determine which means differ, it was determined that tidal freshwater salmon have on average the shortest length. (Tukey HSD, p < 0.001)
ggplot(fishData, 
       aes(habitat, length_mm, fill=habitat))+ #scale_fill_manual only works with fill not color. INTERESTING!
  geom_boxplot(width=0.15, show.legend = F)+
  scale_fill_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
  labs(title="Fish length differs among stream order", x="", 
       caption = "Figure 3: Fish length is smallest in higher order streams (tidal)", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations. 

—————————————————————-

II. Fish with a covariate

temp<-c(rnorm(25, 15, 1.5), rnorm(25, 14, 2.5), rnorm(25, 12, 2.5)) 
fish2<-cbind(fishData, temp)

a

Question: Carry the analysis through. Does the covariate make a difference? Show your work.
#basic summary stats
fishmeans <- tapply(fish2$length_mm, fishData$habitat, mean)
fishsds<- tapply(fish2$length_mm, fishData$habitat, sd)
temperaturemeans <- tapply(fish2$temp, fish2$habitat, mean)
temperaturesds <- tapply(fish2$temp, fish2$habitat, sd)
Skagit_locations <- c("Delta", "Estuary", "Tidal")

fishiesummarystats2 <- data.frame(Skagit_locations, fishmeans, fishsds, temperaturemeans, temperaturesds)

gt(fishiesummarystats2)|>
  opt_table_font(google_font("Caveat"))|>
  opt_table_outline(style="solid")|>
  tab_header(
    title = "Table 6: Fish2 Summary Statistics")|>
  opt_stylize(style = 5, color = "cyan")%>%
  tab_options(table.align='left') 
Table 6: Fish2 Summary Statistics
Skagit_locations fishmeans fishsds temperaturemeans temperaturesds
Delta 32.668 0.8952281 14.27953 2.710873
Estuary 33.000 1.1343133 11.79314 2.913717
Tidal 29.556 0.4899660 14.92297 1.554025

Look at the data:

suppressWarnings({ggplot(fish2, aes(temp, length_mm, color=habitat, shape=habitat))+geom_point()+geom_smooth(method=lm, se=FALSE, fullrange=TRUE)+
    scale_color_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
    labs(title="Fish Length vs. Temperature", x="temperature", 
       caption = "Figure 4: Is there a relationship between fish tail length and temperature?", y="Fish Length (mm)")+
  theme(plot.caption =
          element_text(hjust = 0))})
## `geom_smooth()` using formula = 'y ~ x'

Check for both normality and homeoscedasticity of residuals:

ancova <- lm(length_mm ~ temp*habitat, data=fish2)

#do residuals follow normal distribution?
plot(density(ancova$residuals)) #looks so cute!

#do residuals follow homeoscedasticity rule?
plot(ancova$residuals~ancova$fitted.values)+
lines(lowess(ancova$fitted.values,ancova$residuals), col="lightblue")+
text(ancova$fitted.values, ancova$residuals, row.names(fish2), cex=0.6, pos=4, col="tomato")

## integer(0)
#I honestly have NO IDEA how to interpret this situation. :()

Running the ANOVA on the lm()

anova(ancova)
## Analysis of Variance Table
## 
## Response: length_mm
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## temp          1  39.372  39.372 53.8711 3.212e-10 ***
## habitat       2 144.224  72.112 98.6677 < 2.2e-16 ***
## temp:habitat  2   2.317   1.158  1.5851    0.2123    
## Residuals    69  50.429   0.731                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b

Question: Summarize your findings here.
Answer **The affect of temperature on fish tail length is statistically significant (p=3.210-10). The affect of habitat on fish tail length is also statistically significant (p=2.210-16). The relationship between temperature and habitat is not statistically significant (p=0.212), meaning that these two predictor variables are dependent of one another.

—————————————————————-

III. ANOVA in the Wild

a

Question: Cite the article and give a one-two sentence summary of the paper and what it addresses. Include a link to the article.

Answer: Zvereva et al., 2010 uses ANCOVA to determine the effects of climate change and sap-feeding insects on growth and reproduction of plants. ANCOVA is an ANOVA with an added covariate (just like we did in this lab). Researchers found that higher greenhouse temperatures yielded decrease in plant growth when predated by sap-feeding insects.

b

c

Question: Do you understand the question, the null hypothesis, and the findings? Elaborate a little – what is clear and what is not? Be prepared to share something about the article you found with your work group.

Answer: The question was: Does average plant size and reproduction differ with changing temperatures? The null hypothesis is that all means are the same. The alternative is that at least 1 mean differs from the others. I did not have time to read the entire paper, but their analysis of variance with a covariate seems legit!