b
Question: Generate a subsample of 25 fish from our
fish data.
Answer: See
Table 1
set.seed(50) #integer vetor containing randomly generated numbers
tf <- round(rnorm(25, 29.7, 0.6), 1) #rnorm(n, mean = 0, sd = 1), round(x, digits = 0)
#here you can round (values, a) where a is the number of digits. You can input function in for x and call out a randomly generated dataset, where the dataset has the length 25, mean of 29.7, and standard deviation of 0.6
#let tf represent tidal fresh (higher order stream sample)
#let d represent delta (mid order stream)
#let e represent the mouth
d <- round(rnorm(25, 32.7, 0.8), 1)
e <- round(rnorm(25, 33.5, 1.2), 1)
#now add categorical column to identify site:
T <- rep("tidal", 25) #OH WOW. I could have used this when I didn't know ifelse during lab1
D <- rep("delta", 25)
E <- rep("estuary", 25)
#now combine the data
length_mm <- c(tf ,d, e) #combine all of column 1
habitat <- c(T, D, E)
fishData <- data.frame(habitat, length_mm)
gt(fishData) |>
opt_table_font(google_font("Caveat"))|>
opt_table_outline(style="solid")|>
tab_header(
title = "Table 1: Fish Data",
subtitle = "Data collected from Skagit River, high order (tf), midorder (d), and low order (e)")|>
opt_stylize(style = 5, color = "cyan")#styles refer to the look of the table.
| Table 1: Fish Data |
| Data collected from Skagit River, high order (tf), midorder (d), and low order (e) |
| habitat |
length_mm |
| tidal |
30.0 |
| tidal |
29.2 |
| tidal |
29.7 |
| tidal |
30.0 |
| tidal |
28.7 |
| tidal |
29.5 |
| tidal |
29.9 |
| tidal |
29.3 |
| tidal |
30.3 |
| tidal |
28.8 |
| tidal |
29.9 |
| tidal |
30.0 |
| tidal |
29.4 |
| tidal |
29.8 |
| tidal |
29.4 |
| tidal |
29.5 |
| tidal |
29.6 |
| tidal |
29.2 |
| tidal |
29.0 |
| tidal |
29.5 |
| tidal |
29.5 |
| tidal |
29.3 |
| tidal |
28.7 |
| tidal |
30.7 |
| tidal |
30.0 |
| delta |
34.8 |
| delta |
33.0 |
| delta |
32.4 |
| delta |
33.2 |
| delta |
32.7 |
| delta |
32.9 |
| delta |
32.0 |
| delta |
31.8 |
| delta |
33.2 |
| delta |
31.8 |
| delta |
32.7 |
| delta |
33.0 |
| delta |
34.2 |
| delta |
33.0 |
| delta |
33.0 |
| delta |
32.8 |
| delta |
32.4 |
| delta |
32.8 |
| delta |
30.4 |
| delta |
32.8 |
| delta |
33.1 |
| delta |
33.5 |
| delta |
31.4 |
| delta |
31.5 |
| delta |
32.3 |
| estuary |
33.5 |
| estuary |
33.9 |
| estuary |
31.1 |
| estuary |
30.7 |
| estuary |
34.7 |
| estuary |
34.9 |
| estuary |
32.5 |
| estuary |
31.8 |
| estuary |
31.9 |
| estuary |
33.3 |
| estuary |
33.9 |
| estuary |
33.6 |
| estuary |
33.5 |
| estuary |
32.7 |
| estuary |
31.8 |
| estuary |
33.1 |
| estuary |
34.5 |
| estuary |
32.0 |
| estuary |
33.6 |
| estuary |
32.4 |
| estuary |
33.6 |
| estuary |
32.7 |
| estuary |
31.5 |
| estuary |
34.2 |
| estuary |
33.6 |
c
Question: Assessment of assumptions: Generate
plot(s) and/or statistical tests to assess the data. Paste any plot(s)
into your file and justify (1-2 sentences) why you have selected the
type of plot you have and summarize in words what it shows. Any
statistical tests should be explained–why did you use this and what did
you find? Label as appropriate.
Answer: Figure 1 below
shows that all fish length data are pretty bell-shaped. If it wasn’t
obvious yet, I did a normal quantile plot and noticed that the fish
length data follow a pretty straight line suggesting they are following
a normal distribution. Finally, I ran the Shapiro-Wilk test to triple
check my assumption and found that all locations fail to reject
H0 and fishData is normal enough.
The assumption of equal variance: Table 2 shows the standard deviation for tidal is
0.49, estuary sd is more than 2 times that value. I decided to run
Levene’s test for homogeneity. The data shown above tells us that there
is a violation of homogeneity, but because our sample size is less that
30, I will accept this violation as a problem with sample
size.*
Step 1: look at summary stats
#basic summary stats
fishmeans <- tapply(fishData$length_mm, fishData$habitat, mean)
fishsds<- tapply(fishData$length_mm, fishData$habitat, sd)
Skagit_locations <- c("Delta", "Estuary", "Tidal")
fishiesummarystats <- data.frame(Skagit_locations, fishmeans, fishsds)
gt(fishiesummarystats)|>
opt_table_font(google_font("Caveat"))|>
opt_table_outline(style="solid")|>
tab_header(
title = "Table 2: FishData Summary Statistics")|>
opt_stylize(style = 5, color = "cyan")%>%
tab_options(table.align='left')
| Table 2: FishData Summary Statistics |
| Skagit_locations |
fishmeans |
fishsds |
| Delta |
32.668 |
0.8952281 |
| Estuary |
33.000 |
1.1343133 |
| Tidal |
29.556 |
0.4899660 |
Step 2: Check for normality
ggplot(fishData, aes(length_mm))+
geom_histogram(bins=30, color="white", fill="#db9799")+
labs(title="Figure 1: Fish Length Distribution", x="Fish Length (mm)",
caption = "Figure 1: Distribution data for fish lengths at different sites.")+
theme(plot.caption =
element_text(hjust = 0))+
facet_wrap(as.factor(habitat))

qqnorm(fishData$length_mm, main="All Fish Length") #Looks relatively straight, but I will run the Shapiro-wilk test to double check.

gt(fishData %>%
group_by(habitat) %>%
summarize(Statistic = shapiro.test(length_mm)$statistic, p.value =
shapiro.test(length_mm)$p.value)) |>
opt_table_font(google_font("Caveat")) |>
tab_header(
title = "Table 3: FishData Shapiro-Wilk")|>
opt_stylize(style=5, color = "cyan")%>%
tab_options(table.align='left')
| Table 3: FishData Shapiro-Wilk |
| habitat |
Statistic |
p.value |
| delta |
0.9516023 |
0.2723815 |
| estuary |
0.9666068 |
0.5607988 |
| tidal |
0.9717184 |
0.6888610 |
Step 3: Check for homeoscedasticity
ggplot(fishData,
aes(habitat, length_mm, color=habitat))+
geom_jitter(width=0.15, show.legend = F)+
scale_color_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
labs(title="Homeoscedasticity for Fish Length",
caption = "Figure 2: Variance in fishlength (mm) among habitat type", y="Fish Length (mm)")+
theme(plot.caption =
element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations.

suppressWarnings({gt(leveneTest(fishData$length_mm, fishData$habitat, data=fishData))|>
opt_table_font(google_font("Caveat"))|>
tab_header(
title = "Table 4: FishData Levene's Test")|>
opt_stylize(style=5, color = "cyan")|>
tab_options(table.align='left')})
| Table 4: FishData Levene's Test |
| Df |
F value |
Pr(>F) |
| 2 |
5.879092 |
0.00431508 |
| 72 |
NA |
NA |
d
Question: Conduct an ANOVA (using α=0.05) to
determine if fish size differs among habitats, AND if so, how it does.
Provide relevant statistical output, which means pulling the important
elements out of your R output and putting them in your doc for
submittal, not just pasting the R output in your doc–you may paste, but
you clearly need to explain which parts are relevant/important to your
conclusions (which you will articulate in part e).
Does mean fish length differ among habitat
type?
H0: μtf = μd =
μe
HA: not all
means are equal
fish.aov<- aov(fishData$length_mm ~ fishData$habitat)
summary(fish.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## fishData$habitat 2 180.47 90.23 116.3 <2e-16 ***
## Residuals 72 55.88 0.78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AOV Summary Yes, mean
fish length differs between at least 2 habitats. Using the aov function
above, you can see that p< 0.05, evidence that mean fish length
varies among at least 2 of the following habitat types: tidal, delta,
estuary (F2, 72=116, p<0.001).
Which habitats show difference in mean fish
length?
tukeytable <- TukeyHSD(fish.aov, ordered = TRUE)
tukeytable <- as.data.frame(tukeytable$`fishData$habitat`)
Habitat <- c("delta-tidal", "estuary-tidal", "estuary-delta")
tukeytable<- add_column(tukeytable, Habitat, .before = "diff")
gt(tukeytable) |>
opt_table_font(google_font("Caveat"))|>
tab_header(
title = "Table 5: FishData Tukey HSD")|>
opt_stylize(style=5, color = "cyan")|>
tab_options(table.align='left')
| Table 5: FishData Tukey HSD |
| Habitat |
diff |
lwr |
upr |
p adj |
| delta-tidal |
3.112 |
2.5157113 |
3.7082887 |
0.0000000 |
| estuary-tidal |
3.444 |
2.8477113 |
4.0402887 |
0.0000000 |
| estuary-delta |
0.332 |
-0.2642887 |
0.9282887 |
0.3820299 |
Tukey HSD Summary The
fishData shows that mean fish length differs between tidal and the other
two habitat types (F2, 72=116, p<0.001) with tidal having
the shortest fish and delta and estuary having the longest (Tukey HSD, p
< 0.001)
e
Question: What are your conclusions about fish size
in different habitats? Provide supporting information (descriptive
statistics, quality plots, statistical output) and write your summary in
plain language with the appropriate evidence.
AOV Summary The mean
fish length does differ between at least 2 habitats. Using the aov
function in section d, you can see that p< 0.05, evidence that mean
fish length varies among at least 2 of the following habitat types:
tidal, delta, estuary (F2, 72=116, p<0.001). Upon further
investigation using the Tukey HSD post-hoc test to determine which means
differ, it was determined that tidal freshwater salmon have on average
the shortest length. (Tukey HSD, p < 0.001)
ggplot(fishData,
aes(habitat, length_mm, fill=habitat))+ #scale_fill_manual only works with fill not color. INTERESTING!
geom_boxplot(width=0.15, show.legend = F)+
scale_fill_manual(values = c("#eaad9f", "#b9aea9", "#869c8d"))+
labs(title="Fish length differs among stream order", x="",
caption = "Figure 3: Fish length is smallest in higher order streams (tidal)", y="Fish Length (mm)")+
theme(plot.caption =
element_text(hjust = 0)) #variance seems smaller for tidal than the other two locations.
