library(tidyverse)
large_city <- read.csv('Data/large_city.csv')
ggplot(large_city, aes(x = X, y = Y)) +
geom_point(pch = 15) +
theme_bw(base_size = 14) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(color='lightgrey'),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_x_continuous(expand = c(0.01,0.01), limits = c(0,100), breaks = 0:100) +
scale_y_continuous(expand = c(0.01,0.01), limits = c(0,100), breaks = 0:100) Activity 2.1 - dissimilarity measures
SUBMISSION INSTRUCTIONS:
Submit both:
- Your source .qmd file;
- A pdf version of your work. 2 options for the pdf:
- Rendered directly with Quarto;
- A pdf print of your html. For this option, print the rendered html to a PDF document with Ctrl+P or Cmd+P.
Question 1
A)
Consider the data set below:
| Person | Age (years) | Income (k$) | Gender | Education |
|---|---|---|---|---|
| P1 | 25 | 35 | Female | Bachelor |
| P2 | 40 | 50 | Male | Bachelor |
| P3 | 30 | 65 | Female | Master |
Which distance measure would be best for measuring the dissimilarity between these individuals? Find the 3 pairwise distances using your selected metric “from scratch”. Which pair is most different? Most similar?
Age = |25 − 40| / 15 = 15/15 = 1 Income = |35 − 50| / 30 = 15/30 = 0.5 Avg = (1 + 0.5 + 1 + 0)/4 = 2.5/4 = 0.625 Age = |25 − 30| / 15 = 5/15 = 0.333 Income = |35 − 65| / 30 = 30/30 = 1 Avg = (0.333 + 1 + 0 + 1)/4 = 2.333/4 = 0.583 Age = |40 − 30| / 15 = 10/15 = 0.667 Avg = (0.667 + 0.5 + 1 + 1)/4 = 3.167/4 = 0.792
Most similar: P1 & P3 (0.583)
Most different: P2 & P3 (0.792)
B)
Consider the data set below:
| Person | Preferred Cuisine | Hobby | Device Used Most |
|---|---|---|---|
| P1 | Italian | Reading | Laptop |
| P2 | Mexican | Reading | Phone |
| P3 | Italian | Hiking | Laptop |
Which distance measure would be best for measuring the dissimilarity between these individuals? Find the 3 pairwise distances using your selected metric “from scratch”. Which pair is most different? Most similar? Most similar: P1 & P3 (0.333)
Most different: P2 & P3 (1.0)
C)
Consider the data set below representing counts of three different species at three locations:
| Site | Species A | Species B | Species C |
|---|---|---|---|
| Site 1 | 4 | 2 | 0 |
| Site 2 | 1 | 3 | 1 |
| Site 3 | 0 | 5 | 2 |
Which distance measure would be best for measuring the dissimilarity between these individuals? Find the 3 pairwise distances using your selected metric “from scratch”. Which pair is most different? Most similar?
Pair (Site1, Site2)
Numerator = |4−1| + |2−3| + |0−1| = 3 + 1 + 1 = 5
Denominator = (4+1) + (2+3) + (0+1) = 5 + 5 + 1 = 11 Distance = 5/11 ≈ 0.455
Site1, Site3 Numerator = |4−0| + |2−5| + |0−2| = 4 + 3 + 2 = 9
Denominator = (4+0) + (2+5) + (0+2) = 4 + 7 + 2 = 13 Distance = 9/13 = 0.692 Pair Site2, Site3
Numerator = |1−0| + |3−5| + |1−2| = 1 +2 + 1 = 4
Denominator = (1+0) + (3+5) + (1+2) = 1 + 8 + 3 = 12 Distance = 4/12 = 0.333 Results
Most similar: Site2, Site3 (0.333) Most different: Site1,Site3 (0.692)
Question 2
You’re a city manager in charge of building a new fire station. The fire station serves a city of only 4 households. There are 3 locations in the city suitable for the fire station.
The grid below shows the locations of the houses (grey squares) and suitable fire station locations (red circles):
A)
Which distance metric would be most appropriate for determining the optimal place to build the fire station? Explain your reasoning.
L1 is best because the fire truck literally has to drive on the road. L1 (Manhattan) literally specializes in this exact thing, driving on a road in a city grid
B)
Which is the optimal place of the available locations to build the fire station that minimizes the cumulative distance between the station and the houses?
B because it is the center of the grid.
C)
Now consider a much larger city:
The code below produces the coordinates of all locations in the city:
(possibilities <- expand.grid(X_FS = 0:100, Y_FS = 0:100)
%>% mutate(FireStationID = 1:n())
) %>% head X_FS Y_FS FireStationID
1 0 0 1
2 1 0 2
3 2 0 3
4 3 0 4
5 4 0 5
6 5 0 6
The next chunk of code crosses each house coordinate with all city coordinates:
# PART C: Find optimal location using L1 (Manhattan) distance
distances <- large_city %>%
crossing(possibilities) %>%
mutate(L1_distance = abs(X - X_FS) + abs(Y - Y_FS))
optimal_L1 <- distances %>%
group_by(X_FS, Y_FS) %>%
summarize(total_distance = sum(L1_distance), .groups = 'drop') %>%
anti_join(large_city, by = c("X_FS" = "X", "Y_FS" = "Y")) %>%
slice_min(total_distance, n = 1)
print(optimal_L1)# A tibble: 8 × 3
X_FS Y_FS total_distance
<int> <int> <int>
1 38 49 1271
2 38 50 1271
3 39 49 1271
4 39 50 1271
5 40 49 1271
6 40 50 1271
7 41 49 1271
8 41 50 1271
ggplot(large_city, aes(x = X, y = Y)) +
geom_point(pch = 15) +
geom_point(data = optimal_L1, aes(x = X_FS, y = Y_FS),
color = 'red', size = 6) +
theme_bw(base_size = 14) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = 'lightgrey'),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_x_continuous(expand = c(0.01, 0.01), limits = c(0, 100), breaks = 0:100) +
scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 100), breaks = 0:100)# PART D: Find optimal location using L2 (Euclidean) distance
distances <- distances %>%
mutate(L2_distance = sqrt((X - X_FS)^2 + (Y - Y_FS)^2))
optimal_L2 <- distances %>%
group_by(X_FS, Y_FS) %>%
summarize(total_distance = sum(L2_distance), .groups = 'drop') %>%
anti_join(large_city, by = c("X_FS" = "X", "Y_FS" = "Y")) %>%
slice_min(total_distance, n = 1)
print(optimal_L2)# A tibble: 1 × 3
X_FS Y_FS total_distance
<int> <int> <dbl>
1 41 50 981.
ggplot(large_city, aes(x = X, y = Y)) +
geom_point(pch = 15) +
geom_point(data = optimal_L1, aes(x = X_FS, y = Y_FS),
color = 'red', size = 6) +
geom_point(data = optimal_L2, aes(x = X_FS, y = Y_FS),
color = 'blue', size = 6) +
theme_bw(base_size = 14) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = 'lightgrey'),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_x_continuous(expand = c(0.01, 0.01), limits = c(0, 100), breaks = 0:100) +
scale_y_continuous(expand = c(0.01, 0.01), limits = c(0, 100), breaks = 0:100)Suppose any location where a house is not already built is a candidate for a fire station location. Complete the dplyr chain above to find the fire station location that minimizes the L1 distance from all houses. Add the location of the fire station to the graph of the city using a large red dot. You should of course make sure that you don’t build a fire station on top of an existing house!
D)
Modify your code to find the optimal location of the station if L2 distance is used instead. Add the new location to the grid.
Question 3
Consider the R code below which produced one of the plots from the lectures slides:
library(plotly)
# Read and wrangle
city77 <- (read.csv("Data/City77.csv")
%>% mutate(City = gsub('\\.',' ',City))
%>% mutate(highlight = ifelse(City %in% c("Minneapolis MN", "Kansas City MO", "Milwaukee WI"), 'yes','no'))
)
#Plot
plot_ly(city77,
x = ~popdens,
y = ~k12enr,
z = ~medinc,
type = "scatter3d",
mode = "markers+text",
text = ~ifelse(highlight == "yes", City, ""),
textposition = "top center",
color = ~highlight,
colors = c("gray", "red"),
marker = list(size = 5)) %>%
layout(title = "3D Plot: Population Density vs K-12 vs Median Income",
scene = list(
xaxis = list(title = "Population Density"),
yaxis = list(title = "K-12 enrollment %"),
zaxis = list(title = "Median Income")
))Use this code as a template to analyze bankruptcy data (source: Johnson and Wichern 6e page 657). This data set contains four metrics on both bankrupt and financially sound firms. The metrics were recorded 2 years prior to bankruptcy for the bankrupt firms, and about the same time for the financially sound firms.
Variables:
bankrupt= 1 for yes, 0 for financially soundx1= cash flow / total debtx2= net income/total assetsx3= current assets/current liabilitiesx4= current assets/net sales
bankruptcy <- read.csv('Data/bankruptcy_data.csv')A)
Create an interactive 3D scatterplot of this data using x1, x2, and x3. Color-code the points by bankruptcy status. Imagine your boss is a bank manager. Use your plot to explain what are some warning signs to look out for in terms of the characteristics of bankrupt banks.
Based on this 3D visualization, here are the key warning signs to look out for:
Low Cash Flow Ratio (x1): Bankrupt firms tend to have negative or very low cash flow relative to total debt.
Poor Profitability (x2): Bankrupt firms generally show negative or very low net income relative to total assets. This indicates the firm is either losing money or barely breaking even on its asset base.
Weak Liquidity (x3): Bankrupt firms often have lower current ratios (current assets/current liabilities). When this ratio is below 2, the firm will struggle to meet short-term obligations. Financially sound firms typically have current ratios well above 2.
In the 3D space, bankrupt firms cluster in the region with low x1 (≤0.2), low or negative x2 (≤0), and lower x3 values. Firms in this region are potentially 2-3 years away from potential bankruptcy.
Key Insight: The good banks (green) occupy a distinctly different region of the 3D space ,they have positive cash flows, positive net income, and stronger liquidity positions. This clear separation suggests these three metrics together provide strong warning signals for bankruptcy.
library(plotly)
library(dplyr)
# Read bankruptcy data
bankruptcy <- read.csv('Data/bankruptcy_data.csv')
# Create a label for bankruptcy status
bankruptcy <- bankruptcy %>%
mutate(status = ifelse(bankrupt == 1, "Bankrupt", "Financially Sound"))
# Create 3D scatterplot
plot_ly(bankruptcy,
x = ~x1,
y = ~x2,
z = ~x3,
type = "scatter3d",
mode = "markers",
color = ~status,
colors = c("red", "darkgreen"),
marker = list(size = 5)) %>%
layout(title = "3D Plot: Cash Flow vs Net Income vs Current Ratio",
scene = list(
xaxis = list(title = "Cash Flow / Total Debt (x1)"),
yaxis = list(title = "Net Income / Total Assets (x2)"),
zaxis = list(title = "Current Assets / Current Liabilities (x3)")
))B)
Find the mean vectors of x1 through x4 for bankrupt and sound banks (so 2 mean vectors). Then consider a new bank with the following metrics: x1 = 0.31, x2 = 0.04, x3 = 4.47, x4 = 0.30
# Calculate mean vectors for each group
mean_bankrupt <- bankruptcy %>%
filter(bankrupt == 1) %>%
summarise(
mean_x1 = mean(x1),
mean_x2 = mean(x2),
mean_x3 = mean(x3),
mean_x4 = mean(x4)
)
mean_sound <- bankruptcy %>%
filter(bankrupt == 0) %>%
summarise(
mean_x1 = mean(x1),
mean_x2 = mean(x2),
mean_x3 = mean(x3),
mean_x4 = mean(x4)
)
print("Mean vector for Bankrupt firms:")[1] "Mean vector for Bankrupt firms:"
print(mean_bankrupt) mean_x1 mean_x2 mean_x3 mean_x4
1 -0.06904762 -0.08142857 1.366667 0.437619
print("Mean vector for Financially Sound firms:")[1] "Mean vector for Financially Sound firms:"
print(mean_sound) mean_x1 mean_x2 mean_x3 mean_x4
1 0.2352 0.0556 2.5936 0.4268
# New bank metrics
new_bank <- c(0.31, 0.04, 4.47, 0.30)
# Calculate Euclidean distances
dist_to_bankrupt <- sqrt(sum((new_bank - c(mean_bankrupt$mean_x1,
mean_bankrupt$mean_x2,
mean_bankrupt$mean_x3,
mean_bankrupt$mean_x4))^2))
dist_to_sound <- sqrt(sum((new_bank - c(mean_sound$mean_x1,
mean_sound$mean_x2,
mean_sound$mean_x3,
mean_sound$mean_x4))^2))
print(paste("Distance to Bankrupt mean:", round(dist_to_bankrupt, 4)))[1] "Distance to Bankrupt mean: 3.1318"
print(paste("Distance to Sound mean:", round(dist_to_sound, 4)))[1] "Distance to Sound mean: 1.8822"
# Classification
if(dist_to_sound < dist_to_bankrupt) {
print("Classification: This bank is doing WELL (closer to financially sound firms)")
} else {
print("Classification: This bank is in TROUBLE (closer to bankrupt firms)")
}[1] "Classification: This bank is doing WELL (closer to financially sound firms)"
Is this bank doing well or are they in trouble? Justify your answer using distance between this bank’s metrics and the mean vector for each bank type.
The new bank with metrics (x1=0.31, x2=0.04, x3=4.47, x4=0.30) is doing relatively well and appears financially sound.
The distance to the “Financially Sound” mean vector is smaller than the distance to the bankrupt mean vector Using nearest centroid classification, this bank should be classified as financially sound
x1 = 0.31: Positive cash flow relative to debt (good sign) x2 = 0.04: Positive net income relative to assets (profitable) x3 = 4.47: Strong current ratio above 4 (excellent liquidity) x4 = 0.30: Reasonable current assets to sales ratio
All four metrics are positive and show the bank is generating cash, profitable, and has strong liquidity.These are characteristics much more consistent with financially sound firms than bankrupt ones.
Question 4
These data come from a collaboration with WSU biology faculty and students and published in Water. The Whitewater River has three forks: north, middle, and south.
Image source: https://www.whitewaterwatershed.org/
Electrofishing at over 60 locations on the three forks was used to sample fish populations. The data are below.
whitewater <- read.csv('Data/Whitewater.csv')
head(whitewater, 3) fork site siteID lat long American.brook.lamprey
1 NF Pries source 1 NF32 44.03638 -92.25187 0
2 NF Pries source 2 NF33 44.04202 -92.26845 0
3 NF Pries 1 NF30 44.03670 -92.25552 0
Rainbow.trout Brown.trout Brook.trout Central.stoneroller Common.shiner
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
Sand.shiner Southern.redbelly.dace Bluntnose.minnow Fathead.minnow
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
Blacknose.dace Longnose.dace Creek.chub White.sucker Brook.stickleback
1 0 0 0 0 38
2 10 0 4 0 0
3 0 0 0 0 22
Green.sunfish Bluegill Fantail.darter Johnny.darter mottled.sculpin
1 0 0 0 0 0
2 0 0 1 1 0
3 0 0 0 1 0
slimy.sculpin
1 0
2 0
3 0
library(tidyverse)
library(vegan)Loading required package: permute
whitewater <- read.csv('Data/Whitewater.csv')
cat("\n=== PART A: Bray-Curtis Analysis ===\n")
=== PART A: Bray-Curtis Analysis ===
site_names <- whitewater$siteID
species_cols <- whitewater %>%
select(-fork, -site, -siteID, -lat, -long)
species_cols <- as.data.frame(species_cols)
rownames(species_cols) <- site_names
bc_dist <- vegdist(species_cols, method = "bray")
bc_matrix <- as.matrix(bc_dist)
rownames(bc_matrix) <- site_names
colnames(bc_matrix) <- site_names
bc_matrix_upper <- bc_matrix
bc_matrix_upper[lower.tri(bc_matrix_upper, diag = TRUE)] <- NA
min_dissim <- min(bc_matrix_upper, na.rm = TRUE)
min_idx <- which(bc_matrix_upper == min_dissim, arr.ind = TRUE)
cat(sprintf("\nMost similar: %s and %s (BC = %.4f)\n",
site_names[min_idx[1,1]], site_names[min_idx[1,2]], min_dissim))
Most similar: MF5 and MF4 (BC = 0.0804)
max_dissim <- max(bc_matrix_upper, na.rm = TRUE)
max_idx <- which(bc_matrix_upper == max_dissim, arr.ind = TRUE)
cat(sprintf("Most dissimilar: %s and %s (BC = %.4f)\n",
site_names[max_idx[1,1]], site_names[max_idx[1,2]], max_dissim))Most dissimilar: NF32 and NF33 (BC = 1.0000)
cat("\n=== PART B: Jaccard Dissimilarity (from scratch) ===\n")
=== PART B: Jaccard Dissimilarity (from scratch) ===
present <- \(species) ifelse(species > 0, 1, 0)
whitewater_present <- whitewater %>%
mutate(across(.cols = American.brook.lamprey:slimy.sculpin, .fns = present))
site2 <- whitewater_present %>% slice(2) %>%
select(American.brook.lamprey:slimy.sculpin) %>% as.numeric()
site4 <- whitewater_present %>% slice(4) %>%
select(American.brook.lamprey:slimy.sculpin) %>% as.numeric()
a <- sum(site2 == 1 & site4 == 1) # Both present
b <- sum(site2 == 1 & site4 == 0) # Only site 2
c <- sum(site2 == 0 & site4 == 1) # Only site 4
jaccard_dissim <- (b + c) / (a + b + c)
cat(sprintf("\nSites: %s and %s\n", site_names[2], site_names[4]))
Sites: NF33 and NF31
cat(sprintf("a (both present) = %d, b (only site 2) = %d, c (only site 4) = %d\n", a, b, c))a (both present) = 4, b (only site 2) = 0, c (only site 4) = 3
cat(sprintf("Jaccard = (b+c)/(a+b+c) = %.4f\n", jaccard_dissim))Jaccard = (b+c)/(a+b+c) = 0.4286
cat("\n=== PART C: Comparing Bray-Curtis, Gower, and Jaccard ===\n")
=== PART C: Comparing Bray-Curtis, Gower, and Jaccard ===
gower_dist <- vegdist(species_cols, method = "gower")
gower_matrix <- as.matrix(gower_dist)
jaccard_dist <- vegdist(whitewater_present %>%
select(American.brook.lamprey:slimy.sculpin),
method = "jaccard")
jaccard_matrix <- as.matrix(jaccard_dist)
n_sites <- nrow(whitewater)
comparisons <- expand.grid(site1 = 1:n_sites, site2 = 1:n_sites) %>%
filter(site1 < site2) %>%
mutate(
site1_name = site_names[site1],
site2_name = site_names[site2],
bray_curtis = map2_dbl(site1, site2, ~bc_matrix[.x, .y]),
gower = map2_dbl(site1, site2, ~gower_matrix[.x, .y]),
jaccard = map2_dbl(site1, site2, ~jaccard_matrix[.x, .y])
)
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(comparisons$bray_curtis, comparisons$gower,
xlab = "Bray-Curtis", ylab = "Gower",
main = "Bray-Curtis vs Gower",
pch = 16, col = rgb(0, 0, 1, 0.5))
abline(0, 1, col = "red", lty = 2)
cor_bg <- cor(comparisons$bray_curtis, comparisons$gower)
text(0.1, 0.9, sprintf("r = %.3f", cor_bg), pos = 4)
plot(comparisons$bray_curtis, comparisons$jaccard,
xlab = "Bray-Curtis", ylab = "Jaccard",
main = "Bray-Curtis vs Jaccard",
pch = 16, col = rgb(0, 1, 0, 0.5))
abline(0, 1, col = "red", lty = 2)
cor_bj <- cor(comparisons$bray_curtis, comparisons$jaccard)
text(0.1, 0.9, sprintf("r = %.3f", cor_bj), pos = 4)
plot(comparisons$gower, comparisons$jaccard,
xlab = "Gower", ylab = "Jaccard",
main = "Gower vs Jaccard",
pch = 16, col = rgb(1, 0, 0, 0.5))
abline(0, 1, col = "red", lty = 2)
cor_gj <- cor(comparisons$gower, comparisons$jaccard)
text(0.1, 0.9, sprintf("r = %.3f", cor_gj), pos = 4)
cat(sprintf("\nCorrelations: BC-Gower=%.3f, BC-Jaccard=%.3f, Gower-Jaccard=%.3f\n",
cor_bg, cor_bj, cor_gj))
Correlations: BC-Gower=0.323, BC-Jaccard=0.719, Gower-Jaccard=0.083
cat("\n=== PART D: Largest Bray-Curtis vs Jaccard Discrepancies ===\n")
=== PART D: Largest Bray-Curtis vs Jaccard Discrepancies ===
comparisons <- comparisons %>%
mutate(bc_jac_diff = abs(bray_curtis - jaccard)) %>%
arrange(desc(bc_jac_diff))
top_discrepancies <- comparisons %>% slice(1:2)
for(i in 1:2) {
row <- top_discrepancies[i, ]
cat(sprintf("\nPair %d: %s vs %s\n", i, row$site1_name, row$site2_name))
cat(sprintf(" Bray-Curtis=%.4f, Jaccard=%.4f, Difference=%.4f\n",
row$bray_curtis, row$jaccard, row$bc_jac_diff))
site1_counts <- species_cols[row$site1, ]
site2_counts <- species_cols[row$site2, ]
present1 <- colnames(species_cols)[site1_counts > 0]
present2 <- colnames(species_cols)[site2_counts > 0]
shared <- intersect(present1, present2)
cat(sprintf(" Species counts: %d, %d (shared: %d)\n",
length(present1), length(present2), length(shared)))
if(length(shared) > 0 && length(shared) <= 10) {
cat(" Shared species abundances:\n")
for(sp in shared) {
cat(sprintf(" %s: %.0f vs %.0f\n", sp, site1_counts[[sp]], site2_counts[[sp]]))
}
}
}
Pair 1: NF21 vs NF20
Bray-Curtis=0.8014, Jaccard=0.0000, Difference=0.8014
Species counts: 7, 7 (shared: 7)
Shared species abundances:
Brown.trout: 11 vs 27
Longnose.dace: 10 vs 150
Creek.chub: 10 vs 10
White.sucker: 54 vs 6
Fantail.darter: 1 vs 4
Johnny.darter: 11 vs 3
slimy.sculpin: 1 vs 125
Pair 2: NF36 vs SF52
Bray-Curtis=0.7226, Jaccard=0.0000, Difference=0.7226
Species counts: 8, 8 (shared: 8)
Shared species abundances:
Brown.trout: 3 vs 5
Fathead.minnow: 1 vs 1
Blacknose.dace: 7 vs 7
Longnose.dace: 111 vs 4
Creek.chub: 124 vs 12
White.sucker: 47 vs 23
Fantail.darter: 6 vs 34
Johnny.darter: 56 vs 6
A)
Explain why Bray-Curtis is appropriate for measuring similarity/dissimilarity in populations across sites. Use Bray-Curtis and data management techniques to find the two most similar, and the two most dissimilar sites.
Bray-Curtis dissimilarity is appropriate for measuring fish population similarity across sites because it accounts for abundance. It considers how many of each species are present, not just whether they exist. A site with 150 longnose dace is ecologically very different from one with 10, even though both have the species present.
It also ignores joint absences, has a bounded scale (0-1), handles count data well.
B)
The code below transforms the data set to measure only presence/absence of each species:
present <- \(species) ifelse(species>0, 1, 0)
whitewater_present <- (whitewater
%>% mutate(across(.cols=American.brook.lamprey:slimy.sculpin,
.fns=present))
) Consider these two sites:
whitewater_present %>%
slice(2, 4) fork site siteID lat long American.brook.lamprey
1 NF Pries source 2 NF33 44.04202 -92.26845 0
2 NF Pries 2 NF31 44.04505 -92.26420 0
Rainbow.trout Brown.trout Brook.trout Central.stoneroller Common.shiner
1 0 0 0 0 0
2 0 0 0 0 0
Sand.shiner Southern.redbelly.dace Bluntnose.minnow Fathead.minnow
1 0 0 0 0
2 0 1 0 0
Blacknose.dace Longnose.dace Creek.chub White.sucker Brook.stickleback
1 1 0 1 0 0
2 1 1 1 0 1
Green.sunfish Bluegill Fantail.darter Johnny.darter mottled.sculpin
1 0 0 1 1 0
2 0 0 1 1 0
slimy.sculpin
1 0
2 0
Compute the Jaccard dissimilarity metric between these two sites from scratch.
a=4,b=0,c=3 so our jaccard is 3/7
C)
Use the Gower measure on the count data, and also find the Jaccard dissimilarities for all sites using the present/absent data. Turn each into a data frame and join them with the Bray-Curtis measures (so you have 3 measures for each unique pair of site comparisons). Then create pairwise scatterplots of the 3 dissimilarity measures versus each other. How do they compare?
Correlations: BC-Gower=0.323, BC-Jaccard=0.719, Gower-Jaccard=0.083
The scatterplots reveal that for the Whitewater River fish communities, abundance patterns matter significantly - sites can have identical species lists but very different ecological structure.
D)
Find the 2 pairs of sites with the largest discrepancies between the Bray-Curtis and Jaccard measures. Can you identify why they are in disagreement?
Pair 1: NF21 vs NF20 Bray-Curtis: 0.8014 (highly dissimilar) Jaccard: 0.0000 (perfectly identical) Difference: 0.8014
Why they disagree: These two North Fork sites demonstrate the most extreme case of disagreement. They have perfect species overlap - all 7 species are found at both sites, giving Jaccard = 0 (identical composition). However, the Bray-Curtis value of 0.80 indicates they are highly dissimilar in structure. The abundance differences are dramatic:
Longnose dace: 10 vs 150 → 15-fold difference Slimy sculpin: 1 vs 125 → 125-fold difference White sucker: 54 vs 6 → 9-fold difference in opposite direction Brown trout: 11 vs 27 → 2.5-fold difference
Ecological explanation: Site NF20 is completely dominated by longnose dace and slimy sculpin (combined 275 out of 325 fish), creating a cold-water, fast-flowing stream community. Site NF21 has much lower densities overall but proportionally more white suckers. Jaccard says: “Same species = identical sites” Bray-Curtis says: “Completely different dominant species = very different communities”
Pair 2: NF36 vs SF52 Bray-Curtis: 0.7226 (dissimilar) Jaccard: 0.0000 (perfectly identical) Difference: 0.7226
Why they disagree: Again, perfect species overlap (all 8 species shared), but substantial structural differences:
Longnose dace: 111 vs 4 → 28-fold difference Creek chub: 124 vs 12 → 10-fold difference Johnny darter: 56 vs 6 → 9-fold difference Fantail darter: 6 vs 34 → 6-fold difference (reversed dominance)
Ecological explanation: This comparison is particularly interesting because it’s a North Fork vs South Fork comparison. Despite having access to the same regional species pool, these forks have developed very different community structures:
NF36 is dominated by longnose dace (111) and creek chub (124), suggesting fast-flowing, high-gradient habitat SF52 has more fantail darters (34), a species that prefers slower pools and riffles
The identical species lists suggest both forks have been colonized by the same Whitewater River fish fauna, but local environmental conditions create dramatically different abundance patterns.