The probability of higher-than-median soil organic carbon increased by approximately 0.22 under sagebrush cover relative to interspace soils.
zilb_loglik <- function(y, size, pzero, pbinomial) {
bin_prob <- dbinom(y, size = size, prob = pbinomial)
prob <- ifelse( #If I saw zero Halosphaera, it could be because there were none at all or because I missed them. If I saw any Halosphaera, then I know they were present, and the zero-inflation part no longer applies.
y == 0,
pzero + (1 - pzero) * bin_prob,
(1 - pzero) * bin_prob #Halosphaera were present somewhere, but not in this tow.
)
sum(log(prob))
}
For my final project, I intend to examine how microsite vegetation influences soil organic carbon in a dryland ecosystem, using data from the Desert Detrital Input and Removal Treatment (DDIRT) experiment. Specifically, I will assess whether the presence of sagebrush is associated with an increased probability of higher soil organic carbon relative to interspace areas. The data consist of soil cores collected from multiple plots, with measurements of percent organic carbon and associated vegetation type. I plan to analyze these data using binomial and generalized linear modeling approaches to quantify differences in probability across microsites and to explore uncertainty in parameter estimates using likelihood-based methods.
ddirt <- read.csv("ddirt.csv") #the DDIRT data!
carbonmedian <- median(ddirt$percentOrgC_2023, na.rm = T) #deciding the cutoff for carbon values using median (0.9)
dirt <- data.frame(cbind(VegType=ddirt$vegtype, LitterType = ddirt$litter_treatment, Carbon = ddirt$percentOrgC_2023)) #make new data frame with just the 2023 data, easy to look at.
dirt <- drop_na(dirt) #drop the NAs in the data
MoreThan <- ifelse(dirt$Carbon > 0.9, "Yes", "No") #add binary to continuous carbon data
dirt <- cbind(dirt, MoreThan) #add the binaries back to the dataframe
dirt |> #make a table using the gtable package
gt() |>
data_color(column=Carbon, row = Carbon > 0.90 , palette = "#BCC2B2") |>
tab_header(
title = 'Desert Detrital Input and Removal Treatment Experiments (DDIRT)',
subtitle="VegType SB: Sage, IS: Interspace, Carbon: % Organic Carbon 2023")|>
tab_footnote(footnote = "Data from the DDIRT project has been used. The median value for % organic carbon in 2023 cores was determined, and values > 0.9 were labeled, 'Yes', all others, 'No' and placed in MoreThan column.") |>
opt_interactive(use_highlight = T, use_compact_mode = T)
ggplot(dirt, aes(VegType, as.numeric(Carbon)))+
labs(title="% organic carbon in 2023 soils of differing veg type", x="Vegetation Type", y="% Organic Carbon")+
geom_boxplot(fill=c("white", "#757F6E"))
The boxplot shows substantial overlap in % organic carbon between sagebrush and interspace soils, motivating the use of a binomial model to assess differences in probability rather than attempting to model continuous carbon values directly.
Response: MoreThan turned binary (y) Covariate: sage
WHICH PARAMETER IS BEST
#turning the words into numbers
dirt$y <- ifelse(dirt$MoreThan == "Yes", 1, 0)
dirt$sage <- ifelse(dirt$VegType == "SB", 1, 0)
Decide the ranges for \(b_0\) and \(b_1\).
b0_seq <- seq(-6, 6, by = 0.1)
b1_seq <- seq(-6, 6, by = 0.1)
#create grid
grid <- expand.grid(bo = b0_seq, b1 = b1_seq)
The \(plogis\) function will create a vector of probabilities p the same length as the data (127).
For each data point i:
\(p_i = plogis(b_0 + b_1*x_i)\)
and
\(log L = \sum log(BinomPMF(y_i | 1, p_i))\)
#create an empty column on teh grid and get ready for a forloop (forloop newb here)):
grid$loglik <- NA
for (i in 1:nrow(grid)) {
b0 <- grid$bo[i]
b1 <- grid$b1[i] #compute predicted probs for ALLLLL data points
p <- plogis(b0 + b1*dirt$sage)
#attach the probabilities back to the dirt dataframe:
grid$loglik[i] <- sum(dbinom(x=dirt$y, size=1, prob=p, log=T)) }
best_index <- which.max(grid$loglik)
best <- grid[best_index, ]
best
## bo b1 loglik
## 8405 -0.5 0.9 -85.0777
ggplot(grid, aes(b0, b1, fill = loglik))+
geom_raster()+
theme_minimal()+
labs(title="Parameter space heatmap (sum log-likelihood", x="Intercept (b0)", y="Slope (b1)", fill="Sum log-lik")+
scale_fill_gradient(low = "white", high = "#757769")
Which combinations of intercept (\(b_0\)) and slope (\(b_1\)) make my observed data most likely?
Color: Lighter = worse fit Darker = better fit
b0_hat <- best$bo #baseline situation: NO SAGE PRESENT
b1_hat <- best$b1
p_no_sage <- plogis(b0_hat) #The prob that soil carbon is high when there is no sagebrush. In the interspace areas, the model estimates 38% chance that soil organic carbon is above the median (0.9).
p_sage <- plogis(b0_hat+b1_hat) #the prob that soil carbon is high when sage is present. The model estimates a 60% chance of higher than median soil carbon.
odds <- exp(b1_hat) # the odds of having higher than 0.9 (median for soil carbon) are about 2.5 times larger under sagebrush cover than interspace areas.
# Sentence templates (fill in values you just printed):
# Intercept: When sage=0, P(high carbon) = p_no_sage
# Slope: When sage=1, P(high carbon) = p_sage (odds multiply by exp(b1_hat))
\(P(high C ∣ no sage)\) ~ \(0.38\)
\(P(high C ∣ sage)\) ~ \(0.60\)
In areas without sagebrush, soils have about a 38% chance of having higher-than-median organic carbon. Under sagebrush, that probability increases to about 60%. This suggests that sagebrush is associated with higher soil carbon, though the difference is not large enough to clearly separate sagebrush and interspace soils.
While soils beneath sagebrush show a small tendency toward higher carbon, carbon levels vary widely in both sagebrush and interspace areas, making it difficult to clearly separate the two based on vegetation type alone.
I am imagining this question using Halosphaera data (marine phytoplankton cell counts). These are data I collected over the past 2 years while living in Bellingham, WA.
In ecological count data, zero observations often arise from multiple underlying processes. In my previous phytoplankton sampling, many tows contained zero Halosphaera, reflecting days or locations where the organism was effectively absent from the sampled water mass due to spatial patchiness, phenology, or environmental conditions, even though the organism may have been present elsewhere in the bay. When Halosphaera were present, observed counts then varied according to sampling variability and local abundance.
To represent this two-stage biological process, a zero-inflated binomial likelihood can be constructed in which zeros arise from both an absence process and a binomial counting process, while non-zero counts arise only from the counting process.
zilb_loglik <- function(y, size, pzero, pbinomial) {
bin_prob <- dbinom(y, size = size, prob = pbinomial)
prob <- ifelse( #If I saw zero Halosphaera, it could be because there were none at all or because I missed them. If I saw any Halosphaera, then I know they were present, and the zero-inflation part no longer applies.
y == 0,
pzero + (1 - pzero) * bin_prob,
(1 - pzero) * bin_prob #Halosphaera were present somewhere, but not in this tow.
)
sum(log(prob))
}
Create vector with counts of success.
y <- c(0,0,1,0,2,0,0,1,0,5,0,1,0,2,0,3,0,1,0,2)
#halosphaera counts per tow
#zero can mean no halosphaera present or maybe they were somewhere else in the bay that day. a non-zero here means halosphaera were present.
zilb_loglik(
y = y,
size = 5,
pzero = 0.4, #What fraction of my tows happened on days when Halosphaera were absent?
pbinomial = 0.3 #On days when Halosphaera were present, how likely was it that any given ‘trial’ resulted in a count? captures variation in one day
)
## [1] -27.85718