HW2_DDIRT_Heatmap.knit

DDIRT

Questions

1 | Heatmap

The probability of higher-than-median soil organic carbon increased by approximately 0.22 under sagebrush cover relative to interspace soils.

2 | Zero-inflated Function

zilb_loglik <- function(y, size, pzero, pbinomial) {

  bin_prob <- dbinom(y, size = size, prob = pbinomial)

  prob <- ifelse( #If I saw zero Halosphaera, it could be because there were none at all or because I missed them. If I saw any Halosphaera, then I know they were present, and the zero-inflation part no longer applies.
    y == 0,
    pzero + (1 - pzero) * bin_prob,
    (1 - pzero) * bin_prob #Halosphaera were present somewhere, but not in this tow.
  )

  sum(log(prob))
}

3 | Project Pitch

For my final project, I intend to examine how microsite vegetation influences soil organic carbon in a dryland ecosystem, using data from the Desert Detrital Input and Removal Treatment (DDIRT) experiment. Specifically, I will assess whether the presence of sagebrush is associated with an increased probability of higher soil organic carbon relative to interspace areas. The data consist of soil cores collected from multiple plots, with measurements of percent organic carbon and associated vegetation type. I plan to analyze these data using binomial and generalized linear modeling approaches to quantify differences in probability across microsites and to explore uncertainty in parameter estimates using likelihood-based methods.

Data Exploration

Question: Does the presence of sagebrush increase the probability of higher soil organic carbon?

ddirt <- read.csv("ddirt.csv") #the DDIRT data!

carbonmedian <- median(ddirt$percentOrgC_2023, na.rm = T) #deciding the cutoff for carbon values using median (0.9)

dirt <- data.frame(cbind(VegType=ddirt$vegtype, LitterType = ddirt$litter_treatment, Carbon = ddirt$percentOrgC_2023)) #make new data frame with just the 2023 data, easy to look at. 

dirt <- drop_na(dirt) #drop the NAs in the data

MoreThan <- ifelse(dirt$Carbon > 0.9, "Yes", "No") #add binary to continuous carbon data

dirt <- cbind(dirt, MoreThan) #add the binaries back to the dataframe

dirt |> #make a table using the gtable package
  gt() |>
  data_color(column=Carbon, row = Carbon > 0.90 , palette = "#BCC2B2") |>
  tab_header(
    title = 'Desert Detrital Input and Removal Treatment Experiments (DDIRT)',
    subtitle="VegType SB: Sage, IS: Interspace, Carbon: % Organic Carbon 2023")|>
  tab_footnote(footnote = "Data from the DDIRT project has been used. The median value for % organic carbon in 2023 cores was determined, and values > 0.9 were labeled, 'Yes', all others, 'No' and placed in MoreThan column.") |>
  opt_interactive(use_highlight = T, use_compact_mode = T)

Desert Detrital Input and Removal Treatment Experiments (DDIRT)

VegType SB: Sage, IS: Interspace, Carbon: % Organic Carbon 2023

Data from the DDIRT project has been used. The median value for % organic carbon in 2023 cores was determined, and values > 0.9 were labeled, 'Yes', all others, 'No' and placed in MoreThan column.

ggplot(dirt, aes(VegType, as.numeric(Carbon)))+
  labs(title="% organic carbon in 2023 soils of differing veg type", x="Vegetation Type", y="% Organic Carbon")+
  geom_boxplot(fill=c("white", "#757F6E"))

The boxplot shows substantial overlap in % organic carbon between sagebrush and interspace soils, motivating the use of a binomial model to assess differences in probability rather than attempting to model continuous carbon values directly.

1 Heatmap

Response: MoreThan turned binary (y) Covariate: sage

WHICH PARAMETER IS BEST

1 | Build the Data

#turning the words into numbers
dirt$y <- ifelse(dirt$MoreThan == "Yes", 1, 0)

dirt$sage <- ifelse(dirt$VegType == "SB", 1, 0)

2 | Choose a grid of parameter values

Decide the ranges for \(b_0\) and \(b_1\).

b0_seq <- seq(-6, 6, by = 0.1)
b1_seq <- seq(-6, 6, by = 0.1)

#create grid
grid <- expand.grid(bo = b0_seq, b1 = b1_seq)

3 | Compute prediction probabilities

The \(plogis\) function will create a vector of probabilities p the same length as the data (127).

For each data point i:

\(p_i = plogis(b_0 + b_1*x_i)\)

and

4 | Compute sum of log-likelihoods

\(log L = \sum log(BinomPMF(y_i | 1, p_i))\)

#create an empty column on teh grid and get ready for a forloop (forloop newb here)):

grid$loglik <- NA

for (i in 1:nrow(grid)) {
  b0 <- grid$bo[i]
  b1 <- grid$b1[i] #compute predicted probs for ALLLLL data points
  p <- plogis(b0 + b1*dirt$sage)
  
  #attach the probabilities back to the dirt dataframe: 
  grid$loglik[i] <- sum(dbinom(x=dirt$y, size=1, prob=p, log=T)) }

5 | Find the best slope and intercept

best_index <- which.max(grid$loglik)

best <- grid[best_index, ]

best

##        bo  b1   loglik
## 8405 -0.5 0.9 -85.0777

6 | Create heatmap

ggplot(grid, aes(b0, b1, fill = loglik))+
  geom_raster()+
  theme_minimal()+
  labs(title="Parameter space heatmap (sum log-likelihood", x="Intercept (b0)", y="Slope (b1)", fill="Sum log-lik")+
  scale_fill_gradient(low = "white", high = "#757769")

7 | Interpretation of heatmap

Which combinations of intercept (\(b_0\)) and slope (\(b_1\)) make my observed data most likely?

Color: Lighter = worse fit Darker = better fit

b0_hat <- best$bo #baseline situation: NO SAGE PRESENT
b1_hat <- best$b1 

p_no_sage <- plogis(b0_hat) #The prob that soil carbon is high when there is no sagebrush. In the interspace areas, the model estimates 38% chance that soil organic carbon is above the median (0.9). 

p_sage <- plogis(b0_hat+b1_hat) #the prob that soil carbon is high when sage is present. The model estimates a 60% chance of higher than median soil carbon. 

odds <- exp(b1_hat) # the odds of having higher than 0.9 (median for soil carbon) are about 2.5 times larger under sagebrush cover than interspace areas. 

# Sentence templates (fill in values you just printed):
# Intercept: When sage=0, P(high carbon) = p_no_sage
# Slope: When sage=1, P(high carbon) = p_sage (odds multiply by exp(b1_hat))

\(P(high C ∣ no sage)\) ~ \(0.38\)

\(P(high C ∣ sage)\) ~ \(0.60\)

In areas without sagebrush, soils have about a 38% chance of having higher-than-median organic carbon. Under sagebrush, that probability increases to about 60%. This suggests that sagebrush is associated with higher soil carbon, though the difference is not large enough to clearly separate sagebrush and interspace soils.

8 | Sentence

While soils beneath sagebrush show a small tendency toward higher carbon, carbon levels vary widely in both sagebrush and interspace areas, making it difficult to clearly separate the two based on vegetation type alone.

2 Zero-inflated Function

1 | Imagine cell counts in marine phytoplankton tows

I am imagining this question using Halosphaera data (marine phytoplankton cell counts). These are data I collected over the past 2 years while living in Bellingham, WA.

In ecological count data, zero observations often arise from multiple underlying processes. In my previous phytoplankton sampling, many tows contained zero Halosphaera, reflecting days or locations where the organism was effectively absent from the sampled water mass due to spatial patchiness, phenology, or environmental conditions, even though the organism may have been present elsewhere in the bay. When Halosphaera were present, observed counts then varied according to sampling variability and local abundance.

To represent this two-stage biological process, a zero-inflated binomial likelihood can be constructed in which zeros arise from both an absence process and a binomial counting process, while non-zero counts arise only from the counting process.

2 | Create the function

zilb_loglik <- function(y, size, pzero, pbinomial) {

  bin_prob <- dbinom(y, size = size, prob = pbinomial)

  prob <- ifelse( #If I saw zero Halosphaera, it could be because there were none at all or because I missed them. If I saw any Halosphaera, then I know they were present, and the zero-inflation part no longer applies.
    y == 0,
    pzero + (1 - pzero) * bin_prob,
    (1 - pzero) * bin_prob #Halosphaera were present somewhere, but not in this tow.
  )

  sum(log(prob))
}

3 | Create a vector of data

Create vector with counts of success.

y <- c(0,0,1,0,2,0,0,1,0,5,0,1,0,2,0,3,0,1,0,2)
 #halosphaera counts per tow

#zero can mean no halosphaera present or maybe they were somewhere else in the bay that day. a non-zero here means halosphaera were present.

4 | Call the function

zilb_loglik(
  y = y,
  size = 5,
  pzero = 0.4, #What fraction of my tows happened on days when Halosphaera were absent?
  pbinomial = 0.3 #On days when Halosphaera were present, how likely was it that any given ‘trial’ resulted in a count? captures variation in one day
)

## [1] -27.85718