For this series of exercises, I used OpenAI’s ChatGPT to refine my writing, clarify my English phrasing and explain some of the questions I didn’t understand (with prompts such as ‘explain this {question} in a simplified way, with bullet points’). Additionally, I used Github Co-pilot’s code completion within Rstudio when I struggled to run my chunks of code multiple times. These chunks will be boxed in a red contour line. For smaller things, there is simply a comment for the line I needed help with.
For this exercise we use again the cancer data from the 3rd set which you can download here.
target_deathrate into a new variable,
deathrate_binary, that has value 0 if
target_deathrate is below its median and 1 otherwise. Train
a logistic regression model using incidencerate as
predictor and deathrate_binary as dependent variable. Plot
these two variables (incidencereate and
deathrate_binary) as well as the predicted logistic curve
(see also https://www.econometrics-with-r.org/11.2-palr.html).Hint: use the glm function to fit the
logistic regression and the predict and line
functions for the predicted logistic curve.
# Import data
cancer_reg <- read.csv("data/cancer_reg.csv", header = TRUE)
# Binarize target_deathrate
median_deathrate <- median(cancer_reg$target_deathrate, na.rm = TRUE)
cancer_reg$deathrate_binary <- ifelse(cancer_reg$target_deathrate >= median_deathrate, 1, 0)
# Logistic regression
logit_model <- glm(deathrate_binary ~ incidencerate, data = cancer_reg, family = binomial)
# Predicted logistic curve
cancer_reg$predicted_probs <- predict(logit_model, type = "response")
# Plot the data points
plot(cancer_reg$incidencerate, cancer_reg$deathrate_binary,
xlab = "Incidence Rate", ylab = "Deathrate Binary",
main = "Logistic Regression: Incidence Rate vs Deathrate Binary",
pch = 20, col = "red")
# Sort incidencerate for smoother line (Co-pilot suggested to do that)
sorted_indices <- order(cancer_reg$incidencerate)
sorted_incidencerate <- cancer_reg$incidencerate[sorted_indices]
sorted_predicted_probs <- cancer_reg$predicted_probs[sorted_indices]
# Add logistic regression curve
lines(sorted_incidencerate, sorted_predicted_probs, col = "blue", lwd = 2)
# Define new data points for prediction
new_data <- data.frame(incidencerate = c(300, 800))
# Predict the probabilities
predicted_values <- predict(logit_model, newdata = new_data, type = "response")
# Classify based on a cutoff of 0.5
classification <- ifelse(predicted_values >= 0.5, "High Death Rate", "Low Death Rate")
For an incidence rate of 300, the predicted probability is 0.1192475, which is below the cutoff of 0.5 and thus classified as ‘low death rate’. For the incidence rate of 800, the predicted probability is 0.9913219, which is above the cutoff of 0.5. This is classified as ‘high death rate’.
target_deathrate is between the 25th-75th percentiles of
target_deathrate itself. In other words, keep in the data
frame only the observations below the 25th percentile or above the 75th
percentile of target_deathrate (Hint: use
the quantile function of R). Repeat points 1a and 1b. How
does the new logistic curve look compared to the previously plotted one?
How were the predicted values for input incidence rates 300 and 800
impacted by the removal of the 25th-75th percentile data?# Extreme cases of death rate
q25 <- quantile(cancer_reg$target_deathrate, 0.25, na.rm = TRUE)
q75 <- quantile(cancer_reg$target_deathrate, 0.75, na.rm = TRUE)
# New data frame with extreme values
cancer_reg_extremes <- cancer_reg[cancer_reg$target_deathrate < q25 | cancer_reg$target_deathrate > q75, ]
# Binarize target_deathrate
median_deathrate_extremes <- median(cancer_reg_extremes$target_deathrate, na.rm = TRUE)
cancer_reg_extremes$deathrate_binary <- ifelse(cancer_reg_extremes$target_deathrate >= median_deathrate_extremes, 1, 0)
# Logistic regression
logit_model_extremes <- glm(deathrate_binary ~ incidencerate, data = cancer_reg_extremes, family = binomial)
# Predicted probabilities for incidence rate 300 and 800
predicted_values_extremes <- predict(logit_model_extremes, newdata = new_data, type = "response")
# Plot data points
plot(cancer_reg_extremes$incidencerate, cancer_reg_extremes$deathrate_binary,
xlab = "Incidence Rate", ylab = "Deathrate Binary",
main = "Logistic Regression: Extreme Cases",
pch = 20, col = "red")
# Sort incidence rate for smoother line
sorted_indices_extremes <- order(cancer_reg_extremes$incidencerate)
sorted_incidencerate_extremes <- cancer_reg_extremes$incidencerate[sorted_indices_extremes]
sorted_predicted_probs_extremes <- predict(logit_model_extremes, type = "response")[sorted_indices_extremes]
# Add the logistic regression line for extreme cases
lines(sorted_incidencerate_extremes, sorted_predicted_probs_extremes, col = "blue", lwd = 2)
I see that the logistic curve now seems steeper, this means that the relationship between incidence rate and death rate has become more “clear-cut”. A small change in the incidence rate leads to a big change in the probability of high death rate.
For an incidence rate of 300, the new predicted probability is 0.0370384, which is far below the cutoff of 0.5 and thus classified as ‘low death rate’. For the incidence rate of 800, the new predicted probability is 0.9996404, which is above the cutoff of 0.5. This is classified as ‘high death rate’. The removal of the 25th-75th quantiles impacted the predicted values and pushed them to the extremes. The low became lower and the high became higher. This makes sens since by removing these values we basically polarized our data. The model now only trains on extreme values, which makes it more extreme in its predictions as well.
Consider two populations of ground nesting birds. Preliminary data suggest that young chicks in population B have better chances of survival than those of population A, and you want to plan a study to investigate if the difference is real. The goal of this exercise is to generate and analyze in-silico data in order to obtain an idea about the population sizes that need to be studied for how long to observe a significant difference in survival.
the number of observation days: \(N_{\text{days}}\)
the daily chick survival probability \(P_{\text{surv}}\) (the expected fraction of chicks who survived between two observation time points)
the daily loss probability \(P_{\text{loss}}\) (the expected fraction of chicks who disappeared from the population between two observation time points for unknown reasons, so it is unknown if they are still alive or died)
The function should return a dataframe with one row for each chick with two columns:
observation time (in days)
event status: 0 = alive or lost, 1 = dead
To implement this function you should decide stochastically for each chick on a given observation day if it survived, died or got lost. Whenever a chick dies or is lost, its observation is terminated.
Task: Generate two sets of observation data for \(N_{\text{days}} = 20\) days corresponding to two populations of the same size (\(N_{\text{pop}} = 100\)) with different daily survival probabilities (\(P_{\text{surv}} = 0.8\) and \(0.95\), respectively) and the same loss probability (\(P_{\text{loss}} = 0.03\)). Print the two data frames.
# Function to simulate chick survival
simulate_chick_survival <- function(N_pop, N_days, P_surv, P_loss) {
# Create empty lists to store results
times <- numeric(N_pop) # Stores the day of the event
statuses <- numeric(N_pop) # Stores the event status
# Loop through each chick
for (i in 1:N_pop) {
# Observations for 1 chick
for (day in 1:N_days) {
# Figure out if the chick is lost
if (runif(1) < P_loss) { # Co-pilot suggested using runif()
times[i] <- day
statuses[i] <- 0 # Lost
break # Leave the loop because it's lost
}
# Simulate if the chick survives or dies
if (runif(1) > P_surv) {
times[i] <- day
statuses[i] <- 1 # Died
break # Leave the loop because it's dead
}
# If neither loss nor death, the chick survives this day
if (day == N_days) { # If true the chick survived all days
times[i] <- N_days # Recod the final day as the time
statuses[i] <- 0 # Still alive or lost
}
}
}
# Create a data frame with observation times and event status
data.frame(observation_time = times, event_status = statuses)
}
# Simulate two populations with different survival probabilities
N_pop <- 100
N_days <- 20
P_loss <- 0.03
P_surv_A <- 0.80
P_surv_B <- 0.95
# Population A
pop_A_data <- simulate_chick_survival(N_pop, N_days, P_surv_A, P_loss)
# Population B
pop_B_data <- simulate_chick_survival(N_pop, N_days, P_surv_B, P_loss)
# Print the two data frames
print(head(pop_A_data, 10)) # Print first 10 rows for Population A
## observation_time event_status
## 1 2 1
## 2 2 1
## 3 4 1
## 4 2 1
## 5 2 1
## 6 4 1
## 7 2 0
## 8 12 1
## 9 3 1
## 10 1 1
print(head(pop_B_data, 10)) # Print first 10 rows for Population B
## observation_time event_status
## 1 5 1
## 2 4 0
## 3 5 1
## 4 4 0
## 5 15 0
## 6 20 0
## 7 9 1
## 8 3 0
## 9 8 1
## 10 20 0
Hints: Install the survival package,
and inspect the Surv object. Use rbind to
combine the dataframes of the two populations, and use rep
to add a new column for group membership. Use survfit and
survdiff to plot the Kaplan-Meier curve and to perform the
log-rank test respectively.
# Load package
library(survival)
# Combine data for both populations
pop_A_data$group <- rep("Population A", N_pop)
pop_B_data$group <- rep("Population B", N_pop)
combined_data <- rbind(pop_A_data, pop_B_data)
# Create a Surv object
surv_object <- Surv(combined_data$observation_time, combined_data$event_status)
# Fit Kaplan-Meier survival curves for each group
km_fit <- survfit(surv_object ~ combined_data$group)
# Plot the Kaplan-Meier survival curves
plot(km_fit, col = c("red", "blue"),
xlab = "Days", ylab = "Survival Probability",
main = "Kaplan-Meier Survival Curves",
lwd = 2)
legend("topright", legend = c("Population A", "Population B"),
col = c("red", "blue"), lwd = 2)
# Apply the log-rank test to compare the two survival curves
log_rank_test <- survdiff(surv_object ~ combined_data$group)
# Print log-rank test result
print(log_rank_test)
## Call:
## survdiff(formula = surv_object ~ combined_data$group)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## combined_data$group=Population A 100 92 41.9 59.9 107
## combined_data$group=Population B 100 48 98.1 25.6 107
##
## Chisq= 108 on 1 degrees of freedom, p= <2e-16
I see that both population’s survival probability are decreasing during the whole time we observe them, but while the population B’s survival probability decreases slowly, the population A’s survival probability decreases rapidly and reaches 0% before the end of the observation. The log rank test is used to compare the survival distributions of 2 populations. ChatGPT explained to me what each of the parameters meant.
Based on the difference between the observed events (death) and the expected ones (\(\frac{(\text{observed}-\text{expected})^2}{\text{expected}}\)) coupled to a very low p-value of 3.5350657^{-25}. I conclude that there is no other choice than to reject the hypothesis that there is no difference in suvival distribution for these two populations, and that chicks from the population B indeed have better chances of survival than the ones from population A.
Hint: Use coxph to build the model.
# Cox proportional hazards model
cox_model <- coxph(surv_object ~ combined_data$group)
Yes, the Cox proportional hazard model shows similar results. We see that the hazard ratio for Population B compared to Population A is a lot smaller than 1, meaning that chicks from Population B have a lot less chances of dying than the ones in Population A, here the models predicts that the risk of death of chicks from Population B is around 20% of the risk of death of the chicks from Population A. We also see that the 95% confidence interval is very far from 1, meaning that even there is almost no chance that these results were obtained by pure luck. Finally, we see that all the p-values from three different tests, including the log rank test are very close to 0, indicating that there is indeed a difference betweent the survival chance of chicks from Population A and Population B, that the chicks from Population B have higher chances of surviving.
\(N_{\text{pop}} = 100\), \(95\), \(90\), … Is there a sample size at which you fail to see a significant difference in survival? Repeat the same experiments for \(P_{\text{surv}} = 0.85\) and \(0.95\) for the two populations. Comment on your results!
# Function to perform analysis
perform_analysis <- function(N_pop, P_surv_A, P_surv_B, P_loss, N_days) {
# Data Population A
pop_A_data <- simulate_chick_survival(N_pop, N_days, P_surv_A, P_loss)
pop_A_data$group <- rep("Population A", N_pop)
# Data Population B
pop_B_data <- simulate_chick_survival(N_pop, N_days, P_surv_B, P_loss)
pop_B_data$group <- rep("Population B", N_pop)
# Combine data
combined_data <- rbind(pop_A_data, pop_B_data)
# Surv object
surv_object <- Surv(combined_data$observation_time,
combined_data$event_status)
# Fit Kaplan-Meier
km_fit <- survfit(surv_object ~ combined_data$group)
# Log-rank test
log_rank_test <- survdiff(surv_object ~ combined_data$group)
# Cox proportional hazards model
cox_model <- coxph(surv_object ~ combined_data$group)
# Return results
list(km_fit = km_fit, log_rank_test = log_rank_test, cox_model = cox_model)
}
# Set parameters
P_loss <- 0.03
N_days <- 20
P_surv_A_values <- c(0.85, 0.95)
P_surv_B_values <- c(0.95, 0.85)
sample_sizes <- seq(100, 5, by = -5)
# Initialize results
results <- data.frame(SampleSize = integer(),
P_Surv_A = numeric(),
P_Surv_B = numeric(),
LogRank_p_value = numeric(),
Significant = logical(),
Hazard_Ratio = numeric())
# Conduct the analysis for varying sample sizes
for (N_pop in sample_sizes) {
for (i in 1:length(P_surv_A_values)) {
P_surv_A <- P_surv_A_values[i]
P_surv_B <- P_surv_B_values[i]
analysis_results <- perform_analysis(N_pop,
P_surv_A,
P_surv_B,
P_loss,
N_days)
# Extract log-rank p-value and hazard ratio (Co-Pilot told me how to)
log_rank_p_value <- analysis_results$log_rank_test$p[1]
hazard_ratio <- exp(coef(analysis_results$cox_model)[1])
# Determine significance
significant <- log_rank_p_value < 0.05
# Store results
results <- rbind(results, data.frame(SampleSize = N_pop,
P_Surv_A = P_surv_A,
P_Surv_B = P_surv_B,
LogRank_p_value = log_rank_p_value,
Significant = significant,
Hazard_Ratio = hazard_ratio))
}
}
I asked ChatGPT to put the results in a mardown table so that they would be easier to analyze:
| Sample Size | P_Surv_A | P_Surv_B | LogRank_p_value | Significant | Hazard_Ratio |
|---|---|---|---|---|---|
| 100 | 0.85 | 0.95 | 2.812351e-07 | TRUE | 0.3986257 |
| 100 | 0.95 | 0.85 | 9.982393e-18 | TRUE | 4.9368533 |
| 95 | 0.85 | 0.95 | 4.481588e-09 | TRUE | 0.3321471 |
| 95 | 0.95 | 0.85 | 2.242478e-19 | TRUE | 5.2235494 |
| 90 | 0.85 | 0.95 | 5.168561e-09 | TRUE | 0.3257784 |
| 90 | 0.95 | 0.85 | 2.528762e-09 | TRUE | 2.9063339 |
| 85 | 0.85 | 0.95 | 4.746107e-08 | TRUE | 0.3504010 |
| 85 | 0.95 | 0.85 | 1.094125e-06 | TRUE | 2.5429018 |
| 80 | 0.85 | 0.95 | 1.600992e-09 | TRUE | 0.2813703 |
| 80 | 0.95 | 0.85 | 3.093697e-11 | TRUE | 4.0058033 |
| 75 | 0.85 | 0.95 | 2.569515e-07 | TRUE | 0.3093675 |
| 75 | 0.95 | 0.85 | 1.922076e-07 | TRUE | 2.8053062 |
| 70 | 0.85 | 0.95 | 3.184703e-06 | TRUE | 0.3755245 |
| 70 | 0.95 | 0.85 | 5.796793e-06 | TRUE | 2.5867076 |
| 65 | 0.85 | 0.95 | 3.258108e-11 | TRUE | 0.2143716 |
| 65 | 0.95 | 0.85 | 3.075501e-11 | TRUE | 4.9509234 |
| 60 | 0.85 | 0.95 | 1.058994e-06 | TRUE | 0.3188875 |
| 60 | 0.95 | 0.85 | 6.419377e-08 | TRUE | 3.4674736 |
| 55 | 0.85 | 0.95 | 7.755740e-06 | TRUE | 0.3437639 |
| 55 | 0.95 | 0.85 | 3.377469e-04 | TRUE | 2.3648032 |
| 50 | 0.85 | 0.95 | 2.386045e-05 | TRUE | 0.3495918 |
| 50 | 0.95 | 0.85 | 5.478384e-09 | TRUE | 4.8218152 |
| 45 | 0.85 | 0.95 | 5.070939e-09 | TRUE | 0.2144768 |
| 45 | 0.95 | 0.85 | 1.254435e-03 | TRUE | 2.2388702 |
| 40 | 0.85 | 0.95 | 5.553948e-06 | TRUE | 0.2723527 |
| 40 | 0.95 | 0.85 | 1.445351e-04 | TRUE | 3.1360523 |
| 35 | 0.85 | 0.95 | 0.01293186 | TRUE | 0.4716814 |
| 35 | 0.95 | 0.85 | 4.049820e-05 | TRUE | 3.0824715 |
| 30 | 0.85 | 0.95 | 1.343616e-03 | TRUE | 0.3580566 |
| 30 | 0.95 | 0.85 | 1.450126e-04 | TRUE | 3.0984165 |
| 25 | 0.85 | 0.95 | 3.416597e-04 | TRUE | 0.2646231 |
| 25 | 0.95 | 0.85 | 6.140147e-03 | TRUE | 2.6877334 |
| 20 | 0.85 | 0.95 | 4.222875e-03 | TRUE | 0.2907575 |
| 20 | 0.95 | 0.85 | 0.01205016 | TRUE | 2.8011101 |
| 15 | 0.85 | 0.95 | 0.04115161 | TRUE | 0.3942332 |
| 15 | 0.95 | 0.85 | 0.01620333 | TRUE | 3.1423553 |
| 10 | 0.85 | 0.95 | 0.6509290 | FALSE | 0.7998961 |
| 10 | 0.95 | 0.85 | 0.1145348 | FALSE | 2.7692377 |
| 5 | 0.85 | 0.95 | 0.02727102 | TRUE | 0.1299541 |
| 5 | 0.95 | 0.85 | 0.3034837 | FALSE | 3.1819433 |
As sample sizes decreased from \(N_{\text{pop}} = 100\) to \(N_{\text{pop}} = 5\), our ability to detect statistically significant differences in survival rates diminished. The table shows that when the population size drops below \(N_{\text{pop}} \leq 10\), the log-rank test p-value becomes non-significant, indicating reduced statistical power. This suggests that smaller sample sizes increase the variability of results and may mask true differences between populations. This makes complete sens since with smaller sample size you are more subject to stochastic effects masking true differences between populations.
A plant biologist wants to study seedling growth. In the experiment the diameter of the seedlings is measured daily on five consecutive days. From preliminary data it appears that the diameter seems to increase as a linear function in time, but there is substantial variation in the initial diameter and growth rate. In order to help the biologist to get a feeling for how many seedlings should be studied you offer to do some simulations:
Number of timepoints of the experiment (same for all seedlings): \(N_t\)
The expected seedling diameter \(\mu_d\) and its variance \(\sigma_d^2\) across seedlings at the first day of measurements
The expected (linear) growth rate \(\mu_r\) and its variance \(\sigma_r^2\) across seedlings
The measurement error \(\varepsilon\) which is modeled to be Gaussian with variance \(\sigma_{\varepsilon}^2\)
The output of the function is a \(N_s \times N_t\) data sheet of simulated diameter measurements with one row for each seedling and one column for each time point \(t\). To this end first draw for each seedling its initial diameter \(d_0\) from a normal distribution \(\mathcal{N}(\mu_d , \sigma_d)\) and its growth rate \(r\) from \(\mathcal{N}(\mu_r, \sigma_r)\). Then assume a linear model to generate the diameter measurements, adding measurement errors drawn from \(\mathcal{N}(0, \sigma_{\varepsilon})\):
\[ d(t) = d_0 + r t + \varepsilon \]
Test the function by generating measurements for \(N_s = 10\), \(N_t = 5\), \(\mu_d = 1\)mm, \(\sigma_d = 0.2\)mm, \(\mu_r = 0.5\)mm/day, \(\sigma_r = 0.1\)mm/day, \(\varepsilon = 0.1\)mm. Plot the corresponding data as time courses.
# Load necessary library for plotting
library(ggplot2)
# Function to simulate seedling growth
simulate_seedling_growth <- function(N_s, N_t, mu_d, sigma_d, mu_r, sigma_r, epsilon) {
# Initialize a matrix to store the results
measurements <- matrix(0, nrow = N_s, ncol = N_t)
# Generate initial diameters and growth rates for each seedling
initial_diameters <- rnorm(N_s, mean = mu_d, sd = sigma_d)
growth_rates <- rnorm(N_s, mean = mu_r, sd = sigma_r)
# Simulate measurements for each time point
for (t in 1:N_t) {
# Calculate diameter with measurement error
measurement_error <- rnorm(N_s, mean = 0, sd = epsilon)
measurements[, t] <- initial_diameters + growth_rates * (t - 1) + measurement_error
}
return(measurements)
}
# Parameters for the simulation
N_s <- 10
N_t <- 5
mu_d <- 1
sigma_d <- 0.2
mu_r <- 0.5
sigma_r <- 0.1
epsilon <- 0.1
# Run the simulation
seedling_data <- simulate_seedling_growth(N_s, N_t, mu_d, sigma_d, mu_r, sigma_r, epsilon)
# Convert the matrix into data frame (Co-Pilot helped me)
time_points <- 1:N_t
seedling_df <- data.frame(time = rep(time_points, each = N_s),
seedling = rep(1:N_s, times = N_t),
diameter = as.vector(seedling_data))
# Plot the results
# Made it without using ggplot but asked Co-pilot to make it prettier
ggplot(seedling_df, aes(x = time,
y = diameter,
group = seedling,
color = as.factor(seedling))) +
geom_line() +
geom_point() +
labs(title = "Simulated Seedling Growth Over Time",
x = "Time (days)",
y = "Diameter (mm)",
color = "Seedling") +
theme_minimal()
# Fit linear models to each seedling's growth data
model_results <- list() # List to store models
d0_estimates <- numeric(N_s) # Vector to store initial diameter estimates
r_estimates <- numeric(N_s) # Vector to store growth rate estimates
for (i in 1:N_s) {
# Extract the diameter measurements for the current seedling
seedling_measurements <- seedling_data[i, ]
time_points <- 0:(N_t - 1) # Time starts at 0
# Fit a linear model: diameter ~ time
model <- lm(seedling_measurements ~ time_points)
model_results[[i]] <- model
# Store estimates
d0_estimates[i] <- coef(model)[1] # Intercept (initial diameter)
r_estimates[i] <- coef(model)[2] # Slope (growth rate)
}
# Compute sample mean and variance for d0 and r
mean_d0 <- mean(d0_estimates)
var_d0 <- var(d0_estimates)
mean_r <- mean(r_estimates)
var_r <- var(r_estimates)
# True parameters
true_mu_d <- mu_d
true_sigma_d2 <- sigma_d^2
true_mu_r <- mu_r
true_sigma_r2 <- sigma_r^2
# Compute absolute and relative differences
abs_diff_d0 <- abs(mean_d0 - true_mu_d)
rel_diff_d0 <- abs_diff_d0 / true_mu_d
abs_diff_r <- abs(mean_r - true_mu_r)
rel_diff_r <- abs_diff_r / true_mu_r
# Display results
results <- data.frame(
Parameter = c("Initial Diameter (d0)", "Growth Rate (r)"),
Estimated_Mean = c(mean_d0, mean_r),
Estimated_Variance = c(var_d0, var_r),
True_Mean = c(true_mu_d, true_mu_r),
True_Variance = c(true_sigma_d2, true_sigma_r2),
Abs_Difference_Mean = c(abs_diff_d0, abs_diff_r),
Rel_Difference_Mean = c(rel_diff_d0, rel_diff_r)
)
| Parameter | Estimated Mean | Estimated Variance | True Mean | True Variance | Abs Difference (Mean) | Rel Difference (Mean) |
|---|---|---|---|---|---|---|
| Initial Diameter (\(d_0\)) | 1.080041 | 0.05370362 | 1.0 | 0.04 | 0.08004098 | 0.08004098 |
| Growth Rate (\(r\)) | 0.469813 | 0.01583878 | 0.5 | 0.01 | 0.03018693 | 0.06037386 |
The estimated means for both the initial diameter and growth rate are fairly close to the true values, with small absolute and relative differences, indicating that the linear model effectively captures the parameters from the simulated data. The variance of the estimates is slightly higher than the true variance, likely due to the small sample size (N = 10). However, this difference might be too small to be considered important.
pivot_longer function of the
tidyr library provides a simple way to transform your \(N_s \times N_t\) data sheet into the
required format.# Load necessary libraries
library(dplyr) # for the mutate() function
library(tidyr)
# Convert the seedling data to a data frame
seedling_data <- as.data.frame(seedling_data)
# Reshape the data from wide to long format
seedling_long <- pivot_longer(
seedling_data,
cols = everything(),
names_to = "Timepoint",
values_to = "Diameter")
# Clean up the Timepoint column to make it numeric
seedling_long$Timepoint <- as.numeric(gsub("V", "", seedling_long$Timepoint))
# Fit a linear model using the reshaped data
combined_model <- lm(Diameter ~ Timepoint, data = seedling_long)
# Get the coefficients from the model
d0_combined <- coef(combined_model)[1] # Initial diameter (intercept)
r_combined <- coef(combined_model)[2] # Growth rate (slope)
# Calculate the absolute difference from the true values
abs_diff_d0_combined <- abs(d0_combined - mu_d)
rel_diff_d0_combined <- abs_diff_d0_combined / mu_d # Relative difference
abs_diff_r_combined <- abs(r_combined - mu_r)
rel_diff_r_combined <- abs_diff_r_combined / mu_r # Relative difference
# Create a data frame to store results
results_combined <- data.frame(
Parameter = c("Initial Diameter (d0)", "Growth Rate (r)"),
Estimated_Value = c(d0_combined, r_combined),
True_Value = c(mu_d, mu_r),
Abs_Difference = c(abs_diff_d0_combined, abs_diff_r_combined),
Rel_Difference = c(rel_diff_d0_combined, rel_diff_r_combined))
# Count the number of parameters estimated from the model
num_parameters <- length(coef(combined_model))
| Parameter | Estimated Value | True Value | Abs Difference | Rel Difference |
|---|---|---|---|---|
| Initial Diameter (\(d_0\)) | 0.6102279 | 1.0 | 0.38977209 | 0.38977209 |
| Growth Rate (\(r\)) | 0.4698131 | 0.5 | 0.03018693 | 0.06037386 |
In this approach, we model all seedlings using a single linear regression rather than individual models for each seedling. The number of parameters estimated in this combined model is 2: one for the initial diameter (\(d_0\)) and one for the growth rate (\(r\)). The estimates are further from the true values than the previous model. By fitting a single linear model to all seedlings, the approach assumes that all seedlings grow at the same rate and start from the same size. Which isn’t a great idea since the plant biologist said that the preliminary data indicated that there was “substantial variation in the initial diameter and growth rate”. This explains why the first model we did fitted the true values much better.
Hint: One way to estimate errors is doing repeated experiments.
Here, we will create two additional models that account for the substantial variation in both the initial diameter (intercept) and growth rate (slope) of the seedlings. I attempted to implement this in R and compare the estimates with the true values, but at that point, most of my code had been generated using Co-Pilot, so I won’t include it here. However, I believe that using random slope and intercept models is a suitable approach for capturing the variations in both the diameter and growth rate of the seedlings. I think the estimated values would be the closest to the true values in this model.