0. Acknowledgment

For this series of exercises, I used OpenAI’s ChatGPT to refine my writing, clarify my English phrasing and explain some of the questions I didn’t understand (with prompts such as ‘explain this {question} in a simplified way, with bullet points’). Additionally, I used Github Co-pilot’s code completion within Rstudio when I struggled to run my chunks of code multiple times. These chunks will be boxed in a red contour line. For smaller things, there is simply a comment for the line I needed help with.

1. Logistic regressions (1.5 points max - 0.5 for each subtask)

For this exercise we use again the cancer data from the 3rd set which you can download here.

Binarize target_deathrate into a new variable, deathrate_binary, that has value 0 if target_deathrate is below its median and 1 otherwise. Train a logistic regression model using incidencerate as predictor and deathrate_binary as dependent variable. Plot these two variables (incidencereate and deathrate_binary) as well as the predicted logistic curve (see also https://www.econometrics-with-r.org/11.2-palr.html).

Hint: use the glm function to fit the logistic regression and the predict and line functions for the predicted logistic curve.

# Import data
cancer_reg <- read.csv("data/cancer_reg.csv", header = TRUE)

# Binarize target_deathrate
median_deathrate <- median(cancer_reg$target_deathrate, na.rm = TRUE)
cancer_reg$deathrate_binary <- ifelse(cancer_reg$target_deathrate >= median_deathrate, 1, 0)

# Logistic regression
logit_model <- glm(deathrate_binary ~ incidencerate, data = cancer_reg, family = binomial)

# Predicted logistic curve
cancer_reg$predicted_probs <- predict(logit_model, type = "response")

# Plot the data points
plot(cancer_reg$incidencerate, cancer_reg$deathrate_binary, 
     xlab = "Incidence Rate", ylab = "Deathrate Binary", 
     main = "Logistic Regression: Incidence Rate vs Deathrate Binary",
     pch = 20, col = "red")

# Sort incidencerate for smoother line (Co-pilot suggested to do that)
sorted_indices <- order(cancer_reg$incidencerate)
sorted_incidencerate <- cancer_reg$incidencerate[sorted_indices]
sorted_predicted_probs <- cancer_reg$predicted_probs[sorted_indices]

# Add logistic regression curve
lines(sorted_incidencerate, sorted_predicted_probs, col = "blue", lwd = 2)

Compute the predicted value for input incidence rates of 300 and 800. With a probability cutoff of 0.5, would these two incidence rates (300 and 800) be classified as “high death rate” or “low death rate”?

# Define new data points for prediction
new_data <- data.frame(incidencerate = c(300, 800))

# Predict the probabilities
predicted_values <- predict(logit_model, newdata = new_data, type = "response")

# Classify based on a cutoff of 0.5
classification <- ifelse(predicted_values >= 0.5, "High Death Rate", "Low Death Rate")

For an incidence rate of 300, the predicted probability is 0.1192475, which is below the cutoff of 0.5 and thus classified as ‘low death rate’. For the incidence rate of 800, the predicted probability is 0.9913219, which is above the cutoff of 0.5. This is classified as ‘high death rate’.

Let’s now investigate extreme cases of the death rate. First, remove from your data.frame all the observations whose target_deathrate is between the 25th-75th percentiles of target_deathrate itself. In other words, keep in the data frame only the observations below the 25th percentile or above the 75th percentile of target_deathrate (Hint: use the quantile function of R). Repeat points 1a and 1b. How does the new logistic curve look compared to the previously plotted one? How were the predicted values for input incidence rates 300 and 800 impacted by the removal of the 25th-75th percentile data?

# Extreme cases of death rate
q25 <- quantile(cancer_reg$target_deathrate, 0.25, na.rm = TRUE)
q75 <- quantile(cancer_reg$target_deathrate, 0.75, na.rm = TRUE)

# New data frame with extreme values
cancer_reg_extremes <- cancer_reg[cancer_reg$target_deathrate < q25 | cancer_reg$target_deathrate > q75, ]

# Binarize target_deathrate
median_deathrate_extremes <- median(cancer_reg_extremes$target_deathrate, na.rm = TRUE)
cancer_reg_extremes$deathrate_binary <- ifelse(cancer_reg_extremes$target_deathrate >= median_deathrate_extremes, 1, 0)

# Logistic regression
logit_model_extremes <- glm(deathrate_binary ~ incidencerate, data = cancer_reg_extremes, family = binomial)

# Predicted probabilities for incidence rate 300 and 800
predicted_values_extremes <- predict(logit_model_extremes, newdata = new_data, type = "response")

# Plot data points
plot(cancer_reg_extremes$incidencerate, cancer_reg_extremes$deathrate_binary, 
     xlab = "Incidence Rate", ylab = "Deathrate Binary", 
     main = "Logistic Regression: Extreme Cases",
     pch = 20, col = "red")

# Sort incidence rate for smoother line
sorted_indices_extremes <- order(cancer_reg_extremes$incidencerate)
sorted_incidencerate_extremes <- cancer_reg_extremes$incidencerate[sorted_indices_extremes]
sorted_predicted_probs_extremes <- predict(logit_model_extremes, type = "response")[sorted_indices_extremes]

# Add the logistic regression line for extreme cases
lines(sorted_incidencerate_extremes, sorted_predicted_probs_extremes, col = "blue", lwd = 2)

I see that the logistic curve now seems steeper, this means that the relationship between incidence rate and death rate has become more “clear-cut”. A small change in the incidence rate leads to a big change in the probability of high death rate.

For an incidence rate of 300, the new predicted probability is 0.0370384, which is far below the cutoff of 0.5 and thus classified as ‘low death rate’. For the incidence rate of 800, the new predicted probability is 0.9996404, which is above the cutoff of 0.5. This is classified as ‘high death rate’. The removal of the 25th-75th quantiles impacted the predicted values and pushed them to the extremes. The low became lower and the high became higher. This makes sens since by removing these values we basically polarized our data. The model now only trains on extreme values, which makes it more extreme in its predictions as well.

2. Survival analysis (2 points max - 0.5 for each subtask)

Consider two populations of ground nesting birds. Preliminary data suggest that young chicks in population B have better chances of survival than those of population A, and you want to plan a study to investigate if the difference is real. The goal of this exercise is to generate and analyze in-silico data in order to obtain an idea about the population sizes that need to be studied for how long to observe a significant difference in survival.

Create a function that simulates chick survival. The function has four input variables: i.the size of the population: \(N_{\text{pop}}\)

the number of observation days: \(N_{\text{days}}\)
the daily chick survival probability \(P_{\text{surv}}\) (the expected fraction of chicks who survived between two observation time points)
the daily loss probability \(P_{\text{loss}}\) (the expected fraction of chicks who disappeared from the population between two observation time points for unknown reasons, so it is unknown if they are still alive or died)

The function should return a dataframe with one row for each chick with two columns:

observation time (in days)
event status: 0 = alive or lost, 1 = dead

To implement this function you should decide stochastically for each chick on a given observation day if it survived, died or got lost. Whenever a chick dies or is lost, its observation is terminated.

Task: Generate two sets of observation data for \(N_{\text{days}} = 20\) days corresponding to two populations of the same size (\(N_{\text{pop}} = 100\)) with different daily survival probabilities (\(P_{\text{surv}} = 0.8\) and \(0.95\), respectively) and the same loss probability (\(P_{\text{loss}} = 0.03\)). Print the two data frames.

# Function to simulate chick survival
simulate_chick_survival <- function(N_pop, N_days, P_surv, P_loss) {
  
  # Create empty lists to store results
  times <- numeric(N_pop)  # Stores the day of the event
  statuses <- numeric(N_pop)  # Stores the event status
  
  # Loop through each chick
  for (i in 1:N_pop) {
    # Observations for 1 chick
    for (day in 1:N_days) {
      
      # Figure out if the chick is lost
      if (runif(1) < P_loss) { # Co-pilot suggested using runif()
        times[i] <- day
        statuses[i] <- 0  # Lost
        break # Leave the loop because it's lost
      }
      
      # Simulate if the chick survives or dies
      if (runif(1) > P_surv) {
        times[i] <- day
        statuses[i] <- 1  # Died
        break # Leave the loop because it's dead
      }
      
      # If neither loss nor death, the chick survives this day
      
      if (day == N_days) { # If true the chick survived all days 
        times[i] <- N_days # Recod the final day as the time
        statuses[i] <- 0  # Still alive or lost
      }
    }
  }
  
  # Create a data frame with observation times and event status
  data.frame(observation_time = times, event_status = statuses)
}

# Simulate two populations with different survival probabilities
N_pop <- 100
N_days <- 20
P_loss <- 0.03
P_surv_A <- 0.80
P_surv_B <- 0.95

# Population A
pop_A_data <- simulate_chick_survival(N_pop, N_days, P_surv_A, P_loss)

# Population B
pop_B_data <- simulate_chick_survival(N_pop, N_days, P_surv_B, P_loss)

# Print the two data frames
print(head(pop_A_data, 10))  # Print first 10 rows for Population A

##    observation_time event_status
## 1                 2            1
## 2                 2            1
## 3                 4            1
## 4                 2            1
## 5                 2            1
## 6                 4            1
## 7                 2            0
## 8                12            1
## 9                 3            1
## 10                1            1

print(head(pop_B_data, 10))  # Print first 10 rows for Population B

##    observation_time event_status
## 1                 5            1
## 2                 4            0
## 3                 5            1
## 4                 4            0
## 5                15            0
## 6                20            0
## 7                 9            1
## 8                 3            0
## 9                 8            1
## 10               20            0

Plot survival probability as a function of observation days for both groups in terms of a Kaplan-Meier Curve. Comment on what you observe. Apply a log-rank test. What do you conclude?

Hints: Install the survival package, and inspect the Surv object. Use rbind to combine the dataframes of the two populations, and use rep to add a new column for group membership. Use survfit and survdiff to plot the Kaplan-Meier curve and to perform the log-rank test respectively.

# Load package
library(survival)

# Combine data for both populations
pop_A_data$group <- rep("Population A", N_pop)
pop_B_data$group <- rep("Population B", N_pop)
combined_data <- rbind(pop_A_data, pop_B_data)

# Create a Surv object
surv_object <- Surv(combined_data$observation_time, combined_data$event_status)

# Fit Kaplan-Meier survival curves for each group
km_fit <- survfit(surv_object ~ combined_data$group)

# Plot the Kaplan-Meier survival curves
plot(km_fit, col = c("red", "blue"), 
     xlab = "Days", ylab = "Survival Probability", 
     main = "Kaplan-Meier Survival Curves", 
     lwd = 2)
legend("topright", legend = c("Population A", "Population B"), 
       col = c("red", "blue"), lwd = 2)

# Apply the log-rank test to compare the two survival curves
log_rank_test <- survdiff(surv_object ~ combined_data$group)

# Print log-rank test result
print(log_rank_test)

## Call:
## survdiff(formula = surv_object ~ combined_data$group)
## 
##                                    N Observed Expected (O-E)^2/E (O-E)^2/V
## combined_data$group=Population A 100       92     41.9      59.9       107
## combined_data$group=Population B 100       48     98.1      25.6       107
## 
##  Chisq= 108  on 1 degrees of freedom, p= <2e-16

I see that both population’s survival probability are decreasing during the whole time we observe them, but while the population B’s survival probability decreases slowly, the population A’s survival probability decreases rapidly and reaches 0% before the end of the observation. The log rank test is used to compare the survival distributions of 2 populations. ChatGPT explained to me what each of the parameters meant.

Based on the difference between the observed events (death) and the expected ones (\(\frac{(\text{observed}-\text{expected})^2}{\text{expected}}\)) coupled to a very low p-value of 3.5350657^{-25}. I conclude that there is no other choice than to reject the hypothesis that there is no difference in suvival distribution for these two populations, and that chicks from the population B indeed have better chances of survival than the ones from population A.

Build a Cox proportional hazard model of the same data. Does the estimated hazard ratio correspond to your expectation?

Hint: Use coxph to build the model.

# Cox proportional hazards model
cox_model <- coxph(surv_object ~ combined_data$group)

Yes, the Cox proportional hazard model shows similar results. We see that the hazard ratio for Population B compared to Population A is a lot smaller than 1, meaning that chicks from Population B have a lot less chances of dying than the ones in Population A, here the models predicts that the risk of death of chicks from Population B is around 20% of the risk of death of the chicks from Population A. We also see that the 95% confidence interval is very far from 1, meaning that even there is almost no chance that these results were obtained by pure luck. Finally, we see that all the p-values from three different tests, including the log rank test are very close to 0, indicating that there is indeed a difference betweent the survival chance of chicks from Population A and Population B, that the chicks from Population B have higher chances of surviving.

Conduct an in-silico experiment repeating the above analysis reducing the sample sizes by steps of 5:

\(N_{\text{pop}} = 100\), \(95\), \(90\), … Is there a sample size at which you fail to see a significant difference in survival? Repeat the same experiments for \(P_{\text{surv}} = 0.85\) and \(0.95\) for the two populations. Comment on your results!

# Function to perform analysis
perform_analysis <- function(N_pop, P_surv_A, P_surv_B, P_loss, N_days) {
  
  # Data Population A
  pop_A_data <- simulate_chick_survival(N_pop, N_days, P_surv_A, P_loss)
  pop_A_data$group <- rep("Population A", N_pop)
  
  # Data Population B
  pop_B_data <- simulate_chick_survival(N_pop, N_days, P_surv_B, P_loss)
  pop_B_data$group <- rep("Population B", N_pop)
  
  # Combine data
  combined_data <- rbind(pop_A_data, pop_B_data)
  
  # Surv object
  surv_object <- Surv(combined_data$observation_time,
                      combined_data$event_status)
  
  # Fit Kaplan-Meier
  km_fit <- survfit(surv_object ~ combined_data$group)
  
  # Log-rank test
  log_rank_test <- survdiff(surv_object ~ combined_data$group)
  
  # Cox proportional hazards model
  cox_model <- coxph(surv_object ~ combined_data$group)
  
  # Return results
  list(km_fit = km_fit, log_rank_test = log_rank_test, cox_model = cox_model)
}

# Set parameters
P_loss <- 0.03
N_days <- 20
P_surv_A_values <- c(0.85, 0.95)
P_surv_B_values <- c(0.95, 0.85)
sample_sizes <- seq(100, 5, by = -5)

# Initialize results
results <- data.frame(SampleSize = integer(), 
                      P_Surv_A = numeric(), 
                      P_Surv_B = numeric(), 
                      LogRank_p_value = numeric(), 
                      Significant = logical(), 
                      Hazard_Ratio = numeric())

# Conduct the analysis for varying sample sizes
for (N_pop in sample_sizes) {
  for (i in 1:length(P_surv_A_values)) {
    P_surv_A <- P_surv_A_values[i]
    P_surv_B <- P_surv_B_values[i]
    
    analysis_results <- perform_analysis(N_pop,
                                         P_surv_A,
                                         P_surv_B,
                                         P_loss,
                                         N_days)
    
    # Extract log-rank p-value and hazard ratio (Co-Pilot told me how to)
    log_rank_p_value <- analysis_results$log_rank_test$p[1]
    hazard_ratio <- exp(coef(analysis_results$cox_model)[1])
    
    # Determine significance
    significant <- log_rank_p_value < 0.05
    
    # Store results
    results <- rbind(results, data.frame(SampleSize = N_pop, 
                                          P_Surv_A = P_surv_A, 
                                          P_Surv_B = P_surv_B, 
                                          LogRank_p_value = log_rank_p_value, 
                                          Significant = significant, 
                                          Hazard_Ratio = hazard_ratio))
  }
}

I asked ChatGPT to put the results in a mardown table so that they would be easier to analyze:

Sample Size	P_Surv_A	P_Surv_B	LogRank_p_value	Significant	Hazard_Ratio
100	0.85	0.95	2.812351e-07	TRUE	0.3986257
100	0.95	0.85	9.982393e-18	TRUE	4.9368533
95	0.85	0.95	4.481588e-09	TRUE	0.3321471
95	0.95	0.85	2.242478e-19	TRUE	5.2235494
90	0.85	0.95	5.168561e-09	TRUE	0.3257784
90	0.95	0.85	2.528762e-09	TRUE	2.9063339
85	0.85	0.95	4.746107e-08	TRUE	0.3504010
85	0.95	0.85	1.094125e-06	TRUE	2.5429018
80	0.85	0.95	1.600992e-09	TRUE	0.2813703
80	0.95	0.85	3.093697e-11	TRUE	4.0058033
75	0.85	0.95	2.569515e-07	TRUE	0.3093675
75	0.95	0.85	1.922076e-07	TRUE	2.8053062
70	0.85	0.95	3.184703e-06	TRUE	0.3755245
70	0.95	0.85	5.796793e-06	TRUE	2.5867076
65	0.85	0.95	3.258108e-11	TRUE	0.2143716
65	0.95	0.85	3.075501e-11	TRUE	4.9509234
60	0.85	0.95	1.058994e-06	TRUE	0.3188875
60	0.95	0.85	6.419377e-08	TRUE	3.4674736
55	0.85	0.95	7.755740e-06	TRUE	0.3437639
55	0.95	0.85	3.377469e-04	TRUE	2.3648032
50	0.85	0.95	2.386045e-05	TRUE	0.3495918
50	0.95	0.85	5.478384e-09	TRUE	4.8218152
45	0.85	0.95	5.070939e-09	TRUE	0.2144768
45	0.95	0.85	1.254435e-03	TRUE	2.2388702
40	0.85	0.95	5.553948e-06	TRUE	0.2723527
40	0.95	0.85	1.445351e-04	TRUE	3.1360523
35	0.85	0.95	0.01293186	TRUE	0.4716814
35	0.95	0.85	4.049820e-05	TRUE	3.0824715
30	0.85	0.95	1.343616e-03	TRUE	0.3580566
30	0.95	0.85	1.450126e-04	TRUE	3.0984165
25	0.85	0.95	3.416597e-04	TRUE	0.2646231
25	0.95	0.85	6.140147e-03	TRUE	2.6877334
20	0.85	0.95	4.222875e-03	TRUE	0.2907575
20	0.95	0.85	0.01205016	TRUE	2.8011101
15	0.85	0.95	0.04115161	TRUE	0.3942332
15	0.95	0.85	0.01620333	TRUE	3.1423553
10	0.85	0.95	0.6509290	FALSE	0.7998961
10	0.95	0.85	0.1145348	FALSE	2.7692377
5	0.85	0.95	0.02727102	TRUE	0.1299541
5	0.95	0.85	0.3034837	FALSE	3.1819433

As sample sizes decreased from \(N_{\text{pop}} = 100\) to \(N_{\text{pop}} = 5\), our ability to detect statistically significant differences in survival rates diminished. The table shows that when the population size drops below \(N_{\text{pop}} \leq 10\), the log-rank test p-value becomes non-significant, indicating reduced statistical power. This suggests that smaller sample sizes increase the variability of results and may mask true differences between populations. This makes complete sens since with smaller sample size you are more subject to stochastic effects masking true differences between populations.

3. Mixed model analysis (2.0 points max - 0.5 for each subtask)

A plant biologist wants to study seedling growth. In the experiment the diameter of the seedlings is measured daily on five consecutive days. From preliminary data it appears that the diameter seems to increase as a linear function in time, but there is substantial variation in the initial diameter and growth rate. In order to help the biologist to get a feeling for how many seedlings should be studied you offer to do some simulations:

Create a function that simulates seedling growth. The function takes as input the following parameters:
Number of seedlings: \(N_s\)

Number of timepoints of the experiment (same for all seedlings): \(N_t\)
The expected seedling diameter \(\mu_d\) and its variance \(\sigma_d^2\) across seedlings at the first day of measurements
The expected (linear) growth rate \(\mu_r\) and its variance \(\sigma_r^2\) across seedlings
The measurement error \(\varepsilon\) which is modeled to be Gaussian with variance \(\sigma_{\varepsilon}^2\)

The output of the function is a \(N_s \times N_t\) data sheet of simulated diameter measurements with one row for each seedling and one column for each time point \(t\). To this end first draw for each seedling its initial diameter \(d_0\) from a normal distribution \(\mathcal{N}(\mu_d , \sigma_d)\) and its growth rate \(r\) from \(\mathcal{N}(\mu_r, \sigma_r)\). Then assume a linear model to generate the diameter measurements, adding measurement errors drawn from \(\mathcal{N}(0, \sigma_{\varepsilon})\):

\[ d(t) = d_0 + r t + \varepsilon \]

Test the function by generating measurements for \(N_s = 10\), \(N_t = 5\), \(\mu_d = 1\)mm, \(\sigma_d = 0.2\)mm, \(\mu_r = 0.5\)mm/day, \(\sigma_r = 0.1\)mm/day, \(\varepsilon = 0.1\)mm. Plot the corresponding data as time courses.

# Load necessary library for plotting
library(ggplot2)

# Function to simulate seedling growth
simulate_seedling_growth <- function(N_s, N_t, mu_d, sigma_d, mu_r, sigma_r, epsilon) {
  # Initialize a matrix to store the results
  measurements <- matrix(0, nrow = N_s, ncol = N_t)
  
  # Generate initial diameters and growth rates for each seedling
  initial_diameters <- rnorm(N_s, mean = mu_d, sd = sigma_d)
  growth_rates <- rnorm(N_s, mean = mu_r, sd = sigma_r)
  
  # Simulate measurements for each time point
  for (t in 1:N_t) {
    # Calculate diameter with measurement error
    measurement_error <- rnorm(N_s, mean = 0, sd = epsilon)
    measurements[, t] <- initial_diameters + growth_rates * (t - 1) + measurement_error
  }
  
  return(measurements)
}

# Parameters for the simulation
N_s <- 10
N_t <- 5
mu_d <- 1
sigma_d <- 0.2
mu_r <- 0.5
sigma_r <- 0.1
epsilon <- 0.1

# Run the simulation
seedling_data <- simulate_seedling_growth(N_s, N_t, mu_d, sigma_d, mu_r, sigma_r, epsilon)

# Convert the matrix into data frame (Co-Pilot helped me)
time_points <- 1:N_t
seedling_df <- data.frame(time = rep(time_points, each = N_s), 
                           seedling = rep(1:N_s, times = N_t), 
                           diameter = as.vector(seedling_data))

# Plot the results
# Made it without using ggplot but asked Co-pilot to make it prettier
ggplot(seedling_df, aes(x = time,
                        y = diameter,
                        group = seedling,
                        color = as.factor(seedling))) +
  geom_line() +
  geom_point() +
  labs(title = "Simulated Seedling Growth Over Time", 
       x = "Time (days)", 
       y = "Diameter (mm)", 
       color = "Seedling") +
  theme_minimal()

Model the data you generated for each seedling with a simple linear model to estimate its initial diameter \(d_0\) and growth rate \(r\). What is the total number of many parameters you estimated? Compute the sample mean and variance from these estimates for the initial diameter \(d_0\) and growth rate \(r\) across the seedlings. Compare them to \(\mu_d\) , \(\sigma_d^2\) and \(\mu_r\), \(\sigma_r^2\). What is their absolute and relative difference? Please comment on your observations.

# Fit linear models to each seedling's growth data
model_results <- list()  # List to store models
d0_estimates <- numeric(N_s)  # Vector to store initial diameter estimates
r_estimates <- numeric(N_s)  # Vector to store growth rate estimates

for (i in 1:N_s) {
  # Extract the diameter measurements for the current seedling
  seedling_measurements <- seedling_data[i, ]
  time_points <- 0:(N_t - 1)  # Time starts at 0
  
  # Fit a linear model: diameter ~ time
  model <- lm(seedling_measurements ~ time_points)
  model_results[[i]] <- model
  
  # Store estimates
  d0_estimates[i] <- coef(model)[1]  # Intercept (initial diameter)
  r_estimates[i] <- coef(model)[2]   # Slope (growth rate)
}

# Compute sample mean and variance for d0 and r
mean_d0 <- mean(d0_estimates)
var_d0 <- var(d0_estimates)
mean_r <- mean(r_estimates)
var_r <- var(r_estimates)

# True parameters
true_mu_d <- mu_d
true_sigma_d2 <- sigma_d^2
true_mu_r <- mu_r
true_sigma_r2 <- sigma_r^2

# Compute absolute and relative differences
abs_diff_d0 <- abs(mean_d0 - true_mu_d)
rel_diff_d0 <- abs_diff_d0 / true_mu_d

abs_diff_r <- abs(mean_r - true_mu_r)
rel_diff_r <- abs_diff_r / true_mu_r

# Display results
results <- data.frame(
  Parameter = c("Initial Diameter (d0)", "Growth Rate (r)"),
  Estimated_Mean = c(mean_d0, mean_r),
  Estimated_Variance = c(var_d0, var_r),
  True_Mean = c(true_mu_d, true_mu_r),
  True_Variance = c(true_sigma_d2, true_sigma_r2),
  Abs_Difference_Mean = c(abs_diff_d0, abs_diff_r),
  Rel_Difference_Mean = c(rel_diff_d0, rel_diff_r)
)

Parameter	Estimated Mean	Estimated Variance	True Mean	True Variance	Abs Difference (Mean)	Rel Difference (Mean)
Initial Diameter (\(d_0\))	1.080041	0.05370362	1.0	0.04	0.08004098	0.08004098
Growth Rate (\(r\))	0.469813	0.01583878	0.5	0.01	0.03018693	0.06037386

The estimated means for both the initial diameter and growth rate are fairly close to the true values, with small absolute and relative differences, indicating that the linear model effectively captures the parameters from the simulated data. The variance of the estimates is slightly higher than the true variance, likely due to the small sample size (N = 10). However, this difference might be too small to be considered important.

Join the data from all seedlings and model it with a single simple linear model. How many parameters do you estimate now? Compare their estimates to \(\mu_d\) and \(\mu_r\). What is their absolute and relative difference? Please comment on your observations. Hint: The pivot_longer function of the tidyr library provides a simple way to transform your \(N_s \times N_t\) data sheet into the required format.

# Load necessary libraries
library(dplyr) # for the mutate() function
library(tidyr)

# Convert the seedling data to a data frame
seedling_data <- as.data.frame(seedling_data)

# Reshape the data from wide to long format
seedling_long <- pivot_longer(
  seedling_data, 
  cols = everything(),
  names_to = "Timepoint",
  values_to = "Diameter")

# Clean up the Timepoint column to make it numeric
seedling_long$Timepoint <- as.numeric(gsub("V", "", seedling_long$Timepoint))

# Fit a linear model using the reshaped data
combined_model <- lm(Diameter ~ Timepoint, data = seedling_long)

# Get the coefficients from the model
d0_combined <- coef(combined_model)[1]  # Initial diameter (intercept)
r_combined <- coef(combined_model)[2]   # Growth rate (slope)

# Calculate the absolute difference from the true values
abs_diff_d0_combined <- abs(d0_combined - mu_d)
rel_diff_d0_combined <- abs_diff_d0_combined / mu_d  # Relative difference

abs_diff_r_combined <- abs(r_combined - mu_r)
rel_diff_r_combined <- abs_diff_r_combined / mu_r  # Relative difference

# Create a data frame to store results
results_combined <- data.frame(
  Parameter = c("Initial Diameter (d0)", "Growth Rate (r)"),
  Estimated_Value = c(d0_combined, r_combined),
  True_Value = c(mu_d, mu_r),
  Abs_Difference = c(abs_diff_d0_combined, abs_diff_r_combined),
  Rel_Difference = c(rel_diff_d0_combined, rel_diff_r_combined))

# Count the number of parameters estimated from the model
num_parameters <- length(coef(combined_model))

Parameter	Estimated Value	True Value	Abs Difference	Rel Difference
Initial Diameter (\(d_0\))	0.6102279	1.0	0.38977209	0.38977209
Growth Rate (\(r\))	0.4698131	0.5	0.03018693	0.06037386

In this approach, we model all seedlings using a single linear regression rather than individual models for each seedling. The number of parameters estimated in this combined model is 2: one for the initial diameter (\(d_0\)) and one for the growth rate (\(r\)). The estimates are further from the true values than the previous model. By fitting a single linear model to all seedlings, the approach assumes that all seedlings grow at the same rate and start from the same size. Which isn’t a great idea since the plant biologist said that the preliminary data indicated that there was “substantial variation in the initial diameter and growth rate”. This explains why the first model we did fitted the true values much better.

Model the data you generated for all seedlings with a Random Intercept Model and with a Random Intercept and Slope Model. How many parameters do you estimate in each case? Compare your estimates for \(d_0\) and \(r\) to \(\mu_d\) and \(\mu_r\), respectively. How do their (point) estimates and errors compare to those of the models with fixed effects only? How do your results change when masking 10% of your data as missing?

Hint: One way to estimate errors is doing repeated experiments.

Here, we will create two additional models that account for the substantial variation in both the initial diameter (intercept) and growth rate (slope) of the seedlings. I attempted to implement this in R and compare the estimates with the true values, but at that point, most of my code had been generated using Co-Pilot, so I won’t include it here. However, I believe that using random slope and intercept models is a suitable approach for capturing the variations in both the diameter and growth rate of the seedlings. I think the estimated values would be the closest to the true values in this model.

Data Analysis 2024 - 4th set of exercises

Douhan Wicht

2024-10-23

0. Acknowledgment

1. Logistic regressions (1.5 points max - 0.5 for each subtask)

2. Survival analysis (2 points max - 0.5 for each subtask)

3. Mixed model analysis (2.0 points max - 0.5 for each subtask)