1 Introduction

This research will investigate the differences between a standard and bootstrap simple linear regression approaches. We will aim to determine which method is most ideal for future research.

2 Data Set

The data set of interest contains detailed information on various environmental and soil factors that influence plant growth. It includes measurements of soil contents, environmental factors, and the plant yield. Information is provided for many kinds of fruit, but for this analysis, only strawberry data will be used. Data was recorded throughout the plant growth period and during harvest.

The data set was acquired from Kaggle courtesy of user Masha Sanaei. The link for the dataset is https://www.kaggle.com/datasets/snmahsa/soil-nutrients. The dataset was uploaded to a github repository for access as https://github.com/ncbrechbill/STA321/blob/main/STA321/Soil%20Nutrients.csv.

  • Name; character: The plant name. This dataset was pruned for only Strawberry data
  • Temperature (\(X_1\)); numeric: The average recorded air temperature in celcius
  • Rainfall(\(X_2\)); numeric: The total recorded rainfall in centimeters
  • pH(\(X_3\)); numeric: The average recoded soil pH
  • Nitrogen(\(X_4\)); numeric: Average soil Nitrogen content in parts per million (ppm)
  • Phosphorus(\(X_5\)); numeric: Average soil phosphorus content in ppm
  • Potassium (\(X_6\)); numeric: Average soil Potassium content in ppm
  • Light_Hours (\(X_7\)); numeric: Average hours of sunlight covering the plant each day
  • Yield (\(Y\)); numeric: The mass of harvested strawberries in grams

This data set contains 700 strawberry plant observations, all with complete data and no missing values. This set is sufficiently large to make accurate predictions for any number of these predictors. There are additional categorical variables that group ranges of these variables together. However, these will not be used under this analysis methodology.

3 Exploratory Data Analysis

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The pairwise scatter plots and correlation coefficients indicate no strong linear relationships between the response variable and the factors. Transforming the relationship would not yield a greater relationship. The analysis variables are all normally distributed, thus a box-cox transformation is not required.

4 Standard Simple Linear Regression

A simple linear regression model was fitted to the factor pH. Determining a relationship between soil pH and strawberry yield will help us answer our research question of what treatments or environments could be worth utilizing to increase strawberry yield.

The simple linear regression equation is given by the following:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Where \(Y\) is the strawberry yield, \(\beta_0\) is the intercept coefficient, \(\beta_1\) is the coefficient factor for pH (thus \(x\) is defined as the factor pH), and \(\epsilon\) is an error term. This is a parametric model, hence we proceed under the assumption \(\epsilon \sim N(0, \sigma^2)\), and equivalently, \(y \sim N(\beta_0 + \beta_1 x)\).

The t statistic was 1.007 on 1 and 698 degrees of freedom. The P value was 0.3159, indicating no statistical significance.

The residuals appear to be distributed normally and with equal variance. There are no outlier residuals with high leverage, and the leverage trend is reasonably even.

reg.table <- coef(summary(slr))
pander(reg.table, caption = "Standard Linear Regression 95% CI")
Standard Linear Regression 95% CI
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.92 1.106 17.1 4.927e-55
pH 0.1731 0.1725 1.004 0.3159

5 Bootstrap Simple Linear Regression

The linear regression method under the bootstrap method remains the same. The first few bootstrap beta coefficients are shown. Below, we can see that the bootstrap coefficient distributions are normal. Since zero is included in the 95% confidence interval, we cannot make any conclusions that the population mean in not equal to zero at that confidence level, a similar conclusion we reached under the standard regression model.

set.seed(123)  # for reproducibility

n <- nrow(strawberry)   # should be 700
B <- 1000               # number of bootstrap replications

# Matrix to store coefficients: Intercept and pH slope
beta_mat <- matrix(NA, nrow = B, ncol = 2)

for (b in 1:B) {
  # Sample row indices with replacement
  idx <- sample(1:n, size = n, replace = TRUE)
  
  # Fit linear model on bootstrap sample
  fit <- lm(Yield ~ pH, data = strawberry[idx, ])
  
  # Store coefficients
  beta_mat[b, ] <- coef(fit)
}

# Convert to data frame for easier handling
beta_df <- as.data.frame(beta_mat)
names(beta_df) <- c("Intercept", "Slope")

# Quick look at results
head(beta_df)
##   Intercept       Slope
## 1  20.85783 -0.12725981
## 2  19.28336  0.11662004
## 3  20.21753 -0.03029722
## 4  19.95392  0.01711187
## 5  19.25598  0.12264456
## 6  21.06940 -0.16711248
boot.beta0.ci <- quantile(beta_df$Intercept, c(0.025, 0.975))
boot.beta1.ci <- quantile(beta_df$Slope, c(0.025, 0.975))
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci)) 
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
Bootstrap confidence intervals of regression coefficients.
  2.5% 97.5%
boot.beta0.ci 16.69 21.04
boot.beta1.ci -0.1575 0.5172

6 Method Comparison

The two methods yielded similar results, that is that there is no significant impact on the observed level of soil pH. This indicates that, in general, strawberry plants are robust and would not be affected by any number of factors that impact soil pH (such as sulfur treatment or other natural compounds that raise or lower pH). This data was collected without any experimental design factors being tested, so future research that properly tests a hypothesis at different pH levels may conclude different results.

In the scope of this research, the normal distribution of the data and the number of samples indicate that bootstrapping is not necessary to conclusive analysis. As expected in bootstrap analysis, a sample of this size yields similar results as the standard simple linear regression. For computational ease, the standard approach may be desirable.

