Introduction
This research will investigate the differences between a standard and
bootstrap simple linear regression approaches. We will aim to determine
which method is most ideal for future research.
Data Set
The data set of interest contains detailed information on various
environmental and soil factors that influence plant growth. It includes
measurements of soil contents, environmental factors, and the plant
yield. Information is provided for many kinds of fruit, but for this
analysis, only strawberry data will be used. Data was recorded
throughout the plant growth period and during harvest.
The data set was acquired from Kaggle courtesy of user Masha Sanaei.
The link for the dataset is https://www.kaggle.com/datasets/snmahsa/soil-nutrients.
The dataset was uploaded to a github repository for access as https://github.com/ncbrechbill/STA321/blob/main/STA321/Soil%20Nutrients.csv.
- Name; character: The plant name. This dataset was pruned for only
Strawberry data
- Temperature (\(X_1\)); numeric: The
average recorded air temperature in celcius
- Rainfall(\(X_2\)); numeric: The
total recorded rainfall in centimeters
- pH(\(X_3\)); numeric: The average
recoded soil pH
- Nitrogen(\(X_4\)); numeric: Average
soil Nitrogen content in parts per million (ppm)
- Phosphorus(\(X_5\)); numeric:
Average soil phosphorus content in ppm
- Potassium (\(X_6\)); numeric:
Average soil Potassium content in ppm
- Light_Hours (\(X_7\)); numeric:
Average hours of sunlight covering the plant each day
- Yield (\(Y\)); numeric: The mass of
harvested strawberries in grams
This data set contains 700 strawberry plant observations, all with
complete data and no missing values. This set is sufficiently large to
make accurate predictions for any number of these predictors. There are
additional categorical variables that group ranges of these variables
together. However, these will not be used under this analysis
methodology.
Exploratory Data
Analysis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The pairwise scatter plots and correlation coefficients indicate no
strong linear relationships between the response variable and the
factors. Transforming the relationship would not yield a greater
relationship. The analysis variables are all normally distributed, thus
a box-cox transformation is not required.
Standard Simple Linear
Regression
A simple linear regression model was fitted to the factor pH.
Determining a relationship between soil pH and strawberry yield will
help us answer our research question of what treatments or environments
could be worth utilizing to increase strawberry yield.
The simple linear regression equation is given by the following:
\[
y = \beta_0 + \beta_1 x + \epsilon
\]
Where \(Y\) is the strawberry yield,
\(\beta_0\) is the intercept
coefficient, \(\beta_1\) is the
coefficient factor for pH (thus \(x\)
is defined as the factor pH), and \(\epsilon\) is an error term. This is a
parametric model, hence we proceed under the assumption \(\epsilon \sim N(0, \sigma^2)\), and
equivalently, \(y \sim N(\beta_0 + \beta_1
x)\).
The t statistic was 1.007 on 1 and 698 degrees of freedom. The P
value was 0.3159, indicating no statistical significance.
The residuals appear to be distributed normally and with equal
variance. There are no outlier residuals with high leverage, and the
leverage trend is reasonably even.




reg.table <- coef(summary(slr))
pander(reg.table, caption = "Standard Linear Regression 95% CI")
Standard Linear Regression 95% CI
(Intercept) |
18.92 |
1.106 |
17.1 |
4.927e-55 |
pH |
0.1731 |
0.1725 |
1.004 |
0.3159 |
Bootstrap Simple Linear
Regression
The linear regression method under the bootstrap method remains the
same. The first few bootstrap beta coefficients are shown. Below, we can
see that the bootstrap coefficient distributions are normal. Since zero
is included in the 95% confidence interval, we cannot make any
conclusions that the population mean in not equal to zero at that
confidence level, a similar conclusion we reached under the standard
regression model.
set.seed(123) # for reproducibility
n <- nrow(strawberry) # should be 700
B <- 1000 # number of bootstrap replications
# Matrix to store coefficients: Intercept and pH slope
beta_mat <- matrix(NA, nrow = B, ncol = 2)
for (b in 1:B) {
# Sample row indices with replacement
idx <- sample(1:n, size = n, replace = TRUE)
# Fit linear model on bootstrap sample
fit <- lm(Yield ~ pH, data = strawberry[idx, ])
# Store coefficients
beta_mat[b, ] <- coef(fit)
}
# Convert to data frame for easier handling
beta_df <- as.data.frame(beta_mat)
names(beta_df) <- c("Intercept", "Slope")
# Quick look at results
head(beta_df)
## Intercept Slope
## 1 20.85783 -0.12725981
## 2 19.28336 0.11662004
## 3 20.21753 -0.03029722
## 4 19.95392 0.01711187
## 5 19.25598 0.12264456
## 6 21.06940 -0.16711248
boot.beta0.ci <- quantile(beta_df$Intercept, c(0.025, 0.975))
boot.beta1.ci <- quantile(beta_df$Slope, c(0.025, 0.975))
boot.coef <- data.frame(rbind(boot.beta0.ci, boot.beta1.ci))
names(boot.coef) <- c("2.5%", "97.5%")
pander(boot.coef, caption="Bootstrap confidence intervals of regression coefficients.")
Bootstrap confidence intervals of regression
coefficients.
boot.beta0.ci |
16.69 |
21.04 |
boot.beta1.ci |
-0.1575 |
0.5172 |
Method Comparison
The two methods yielded similar results, that is that there is no
significant impact on the observed level of soil pH. This indicates
that, in general, strawberry plants are robust and would not be affected
by any number of factors that impact soil pH (such as sulfur treatment
or other natural compounds that raise or lower pH). This data was
collected without any experimental design factors being tested, so
future research that properly tests a hypothesis at different pH levels
may conclude different results.
In the scope of this research, the normal distribution of the data
and the number of samples indicate that bootstrapping is not necessary
to conclusive analysis. As expected in bootstrap analysis, a sample of
this size yields similar results as the standard simple linear
regression. For computational ease, the standard approach may be
desirable.
