Introduction
This report will compare several multiple linear regression models to
determine which best predicts the yield of a strawberry plant based on
several soil and environmental factors.
Data Set
The data set of interest contains detailed information on various
environmental and soil factors that influence plant growth. It includes
measurements of soil contents, environmental factors, and the plant
yield. Information is provided for many kinds of fruit, but for this
analysis, only strawberry data will be used. Data was recorded
throughout the plant growth period and during harvest.
The data set was acquired from Kaggle courtesy of user Masha Sanaei.
The link for the dataset is https://www.kaggle.com/datasets/snmahsa/soil-nutrients.
The dataset was uploaded to a github repository for access as https://github.com/ncbrechbill/STA321/blob/main/STA321/Soil%20Nutrients.csv.
- Name; character: The plant name. This dataset was pruned for only
Strawberry data
- Temperature (\(X_1\)); numeric: The
average recorded air temperature in celcius
- Rainfall(\(X_2\)); numeric: The
total recorded rainfall in centimeters
- pH(\(X_3\)); numeric: The average
recoded soil pH
- Nitrogen(\(X_4\)); numeric: Average
soil Nitrogen content in parts per million (ppm)
- Phosphorus(\(X_5\)); numeric:
Average soil phosphorus content in ppm
- Potassium (\(X_6\)); numeric:
Average soil Potassium content in ppm
- Light_Hours (\(X_7\)); numeric:
Average hours of sunlight covering the plant each day
- Yield (\(Y\)); numeric: The mass of
harvested strawberries in grams
This data set contains 700 strawberry plant observations, all with
complete data and no missing values. This set is sufficiently large to
make accurate predictions for any number of these predictors. There are
additional categorical variables that group ranges of these variables
together. However, these will not be used under this analysis
methodology.
Exploratory Data
Analysis

Above are the pairwise scatterplots, correlation coefficients, and
density functions. We can see that the variables are distributed
normally, and the individual correlation coefficients are not very
large. This, alongside previous research, indicates that no particular
variable individually predicts the plant yield. We aim to determine if
any combinations of factors can predict yield to any degree through
multiple linear regression models.
Multiple Linear
Regression
Let \(\{x_1, x_2, \cdots, x_k \}\)
be \(k\) explanatory variables and
\(y\) be the response variables. The
general form of the multiple linear regression model is defined as
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2
+ \cdots + \beta_k x_k + \epsilon. \] This is a very special form
in that y is linear in both parameters and predictors. The actual linear
regression only assumes that \(y\) is
linear only in parameters but not predictors since the value of
predictors will be observed in data.
A multiple linear regression model was first created utilizing all
potential variables. None of the factors are statistically significant.
We will reduce the next model by removing the least significant factors.
The residual plots were analyzed to check the assumptions of the
parametric model, and all were valid. No further transformation would be
justified.
strawberry <- dplyr::select(strawberry, Temperature, Rainfall, pH, Light_Hours, Nitrogen, Phosphorus, Potassium, Yield)
full.model <- lm(Yield ~ ., data = strawberry)
pander(full.model)
Fitting linear model: Yield ~ .
| (Intercept) |
20.46 |
2.435 |
8.402 |
2.48e-16 |
| Temperature |
-0.006367 |
0.01052 |
-0.6051 |
0.5453 |
| Rainfall |
-0.0003384 |
0.0004649 |
-0.728 |
0.4669 |
| pH |
0.1901 |
0.1737 |
1.095 |
0.274 |
| Light_Hours |
-0.07511 |
0.07 |
-1.073 |
0.2836 |
| Nitrogen |
-0.001444 |
0.002753 |
-0.5246 |
0.6001 |
| Phosphorus |
0.001231 |
0.007015 |
0.1755 |
0.8608 |
| Potassium |
-0.0008489 |
0.007013 |
-0.121 |
0.9037 |
resid_panel(full.model)

The next model is reduced from the three least significant factors:
potassium, phosphorus, and nitrogen. Still, no factors are statistically
significant. The residual plots are provided again to check the
parametric assumptions, all of which were valid.
reduced.model1 <- lm(Yield ~ . - Nitrogen - Phosphorus, - Potassium, data = strawberry)
pander(reduced.model1)
Fitting linear model: Yield ~ . - Nitrogen -
Phosphorus
| (Intercept) |
20.53 |
2.263 |
9.073 |
1.257e-18 |
| Temperature |
-0.006573 |
0.01064 |
-0.6179 |
0.5369 |
| Rainfall |
-0.000428 |
0.0004754 |
-0.9003 |
0.3683 |
| pH |
0.1662 |
0.1762 |
0.9435 |
0.3458 |
| Light_Hours |
-0.06902 |
0.07095 |
-0.9728 |
0.331 |
| Potassium |
-0.0009551 |
0.00715 |
-0.1336 |
0.8938 |
resid_panel(reduced.model1)

The model was once again reduced, now only including two factors, pH
and Light Hours. Still, neither factor was significant. The residual
plots are provided again to check the parametric assumptions, all of
which were valid.
reduced.model2 <- lm(Yield ~pH + Light_Hours, data = strawberry)
pander(reduced.model2)
Fitting linear model: Yield ~ pH + Light_Hours
| (Intercept) |
19.79 |
1.39 |
14.24 |
1.418e-40 |
| pH |
0.1833 |
0.1728 |
1.061 |
0.289 |
| Light_Hours |
-0.07247 |
0.0697 |
-1.04 |
0.2988 |
resid_panel(reduced.model2)

Results Analysis
None of the tested multiple linear regression models were able to
predict the variance in plant yield at a statistically significant
level. Reducing the model again would result in a previously analyzed
simple linear regression model, which was not significant either. It can
be reasonably determined that, in our sampled population, none of the
factors would be able to predict the strawberry yield, individually or
combined.
Conclusions
More significant models may be able to be constructed under different
sampling circumstances. For example, a designed experiment that changes
the various factors at different levels may find differences in the
plant yield. More conclusive evidence may be found in such an
experiment.
