1 Introduction

This report will compare several multiple linear regression models to determine which best predicts the yield of a strawberry plant based on several soil and environmental factors.

2 Data Set

The data set of interest contains detailed information on various environmental and soil factors that influence plant growth. It includes measurements of soil contents, environmental factors, and the plant yield. Information is provided for many kinds of fruit, but for this analysis, only strawberry data will be used. Data was recorded throughout the plant growth period and during harvest.

The data set was acquired from Kaggle courtesy of user Masha Sanaei. The link for the dataset is https://www.kaggle.com/datasets/snmahsa/soil-nutrients. The dataset was uploaded to a github repository for access as https://github.com/ncbrechbill/STA321/blob/main/STA321/Soil%20Nutrients.csv.

  • Name; character: The plant name. This dataset was pruned for only Strawberry data
  • Temperature (\(X_1\)); numeric: The average recorded air temperature in celcius
  • Rainfall(\(X_2\)); numeric: The total recorded rainfall in centimeters
  • pH(\(X_3\)); numeric: The average recoded soil pH
  • Nitrogen(\(X_4\)); numeric: Average soil Nitrogen content in parts per million (ppm)
  • Phosphorus(\(X_5\)); numeric: Average soil phosphorus content in ppm
  • Potassium (\(X_6\)); numeric: Average soil Potassium content in ppm
  • Light_Hours (\(X_7\)); numeric: Average hours of sunlight covering the plant each day
  • Yield (\(Y\)); numeric: The mass of harvested strawberries in grams

This data set contains 700 strawberry plant observations, all with complete data and no missing values. This set is sufficiently large to make accurate predictions for any number of these predictors. There are additional categorical variables that group ranges of these variables together. However, these will not be used under this analysis methodology.

3 Exploratory Data Analysis

Above are the pairwise scatterplots, correlation coefficients, and density functions. We can see that the variables are distributed normally, and the individual correlation coefficients are not very large. This, alongside previous research, indicates that no particular variable individually predicts the plant yield. We aim to determine if any combinations of factors can predict yield to any degree through multiple linear regression models.

4 Multiple Linear Regression

Let \(\{x_1, x_2, \cdots, x_k \}\) be \(k\) explanatory variables and \(y\) be the response variables. The general form of the multiple linear regression model is defined as

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \epsilon. \] This is a very special form in that y is linear in both parameters and predictors. The actual linear regression only assumes that \(y\) is linear only in parameters but not predictors since the value of predictors will be observed in data.

A multiple linear regression model was first created utilizing all potential variables. None of the factors are statistically significant. We will reduce the next model by removing the least significant factors. The residual plots were analyzed to check the assumptions of the parametric model, and all were valid. No further transformation would be justified.

strawberry <- dplyr::select(strawberry, Temperature, Rainfall, pH, Light_Hours, Nitrogen, Phosphorus, Potassium, Yield)

full.model <- lm(Yield ~ ., data = strawberry)
pander(full.model)
Fitting linear model: Yield ~ .
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.46 2.435 8.402 2.48e-16
Temperature -0.006367 0.01052 -0.6051 0.5453
Rainfall -0.0003384 0.0004649 -0.728 0.4669
pH 0.1901 0.1737 1.095 0.274
Light_Hours -0.07511 0.07 -1.073 0.2836
Nitrogen -0.001444 0.002753 -0.5246 0.6001
Phosphorus 0.001231 0.007015 0.1755 0.8608
Potassium -0.0008489 0.007013 -0.121 0.9037
resid_panel(full.model)

The next model is reduced from the three least significant factors: potassium, phosphorus, and nitrogen. Still, no factors are statistically significant. The residual plots are provided again to check the parametric assumptions, all of which were valid.

reduced.model1 <- lm(Yield ~ . - Nitrogen - Phosphorus, - Potassium, data = strawberry)
pander(reduced.model1)
Fitting linear model: Yield ~ . - Nitrogen - Phosphorus
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.53 2.263 9.073 1.257e-18
Temperature -0.006573 0.01064 -0.6179 0.5369
Rainfall -0.000428 0.0004754 -0.9003 0.3683
pH 0.1662 0.1762 0.9435 0.3458
Light_Hours -0.06902 0.07095 -0.9728 0.331
Potassium -0.0009551 0.00715 -0.1336 0.8938
resid_panel(reduced.model1)

The model was once again reduced, now only including two factors, pH and Light Hours. Still, neither factor was significant. The residual plots are provided again to check the parametric assumptions, all of which were valid.

reduced.model2 <- lm(Yield ~pH + Light_Hours, data = strawberry)
pander(reduced.model2)
Fitting linear model: Yield ~ pH + Light_Hours
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.79 1.39 14.24 1.418e-40
pH 0.1833 0.1728 1.061 0.289
Light_Hours -0.07247 0.0697 -1.04 0.2988
resid_panel(reduced.model2)

5 Results Analysis

None of the tested multiple linear regression models were able to predict the variance in plant yield at a statistically significant level. Reducing the model again would result in a previously analyzed simple linear regression model, which was not significant either. It can be reasonably determined that, in our sampled population, none of the factors would be able to predict the strawberry yield, individually or combined.

6 Conclusions

More significant models may be able to be constructed under different sampling circumstances. For example, a designed experiment that changes the various factors at different levels may find differences in the plant yield. More conclusive evidence may be found in such an experiment.

