Friedman (1991) introduced several benchmark data sets create by simulation. One of these simulations used the following nonlinear equation to create data:
\(y = 10 \sin(\pi x_{1}x_{2}) + 20(x_{3} - 0.5)^2 + 10x_{4} + 5x_{5} + N(0, \sigma^2)\)
where the \(x\) values are random variables uniformly distributed between [0, 1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function called mlbench.friedman1 that simulates these data:
set.seed(200)
trainingData <- mlbench.friedman1(200, sd = 1)
## We convert the 'x' data from a matrix to a data frame
## One reason is that this will give the columns names.
trainingData$x <- data.frame(trainingData$x)
## Look at the data using
featurePlot(trainingData$x, trainingData$y)
## or other methods.
## This creates a list with a vector 'y' and a matrix
## of predictors 'x'. Also simulate a large test set to
## estimate the true error rate with good precision:
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
knnPred <- predict(knnModel, newdata = testData$x)
## The function 'postResample' can be used to get the test set
## performance values.
knnPerformance <- postResample(pred = knnPred, obs = testData$y)
knnPerformance
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Model 1: MARS Model
set.seed(50)
# Define and tune the MARS model.
marsGrid <- expand.grid(.degree = 1:2, .nprune = 2:15)
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = 'earth',
tuneGrid = marsGrid,
tuneLength = 25,
preProc = c('center', 'scale'))
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.404347 0.2245451 3.590453
## 1 3 3.746178 0.4394258 3.031879
## 1 4 2.810191 0.6837637 2.267477
## 1 5 2.546676 0.7381424 2.038306
## 1 6 2.457201 0.7571937 1.956201
## 1 7 1.982313 0.8395771 1.561345
## 1 8 1.853756 0.8616590 1.462258
## 1 9 1.812192 0.8676035 1.419271
## 1 10 1.764617 0.8742948 1.397465
## 1 11 1.759589 0.8748903 1.383201
## 1 12 1.770417 0.8729590 1.382469
## 1 13 1.781220 0.8710731 1.388037
## 1 14 1.799692 0.8682999 1.405146
## 1 15 1.807710 0.8668916 1.407880
## 2 2 4.401107 0.2279086 3.575905
## 2 3 3.733032 0.4437917 3.015929
## 2 4 2.853802 0.6737997 2.299631
## 2 5 2.578260 0.7313585 2.052106
## 2 6 2.438420 0.7625103 1.916612
## 2 7 2.085843 0.8225249 1.632268
## 2 8 1.925515 0.8499074 1.486561
## 2 9 1.797766 0.8676310 1.393979
## 2 10 1.628736 0.8920037 1.278338
## 2 11 1.530777 0.9053842 1.200624
## 2 12 1.513383 0.9065194 1.179791
## 2 13 1.482438 0.9110205 1.147230
## 2 14 1.468260 0.9123245 1.134669
## 2 15 1.473961 0.9113928 1.143106
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 14 and degree = 2.
# Run predict() and postResample() on the model.
marsPred <- predict(marsModel, newdata = testData$x)
marsPerformance <- postResample(pred = marsPred, obs = testData$y)
marsPerformance
## RMSE Rsquared MAE
## 1.2779993 0.9338365 1.0147070
Model 2: SVM Model
set.seed(50)
# Define and tune the SVM model.
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = 'svmRadial',
preProc = c('center', 'scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.485542 0.8040069 1.988122
## 0.50 2.224708 0.8215673 1.762504
## 1.00 2.033732 0.8438484 1.600925
## 2.00 1.906322 0.8578861 1.499489
## 4.00 1.811983 0.8692457 1.435218
## 8.00 1.775768 0.8736945 1.413409
## 16.00 1.768571 0.8754132 1.410033
## 32.00 1.769243 0.8754177 1.410352
## 64.00 1.769243 0.8754177 1.410352
## 128.00 1.769243 0.8754177 1.410352
## 256.00 1.769243 0.8754177 1.410352
## 512.00 1.769243 0.8754177 1.410352
## 1024.00 1.769243 0.8754177 1.410352
## 2048.00 1.769243 0.8754177 1.410352
##
## Tuning parameter 'sigma' was held constant at a value of 0.05909722
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.05909722 and C = 16.
# Run predict() and postResample().
svmPred <- predict(svmModel, newdata = testData$x)
svmPerformance <- postResample(pred = svmPred, obs = testData$y)
svmPerformance
## RMSE Rsquared MAE
## 2.062750 0.827448 1.567249
Model 3: Neural Network Model
set.seed(50)
# Define and tune the Neural Network model.
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnetModel <- train(x = trainingData$x,
y = trainingData$y,
method = 'avNNet',
preProc = c('center', 'scale'),
tuneGrid = nnetGrid,
trControl = trainControl(method = 'cv'),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(trainingData$x) + 1) + 10 + 1,
maxit = 500)
nnetModel
## Model Averaged Neural Network
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 2.412253 0.7684991 1.921744
## 0.00 2 2.478253 0.7571375 1.968920
## 0.00 3 2.073243 0.8297694 1.661660
## 0.00 4 1.896207 0.8535420 1.481233
## 0.00 5 1.967575 0.8499912 1.535623
## 0.00 6 3.155466 0.6667462 2.134195
## 0.00 7 4.644263 0.5211336 2.793921
## 0.00 8 4.692602 0.5686019 3.052220
## 0.00 9 5.478626 0.5330334 3.055039
## 0.00 10 3.875366 0.6236835 2.483791
## 0.01 1 2.388403 0.7729080 1.880208
## 0.01 2 2.458497 0.7642903 1.919754
## 0.01 3 2.045352 0.8326055 1.586264
## 0.01 4 2.013498 0.8409427 1.613255
## 0.01 5 2.013716 0.8434760 1.582315
## 0.01 6 2.141465 0.8135164 1.701146
## 0.01 7 2.376010 0.7806695 1.857197
## 0.01 8 2.536101 0.7604519 2.023233
## 0.01 9 2.387242 0.7848141 1.957367
## 0.01 10 2.332668 0.7841584 1.842190
## 0.10 1 2.399640 0.7708286 1.889741
## 0.10 2 2.492352 0.7504716 1.967391
## 0.10 3 2.113141 0.8259432 1.660050
## 0.10 4 2.067737 0.8290000 1.659741
## 0.10 5 2.011310 0.8384875 1.625327
## 0.10 6 2.208217 0.8102918 1.773579
## 0.10 7 2.169940 0.8173408 1.723025
## 0.10 8 2.206817 0.8088873 1.728806
## 0.10 9 2.323232 0.7906531 1.830399
## 0.10 10 2.241793 0.7980504 1.754629
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0 and bag = FALSE.
# Run predict() and postResample().
nnetPred <- predict(nnetModel, newdata = testData$x)
nnetPerformance <- postResample(pred = nnetPred, obs = testData$y)
nnetPerformance
## RMSE Rsquared MAE
## 2.0073619 0.8399851 1.5368340
Model Performance Comparison
rbind('MARS' = marsPerformance, 'SVM' = svmPerformance, 'Neural Network' = nnetPerformance, 'KNN' = knnPerformance) %>%
kable() %>% kable_styling()
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| MARS | 1.277999 | 0.9338365 | 1.014707 |
| SVM | 2.062750 | 0.8274480 | 1.567249 |
| Neural Network | 2.007362 | 0.8399851 | 1.536834 |
| KNN | 3.204060 | 0.6819919 | 2.568346 |
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.40
## X2 49.00
## X5 15.72
## X3 0.00
Answer:
From the above output, it looks like MARS selected the informative predictors X1, X2, X4, and X5, but X3 has an overall score of 0.0 which suggests it did not select it.
Exercise 6.3 describes data for a chemical manufacturing process. Use the same data imputation, data splitting, and pre-processing steps as before and train several nonlinear regression models.
# Load the chemicalManufacturingProcess dataset that is provided by the "AppliedPredictiveModeling" package.
data(ChemicalManufacturingProcess)
1. Impute missing values using KNN.
# Impute the missing values using KNN.
cmpImputed <- preProcess(ChemicalManufacturingProcess, 'knnImpute')
2. Predict after imputation.
# Predict after imputation.
chemicalMPData <- predict(cmpImputed, ChemicalManufacturingProcess)
3. Split the data into training and test sets.
# Split the training data using an 80% training data split.
trainingData <- createDataPartition(ChemicalManufacturingProcess$Yield, p = 0.8, list = FALSE)
xTrainData <- chemicalMPData[trainingData, ]
yTrainData <- ChemicalManufacturingProcess$Yield[trainingData]
# Split the test data.
xTestData <- chemicalMPData[-trainingData, ]
yTestData <- ChemicalManufacturingProcess$Yield[-trainingData]
set.seed(50)
# Define and tune a PLS model.
plsModel <- train(x = xTrainData,
y = yTrainData,
method = 'pls',
metric = 'Rsquared',
tuneLength = 20,
trControl = trainControl(method = 'cv'),
preProcess = c('center', 'scale'))
# Print out the results.
plsModel
## Partial Least Squares
##
## 144 samples
## 58 predictor
##
## Pre-processing: centered (58), scaled (58)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.33574102 0.5280627 1.03699637
## 2 1.55514570 0.6253031 0.94935173
## 3 0.86332117 0.8111104 0.65667569
## 4 1.04296461 0.7662041 0.63454464
## 5 0.91339824 0.8086960 0.51672968
## 6 0.79162632 0.8453758 0.44255201
## 7 0.68159846 0.8717648 0.36204944
## 8 0.56873260 0.9100155 0.29161510
## 9 0.42468882 0.9447098 0.21310311
## 10 0.42815268 0.9437162 0.19383465
## 11 0.37270636 0.9524233 0.16135526
## 12 0.32895662 0.9588218 0.14359705
## 13 0.26436139 0.9698894 0.11528017
## 14 0.19059348 0.9820459 0.08601711
## 15 0.15370618 0.9850314 0.07005732
## 16 0.14405399 0.9873845 0.06228076
## 17 0.13376304 0.9904345 0.05565696
## 18 0.12491794 0.9920739 0.05081665
## 19 0.10329226 0.9945144 0.04253655
## 20 0.06676993 0.9970457 0.03046372
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 20.
# Run predict() and postResample() on the model.
plsPred <- predict(plsModel, newdata = xTestData)
plsPerformance <- postResample(pred = plsPred, obs = yTestData)
plsPerformance
## RMSE Rsquared MAE
## 0.01695540 0.99991846 0.01312557
set.seed(50)
# Train a KNN model.
knnModel <- train(x = xTrainData,
y = yTrainData,
method = 'knn',
preProc = c('center', 'scale'),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 144 samples
## 58 predictor
##
## Pre-processing: centered (58), scaled (58)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 144, 144, 144, 144, 144, 144, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.313240 0.4976614 1.036873
## 7 1.284785 0.5237572 1.012093
## 9 1.284953 0.5348281 1.017324
## 11 1.283164 0.5439781 1.025511
## 13 1.272175 0.5600633 1.018044
## 15 1.282842 0.5582924 1.026190
## 17 1.286213 0.5646184 1.030484
## 19 1.299337 0.5596987 1.041285
## 21 1.310624 0.5571506 1.054980
## 23 1.324173 0.5526370 1.063355
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 13.
# Run predict() and postResample().
knnPred <- predict(knnModel, newdata = xTestData)
knnPerformance <- postResample(pred = knnPred, obs = yTestData)
knnPerformance
## RMSE Rsquared MAE
## 1.2430863 0.6463763 1.0065625
set.seed(50)
# Define and tune the SVM model.
svmModel <- train(x = xTrainData,
y = yTrainData,
method = 'svmRadial',
preProc = c('center', 'scale'),
tuneLength = 14,
trControl = trainControl(method = 'cv'))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 144 samples
## 58 predictor
##
## Pre-processing: centered (58), scaled (58)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.1701760 0.6858684 0.9520916
## 0.50 0.9344061 0.7987497 0.7447998
## 1.00 0.7425587 0.8749638 0.5722410
## 2.00 0.6482239 0.9040551 0.5022892
## 4.00 0.6344307 0.9068769 0.4892183
## 8.00 0.6344307 0.9068769 0.4892183
## 16.00 0.6344307 0.9068769 0.4892183
## 32.00 0.6344307 0.9068769 0.4892183
## 64.00 0.6344307 0.9068769 0.4892183
## 128.00 0.6344307 0.9068769 0.4892183
## 256.00 0.6344307 0.9068769 0.4892183
## 512.00 0.6344307 0.9068769 0.4892183
## 1024.00 0.6344307 0.9068769 0.4892183
## 2048.00 0.6344307 0.9068769 0.4892183
##
## Tuning parameter 'sigma' was held constant at a value of 0.01299667
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01299667 and C = 4.
# Run predict() and postResample().
svmPred <- predict(svmModel, newdata = xTestData)
svmPerformance <- postResample(pred = svmPred, obs = yTestData)
svmPerformance
## RMSE Rsquared MAE
## 0.5865390 0.9263753 0.4751181
set.seed(50)
# Define and tune the Neural Network model.
nnetGrid <- expand.grid(.decay = c(0, 0.01, .1), .size = c(1:10), .bag = FALSE)
nnetModel <- train(x = xTrainData,
y = yTrainData,
method = 'avNNet',
preProc = c('center', 'scale'),
tuneGrid = nnetGrid,
trControl = trainControl(method = 'cv'),
linout = TRUE,
trace = FALSE,
MaxNWts = 10 * (ncol(xTrainData) + 1) + 10 + 1,
maxit = 500)
nnetModel
## Model Averaged Neural Network
##
## 144 samples
## 58 predictor
##
## Pre-processing: centered (58), scaled (58)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 129, 129, 130, 129, 131, 131, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 1.4531348 0.4533271 1.1823720
## 0.00 2 1.3162565 0.5395748 1.0778069
## 0.00 3 1.2198049 0.6526343 0.9471283
## 0.00 4 1.3378670 0.6284613 1.0905312
## 0.00 5 1.5719641 0.5865243 1.2495912
## 0.00 6 1.6265453 0.5154893 1.2914693
## 0.00 7 2.4829943 0.4833184 1.8647638
## 0.00 8 3.2255752 0.3735543 2.3738178
## 0.00 9 5.8292389 0.2897651 3.7393765
## 0.00 10 6.2528966 0.2918346 4.2571667
## 0.01 1 0.3281779 0.9561907 0.1631346
## 0.01 2 0.4054010 0.9454843 0.2604388
## 0.01 3 0.8581809 0.8090849 0.5085160
## 0.01 4 1.0382264 0.7905566 0.6555289
## 0.01 5 1.2616141 0.6974517 0.8464117
## 0.01 6 1.0336227 0.7570536 0.7710083
## 0.01 7 1.0238360 0.7628288 0.7573270
## 0.01 8 1.1738440 0.6735672 0.9121379
## 0.01 9 1.6752530 0.5879600 1.1506262
## 0.01 10 1.7321663 0.5827655 1.2045439
## 0.10 1 0.6123655 0.8988199 0.3389761
## 0.10 2 1.1331552 0.7436390 0.6257841
## 0.10 3 1.2289181 0.7491524 0.6729656
## 0.10 4 1.3192214 0.7293678 0.7547688
## 0.10 5 1.6319327 0.6785774 0.8655697
## 0.10 6 1.5203251 0.6820915 0.8815065
## 0.10 7 1.5876946 0.6720674 0.9679185
## 0.10 8 1.3918967 0.6790485 0.8999611
## 0.10 9 1.3915676 0.6176788 0.9516209
## 0.10 10 1.1999465 0.6977182 0.8571284
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1, decay = 0.01 and bag = FALSE.
# Run predict() and postResample().
nnetPred <- predict(nnetModel, newdata = xTestData)
nnetPerformance <- postResample(pred = nnetPred, obs = yTestData)
nnetPerformance
## RMSE Rsquared MAE
## 0.07730387 0.99839211 0.06183762
(a) Which nonlinear regression model gives the optimal resampling and test set performance?
rbind('PLS (Linear Model)' = plsPerformance, 'SVM' = svmPerformance, 'Neural Network' = nnetPerformance, 'KNN' = knnPerformance) %>%
kable() %>% kable_styling()
| RMSE | Rsquared | MAE | |
|---|---|---|---|
| PLS (Linear Model) | 0.0169554 | 0.9999185 | 0.0131256 |
| SVM | 0.5865390 | 0.9263753 | 0.4751181 |
| Neural Network | 0.0773039 | 0.9983921 | 0.0618376 |
| KNN | 1.2430863 | 0.6463763 | 1.0065625 |
Answer:
Based on the lowest RMSE value and the highest R^2 value, the SVM model gives the optimal resampling and test set performance.
(b) Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list? How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
svmImportantPredictors <- varImp(svmModel)
svmImportantPredictors
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## Yield 100.00
## ManufacturingProcess32 38.78
## BiologicalMaterial06 34.33
## ManufacturingProcess13 32.88
## BiologicalMaterial03 27.23
## ManufacturingProcess17 27.00
## BiologicalMaterial02 26.83
## ManufacturingProcess36 26.58
## BiologicalMaterial12 25.98
## ManufacturingProcess31 25.05
## ManufacturingProcess09 24.97
## ManufacturingProcess02 20.54
## ManufacturingProcess33 20.13
## BiologicalMaterial04 19.17
## ManufacturingProcess06 18.50
## ManufacturingProcess29 18.49
## ManufacturingProcess11 17.51
## BiologicalMaterial11 17.05
## BiologicalMaterial08 16.83
## BiologicalMaterial01 16.18
Answer:
B1 Which predictors are most important in the optimal nonlinear regression model? Do either the biological or process variables dominate the list?
The most important predictors for the optimal nonlinear regression model (the SVM model) are shown above. The ManufacturingProcess predictors dominate the list.
B2 How do the top ten important predictors compare to the top ten predictors from the optimal linear model?
varImp(plsModel)
## pls variable importance
##
## only 20 most important variables shown (out of 58)
##
## Overall
## Yield 100.00
## ManufacturingProcess32 40.43
## ManufacturingProcess17 37.81
## ManufacturingProcess13 35.58
## ManufacturingProcess09 34.20
## ManufacturingProcess36 31.78
## BiologicalMaterial02 27.78
## BiologicalMaterial06 26.54
## BiologicalMaterial08 26.47
## BiologicalMaterial11 25.60
## BiologicalMaterial12 25.23
## ManufacturingProcess33 25.13
## ManufacturingProcess11 24.50
## BiologicalMaterial01 23.96
## ManufacturingProcess12 23.29
## ManufacturingProcess06 23.29
## BiologicalMaterial03 23.22
## ManufacturingProcess28 23.00
## BiologicalMaterial04 22.78
## ManufacturingProcess04 22.02
The top 10 predictors of the optimal nonlinear regression model are very similiar to the top 10 predictors of the linear model (PLS model), ManufacturingProcess predictors dominate the list.
(c) Explore the relationships between the top predictors and the response for the predictors that are unique to the optimal nonlinear regression model. Do these plots reveal intuition about the biological or process predictors and their relationship with yield?
Yield <- which(colnames(chemicalMPData) == 'Yield')
SVMTopTenPredictors <- head(rownames(svmImportantPredictors$importance)[order(-svmImportantPredictors$importance$Overall)], 10)
as.data.frame(SVMTopTenPredictors)
## SVMTopTenPredictors
## 1 Yield
## 2 ManufacturingProcess32
## 3 BiologicalMaterial06
## 4 ManufacturingProcess13
## 5 BiologicalMaterial03
## 6 ManufacturingProcess17
## 7 BiologicalMaterial02
## 8 ManufacturingProcess36
## 9 BiologicalMaterial12
## 10 ManufacturingProcess31
Y <- chemicalMPData[,Yield]
X <- chemicalMPData[,SVMTopTenPredictors]
colnames(X) <- gsub('(Process|Material)', '', colnames(X))
featurePlot(x = X, y = Y, plot = 'scatter', type = c('p', 'smooth'), span = 0.5)
The above plots show us that there is a relationship between the response variable (Yield), and the top 10 predictor variables. Most of the predictor variables have a linear relationship with the response variable. For example, there is a clear positive linear relationship between Yield and Biological03, whilst Manufacturing17 appears to have a negative relationship.