y=10sin(πx1x2)+20(x3−0.5)2+10x4+5x5+N(0,σ2)
where the x values are random variables uniformly distributed between [0,1] (there are also 5 other non-informative variables also created in the simulation). The package mlbench contains a function mlbench.friedman1 that simulates these data:
## Loading required package: lattice
## Loading required package: ggplot2
Tune several models on these data. Lets try Knn Model first.
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
knnModel
## k-Nearest Neighbors
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 3.466085 0.5121775 2.816838
## 7 3.349428 0.5452823 2.727410
## 9 3.264276 0.5785990 2.660026
## 11 3.214216 0.6024244 2.603767
## 13 3.196510 0.6176570 2.591935
## 15 3.184173 0.6305506 2.577482
## 17 3.183130 0.6425367 2.567787
## 19 3.198752 0.6483184 2.592683
## 21 3.188993 0.6611428 2.588787
## 23 3.200458 0.6638353 2.604529
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.
The knn model has selected k= 17. This means that the model would consider the nearest 17 data points and determine the predicted value. We will run the prediction on test data with KNN model we just created.
knnPred <- predict(knnModel, newdata = testData$x)
postResample(pred = knnPred, obs = testData$y)
## RMSE Rsquared MAE
## 3.2040595 0.6819919 2.5683461
Lets try the SVM model.
#install.packages("kernlab")
svmModel <- train(x = trainingData$x,
y = trainingData$y,
method = "svmRadial",
tuneLength=10,
preProc = c("center", "scale"))
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 2.545335 0.7804647 2.015121
## 0.50 2.319786 0.7965148 1.830009
## 1.00 2.188349 0.8119636 1.726027
## 2.00 2.103655 0.8241314 1.655842
## 4.00 2.066879 0.8294322 1.631051
## 8.00 2.052681 0.8313929 1.623550
## 16.00 2.049867 0.8318312 1.621820
## 32.00 2.049867 0.8318312 1.621820
## 64.00 2.049867 0.8318312 1.621820
## 128.00 2.049867 0.8318312 1.621820
##
## Tuning parameter 'sigma' was held constant at a value of 0.06802164
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.06802164 and C = 16.
svmModelpred <- predict(svmModel, newdata = testData$x)
postResample(pred = svmModelpred, obs = testData$y)
## RMSE Rsquared MAE
## 2.0864652 0.8236735 1.5854649
The SVM has less error(RMSE, MAE) while compared with the KNN Model. R-square also got improved.
Lets try Mars and compare the accuracy.
marsGrid <- expand.grid(.degree=1:2,
.nprune=2:20)
marsModel <- train(x = trainingData$x,
y = trainingData$y,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"))
## Loading required package: earth
## Warning: package 'earth' was built under R version 4.0.3
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.0.3
## Loading required package: plotmo
## Warning: package 'plotmo' was built under R version 4.0.3
## Loading required package: plotrix
## Warning: package 'plotrix' was built under R version 4.0.3
## Loading required package: TeachingDemos
## Warning: package 'TeachingDemos' was built under R version 4.0.3
marsModel
## Multivariate Adaptive Regression Spline
##
## 200 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 200, 200, 200, 200, 200, 200, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 4.447386 0.2254125 3.620675
## 1 3 3.790305 0.4344625 3.058704
## 1 4 2.801182 0.6884819 2.233531
## 1 5 2.551283 0.7412626 2.051644
## 1 6 2.493135 0.7492201 1.986528
## 1 7 2.089713 0.8239588 1.645996
## 1 8 1.889475 0.8565881 1.484798
## 1 9 1.816053 0.8673608 1.420333
## 1 10 1.819611 0.8674028 1.417343
## 1 11 1.819783 0.8670556 1.415058
## 1 12 1.832487 0.8651613 1.426371
## 1 13 1.845943 0.8632112 1.436005
## 1 14 1.855353 0.8613778 1.452115
## 1 15 1.854557 0.8617322 1.452920
## 1 16 1.856173 0.8616879 1.455393
## 1 17 1.856989 0.8615480 1.456862
## 1 18 1.856989 0.8615480 1.456862
## 1 19 1.856989 0.8615480 1.456862
## 1 20 1.856989 0.8615480 1.456862
## 2 2 4.434592 0.2241213 3.616685
## 2 3 3.799538 0.4319047 3.064845
## 2 4 2.806374 0.6871266 2.237911
## 2 5 2.524002 0.7462965 2.023657
## 2 6 2.446243 0.7602514 1.931404
## 2 7 2.147529 0.8127597 1.682839
## 2 8 1.977186 0.8393569 1.557609
## 2 9 1.831267 0.8635192 1.428370
## 2 10 1.639428 0.8902850 1.280510
## 2 11 1.545708 0.9019039 1.213559
## 2 12 1.499558 0.9081641 1.171249
## 2 13 1.494111 0.9087340 1.161702
## 2 14 1.492700 0.9102980 1.160345
## 2 15 1.484444 0.9116520 1.153052
## 2 16 1.487065 0.9109633 1.151057
## 2 17 1.496021 0.9098876 1.156630
## 2 18 1.487296 0.9111035 1.150491
## 2 19 1.486280 0.9113126 1.149198
## 2 20 1.486280 0.9113126 1.149198
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 15 and degree = 2.
marsModelpred <- predict(marsModel, newdata = testData$x)
postResample(pred = marsModelpred, obs = testData$y)
## RMSE Rsquared MAE
## 1.1908806 0.9428866 0.9496858
From the above, Mars model gives a better lower error(in terms of RMSE, MAE) and R-square is more closed to 1.
varImp(marsModel)
## earth variable importance
##
## Overall
## X1 100.00
## X4 75.31
## X2 48.86
## X5 15.61
## X3 0.00
# Load the chemical manufacturing data
#install.packages('AppliedPredictiveModeling')
library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.0.3
data("ChemicalManufacturingProcess")
Lets look at the sample data from the dataframe. We see there are 58 columns and the target variable is Yield.
head(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
ncol(ChemicalManufacturingProcess)
## [1] 58
Lets do some preprocessing and clean the data. As part of it, we will see if there is any missing values.
There are 176 rows in this dataset. Out of that 24 rows has NAs. There are 106 total NA occurances in the data set.
Total number of Observation.
nrow(ChemicalManufacturingProcess)
## [1] 176
Total number of NAs
length(which(is.na(ChemicalManufacturingProcess)))
## [1] 106
Total number of rows with NA
length(which(!complete.cases(ChemicalManufacturingProcess)))
## [1] 24
Lets do a summary of data.
summary(ChemicalManufacturingProcess)
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
## 1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
## Median :39.97 Median :6.305 Median :55.09 Median :67.22
## Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
## 3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
## Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
##
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## Min. : 9.38 Min. :13.24 Min. :40.60
## 1st Qu.:11.24 1st Qu.:17.23 1st Qu.:46.05
## Median :12.10 Median :18.49 Median :48.46
## Mean :12.35 Mean :18.60 Mean :48.91
## 3rd Qu.:13.22 3rd Qu.:19.90 3rd Qu.:51.34
## Max. :23.09 Max. :24.85 Max. :59.38
##
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## Min. :100.0 Min. :15.88 Min. :11.44
## 1st Qu.:100.0 1st Qu.:17.06 1st Qu.:12.60
## Median :100.0 Median :17.51 Median :12.84
## Mean :100.0 Mean :17.49 Mean :12.85
## 3rd Qu.:100.0 3rd Qu.:17.88 3rd Qu.:13.13
## Max. :100.8 Max. :19.14 Max. :14.08
##
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## Min. :1.770 Min. :135.8 Min. :18.35
## 1st Qu.:2.460 1st Qu.:143.8 1st Qu.:19.73
## Median :2.710 Median :146.1 Median :20.12
## Mean :2.801 Mean :147.0 Mean :20.20
## 3rd Qu.:2.990 3rd Qu.:149.6 3rd Qu.:20.75
## Max. :6.870 Max. :158.7 Max. :22.21
##
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## Min. : 0.00 Min. : 0.00 Min. :1.47
## 1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
## Median :11.40 Median :21.00 Median :1.54
## Mean :11.21 Mean :16.68 Mean :1.54
## 3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
## Max. :14.10 Max. :22.50 Max. :1.60
## NA's :1 NA's :3 NA's :15
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## Min. :911.0 Min. : 923.0 Min. :203.0
## 1st Qu.:928.0 1st Qu.: 986.8 1st Qu.:205.7
## Median :934.0 Median : 999.2 Median :206.8
## Mean :931.9 Mean :1001.7 Mean :207.4
## 3rd Qu.:936.0 3rd Qu.:1008.9 3rd Qu.:208.7
## Max. :946.0 Max. :1175.3 Max. :227.4
## NA's :1 NA's :1 NA's :2
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## Min. :177.0 Min. :177.0 Min. :38.89
## 1st Qu.:177.0 1st Qu.:177.0 1st Qu.:44.89
## Median :177.0 Median :178.0 Median :45.73
## Mean :177.5 Mean :177.6 Mean :45.66
## 3rd Qu.:178.0 3rd Qu.:178.0 3rd Qu.:46.52
## Max. :178.0 Max. :178.0 Max. :49.36
## NA's :1 NA's :1
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## Min. : 7.500 Min. : 7.500 Min. : 0.0
## 1st Qu.: 8.700 1st Qu.: 9.000 1st Qu.: 0.0
## Median : 9.100 Median : 9.400 Median : 0.0
## Mean : 9.179 Mean : 9.386 Mean : 857.8
## 3rd Qu.: 9.550 3rd Qu.: 9.900 3rd Qu.: 0.0
## Max. :11.600 Max. :11.500 Max. :4549.0
## NA's :9 NA's :10 NA's :1
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## Min. :32.10 Min. :4701 Min. :5904
## 1st Qu.:33.90 1st Qu.:4828 1st Qu.:6010
## Median :34.60 Median :4856 Median :6032
## Mean :34.51 Mean :4854 Mean :6039
## 3rd Qu.:35.20 3rd Qu.:4882 3rd Qu.:6061
## Max. :38.60 Max. :5055 Max. :6233
## NA's :1
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## Min. : 0 Min. :31.30 Min. : 0
## 1st Qu.:4561 1st Qu.:33.50 1st Qu.:4813
## Median :4588 Median :34.40 Median :4835
## Mean :4566 Mean :34.34 Mean :4810
## 3rd Qu.:4619 3rd Qu.:35.10 3rd Qu.:4862
## Max. :4852 Max. :40.00 Max. :4971
##
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## Min. :5890 Min. : 0 Min. :-1.8000
## 1st Qu.:6001 1st Qu.:4553 1st Qu.:-0.6000
## Median :6022 Median :4582 Median :-0.3000
## Mean :6028 Mean :4556 Mean :-0.1642
## 3rd Qu.:6050 3rd Qu.:4610 3rd Qu.: 0.0000
## Max. :6146 Max. :4759 Max. : 3.6000
##
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## Min. : 0.000 Min. :0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.000
## Median : 5.000 Median :3.000 Median : 8.000
## Mean : 5.406 Mean :3.017 Mean : 8.834
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.000
## Max. :12.000 Max. :6.000 Max. :23.000
## NA's :1 NA's :1 NA's :1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:4832 1st Qu.:6020 1st Qu.:4560
## Median :4855 Median :6047 Median :4587
## Mean :4828 Mean :6016 Mean :4563
## 3rd Qu.:4877 3rd Qu.:6070 3rd Qu.:4609
## Max. :4990 Max. :6161 Max. :4710
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.:19.70 1st Qu.: 8.800
## Median :10.400 Median :19.90 Median : 9.100
## Mean : 6.592 Mean :20.01 Mean : 9.161
## 3rd Qu.:10.750 3rd Qu.:20.40 3rd Qu.: 9.700
## Max. :11.500 Max. :22.00 Max. :11.200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## Min. : 0.00 Min. :143.0 Min. :56.00
## 1st Qu.:70.10 1st Qu.:155.0 1st Qu.:62.00
## Median :70.80 Median :158.0 Median :64.00
## Mean :70.18 Mean :158.5 Mean :63.54
## 3rd Qu.:71.40 3rd Qu.:162.0 3rd Qu.:65.00
## Max. :72.50 Max. :173.0 Max. :70.00
## NA's :5 NA's :5
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## Min. :2.300 Min. :463.0 Min. :0.01700
## 1st Qu.:2.500 1st Qu.:490.0 1st Qu.:0.01900
## Median :2.500 Median :495.0 Median :0.02000
## Mean :2.494 Mean :495.6 Mean :0.01957
## 3rd Qu.:2.500 3rd Qu.:501.5 3rd Qu.:0.02000
## Max. :2.600 Max. :522.0 Max. :0.02200
## NA's :5 NA's :5 NA's :5
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:0.700 1st Qu.:2.000 1st Qu.:7.100
## Median :1.000 Median :3.000 Median :7.200
## Mean :1.014 Mean :2.534 Mean :6.851
## 3rd Qu.:1.300 3rd Qu.:3.000 3rd Qu.:7.300
## Max. :2.300 Max. :3.000 Max. :7.500
##
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## Min. :0.00000 Min. :0.00000 Min. : 0.00
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:11.40
## Median :0.00000 Median :0.00000 Median :11.60
## Mean :0.01771 Mean :0.02371 Mean :11.21
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:11.70
## Max. :0.10000 Max. :0.20000 Max. :12.10
## NA's :1 NA's :1
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## Min. : 0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 0.6000 1st Qu.:1.800 1st Qu.:2.100
## Median : 0.8000 Median :1.900 Median :2.200
## Mean : 0.9119 Mean :1.805 Mean :2.138
## 3rd Qu.: 1.0250 3rd Qu.:1.900 3rd Qu.:2.300
## Max. :11.0000 Max. :2.100 Max. :2.600
##
There are whole lot of features has missing values. It looks like this plot not showing the feature names in the x axis. Around 13% rows has missing values. We can either drop these rows or impute the missing values.
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
summary(aggr(ChemicalManufacturingProcess))
##
## Missings per variable:
## Variable Count
## Yield 0
## BiologicalMaterial01 0
## BiologicalMaterial02 0
## BiologicalMaterial03 0
## BiologicalMaterial04 0
## BiologicalMaterial05 0
## BiologicalMaterial06 0
## BiologicalMaterial07 0
## BiologicalMaterial08 0
## BiologicalMaterial09 0
## BiologicalMaterial10 0
## BiologicalMaterial11 0
## BiologicalMaterial12 0
## ManufacturingProcess01 1
## ManufacturingProcess02 3
## ManufacturingProcess03 15
## ManufacturingProcess04 1
## ManufacturingProcess05 1
## ManufacturingProcess06 2
## ManufacturingProcess07 1
## ManufacturingProcess08 1
## ManufacturingProcess09 0
## ManufacturingProcess10 9
## ManufacturingProcess11 10
## ManufacturingProcess12 1
## ManufacturingProcess13 0
## ManufacturingProcess14 1
## ManufacturingProcess15 0
## ManufacturingProcess16 0
## ManufacturingProcess17 0
## ManufacturingProcess18 0
## ManufacturingProcess19 0
## ManufacturingProcess20 0
## ManufacturingProcess21 0
## ManufacturingProcess22 1
## ManufacturingProcess23 1
## ManufacturingProcess24 1
## ManufacturingProcess25 5
## ManufacturingProcess26 5
## ManufacturingProcess27 5
## ManufacturingProcess28 5
## ManufacturingProcess29 5
## ManufacturingProcess30 5
## ManufacturingProcess31 5
## ManufacturingProcess32 0
## ManufacturingProcess33 5
## ManufacturingProcess34 5
## ManufacturingProcess35 5
## ManufacturingProcess36 5
## ManufacturingProcess37 0
## ManufacturingProcess38 0
## ManufacturingProcess39 0
## ManufacturingProcess40 1
## ManufacturingProcess41 1
## ManufacturingProcess42 0
## ManufacturingProcess43 0
## ManufacturingProcess44 0
## ManufacturingProcess45 0
##
## Missings in combinations of variables:
## Combinations
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:1:1:1:1:1:1:0:1:1:1:1:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:1:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:1:1:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0
## 0:0:0:0:0:0:0:0:0:0:0:0:0:1:1:1:1:1:1:1:1:0:1:1:1:0:0:0:0:0:0:0:0:0:1:1:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:1:1:0:0:0:0
## Count Percent
## 152 86.3636364
## 5 2.8409091
## 1 0.5681818
## 1 0.5681818
## 6 3.4090909
## 7 3.9772727
## 1 0.5681818
## 2 1.1363636
## 1 0.5681818
Lets try our first model with KNN().
knnModel <- train(x = X.train,
y = y.train,
method = "knn",
preProc = c("center", "scale"),
tuneLength = 10)
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
knnModel
## k-Nearest Neighbors
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 1.386818 0.4069868 1.130305
## 7 1.370576 0.4193097 1.120586
## 9 1.384592 0.4058596 1.130505
## 11 1.398555 0.3989628 1.144835
## 13 1.410754 0.3902433 1.154346
## 15 1.416930 0.3904867 1.160173
## 17 1.417252 0.3977882 1.161859
## 19 1.425860 0.3955884 1.167947
## 21 1.435216 0.3926678 1.175034
## 23 1.447417 0.3907068 1.186565
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 7.
Lets predict with test data and determine the errors and rSquare.
knnPred <- predict(knnModel, newdata = X.test)
postResample(pred = knnPred, obs = y.test)
## RMSE Rsquared MAE
## 1.4967980 0.5121008 1.1812662
Lets try with SVM
svmModel <- train(x = X.train,
y = y.train,
method = "svmRadial",
tuneLength=10,
preProc = c("center", "scale"))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
svmModel
## Support Vector Machines with Radial Basis Function Kernel
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 1.445287 0.4618584 1.1834154
## 0.50 1.340407 0.5008550 1.0926548
## 1.00 1.264424 0.5370274 1.0281439
## 2.00 1.223382 0.5561368 0.9872089
## 4.00 1.212827 0.5576232 0.9710348
## 8.00 1.215786 0.5571255 0.9705215
## 16.00 1.216463 0.5570912 0.9711200
## 32.00 1.216463 0.5570912 0.9711200
## 64.00 1.216463 0.5570912 0.9711200
## 128.00 1.216463 0.5570912 0.9711200
##
## Tuning parameter 'sigma' was held constant at a value of 0.01386174
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.01386174 and C = 4.
Predict on test set
svmPred <- predict(svmModel, newdata = X.test)
postResample(pred = svmPred, obs = y.test)
## RMSE Rsquared MAE
## 1.2585773 0.6251842 1.0072735
Lets try our last model with Mars()
marsGrid <- expand.grid(.degree=1:2,
.nprune=2:20)
marsModel <- train(x = X.train,
y = y.train,
method = "earth",
tuneGrid = marsGrid,
preProc = c("center", "scale"))
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
## Warning in preProcess.default(thresh = 0.95, k = 5, freqCut = 19, uniqueCut =
## 10, : These variables have zero variances: BiologicalMaterial07
marsModel
## Multivariate Adaptive Regression Spline
##
## 132 samples
## 57 predictor
##
## Pre-processing: centered (57), scaled (57)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 132, 132, 132, 132, 132, 132, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 2 1.535570 0.3375963 1.187483
## 1 3 1.340388 0.4919964 1.049814
## 1 4 3.932410 0.4895318 1.393907
## 1 5 3.443236 0.4770059 1.325244
## 1 6 3.813132 0.4446017 1.396980
## 1 7 4.875685 0.3940506 1.585389
## 1 8 4.140970 0.4097375 1.466265
## 1 9 5.218548 0.3875712 1.642331
## 1 10 5.355814 0.3723563 1.683616
## 1 11 5.510000 0.3261928 1.738215
## 1 12 5.337506 0.3145538 1.727836
## 1 13 6.157829 0.3109134 1.888466
## 1 14 6.702187 0.2993218 1.987325
## 1 15 6.831920 0.2997222 2.034952
## 1 16 6.564210 0.3027182 1.991806
## 1 17 6.548614 0.3019879 1.995204
## 1 18 6.533720 0.3032393 1.990802
## 1 19 6.537509 0.3022043 1.994733
## 1 20 6.535477 0.3033874 1.994230
## 2 2 1.533899 0.3395688 1.190380
## 2 3 1.386810 0.4557288 1.090908
## 2 4 1.934649 0.4611534 1.155901
## 2 5 1.695142 0.4757415 1.118137
## 2 6 2.101742 0.4556661 1.186778
## 2 7 2.287001 0.4452221 1.226043
## 2 8 2.229628 0.4527732 1.214881
## 2 9 2.241617 0.4342188 1.231014
## 2 10 2.260622 0.4340273 1.237285
## 2 11 2.473603 0.4086686 1.288932
## 2 12 2.429886 0.3959948 1.307211
## 2 13 2.443104 0.4019047 1.312503
## 2 14 2.401061 0.3988870 1.305754
## 2 15 2.534867 0.3969389 1.328302
## 2 16 2.673429 0.3762920 1.376511
## 2 17 2.865160 0.3688746 1.411509
## 2 18 2.696567 0.3724343 1.390933
## 2 19 2.854104 0.3589748 1.439246
## 2 20 2.893790 0.3584368 1.440659
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 3 and degree = 1.
Predict on test set
marsModelpred <- predict(marsModel, newdata = X.test)
postResample(pred = marsModelpred, obs = y.test)
## RMSE Rsquared MAE
## 1.4956756 0.4415462 1.1910414
resamp <- resamples(list(KNN=knnModel, MARS=marsModel, SVM=svmModel))
summary(resamp)
##
## Call:
## summary.resamples(object = resamp)
##
## Models: KNN, MARS, SVM
## Number of resamples: 25
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 0.9466563 1.0526108 1.1242598 1.1205860 1.184430 1.288532 0
## MARS 0.8329489 0.9640332 1.0187049 1.0498143 1.116347 1.457561 0
## SVM 0.7748748 0.9059040 0.9316888 0.9710348 1.070985 1.156674 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 1.1145218 1.305401 1.376347 1.370576 1.440231 1.530900 0
## MARS 1.0329034 1.200956 1.313639 1.340388 1.456943 1.868650 0
## SVM 0.9890417 1.116357 1.154228 1.212827 1.327700 1.441459 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## KNN 0.2742013 0.3331019 0.4318388 0.4193097 0.4978750 0.5836994 0
## MARS 0.1500053 0.4257317 0.5111191 0.4919964 0.5902840 0.6738815 0
## SVM 0.3867768 0.4820051 0.5613091 0.5576232 0.6431642 0.6893192 0
Out of 3 models I tried, SVM stands out as an optimal model with overall R2(closer to 1) with less RMSE & MAE. From over test predict results too, SVM provides a good convergence.
varImp(svmModel)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 57)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.37
## BiologicalMaterial03 86.71
## BiologicalMaterial06 86.61
## ManufacturingProcess09 73.49
## ManufacturingProcess17 70.57
## BiologicalMaterial12 63.83
## ManufacturingProcess36 63.76
## ManufacturingProcess06 58.47
## ManufacturingProcess31 54.85
## BiologicalMaterial02 54.22
## ManufacturingProcess11 48.59
## BiologicalMaterial11 45.41
## ManufacturingProcess33 45.09
## ManufacturingProcess30 43.00
## BiologicalMaterial04 42.53
## ManufacturingProcess20 40.85
## ManufacturingProcess12 40.53
## ManufacturingProcess29 37.11
## ManufacturingProcess02 36.01
6 out of the top 10 ranked predictors are ManufacturingProcess predictors. The top ranking predictor is ManufacturingProcess32. It appears that the ManufacturingProcess predictors are more important.
The top predictor from the above list is ManufacturingProcess32. Out of top 10 predictors, 6 are process variables and 4 are biological variables. Hence process dominates the biological variable.
Below are the different plot of top 10 predicators versus our traget vraiable(yeild). As you see, there are no noticible relationship between predictors to yield as we look at individually.
ManufacturingProcess33 v/s Yield
plot(x = X.train$ManufacturingProcess33, y.train)
ManufacturingProcess31 v/s Yield
plot(x = X.train$ManufacturingProcess31, y.train)
BiologicalMaterial11 v/s Yield
plot(x = X.train$BiologicalMaterial11, y.train)
BiologicalMaterial04 v/s Yield
plot(x = X.train$BiologicalMaterial04, y.train)
ManufacturingProcess29 v/s Yield
plot(x = X.train$ManufacturingProcess29, y.train)
ManufacturingProcess11 v/s Yield
plot(x = X.train$ManufacturingProcess11, y.train)
ManufacturingProcess12 v/s yield
plot(x = X.train$ManufacturingProcess12, y.train)
BiologicalMaterial08 v/s Yield
plot(x = X.train$BiologicalMaterial08, y.train)
BiologicalMaterial09 v/s Yield
plot(x = X.train$BiologicalMaterial09, y.train)
BiologicalMaterial01 v/s Yield
plot(x = X.train$BiologicalMaterial01, y.train)