Here we split the data.
## Warning: package 'mlbench' was built under R version 3.5.3
## Warning: package 'caret' was built under R version 3.5.3
## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'earth' was built under R version 3.5.3
## Warning: package 'Formula' was built under R version 3.5.2
## Warning: package 'plotmo' was built under R version 3.5.3
## Warning: package 'plotrix' was built under R version 3.5.3
## Warning: package 'TeachingDemos' was built under R version 3.5.3
## x.1 x.2 x.3 x.4 x.5 x.6
## 1 0.5337724 0.6478064 0.85078526 0.18159957 0.92903976 0.36179060
## 2 0.5837650 0.4381528 0.67272659 0.66924914 0.16379784 0.45305931
## 3 0.5895783 0.5879065 0.40967108 0.33812728 0.89409334 0.02681911
## 4 0.6910399 0.2259548 0.03335447 0.06691274 0.63744519 0.52500637
## 5 0.6673315 0.8188985 0.71676079 0.80324287 0.08306864 0.22344157
## 6 0.8392937 0.3862983 0.64618857 0.86105431 0.63038947 0.43703891
## x.7 x.8 x.9 x.10 y
## 1 0.8266609 0.4214081 0.59111440 0.5886216 18.46398
## 2 0.6489601 0.8446239 0.92819306 0.7584008 16.09836
## 3 0.1785614 0.3495908 0.01759542 0.4441185 17.76165
## 4 0.5133614 0.7970260 0.68986918 0.4450716 13.78730
## 5 0.6644906 0.9038919 0.39696995 0.5500808 18.42984
## 6 0.3360117 0.6489177 0.53116033 0.9066182 20.85817
Here I generate the test data.
testData <- mlbench.friedman1(5000, sd = 1)
testData$x <- data.frame(testData$x)
Here I build a KNN classifier.
knnModel <- train(x = trainingData$x,
y = trainingData$y,
method = "knn", preProc = c("center", "scale"),
tuneLength = 10)
plot(knnModel$results$RMSE)
The minimized RMSE appears to happen around 6 cluster, although 5 and 7 give similar results.
mars <- earth(trainingData$x, trainingData$y)
mars
## Selected 12 of 18 terms, and 6 of 10 predictors
## Termination condition: Reached nk 21
## Importance: X1, X4, X2, X5, X3, X6, X7-unused, X8-unused, X9-unused, ...
## Number of terms at each degree of interaction: 1 11 (additive model)
## GCV 2.540556 RSS 397.9654 GRSq 0.8968524 RSq 0.9183982
Mars selects x1-x6.
Below is the data presented for modelling.
## Warning: package 'AppliedPredictiveModeling' was built under R version
## 3.5.3
## Warning: package 'doParallel' was built under R version 3.5.3
## Warning: package 'foreach' was built under R version 3.5.3
## Warning: package 'iterators' was built under R version 3.5.3
## Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
## 1 38.00 6.25 49.58 56.97
## 2 42.44 8.01 60.97 67.48
## 3 42.03 8.01 60.97 67.48
## 4 41.42 8.01 60.97 67.48
## 5 42.49 7.47 63.33 72.25
## 6 43.57 6.12 58.36 65.31
## BiologicalMaterial04 BiologicalMaterial05 BiologicalMaterial06
## 1 12.74 19.51 43.73
## 2 14.65 19.36 53.14
## 3 14.65 19.36 53.14
## 4 14.65 19.36 53.14
## 5 14.02 17.91 54.66
## 6 15.17 21.79 51.23
## BiologicalMaterial07 BiologicalMaterial08 BiologicalMaterial09
## 1 100 16.66 11.44
## 2 100 19.04 12.55
## 3 100 19.04 12.55
## 4 100 19.04 12.55
## 5 100 18.22 12.80
## 6 100 18.30 12.13
## BiologicalMaterial10 BiologicalMaterial11 BiologicalMaterial12
## 1 3.46 138.09 18.83
## 2 3.46 153.67 21.05
## 3 3.46 153.67 21.05
## 4 3.46 153.67 21.05
## 5 3.05 147.61 21.05
## 6 3.78 151.88 20.76
## ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
## 1 NA NA NA
## 2 0.0 0 NA
## 3 0.0 0 NA
## 4 0.0 0 NA
## 5 10.7 0 NA
## 6 12.0 0 NA
## ManufacturingProcess04 ManufacturingProcess05 ManufacturingProcess06
## 1 NA NA NA
## 2 917 1032.2 210.0
## 3 912 1003.6 207.1
## 4 911 1014.6 213.3
## 5 918 1027.5 205.7
## 6 924 1016.8 208.9
## ManufacturingProcess07 ManufacturingProcess08 ManufacturingProcess09
## 1 NA NA 43.00
## 2 177 178 46.57
## 3 178 178 45.07
## 4 177 177 44.92
## 5 178 178 44.96
## 6 178 178 45.32
## ManufacturingProcess10 ManufacturingProcess11 ManufacturingProcess12
## 1 NA NA NA
## 2 NA NA 0
## 3 NA NA 0
## 4 NA NA 0
## 5 NA NA 0
## 6 NA NA 0
## ManufacturingProcess13 ManufacturingProcess14 ManufacturingProcess15
## 1 35.5 4898 6108
## 2 34.0 4869 6095
## 3 34.8 4878 6087
## 4 34.8 4897 6102
## 5 34.6 4992 6233
## 6 34.0 4985 6222
## ManufacturingProcess16 ManufacturingProcess17 ManufacturingProcess18
## 1 4682 35.5 4865
## 2 4617 34.0 4867
## 3 4617 34.8 4877
## 4 4635 34.8 4872
## 5 4733 33.9 4886
## 6 4786 33.4 4862
## ManufacturingProcess19 ManufacturingProcess20 ManufacturingProcess21
## 1 6049 4665 0.0
## 2 6097 4621 0.0
## 3 6078 4621 0.0
## 4 6073 4611 0.0
## 5 6102 4659 -0.7
## 6 6115 4696 -0.6
## ManufacturingProcess22 ManufacturingProcess23 ManufacturingProcess24
## 1 NA NA NA
## 2 3 0 3
## 3 4 1 4
## 4 5 2 5
## 5 8 4 18
## 6 9 1 1
## ManufacturingProcess25 ManufacturingProcess26 ManufacturingProcess27
## 1 4873 6074 4685
## 2 4869 6107 4630
## 3 4897 6116 4637
## 4 4892 6111 4630
## 5 4930 6151 4684
## 6 4871 6128 4687
## ManufacturingProcess28 ManufacturingProcess29 ManufacturingProcess30
## 1 10.7 21.0 9.9
## 2 11.2 21.4 9.9
## 3 11.1 21.3 9.4
## 4 11.1 21.3 9.4
## 5 11.3 21.6 9.0
## 6 11.4 21.7 10.1
## ManufacturingProcess31 ManufacturingProcess32 ManufacturingProcess33
## 1 69.1 156 66
## 2 68.7 169 66
## 3 69.3 173 66
## 4 69.3 171 68
## 5 69.4 171 70
## 6 68.2 173 70
## ManufacturingProcess34 ManufacturingProcess35 ManufacturingProcess36
## 1 2.4 486 0.019
## 2 2.6 508 0.019
## 3 2.6 509 0.018
## 4 2.5 496 0.018
## 5 2.5 468 0.017
## 6 2.5 490 0.018
## ManufacturingProcess37 ManufacturingProcess38 ManufacturingProcess39
## 1 0.5 3 7.2
## 2 2.0 2 7.2
## 3 0.7 2 7.2
## 4 1.2 2 7.2
## 5 0.2 2 7.3
## 6 0.4 2 7.2
## ManufacturingProcess40 ManufacturingProcess41 ManufacturingProcess42
## 1 NA NA 11.6
## 2 0.1 0.15 11.1
## 3 0.0 0.00 12.0
## 4 0.0 0.00 10.6
## 5 0.0 0.00 11.0
## 6 0.0 0.00 11.5
## ManufacturingProcess43 ManufacturingProcess44 ManufacturingProcess45
## 1 3.0 1.8 2.4
## 2 0.9 1.9 2.2
## 3 1.0 1.8 2.3
## 4 1.1 1.8 2.1
## 5 1.1 1.7 2.1
## 6 2.2 1.8 2.0
Here, I replaced NAs with the mean of the column they are in and center and scaled all the data.
na_to_mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
df <- replace(df, TRUE, lapply(df, na_to_mean))
pre <- preProcess(df, method = c("center", "scale"))
data <- predict(pre, df)
head(data)
## Yield BiologicalMaterial01 BiologicalMaterial02
## 1 -1.1792673 -0.2261036 -1.5140979
## 2 1.2263678 2.2391498 1.3089960
## 3 1.0042258 2.2391498 1.3089960
## 4 0.6737219 2.2391498 1.3089960
## 5 1.2534583 1.4827653 1.8939391
## 6 1.8386128 -0.4081962 0.6620886
## BiologicalMaterial03 BiologicalMaterial04 BiologicalMaterial05
## 1 -2.68303622 0.2201765 0.4941942
## 2 -0.05623504 1.2964386 0.4128555
## 3 -0.05623504 1.2964386 0.4128555
## 4 -0.05623504 1.2964386 0.4128555
## 5 1.13594780 0.9414412 -0.3734185
## 6 -0.59859075 1.5894524 1.7305423
## BiologicalMaterial06 BiologicalMaterial07 BiologicalMaterial08
## 1 -1.3828880 -0.1313107 -1.233131
## 2 1.1290767 -0.1313107 2.282619
## 3 1.1290767 -0.1313107 2.282619
## 4 1.1290767 -0.1313107 2.282619
## 5 1.5348350 -0.1313107 1.071310
## 6 0.6192092 -0.1313107 1.189487
## BiologicalMaterial09 BiologicalMaterial10 BiologicalMaterial11
## 1 -3.3962895 1.1005296 -1.838655
## 2 -0.7227225 1.1005296 1.393395
## 3 -0.7227225 1.1005296 1.393395
## 4 -0.7227225 1.1005296 1.393395
## 5 -0.1205678 0.4162193 0.136256
## 6 -1.7343424 1.6346255 1.022062
## BiologicalMaterial12 ManufacturingProcess01 ManufacturingProcess02
## 1 -1.7709224 0.0000000 0.000000
## 2 1.0989855 -6.1673490 -1.986352
## 3 1.0989855 -6.1673490 -1.986352
## 4 1.0989855 -6.1673490 -1.986352
## 5 1.0989855 -0.2792335 -1.986352
## 6 0.7240877 0.4361451 -1.986352
## ManufacturingProcess03 ManufacturingProcess04 ManufacturingProcess05
## 1 0 0.000000 0.00000000
## 2 0 -2.373764 1.00220071
## 3 0 -3.172935 0.06264341
## 4 0 -3.332769 0.42401160
## 5 0 -2.213930 0.84779794
## 6 0 -1.254926 0.49628524
## ManufacturingProcess06 ManufacturingProcess07 ManufacturingProcess08
## 1 0.0000000 0.0000000 0.0000000
## 2 0.9680861 -0.9607689 0.8967295
## 3 -0.1124188 1.0408330 0.8967295
## 4 2.1976262 -0.9607689 -1.1151636
## 5 -0.6340418 1.0408330 0.8967295
## 6 0.5582394 1.0408330 0.8967295
## ManufacturingProcess09 ManufacturingProcess10 ManufacturingProcess11
## 1 -1.7201524 0 0
## 2 0.5883746 0 0
## 3 -0.3815947 0 0
## 4 -0.4785917 0 0
## 5 -0.4527258 0 0
## 6 -0.2199332 0 0
## ManufacturingProcess12 ManufacturingProcess13 ManufacturingProcess14
## 1 0.000000 0.97711512 0.8117224
## 2 -0.482073 -0.50030980 0.2783168
## 3 -0.482073 0.28765016 0.4438565
## 4 -0.482073 0.28765016 0.7933291
## 5 -0.482073 0.09066017 2.5406922
## 6 -0.482073 -0.50030980 2.4119391
## ManufacturingProcess15 ManufacturingProcess16 ManufacturingProcess17
## 1 1.1846438 0.3303945 0.9263296
## 2 0.9617071 0.1455765 -0.2753953
## 3 0.8245152 0.1455765 0.3655246
## 4 1.0817499 0.1967569 0.3655246
## 5 3.3282665 0.4754056 -0.3555103
## 6 3.1396277 0.6261033 -0.7560852
## ManufacturingProcess18 ManufacturingProcess19 ManufacturingProcess20
## 1 0.1505348 0.4563798 0.3109942
## 2 0.1559773 1.5095063 0.1849230
## 3 0.1831898 1.0926437 0.1849230
## 4 0.1695836 0.9829430 0.1562704
## 5 0.2076811 1.6192070 0.2938027
## 6 0.1423710 1.9044287 0.3998171
## ManufacturingProcess21 ManufacturingProcess22 ManufacturingProcess23
## 1 0.2109804 0.0000000 0.0000000
## 2 0.2109804 -0.7243735 -1.8199757
## 3 0.2109804 -0.4232681 -1.2167640
## 4 0.2109804 -0.1221628 -0.6135524
## 5 -0.6884239 0.7811534 0.5928709
## 6 -0.5599376 1.0822588 -1.2167640
## ManufacturingProcess24 ManufacturingProcess25 ManufacturingProcess26
## 1 0.0000000 0.1217705 0.1274689
## 2 -1.0088982 0.1109041 0.1994933
## 3 -0.8359725 0.1869689 0.2191363
## 4 -0.6630467 0.1733859 0.2082235
## 5 1.5849880 0.2766168 0.2955257
## 6 -1.3547497 0.1163373 0.2453269
## ManufacturingProcess27 ManufacturingProcess28 ManufacturingProcess29
## 1 0.3510871 0.7940899 0.6030010
## 2 0.1934449 0.8907371 0.8469115
## 3 0.2135084 0.8714077 0.7859338
## 4 0.1934449 0.8714077 0.7859338
## 5 0.3482209 0.9100666 0.9688667
## 6 0.3568195 0.9293960 1.0298443
## ManufacturingProcess30 ManufacturingProcess31 ManufacturingProcess32
## 1 0.7677420 -0.1981058 -0.4568829
## 2 0.7677420 -0.2711540 1.9517531
## 3 0.2480117 -0.1615817 2.6928719
## 4 0.2480117 -0.1615817 2.3223125
## 5 -0.1677726 -0.1433197 2.3223125
## 6 0.9756342 -0.3624642 2.6928719
## ManufacturingProcess33 ManufacturingProcess34 ManufacturingProcess35
## 1 1.003470 -1.7453870 -0.89989600
## 2 1.003470 1.9853777 1.16311970
## 3 1.003470 1.9853777 1.25689314
## 4 1.820581 0.1199954 0.03783841
## 5 2.637692 0.1199954 -2.58781794
## 6 2.637692 0.1199954 -0.52480224
## ManufacturingProcess36 ManufacturingProcess37 ManufacturingProcess38
## 1 -0.6653513 -1.1540243 0.7174727
## 2 -0.6653513 2.2161351 -0.8224687
## 3 -1.8263214 -0.7046697 -0.8224687
## 4 -1.8263214 0.4187168 -0.8224687
## 5 -2.9872915 -1.8280562 -0.8224687
## 6 -1.8263214 -1.3787016 -0.8224687
## ManufacturingProcess39 ManufacturingProcess40 ManufacturingProcess41
## 1 0.2317270 0.0000000 0.0000000
## 2 0.2317270 2.1552636 2.3529953
## 3 0.2317270 -0.4639804 -0.4418521
## 4 0.2317270 -0.4639804 -0.4418521
## 5 0.2981503 -0.4639804 -0.4418521
## 6 0.2317270 -0.4639804 -0.4418521
## ManufacturingProcess42 ManufacturingProcess43 ManufacturingProcess44
## 1 0.20279570 2.40564734 -0.01588055
## 2 -0.05472265 -0.01374656 0.29467248
## 3 0.40881037 0.10146268 -0.01588055
## 4 -0.31224099 0.21667191 -0.01588055
## 5 -0.10622632 0.21667191 -0.32643359
## 6 0.15129203 1.48397347 -0.01588055
## ManufacturingProcess45
## 1 0.64371849
## 2 0.15220242
## 3 0.39796046
## 4 -0.09355562
## 5 -0.09355562
## 6 -0.33931365
Here, I create my training and test sets as well as label sets.
all_indexes = 1:176
training_index = sample(1:176, size = 140)
test_index = setdiff(all_indexes, training_index)
training = data[training_index,]
testing = data[test_index,]
test_labels <- testing['Yield']
train_labels <- training['Yield']
features <- training
features['Yield'] <- NULL
Below we build a neural net.
library(neuralnet, quietly = TRUE)
## Warning: package 'neuralnet' was built under R version 3.5.3
model1 <- neuralnet(Yield~. , data = training, hidden = c(5,3))
#plot(model1)
tmp <- compute(model1, testing)
pred1 <- tmp$net.result
rmse1 <- sum(((pred1 - test_labels)^2)^.5)
Then, we build a knn. Here we find that the optimal cluster size is 5
model2 <- train(x = features ,
y = train_labels$Yield,
method = "knn", preProc = c("center", "scale"),
tuneLength = 10)
plot(model2$results$RMSE)
Model 3 was a MARS model.
model3 <- earth(x = features , y = train_labels$Yield)
pred3 <- predict(model3, testing)
rmse3 <- sum(((pred3 - test_labels)^2)^.5)
rmse3
## [1] 21.57898
Below is an SVM, using the polynomial kernel.
library(e1071, quietly = TRUE)
## Warning: package 'e1071' was built under R version 3.5.3
labels = list(train_labels)
model4 <- svm(Yield ~ ., data = training, kernel = "polynomial")
pred4 <- predict(model4, testing)
rmse4 <- sum(((pred4 - test_labels)^2)^.5)
rmse4
## [1] 23.71598
Below we can compare the results of 3 of the models.
barplot(c(rmse1, rmse3, rmse4), names.arg = c('NN', 'KNN', 'SVM'), main = "RMSE for Various Models")
Since the KNN is a classifier rather than a regressor, it is inappropriate for this data and the RMSE comparator. Both the KNN and NN perform well with the data, but the NN wins slightly.