Juan Olazaba 2/25/2024
The objective of this analysis is to find the number of PLS components that yields the optimal R2 value (Sect. 5.1). PLS models with 1 - 10 components were each evaluated using five repeats of 10-fold cross-validation and the results are presented in the following table:
Answer in PDF
Answer in PDF
The support vector machine models provides the optimal R^2 value and is very close to the random forest model. I also chose the SVM model because it has a significant advantage over the best performing model (Random Forest) in terms of computational time.
I would choose boosted regression due to its good performance and model simplicity. Its prediction time is also one of the best out of all the models presented.
Brodnjak-Vonina et al. (2005) develop a methodology for food laboratories to determine the type of oil from a sample. In their procedure, they used a gas chromatograph (an instrument that separate chemicals in a sample) to measure seven different fatty acids in an oil. These measurements would then be used to predict the type of oil in a food samples. To create their model, they used 96 samples of seven types of oils. These data can be found in the caret package using data(oil). The oil types are contained in a factor variable called oilType. The types are pumpkin (coded as A), sunflower (B), peanut (C), olive (D), soybean (E), rapeseed (F) and corn (G).
## Factor w/ 7 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## oilType
## A B C D E F G
## 37 26 3 7 11 10 2
set.seed(629)
Ja <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jb <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jc <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jd <- sample(oilType, 60, replace = FALSE, prob = NULL)
table(Ja)## Ja
## A B C D E F G
## 20 16 3 4 8 8 1
## Jb
## A B C D E F G
## 24 17 2 4 7 5 1
## Jc
## A B C D E F G
## 24 14 2 4 6 8 2
## Jd
## A B C D E F G
## 25 13 1 5 8 6 2
Stratified random sampling allocated equal samples per each class every round of sampling. Thus, substantially minimizing our variance within each sample.
set.seed(318)
oil <- createDataPartition(oilType, p = .70, times = 20)
oil <- lapply(oil, function(x, y) table(y[x]), y = oilType)
head(oil, 3)## $Resample01
##
## A B C D E F G
## 26 19 3 5 8 7 2
##
## $Resample02
##
## A B C D E F G
## 26 19 3 5 8 7 2
##
## $Resample03
##
## A B C D E F G
## 26 19 3 5 8 7 2
Leave one out cross validation could be an option. Divide the dataset into multiple folds, train the model on all but one fold, and then evaluate it on the left-out fold. And repeat this procress on all folds then average the results. Another option that is good and robust when dealing with small sample sizes is Bootstrapping. This would require repeated sampling with replacement.
getWidth <- function(values) {
binom.test(x = floor(values["size"]*values["accuracy"])+1,
n = values["size"])$conf.int
}
ciInfo <- expand.grid(size = 10:30, accuracy = seq(.7, .95, by = 0.01))
ciWidths <- t(apply(ciInfo, 1, getWidth))
head(ciWidths)## [,1] [,2]
## [1,] 0.4439045 0.9747893
## [2,] 0.3902574 0.9397823
## [3,] 0.4281415 0.9451394
## [4,] 0.4618685 0.9496189
## [5,] 0.4189647 0.9161107
## [6,] 0.4489968 0.9221285
Problems (a) - (f) were hand written on a pdf. I am still learning how to code in LateX so I can submit digitally.
If we would like to get a closer look at what is going with the function, we can scale it down.
# Generate x values
x <- 1:50
# Calculate y values
y <- 1 - (1 - 1/x)^x
# Plot with a logarithmic scale on the y-axis
plot(x, y, log = "y", type = "o", col = "grey", lwd = 2, xlab = "x", ylab = "y",
main = "Plot with Logarithmic Y-axis Scale, Scaled down")store <- rep(NA, 10000)
for(i in 1:10000){
store[i] <- sum(sample(1:100, rep=TRUE) == 4) > 0
}
mean(store)## [1] 0.6311
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index = 1:nrow(data)) {
coef(glm(default ~ balance + income, data = data, subset = index, family = "binomial"))[-1]
}
boot.fn(Default)## balance income
## 5.647103e-03 2.080898e-05
Multiple Logistic Regression Method - Balance: 2.274e-04 Income: 4.985e-06
Bootstrap Method - Balance: 2.390732e-04 Income: 4.940896e-06
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 5.647103e-03 1.702938e-05 2.390732e-04
## t2* 2.080898e-05 3.682295e-07 4.940896e-06