library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.2
data(ChemicalManufacturingProcess)
The objective of this analysis is to find the number of PLS components that yields the optimal R2 value (Sect. 5.1). PLS models with 1 through 10 components were each evaluated using five repeats of 10-fold cross-validation and the results are presented in the following table:
library(knitr)
resampled_r2 <- data.frame(
Components = 1:10,
Mean = c(0.444, 0.500, 0.533, 0.545, 0.542, 0.537, 0.534, 0.534, 0.520, 0.507),
Std_Error = c(0.0272, 0.0298, 0.0302, 0.0308, 0.0322, 0.0327, 0.0333, 0.0330, 0.0326, 0.0324)
)
kable(resampled_r2, col.names = c("Components", "Mean", "Std. Error"),
caption = "Resampled R² Results")
| Components | Mean | Std. Error |
|---|---|---|
| 1 | 0.444 | 0.0272 |
| 2 | 0.500 | 0.0298 |
| 3 | 0.533 | 0.0302 |
| 4 | 0.545 | 0.0308 |
| 5 | 0.542 | 0.0322 |
| 6 | 0.537 | 0.0327 |
| 7 | 0.534 | 0.0333 |
| 8 | 0.534 | 0.0330 |
| 9 | 0.520 | 0.0326 |
| 10 | 0.507 | 0.0324 |
The one-standard error method involves selecting the simplest model with fewest components whose R^2 is within 1 standard error of the R^2 of the model with best performance. Best overall model is at 4 components with R^2 of 0.545 and SE of 0.0308. The lower bound of a simpler model would be 0.545-0.0308=0.5142. There is one model that fits the criteria for the one-standard error method, and it is the model with 3 PLS components, with R^2 of 0.533.
Tolerance is equal to R2-R_best2/R_best^2 x 100%. Since the best R^2=0.545, the tolerance for 1 component is -18.53%, 2 is -8.26%, and 3 is -2.20%. If a 10% loss is in R^2 is acceptable, then 2 PLS components is the smallest number of components that meets this criterion.
If the sole objective is to optimize R^2, the best model is RF since it achieves the highest R^2 among all models. However, SVM performs similarly with overlapping confidence intervals in R^2. This suggests that the true performance difference between RF and SVM may not be statistically significant, meaning either could be considered a strong candidate.
I would choose boosted linear regression because it has the second highest R2 estimates while providing short prediction time and relatively low model complexity, second only to linear regression.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
data(oil) # loads oil dataset which the caret package, includes fattyAcid (predictor variable - chemical measurements) and oilType (response variable - type of oil coded A to G)
str(oilType)
## Factor w/ 7 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
table(oilType)
## oilType
## A B C D E F G
## 37 26 3 7 11 10 2
head(fattyAcids)
## Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
## 1 9.7 5.2 31.0 52.7 0.4 0.4 0.1
## 2 11.1 5.0 32.9 49.8 0.3 0.4 0.1
## 3 11.5 5.2 35.0 47.2 0.2 0.4 0.1
## 4 10.0 4.8 30.4 53.5 0.3 0.4 0.1
## 5 12.2 5.0 31.1 50.5 0.3 0.4 0.1
## 6 9.8 4.2 43.0 39.2 2.4 0.4 0.5
summary(fattyAcids)
## Palmitic Stearic Oleic Linoleic
## Min. : 4.50 Min. :1.700 Min. :22.80 Min. : 7.90
## 1st Qu.: 6.20 1st Qu.:3.475 1st Qu.:26.30 1st Qu.:43.10
## Median : 9.85 Median :4.200 Median :30.70 Median :50.80
## Mean : 9.04 Mean :4.200 Mean :36.73 Mean :46.49
## 3rd Qu.:11.12 3rd Qu.:5.000 3rd Qu.:38.62 3rd Qu.:58.08
## Max. :14.90 Max. :6.700 Max. :76.70 Max. :66.10
## Linolenic Eicosanoic Eicosenoic
## Min. :0.100 Min. :0.100 Min. :0.1000
## 1st Qu.:0.375 1st Qu.:0.100 1st Qu.:0.1000
## Median :0.800 Median :0.400 Median :0.1000
## Mean :2.272 Mean :0.399 Mean :0.3115
## 3rd Qu.:2.650 3rd Qu.:0.400 3rd Qu.:0.3000
## Max. :9.500 Max. :2.800 Max. :1.8000
original_freq <- table(oilType)
original_pct <- original_freq / sum(original_freq) * 100
cat("Original % frequencies:\n", original_pct)
## Original % frequencies:
## 38.54167 27.08333 3.125 7.291667 11.45833 10.41667 2.083333
set.seed(0)
for (i in 1:5) {
sampled_indices <- sample(1:length(oilType), 60)
sampled_oilType <- oilType[sampled_indices]
sampled_freq <- table(sampled_oilType)
sampled_pct <- sampled_freq / sum(sampled_freq) * 100
# To show all oil types, even if some are missing (just insert 0% if not present)
all_types <- levels(oilType) # all 7 oil types: [A,B,C,D,E,F,G]
sampled_pct_full <- setNames(rep(0, length(all_types)), all_types) # setNames(rep(0,7), A-G)
sampled_pct_full[names(sampled_pct)] <- sampled_pct
cat("\nSample", i, "% frequencies:\n", sampled_pct_full)
}
##
## Sample 1 % frequencies:
## 38.33333 26.66667 1.666667 10 8.333333 13.33333 1.666667
## Sample 2 % frequencies:
## 36.66667 26.66667 3.333333 6.666667 10 15 1.666667
## Sample 3 % frequencies:
## 41.66667 25 5 3.333333 11.66667 10 3.333333
## Sample 4 % frequencies:
## 40 25 5 5 11.66667 10 3.333333
## Sample 5 % frequencies:
## 43.33333 21.66667 3.333333 8.333333 10 11.66667 1.666667
Overall, the frequencies of the random sample are quite similar to the original sample. It can over/under represent different types of oil. For example, soybean oil (E) is the 3rd most common oil from the original dataset, but it is only properly represented as such in 2 out of the 5 random samples.
set.seed(0)
train_indices <- createDataPartition(oilType, p = 60/96, list = FALSE)
stratified_sample <- oilType[train_indices]
cat("\nStratified sample frequencies:\n", table(stratified_sample) / length(stratified_sample) * 100)
##
## Stratified sample frequencies:
## 37.5 26.5625 3.125 7.8125 10.9375 10.9375 3.125
The stratified random sample maintains class balance and better matches with the original frequencies than the completely random samples. However, length(stratified_sample) is not guaranteed to be 60.
Given the small sample size (n = 96), the most appropriate approach for determining model performance is repeated CV. This allows all samples to be used for both training and evaluation, improving performance estimates while preserving the limited data. A separate test set should not be used since it would leave too little data for effective model training.
binom.test(16,20)
##
## Exact binomial test
##
## data: 16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.563386 0.942666
## sample estimates:
## probability of success
## 0.8
In this case, the width of the 95 % confidence interval is 37.9 %. Try different samples sizes and accuracy rates to understand the trade-off between the uncertainty in the results, the model performance, and the test set size.
binom.test(1, 20) #25
##
## Exact binomial test
##
## data: 1 and 20
## number of successes = 1, number of trials = 20, p-value = 4.005e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.001265089 0.248732763
## sample estimates:
## probability of success
## 0.05
binom.test(10, 20) #50
##
## Exact binomial test
##
## data: 10 and 20
## number of successes = 10, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.2719578 0.7280422
## sample estimates:
## probability of success
## 0.5
binom.test(19, 20) #25
##
## Exact binomial test
##
## data: 19 and 20
## number of successes = 19, number of trials = 20, p-value = 4.005e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.7512672 0.9987349
## sample estimates:
## probability of success
## 0.95
binom.test(10, 200) #7
##
## Exact binomial test
##
## data: 10 and 200
## number of successes = 10, number of trials = 200, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.02423417 0.09002754
## sample estimates:
## probability of success
## 0.05
binom.test(100, 200) #14
##
## Exact binomial test
##
## data: 100 and 200
## number of successes = 100, number of trials = 200, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4286584 0.5713416
## sample estimates:
## probability of success
## 0.5
binom.test(190, 200) #7
##
## Exact binomial test
##
## data: 190 and 200
## number of successes = 190, number of trials = 200, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.9099725 0.9757658
## sample estimates:
## probability of success
## 0.95
binom.test(100, 2000) #2
##
## Exact binomial test
##
## data: 100 and 2000
## number of successes = 100, number of trials = 2000, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.04086427 0.06048172
## sample estimates:
## probability of success
## 0.05
binom.test(1000, 2000) #4
##
## Exact binomial test
##
## data: 1000 and 2000
## number of successes = 1000, number of trials = 2000, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4778506 0.5221494
## sample estimates:
## probability of success
## 0.5
binom.test(1900, 2000) #2
##
## Exact binomial test
##
## data: 1900 and 2000
## number of successes = 1900, number of trials = 2000, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.9395183 0.9591357
## sample estimates:
## probability of success
## 0.95
For small test sets (n=20), the confidence interval is wide, meaning the estimated accuracy could vary greatly. As test sets increase, the interval narrows, reflecting increased confidence in accuracy estimate. Moreover, as accuracy increases from 50%, the interval narrows as well.
There are n observations in the original sample, each equally likely to be selected in each bootstrap draw. Thus, the probability of excluding any specific observation is: 1 - 1/n.
Since bootstrap samples with replacement, it will still be 1 - 1/n.
The probability that the jth observation is excluded from a single draw is 1 - 1/n. The probability that it will be excluded from all n bootstrap draws is thus (1 - 1/n)^n since each draw is independent (sample with replacement).
The formula would be 1 - (c). If you plug in 5, 1-(1-1/5)^5 = 1-0.8^5 = 0.672
Plug in 1000, so 1-(1-1/100)^100 = 1-0.99^100 = 0.634
Plug in 10,000 which is 0.632.
n <- 1:100000
plot(n, 1 - (1 - 1/n)^n)
The model relatively quickly approaches an asymptotic probability of: \[ \lim_{n \to \infty} 1 - \left(1 - \frac{1}{n}\right)^n = 1-\frac{1}{e} \approx 0.632 \] This shows that even with very large samples, about 63.2% of observations appear at least once in each bootstrap sample, while about 36.8% are left out entirely.
store <- rep(NA, 10000)
for(i in 1:10000){
store[i] <- sum(sample(1:100, rep=TRUE) == 4) > 0
}
mean(store)
## [1] 0.6355
We loop over 10,000 bootstrap samples of size 100 with replacement. The result is 0.6317 which is approximately 0.632, matching what we expected. This highlights the property of bootstrap sampling, where each observation has about a 63.2% chance of appearing at least once in a bootstrap sample and 36.8% chance of being excluded entirely.
library(ISLR)
data(Default)
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
set.seed(0)
logit_model <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
SE of income is 4.985e-6 and balance is 2.274e-4.
boot.fn <- function(data, index) {
fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return(coef(fit))
}
library(boot)
## Warning: package 'boot' was built under R version 4.4.2
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -4.029654e-02 4.369085e-01
## t2* 2.080898e-05 2.593598e-07 4.723616e-06
## t3* 5.647103e-03 1.627427e-05 2.274881e-04
Using my bootstrap function, the SE of income is 4.724e-6 and balance is 2.275e-4. The estimated standard errors for both income and balance are very similar between the glm function and my bootstrap function, providing confidence that the model performance will be similar when tested on new data.