HW 2

1. Exercise 4.3 from the reference textbook (Applied Predictive Modeling, p. 105): Partial least squares (Sect. 6.3) was used to model the yield of a chemical manufacturing process (Sect. 1.4). The data can be found in the AppliedPredictiveModeling package and can be loaded using

library(AppliedPredictiveModeling)
## Warning: package 'AppliedPredictiveModeling' was built under R version 4.4.2
data(ChemicalManufacturingProcess)

The objective of this analysis is to find the number of PLS components that yields the optimal R2 value (Sect. 5.1). PLS models with 1 through 10 components were each evaluated using five repeats of 10-fold cross-validation and the results are presented in the following table:

library(knitr)
resampled_r2 <- data.frame(
  Components = 1:10,
  Mean = c(0.444, 0.500, 0.533, 0.545, 0.542, 0.537, 0.534, 0.534, 0.520, 0.507),
  Std_Error = c(0.0272, 0.0298, 0.0302, 0.0308, 0.0322, 0.0327, 0.0333, 0.0330, 0.0326, 0.0324)
)
kable(resampled_r2, col.names = c("Components", "Mean", "Std. Error"),
      caption = "Resampled R² Results")
Resampled R² Results
Components Mean Std. Error
1 0.444 0.0272
2 0.500 0.0298
3 0.533 0.0302
4 0.545 0.0308
5 0.542 0.0322
6 0.537 0.0327
7 0.534 0.0333
8 0.534 0.0330
9 0.520 0.0326
10 0.507 0.0324

a) Using the “one-standard error” method, what number of PLS components provides the most parsimonious model?

The one-standard error method involves selecting the simplest model with fewest components whose R^2 is within 1 standard error of the R^2 of the model with best performance. Best overall model is at 4 components with R^2 of 0.545 and SE of 0.0308. The lower bound of a simpler model would be 0.545-0.0308=0.5142. There is one model that fits the criteria for the one-standard error method, and it is the model with 3 PLS components, with R^2 of 0.533.

b) Compute the tolerance values for this example. If a 10 % loss in R2 is acceptable, then what is the optimal number of PLS components?

Tolerance is equal to R2-R_best2/R_best^2 x 100%. Since the best R^2=0.545, the tolerance for 1 component is -18.53%, 2 is -8.26%, and 3 is -2.20%. If a 10% loss is in R^2 is acceptable, then 2 PLS components is the smallest number of components that meets this criterion.

c) Several other models (discussed in Part II) with varying degrees of complexity were trained and tuned and the results are presented in Fig. 4.13. If the goal is to select the model that optimizes R2, then which model(s) would you choose, and why?

If the sole objective is to optimize R^2, the best model is RF since it achieves the highest R^2 among all models. However, SVM performs similarly with overlapping confidence intervals in R^2. This suggests that the true performance difference between RF and SVM may not be statistically significant, meaning either could be considered a strong candidate.

d) Prediction time, as well as model complexity (Sect. 4.8) are other factors to consider when selecting the optimal model(s). Given each model’s prediction time, model complexity, and R2 estimates, which model(s) would you choose, and why?

I would choose boosted linear regression because it has the second highest R2 estimates while providing short prediction time and relatively low model complexity, second only to linear regression.

2. Exercise 4.4 from the reference textbook (Applied Predictive Modeling, p. 105): Brodnjak-Vonina et al. (2005) develop a methodology for food laboratories to determine the type of oil from a sample. In their procedure, they used a gas chromatograph (an instrument that separate chemicals in a sample) to measure seven different fatty acids in an oil. These measurements would then be used to predict the type of oil in a food samples. To create their model, they used 96 samples of seven types of oils. These data can be found in the caret package using data(oil). The oil types are contained in a factor variable called oilType. The types are pumpkin (coded as A), sunflower (B), peanut (C), olive (D), soybean (E), rapeseed (F), and corn (G).

library(caret) 
## Loading required package: ggplot2
## Loading required package: lattice
data(oil) # loads oil dataset which the caret package, includes fattyAcid (predictor variable - chemical measurements) and oilType (response variable - type of oil coded A to G)
str(oilType)
##  Factor w/ 7 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
table(oilType)
## oilType
##  A  B  C  D  E  F  G 
## 37 26  3  7 11 10  2

a) Use the sample function in base R to create a completely random sample of 60 oils. How closely do the frequencies of the random sample match the original samples? Repeat this procedure several times of understand the variation in the sampling process.

head(fattyAcids)
##   Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic
## 1      9.7     5.2  31.0     52.7       0.4        0.4        0.1
## 2     11.1     5.0  32.9     49.8       0.3        0.4        0.1
## 3     11.5     5.2  35.0     47.2       0.2        0.4        0.1
## 4     10.0     4.8  30.4     53.5       0.3        0.4        0.1
## 5     12.2     5.0  31.1     50.5       0.3        0.4        0.1
## 6      9.8     4.2  43.0     39.2       2.4        0.4        0.5
summary(fattyAcids)
##     Palmitic        Stearic          Oleic          Linoleic    
##  Min.   : 4.50   Min.   :1.700   Min.   :22.80   Min.   : 7.90  
##  1st Qu.: 6.20   1st Qu.:3.475   1st Qu.:26.30   1st Qu.:43.10  
##  Median : 9.85   Median :4.200   Median :30.70   Median :50.80  
##  Mean   : 9.04   Mean   :4.200   Mean   :36.73   Mean   :46.49  
##  3rd Qu.:11.12   3rd Qu.:5.000   3rd Qu.:38.62   3rd Qu.:58.08  
##  Max.   :14.90   Max.   :6.700   Max.   :76.70   Max.   :66.10  
##    Linolenic       Eicosanoic      Eicosenoic    
##  Min.   :0.100   Min.   :0.100   Min.   :0.1000  
##  1st Qu.:0.375   1st Qu.:0.100   1st Qu.:0.1000  
##  Median :0.800   Median :0.400   Median :0.1000  
##  Mean   :2.272   Mean   :0.399   Mean   :0.3115  
##  3rd Qu.:2.650   3rd Qu.:0.400   3rd Qu.:0.3000  
##  Max.   :9.500   Max.   :2.800   Max.   :1.8000
original_freq <- table(oilType)
original_pct <- original_freq / sum(original_freq) * 100
cat("Original % frequencies:\n", original_pct)
## Original % frequencies:
##  38.54167 27.08333 3.125 7.291667 11.45833 10.41667 2.083333
set.seed(0)
for (i in 1:5) {
  sampled_indices <- sample(1:length(oilType), 60) 
  sampled_oilType <- oilType[sampled_indices]
  sampled_freq <- table(sampled_oilType)
  sampled_pct <- sampled_freq / sum(sampled_freq) * 100
  
  # To show all oil types, even if some are missing (just insert 0% if not present)
  all_types <- levels(oilType) # all 7 oil types: [A,B,C,D,E,F,G]
  sampled_pct_full <- setNames(rep(0, length(all_types)), all_types) # setNames(rep(0,7), A-G)
  sampled_pct_full[names(sampled_pct)] <- sampled_pct

  cat("\nSample", i, "% frequencies:\n", sampled_pct_full)
}
## 
## Sample 1 % frequencies:
##  38.33333 26.66667 1.666667 10 8.333333 13.33333 1.666667
## Sample 2 % frequencies:
##  36.66667 26.66667 3.333333 6.666667 10 15 1.666667
## Sample 3 % frequencies:
##  41.66667 25 5 3.333333 11.66667 10 3.333333
## Sample 4 % frequencies:
##  40 25 5 5 11.66667 10 3.333333
## Sample 5 % frequencies:
##  43.33333 21.66667 3.333333 8.333333 10 11.66667 1.666667

Overall, the frequencies of the random sample are quite similar to the original sample. It can over/under represent different types of oil. For example, soybean oil (E) is the 3rd most common oil from the original dataset, but it is only properly represented as such in 2 out of the 5 random samples.

b) Use the caret package function createDataPartition to create a stratified random sample. How does this compare to the completely random samples?

set.seed(0)
train_indices <- createDataPartition(oilType, p = 60/96, list = FALSE)
stratified_sample <- oilType[train_indices]
cat("\nStratified sample frequencies:\n", table(stratified_sample) / length(stratified_sample) * 100) 
## 
## Stratified sample frequencies:
##  37.5 26.5625 3.125 7.8125 10.9375 10.9375 3.125

The stratified random sample maintains class balance and better matches with the original frequencies than the completely random samples. However, length(stratified_sample) is not guaranteed to be 60.

c) With such a small samples size, what are the options for determining performance of the model? Should a test set be used?

Given the small sample size (n = 96), the most appropriate approach for determining model performance is repeated CV. This allows all samples to be used for both training and evaluation, improving performance estimates while preserving the limited data. A separate test set should not be used since it would leave too little data for effective model training.

d) One method for understanding the uncertainty of a test set is to use a confidence interval. To obtain a confidence interval for the overall accuracy, the based R function binom.test can be used. It requires the user to input the number of samples and the number correctly classified to calculate the interval. For example, suppose a test set sample of 20 oil samples was set aside and 76 were used for model training. For this test set size and a model that is about 80 % accurate (16 out of 20 correct), the confidence interval would be computed using:

binom.test(16,20)
## 
##  Exact binomial test
## 
## data:  16 and 20
## number of successes = 16, number of trials = 20, p-value = 0.01182
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.563386 0.942666
## sample estimates:
## probability of success 
##                    0.8

In this case, the width of the 95 % confidence interval is 37.9 %. Try different samples sizes and accuracy rates to understand the trade-off between the uncertainty in the results, the model performance, and the test set size.

binom.test(1, 20) #25
## 
##  Exact binomial test
## 
## data:  1 and 20
## number of successes = 1, number of trials = 20, p-value = 4.005e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.001265089 0.248732763
## sample estimates:
## probability of success 
##                   0.05
binom.test(10, 20) #50
## 
##  Exact binomial test
## 
## data:  10 and 20
## number of successes = 10, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2719578 0.7280422
## sample estimates:
## probability of success 
##                    0.5
binom.test(19, 20) #25
## 
##  Exact binomial test
## 
## data:  19 and 20
## number of successes = 19, number of trials = 20, p-value = 4.005e-05
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.7512672 0.9987349
## sample estimates:
## probability of success 
##                   0.95
binom.test(10, 200) #7
## 
##  Exact binomial test
## 
## data:  10 and 200
## number of successes = 10, number of trials = 200, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.02423417 0.09002754
## sample estimates:
## probability of success 
##                   0.05
binom.test(100, 200) #14
## 
##  Exact binomial test
## 
## data:  100 and 200
## number of successes = 100, number of trials = 200, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4286584 0.5713416
## sample estimates:
## probability of success 
##                    0.5
binom.test(190, 200) #7
## 
##  Exact binomial test
## 
## data:  190 and 200
## number of successes = 190, number of trials = 200, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.9099725 0.9757658
## sample estimates:
## probability of success 
##                   0.95
binom.test(100, 2000) #2
## 
##  Exact binomial test
## 
## data:  100 and 2000
## number of successes = 100, number of trials = 2000, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.04086427 0.06048172
## sample estimates:
## probability of success 
##                   0.05
binom.test(1000, 2000) #4
## 
##  Exact binomial test
## 
## data:  1000 and 2000
## number of successes = 1000, number of trials = 2000, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4778506 0.5221494
## sample estimates:
## probability of success 
##                    0.5
binom.test(1900, 2000) #2
## 
##  Exact binomial test
## 
## data:  1900 and 2000
## number of successes = 1900, number of trials = 2000, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.9395183 0.9591357
## sample estimates:
## probability of success 
##                   0.95

For small test sets (n=20), the confidence interval is wide, meaning the estimated accuracy could vary greatly. As test sets increase, the interval narrows, reflecting increased confidence in accuracy estimate. Moreover, as accuracy increases from 50%, the interval narrows as well.

3. Exercise 5.4.2 from the reference textbook (An Introduction to Statistical Learning with Applications in R, p. 230): We will now derive the probability that a given observation is part of a bootstrap sample. Suppose that we obtain a bootstrap sample from a set of n observations.

a) What is the probability that the first bootstrap observation is not the jth observation from the original sample? Justify your answer.

There are n observations in the original sample, each equally likely to be selected in each bootstrap draw. Thus, the probability of excluding any specific observation is: 1 - 1/n.

b) What is the probability that the second bootstrap observation is not the jth observation from the original sample?

Since bootstrap samples with replacement, it will still be 1 - 1/n.

c) Argue that the probability that the jth observation is not in the bootstrap sample is (1 − 1/n)n.

The probability that the jth observation is excluded from a single draw is 1 - 1/n. The probability that it will be excluded from all n bootstrap draws is thus (1 - 1/n)^n since each draw is independent (sample with replacement).

d) When n = 5, what is the probability that the jth observation is in the bootstrap sample?

The formula would be 1 - (c). If you plug in 5, 1-(1-1/5)^5 = 1-0.8^5 = 0.672

e) When n = 100, what is the probability that the jth observation is in the bootstrap sample?

Plug in 1000, so 1-(1-1/100)^100 = 1-0.99^100 = 0.634

f) When n = 10, 000, what is the probability that the jth observation is in the bootstrap sample?

Plug in 10,000 which is 0.632.

g) Create a plot that displays, for each integer value of n from 1 to 100, 000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.

n <- 1:100000
plot(n, 1 - (1 - 1/n)^n)

Final Probability of Inclusion

The model relatively quickly approaches an asymptotic probability of: \[ \lim_{n \to \infty} 1 - \left(1 - \frac{1}{n}\right)^n = 1-\frac{1}{e} \approx 0.632 \] This shows that even with very large samples, about 63.2% of observations appear at least once in each bootstrap sample, while about 36.8% are left out entirely.

h) We will now investigate numerically the probability that a bootstrap sample of size n = 100 contains the jth observation. Here j = 4. We repeatedly create bootstrap samples, and each time we record whether or not the fourth observation is contained in the bootstrap sample. Comment on results obtained.

store <- rep(NA, 10000)
for(i in 1:10000){
  store[i] <- sum(sample(1:100, rep=TRUE) == 4) > 0
}
mean(store)
## [1] 0.6355

We loop over 10,000 bootstrap samples of size 100 with replacement. The result is 0.6317 which is approximately 0.632, matching what we expected. This highlights the property of bootstrap sampling, where each observation has about a 63.2% chance of appearing at least once in a bootstrap sample and 36.8% chance of being excluded entirely.

4. Exercise 5.4.6 from the reference textbook (An Introduction to Statistical Learning with Applications in R, p. 232): We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coeffcients in two diferent ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

a) Using the summary() and glm() functions, determine the estimated standard errors for the coeffcients associated with income and balance in a multiple logistic regression model that uses both predictors.

library(ISLR)
data(Default)
str(Default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554
set.seed(0)
logit_model <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logit_model)
## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

SE of income is 4.985e-6 and balance is 2.274e-4.

b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coeffcient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
  return(coef(fit))
}

c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coeffcients for income and balance.

library(boot)
## Warning: package 'boot' was built under R version 4.4.2
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
boot(Default, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -4.029654e-02 4.369085e-01
## t2*  2.080898e-05  2.593598e-07 4.723616e-06
## t3*  5.647103e-03  1.627427e-05 2.274881e-04

d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

Using my bootstrap function, the SE of income is 4.724e-6 and balance is 2.275e-4. The estimated standard errors for both income and balance are very similar between the glm function and my bootstrap function, providing confidence that the model performance will be similar when tested on new data.