Advanced Statistical Learning Homework 2

Juan Olazaba 2/25/2024

Question 4.3

The objective of this analysis is to find the number of PLS components that yields the optimal R2 value (Sect. 5.1). PLS models with 1 - 10 components were each evaluated using five repeats of 10-fold cross-validation and the results are presented in the following table:

Answer in PDF
Answer in PDF
The support vector machine models provides the optimal R^2 value and is very close to the random forest model. I also chose the SVM model because it has a significant advantage over the best performing model (Random Forest) in terms of computational time.
I would choose boosted regression due to its good performance and model simplicity. Its prediction time is also one of the best out of all the models presented.

Question 4.4

Brodnjak-Vonina et al. (2005) develop a methodology for food laboratories to determine the type of oil from a sample. In their procedure, they used a gas chromatograph (an instrument that separate chemicals in a sample) to measure seven different fatty acids in an oil. These measurements would then be used to predict the type of oil in a food samples. To create their model, they used 96 samples of seven types of oils. These data can be found in the caret package using data(oil). The oil types are contained in a factor variable called oilType. The types are pumpkin (coded as A), sunflower (B), peanut (C), olive (D), soybean (E), rapeseed (F) and corn (G).

##  Factor w/ 7 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

## oilType
##  A  B  C  D  E  F  G 
## 37 26  3  7 11 10  2

Performing the sampling function several times has showed that the sampling if fairly accurate. The frequencies of the random sample closely match the frequencies of the original sample. There is variation but it is not significant enough to grab our atttention.

set.seed(629)
Ja <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jb <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jc <- sample(oilType, 60, replace = FALSE, prob = NULL)
Jd <- sample(oilType, 60, replace = FALSE, prob = NULL)

table(Ja)

## Ja
##  A  B  C  D  E  F  G 
## 20 16  3  4  8  8  1

table(Jb)

## Jb
##  A  B  C  D  E  F  G 
## 24 17  2  4  7  5  1

table(Jc)

## Jc
##  A  B  C  D  E  F  G 
## 24 14  2  4  6  8  2

table(Jd)

## Jd
##  A  B  C  D  E  F  G 
## 25 13  1  5  8  6  2

Use the caret package function createDataPartition to create a stratified random sample. How does this compare to the completely random samples?

Stratified random sampling allocated equal samples per each class every round of sampling. Thus, substantially minimizing our variance within each sample.

set.seed(318)
oil <- createDataPartition(oilType, p = .70, times = 20)
oil <- lapply(oil, function(x, y) table(y[x]), y = oilType)
head(oil, 3)

## $Resample01
## 
##  A  B  C  D  E  F  G 
## 26 19  3  5  8  7  2 
## 
## $Resample02
## 
##  A  B  C  D  E  F  G 
## 26 19  3  5  8  7  2 
## 
## $Resample03
## 
##  A  B  C  D  E  F  G 
## 26 19  3  5  8  7  2

With such a small samples size, what are the options for determining performance of the model? Should a test set be used?

Leave one out cross validation could be an option. Divide the dataset into multiple folds, train the model on all but one fold, and then evaluate it on the left-out fold. And repeat this procress on all folds then average the results. Another option that is good and robust when dealing with small sample sizes is Bootstrapping. This would require repeated sampling with replacement.

Using confidence intervals to understand uncertainty. From the plot created we are able to see that as our test sample size and accuracy increases, our overall CI width decreases. One could classify this approximately a linear relationship.

getWidth <- function(values) {
  binom.test(x = floor(values["size"]*values["accuracy"])+1, 
             n = values["size"])$conf.int
}

ciInfo <- expand.grid(size = 10:30, accuracy = seq(.7, .95, by = 0.01))
ciWidths <- t(apply(ciInfo, 1, getWidth))
head(ciWidths)

##           [,1]      [,2]
## [1,] 0.4439045 0.9747893
## [2,] 0.3902574 0.9397823
## [3,] 0.4281415 0.9451394
## [4,] 0.4618685 0.9496189
## [5,] 0.4189647 0.9161107
## [6,] 0.4489968 0.9221285

ciInfo$length <- ciWidths[,2] - ciWidths[,1]


levelplot(length ~ size * accuracy, data = ciInfo)

Question 5.4.2 (From Introduction to Statistical Learning)

Problems (a) - (f) were hand written on a pdf. I am still learning how to code in LateX so I can submit digitally.

After observing the plot, it is easy to see that there is a horizontal asymptote that is represented at a y-value of 0.63212. That means that as our number of sample (n) increases, our function tends to 1-(1-1/e). Which is the probability that the jth observation is in the bootstrap sample. Below we can see the demonstration of the behavior of the probability function. To get a clearer picture, I showcase a second plot with less samples.

If we would like to get a closer look at what is going with the function, we can scale it down.

# Generate x values
x <- 1:50
# Calculate y values
y <- 1 - (1 - 1/x)^x

# Plot with a logarithmic scale on the y-axis
plot(x, y, log = "y", type = "o", col = "grey", lwd = 2, xlab = "x", ylab = "y",
     main = "Plot with Logarithmic Y-axis Scale, Scaled down")

As expected, the probability that the 4th observation is in the bootstrap sample of size 100 is approximately 63%. This confirms our theory that on average, a random observation will have a probability of 0.6352 of being a part of the bootstrap sample.

store <- rep(NA, 10000)
for(i in 1:10000){
store[i] <- sum(sample(1:100, rep=TRUE) == 4) > 0
}
mean(store)

## [1] 0.6311

Question 5.4.6 (From Introduction to Statistical Learning)

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(35)
lmod <- glm(default ~ balance + income, Default, family = binomial)
summary(lmod)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data, index = 1:nrow(data)) {
  coef(glm(default ~ balance + income, data = data, subset = index, family = "binomial"))[-1]
}

boot.fn(Default)

##      balance       income 
## 5.647103e-03 2.080898e-05

Multiple Logistic Regression Method - Balance: 2.274e-04 Income: 4.985e-06

Bootstrap Method - Balance: 2.390732e-04 Income: 4.940896e-06

set.seed(35)
boot.t = boot(Default, boot.fn, R=500)
print(boot.t)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 500)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 5.647103e-03 1.702938e-05 2.390732e-04
## t2* 2.080898e-05 3.682295e-07 4.940896e-06

Standard errors for both methods had interesting results. Both the predictor Balance and the predictor Income had similar estimated standard errors. Thus, we can conclude that bootstrapping is an effective method at obtaining accurate errors.