Assignment_4

Irving Cedillo

Assignment_4 : Sapling Methods

3

  1. K-Cross Fold validation works on the premise of repeated test/train implementations. For a given data set it is partitioned into test/train sets. K fold works by selecting a different subset of test train and interates k number of times. The accuracy of a model is them able to be tested by getting the mean of percent accuracy.

  2. K-Fold vs Validation Set Approach. The VSA is a subset of K-Fold vlidation where K=1. The advantage would be processing sped and compute for very large datasets. Since the VSA holds larger portion of data for testing, it is tested on significant lower amount of data raising its bias.

  3. K-Fold vs LOOCV is the otehr extreme where K=n, where n is the number of observations. The disadvatage of th LOOCV is that computationally it is expensive, but since it is trained on a much largter training set the bias is very low.

5

#| echo : false
library(ISLR2)
Warning: package 'ISLR2' was built under R version 4.5.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(ggplot2)
library(GGally)
library(mlbench)
Warning: package 'mlbench' was built under R version 4.5.2
library(tidyr)
library(modelr)
Warning: package 'modelr' was built under R version 4.5.2
library(MASS)
Warning: package 'MASS' was built under R version 4.5.2

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

The following object is masked from 'package:ISLR2':

    Boston
library(gtsummary)

Attaching package: 'gtsummary'

The following object is masked from 'package:MASS':

    select
library(caret)
Warning: package 'caret' was built under R version 4.5.2
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
attach(Default)
library(boot)

Attaching package: 'boot'

The following object is masked from 'package:lattice':

    melanoma

The following object is masked from 'package:psych':

    logit

From the summary it is clear to see that the Default column is not wekk balanced and can skew the results. The number of rows who are no are significantly more than the yes, which must be handled appropriately. Income also has two local maximums bimodal, which may affect the results.

default_clean <- Default|>
  select(default, income, balance, student)|>
  mutate(
    default = as.factor(default),
    student = as.factor(student)
  )

summary(default_clean)
 default        income         balance       student   
 No :9667   Min.   :  772   Min.   :   0.0   No :7056  
 Yes: 333   1st Qu.:21340   1st Qu.: 481.7   Yes:2944  
            Median :34553   Median : 823.6             
            Mean   :33517   Mean   : 835.4             
            3rd Qu.:43808   3rd Qu.:1166.3             
            Max.   :73554   Max.   :2654.3             
describe(default_clean)
         vars     n     mean       sd   median  trimmed      mad    min
default*    1 10000     1.03     0.18     1.00     1.00     0.00   1.00
income      2 10000 33516.98 13336.64 34552.64 33305.57 16350.86 771.97
balance     3 10000   835.37   483.71   823.64   823.73   507.52   0.00
student*    4 10000     1.29     0.46     1.00     1.24     0.00   1.00
              max    range skew kurtosis     se
default*     2.00     1.00 5.20    25.06   0.00
income   73554.23 72782.27 0.07    -0.90 133.37
balance   2654.32  2654.32 0.25    -0.36   4.84
student*     2.00     1.00 0.90    -1.19   0.00
ggpairs(default_clean, progress = FALSE)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Train test split of the data

library(modelr)

set.seed(42)
partitions<- resample_partition(default_clean, c(train=.75, test=.25))

train_df<-as.data.frame(partitions$train)
test_df<-as.data.frame(partitions$test)

Evaluate the test train split on one iteration

log_model <- glm(default ~ income+balance, data=train_df, family="binomial")
summary(log_model)

Call:
glm(formula = default ~ income + balance, family = "binomial", 
    data = train_df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.163e+01  5.114e-01 -22.737  < 2e-16 ***
income       2.058e-05  5.842e-06   3.523 0.000426 ***
balance      5.709e-03  2.661e-04  21.454  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2185.4  on 7498  degrees of freedom
Residual deviance: 1166.1  on 7496  degrees of freedom
AIC: 1172.1

Number of Fisher Scoring iterations: 8
tbl_regression(log_model, exponentiate = TRUE)
Characteristic OR 95% CI p-value
income 1.00 1.00, 1.00 <0.001
balance 1.01 1.01, 1.01 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
pred_probs <- predict(log_model, newdata = test_df, type = "response")
pred_classes <- ifelse(pred_probs >= 0.5, "Yes", "No")

    
actual_factor <- as.factor(test_df$default)
pred_factor <- factor(pred_classes, levels = levels(actual_factor))

misclassified<- pred_factor != actual_factor
validation_error_direct <- mean(misclassified)
cat("Validation Set Error:", round(validation_error_direct, 4))
Validation Set Error: 0.0268

Extend the evualtion process to 3 more K fold validations

err_logistic<- numeric(3)
accuracy_logistic<- numeric(3)
cv_frame <-crossv_kfold(default_clean, k=3)

for (i in 1:3){
  train<- as.data.frame(cv_frame$train[[i]])
  test<- as.data.frame(cv_frame$test[[i]])

  
  #fit logistic 
  fit_logistic<- glm(default ~ income+balance, 
                     data=train, family="binomial")
  
  #log odds
  prob_log<- predict(fit_logistic, 
                     newdata = test,
                     type="response")#logs odds to prob
  
  #classification of prediction
  pred_log<- if_else(prob_log>=.5, "Yes", "No")
  
  err_logistic[i]<- mean(pred_log != test$default)
  accuracy_logistic[i]<- mean(pred_log == test$default)
  

}

cat('lower is better for the error\n')
lower is better for the error
print(mean(err_logistic))
[1] 0.0267
cat('higher is better for the accuracy\n')
higher is better for the accuracy
print(mean(accuracy_logistic))
[1] 0.9733

Now we include the is student column re evaluate the models:

err_logistic<- numeric(3)
accuracy_logistic<- numeric(3)
cv_frame <-crossv_kfold(default_clean, k=3)

for (i in 1:3){
  train<- as.data.frame(cv_frame$train[[i]])
  test<- as.data.frame(cv_frame$test[[i]])

  
  #fit logistic 
  fit_logistic<- glm(default ~ income+balance+student, 
                     data=train, family="binomial")
  
  #log odds
  prob_log<- predict(fit_logistic, 
                     newdata = test,
                     type="response")#logs odds to prob
  
  #classification of prediction
  pred_log<- if_else(prob_log>=.5, "Yes", "No")
  
  err_logistic[i]<- mean(pred_log != test$default)
  accuracy_logistic[i]<- mean(pred_log == test$default)
  

}

cat('lower is better for the error\n')
lower is better for the error
print(mean(err_logistic))
[1] 0.02670036
cat('higher is better for the accuracy\n')
higher is better for the accuracy
print(mean(accuracy_logistic))
[1] 0.9732996

From the addition of the student column, it did not provide a significant increase in accuracy or conversely a significant decrease in error. By parsimony it is best to leave the student column out of any model.

6

The results of the standrd glm function of std error converge to similar results from the boot function in 1000 repetitions. This can be interpreted as the parametric linear assumptions made by the glm function to be correct:

  1. Variance correctly identified and constant
  2. Relationship is linear
set.seed(420)

#standard method

fit_logistic<- glm(default ~ income+balance, data=Default, family="binomial")

summary(fit_logistic)

Call:
glm(formula = default ~ income + balance, family = "binomial", 
    data = Default)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8
#using the bootstrap method
boot.fn <- function(data, index){
  model <- glm(default ~ income+balance, 
               data=data, 
               subset=index, 
               family="binomial"
               )
  return(coef(model))
}

boot_results <- boot(Default, boot.fn, R=1000)

print(boot_results)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Default, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
         original        bias     std. error
t1* -1.154047e+01 -3.825618e-02 4.294395e-01
t2*  2.080898e-05  2.327731e-07 4.790983e-06
t3*  5.647103e-03  1.849025e-05 2.244517e-04

9

attach(Boston)

The standard deviation measure how spread the individual housing prices are from each other, while the standard error measures the precision of the sample mean to the population mean

#saple population median
mu <- mean(Boston$medv)
print(mu)
[1] 22.53281
#std sample mean
std_err_mu <- sd(Boston$medv)/sqrt(nrow(Boston))
print(std_err_mu)
[1] 0.4088611
#describe(Boston$medv)

The standard error from the bootstrap method (.419) aligns closely to the standard error calculated by the standard deviation (.408). Meaning that the assumptions of a standard error for mean are valid and that there are enough samples for Central Limit Theorem to hold and that no extreme outliers exist.

#using the bootstrap method
boot_mean <- function(data, index){
  selected_col <- Boston$medv[index]
  ans_mean <- mean(selected_col)
  return(ans_mean)
}

set.seed(420)
boot_results <- boot(Boston, boot_mean, R=1000)
print(boot_results)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston, statistic = boot_mean, R = 1000)


Bootstrap Statistics :
    original     bias    std. error
t1* 22.53281 0.01103281   0.4196645

Both ranges for the lower and upper bound of the confidence interval for medv are similar. The small technical difference is due to the fact that the t.test uses the actual t distribution value while the standard formula uses 2.

t.test(Boston$medv)

    One Sample t-test

data:  Boston$medv
t = 55.111, df = 505, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 21.72953 23.33608
sample estimates:
mean of x 
 22.53281 
lower<- mu-2*std_err_mu
upper<- mu+2*std_err_mu
cat("Confidence interval of Mu:[",lower,",",upper,"]")
Confidence interval of Mu:[ 21.71508 , 23.35053 ]

The median was 21.2 , and the standard error found through the bootstrap process informs of a small number, .3797. Meaning that the data is a large enough in sample size to be able to estimate the population median and that it is spread enough that outliers will not affect a deviation of sample median to population median. Repeated bootstrap of 1000 would results in a medain that deviates by .3797.

mu_med <- median(Boston$medv)
cat('Mu_mean medv: ', mu_med)
Mu_mean medv:  21.2
boot_median<- function(data, index){
  return(median(data$medv[index]))
}

std_err_med <- boot(Boston, boot_median, R=1000)
cat('\n\nSD error Mu_mean medv: ')


SD error Mu_mean medv: 
print(std_err_med)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston, statistic = boot_median, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*     21.2  -0.014   0.3740038

From the quantile this tells us that 10% of the houses are at a price of 12.75 k, and that 90% of our sample is higher. From the standard error, given the bootstrap that would mean that the true population mean would differ by .486 (486$). With the bias so low it tells us that the 12.75 is an unbiased estimator of the true population 10%.

mu_10_percentile <- quantile(Boston$medv, probs = c(.1))
cat('10th percentile: ', mu_10_percentile)
10th percentile:  12.75
boot_quant_10 <- function(data, index){
  return(quantile(Boston$medv[index], probs = c(.1)))
}

boot_quant_10 <- boot(Boston, boot_quant_10, R=1000)
print(boot_quant_10)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston, statistic = boot_quant_10, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*    12.75 0.03335   0.5037152