K-Cross Fold validation works on the premise of repeated test/train implementations. For a given data set it is partitioned into test/train sets. K fold works by selecting a different subset of test train and interates k number of times. The accuracy of a model is them able to be tested by getting the mean of percent accuracy.
K-Fold vs Validation Set Approach. The VSA is a subset of K-Fold vlidation where K=1. The advantage would be processing sped and compute for very large datasets. Since the VSA holds larger portion of data for testing, it is tested on significant lower amount of data raising its bias.
K-Fold vs LOOCV is the otehr extreme where K=n, where n is the number of observations. The disadvatage of th LOOCV is that computationally it is expensive, but since it is trained on a much largter training set the bias is very low.
5
#| echo : falselibrary(ISLR2)
Warning: package 'ISLR2' was built under R version 4.5.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(ggplot2)library(GGally)library(mlbench)
Warning: package 'mlbench' was built under R version 4.5.2
library(tidyr)library(modelr)
Warning: package 'modelr' was built under R version 4.5.2
library(MASS)
Warning: package 'MASS' was built under R version 4.5.2
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
The following object is masked from 'package:ISLR2':
Boston
library(gtsummary)
Attaching package: 'gtsummary'
The following object is masked from 'package:MASS':
select
library(caret)
Warning: package 'caret' was built under R version 4.5.2
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
attach(Default)library(boot)
Attaching package: 'boot'
The following object is masked from 'package:lattice':
melanoma
The following object is masked from 'package:psych':
logit
From the summary it is clear to see that the Default column is not wekk balanced and can skew the results. The number of rows who are no are significantly more than the yes, which must be handled appropriately. Income also has two local maximums bimodal, which may affect the results.
default income balance student
No :9667 Min. : 772 Min. : 0.0 No :7056
Yes: 333 1st Qu.:21340 1st Qu.: 481.7 Yes:2944
Median :34553 Median : 823.6
Mean :33517 Mean : 835.4
3rd Qu.:43808 3rd Qu.:1166.3
Max. :73554 Max. :2654.3
describe(default_clean)
vars n mean sd median trimmed mad min
default* 1 10000 1.03 0.18 1.00 1.00 0.00 1.00
income 2 10000 33516.98 13336.64 34552.64 33305.57 16350.86 771.97
balance 3 10000 835.37 483.71 823.64 823.73 507.52 0.00
student* 4 10000 1.29 0.46 1.00 1.24 0.00 1.00
max range skew kurtosis se
default* 2.00 1.00 5.20 25.06 0.00
income 73554.23 72782.27 0.07 -0.90 133.37
balance 2654.32 2654.32 0.25 -0.36 4.84
student* 2.00 1.00 0.90 -1.19 0.00
ggpairs(default_clean, progress =FALSE)
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Extend the evualtion process to 3 more K fold validations
err_logistic<-numeric(3)accuracy_logistic<-numeric(3)cv_frame <-crossv_kfold(default_clean, k=3)for (i in1:3){ train<-as.data.frame(cv_frame$train[[i]]) test<-as.data.frame(cv_frame$test[[i]])#fit logistic fit_logistic<-glm(default ~ income+balance, data=train, family="binomial")#log odds prob_log<-predict(fit_logistic, newdata = test,type="response")#logs odds to prob#classification of prediction pred_log<-if_else(prob_log>=.5, "Yes", "No") err_logistic[i]<-mean(pred_log != test$default) accuracy_logistic[i]<-mean(pred_log == test$default)}cat('lower is better for the error\n')
lower is better for the error
print(mean(err_logistic))
[1] 0.0267
cat('higher is better for the accuracy\n')
higher is better for the accuracy
print(mean(accuracy_logistic))
[1] 0.9733
Now we include the is student column re evaluate the models:
err_logistic<-numeric(3)accuracy_logistic<-numeric(3)cv_frame <-crossv_kfold(default_clean, k=3)for (i in1:3){ train<-as.data.frame(cv_frame$train[[i]]) test<-as.data.frame(cv_frame$test[[i]])#fit logistic fit_logistic<-glm(default ~ income+balance+student, data=train, family="binomial")#log odds prob_log<-predict(fit_logistic, newdata = test,type="response")#logs odds to prob#classification of prediction pred_log<-if_else(prob_log>=.5, "Yes", "No") err_logistic[i]<-mean(pred_log != test$default) accuracy_logistic[i]<-mean(pred_log == test$default)}cat('lower is better for the error\n')
lower is better for the error
print(mean(err_logistic))
[1] 0.02670036
cat('higher is better for the accuracy\n')
higher is better for the accuracy
print(mean(accuracy_logistic))
[1] 0.9732996
From the addition of the student column, it did not provide a significant increase in accuracy or conversely a significant decrease in error. By parsimony it is best to leave the student column out of any model.
6
The results of the standrd glm function of std error converge to similar results from the boot function in 1000 repetitions. This can be interpreted as the parametric linear assumptions made by the glm function to be correct:
The standard deviation measure how spread the individual housing prices are from each other, while the standard error measures the precision of the sample mean to the population mean
#saple population medianmu <-mean(Boston$medv)print(mu)
The standard error from the bootstrap method (.419) aligns closely to the standard error calculated by the standard deviation (.408). Meaning that the assumptions of a standard error for mean are valid and that there are enough samples for Central Limit Theorem to hold and that no extreme outliers exist.
Both ranges for the lower and upper bound of the confidence interval for medv are similar. The small technical difference is due to the fact that the t.test uses the actual t distribution value while the standard formula uses 2.
t.test(Boston$medv)
One Sample t-test
data: Boston$medv
t = 55.111, df = 505, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
21.72953 23.33608
sample estimates:
mean of x
22.53281
lower<- mu-2*std_err_muupper<- mu+2*std_err_mucat("Confidence interval of Mu:[",lower,",",upper,"]")
Confidence interval of Mu:[ 21.71508 , 23.35053 ]
The median was 21.2 , and the standard error found through the bootstrap process informs of a small number, .3797. Meaning that the data is a large enough in sample size to be able to estimate the population median and that it is spread enough that outliers will not affect a deviation of sample median to population median. Repeated bootstrap of 1000 would results in a medain that deviates by .3797.
From the quantile this tells us that 10% of the houses are at a price of 12.75 k, and that 90% of our sample is higher. From the standard error, given the bootstrap that would mean that the true population mean would differ by .486 (486$). With the bias so low it tells us that the 12.75 is an unbiased estimator of the true population 10%.