hw_4_data_mining

Question 3

We now review k-fold cross-validation

a) Explain how k-fold cross-validation is implemented You divide your set of observations into a number of k groups/folds (approximately of equal size), with the first one being treated as a validation set and the method being implemented on the rest of the folds. The MSE error is computed on the observations that are not on the validation set and the process is repeated k times (with each new replication changing the group of observations being treated as the validation set). After obtaining all of the MSEs, we average them to obtain the cross-validation measure.

b) what are the advantages and disadvantages of k-fold cross-validation relative to:

i) the validation of set approach? When we examine real data, we are not sure of the true test MSE, therefore making it difficult to assess if the cross-validation of our model is accurate or not. It could also be a possibility that while the cross-validation curves may get the general shape of the data, they could under/overestimate the true test MSE. Still, in some instances, we are only looking for the minimum point in the estimated test MSE curve, which can be obtained easily when running this type of model.

ii)LOOCV? LOOCV is a special case of k-fold cross-validation in which n = k. If k is large, it may result in a very computationally expensive and intensive procedure to be able to fit it into the data. Still, cross-validation has the advantage that it is a very general approach that can be applied to almost any statistical learning method.

Question 5

In chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

a) Fit a logistic regression model that uses income and balance to predict default

default_data <- ISLR::Default
default_log_original <- glm(default ~ income + balance, data = default_data, family = "binomial")
summary(default_log_original)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = default_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

default_predict_original <- predict(default_log_original, new_data = default_data(default_data$default), type = "response")
predict_class_original <- ifelse(default_predict_original>0.5, "Yes", "No")
mean(default_data$default!=predict_class_original)

## [1] 0.0263

When running a simple logistic regression model, we see that both income and balance are regarded as statistically significant predictors for our model, as they both have a p-value that can be rounded to 0, which is well below the measure of our alpha-value of 0.05. Our miscalculation rate seems to be about 0.0263.

b) Using the validation set approach, estimate the test error of this model. In order to do this, you must do the following steps:

i) Split the sample set into a training set and a validation set

set.seed(1)
subset_default<- sample(10000, 5000)
default_train <- default_data[subset_default, ]
default_test <- default_data[-subset_default, ]

ii) Fit a multiple logistic regression model using only the training observations

default_log_train <- glm(default ~ income + balance, data = default_train, family = "binomial")
summary(default_log_train)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = default_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5830  -0.1428  -0.0573  -0.0213   3.3395  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8

iii) Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5

predict_class_train <- predict(default_log_train, default_test, type = "response")
log_class_train <- ifelse(predict_class_train>0.5, "Yes", "No")

iv) Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified

table(default_test$default, log_class_train)

##      log_class_train
##         No  Yes
##   No  4824   19
##   Yes  108   49

mean(default_test$default!=log_class_train)

## [1] 0.0254

Here, we see that when using the training and test data sets, the miscalculation rate is about 0.0254.

c) repeat the process in question b 3 times, using 3 different splits of the observations into a training set and a validation set. Comment on the results obtained.

subset_default_rerun1<- sample(10000, 1000)
default_train_rerun1 <- default_data[subset_default_rerun1, ]
default_test_rerun1 <- default_data[-subset_default_rerun1, ]
default_log_train_rerun1 <- glm(default ~ income + balance, data = default_train_rerun1, family = "binomial")
predict_class_train_rerun1 <- predict(default_log_train_rerun1, default_test_rerun1, type = "response")
log_class_train_rerun1 <- ifelse(predict_class_train_rerun1>0.5, "Yes", "No")
table(default_test_rerun1$default, log_class_train_rerun1)

##      log_class_train_rerun1
##         No  Yes
##   No  8671   27
##   Yes  225   77

mean(default_test_rerun1$default!=log_class_train_rerun1)

## [1] 0.028

When setting the training data set to 1/10th the size of the original data, we get a miscalculation rate of about 0.028

subset_default_rerun2<- sample(10000, 2500)
default_train_rerun2 <- default_data[subset_default_rerun2, ]
default_test_rerun2 <- default_data[-subset_default_rerun2, ]
default_log_train_rerun2 <- glm(default ~ income + balance, data = default_train_rerun2, family = "binomial")
predict_class_train_rerun2 <- predict(default_log_train_rerun2, default_test_rerun2, type = "response")
log_class_train_rerun2 <- ifelse(predict_class_train_rerun2>0.5, "Yes", "No")
table(default_test_rerun2$default, log_class_train_rerun2)

##      log_class_train_rerun2
##         No  Yes
##   No  7208   41
##   Yes  174   77

mean(default_test_rerun2$default!=log_class_train_rerun2)

## [1] 0.02866667

When setting the training data set to 1/4th the size of the original data, we get a miscalculation rate of about 0.029

subset_default_rerun3<- sample(10000, 6667)
default_train_rerun3 <- default_data[subset_default_rerun3, ]
default_test_rerun3 <- default_data[-subset_default_rerun3, ]
default_log_train_rerun3 <- glm(default ~ income + balance, data = default_train_rerun3, family = "binomial")
predict_class_train_rerun3 <- predict(default_log_train_rerun3, default_test_rerun3, type = "response")
log_class_train_rerun3 <- ifelse(predict_class_train_rerun3>0.5, "Yes", "No")
table(default_test_rerun3$default, log_class_train_rerun3)

##      log_class_train_rerun3
##         No  Yes
##   No  3210   10
##   Yes   77   36

mean(default_test_rerun3$default!=log_class_train_rerun3)

## [1] 0.02610261

When setting the training data set to 2/3ths the size of the original data, we get a miscalculation rate of about 0.022

d) Now consider the logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate

set.seed(1)
default_data_dummy <- default_data %>%
                      mutate(student_numeric = case_when(student == "Yes" ~1,
                                                         student == "No" ~ 0)) #%>%
                      #select(-student)
subset_dummy <- sample(10000, 5000)
default_train_dummy <- default_data_dummy[subset_dummy, ]
default_test_dummy <- default_data_dummy[-subset_dummy, ]
log_train_dummy <- glm(default ~ income + balance + student_numeric, data = default_train_dummy, family = "binomial")
predict_default_dummy <- predict(log_train_dummy, default_test_dummy, type = "response")
class_dummy <- ifelse(predict_default_dummy>0.5, "Yes", "No")
table(default_test_dummy$default, class_dummy)

##      class_dummy
##         No  Yes
##   No  4825   18
##   Yes  112   45

mean(default_test_dummy$default!=class_dummy)

## [1] 0.026

When changing the student variable from a categorical to a dummy variable, the miscalculation rate increases slightly (from 0.0254 to 0.026). Although it does not lead to a reduction in test rate (rather, it leads to an increase), it is not substantially high.

Question 6

We continue to consider the use of a a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

default_log_original <- glm(default ~ income + balance, data = default_data, family = "binomial")
summary(default_log_original)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = default_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

As given by our summary output, the estimated standard errors given for both income and balance are 4.985e-06 and 2.27e-04 respectively. These can also be rounded to zero if needed.

b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn <- function(data = default_data, index) {
        model_estimates <- glm(default ~ income + balance, data = data[index, ], family = "binomial")
        coef(model_estimates)[2:3] #this gives us the estimates of only income and balance, taking out the estimate for the intercept of our model
}

c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

boot_coefs <- boot(data = default_data, boot.fn, R = 200)

boot_coefs %>%
  tidy() %>%
  dplyr::select(std.error) %>%
  kable()

std.error
0.0000047
0.0002261

d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. These estimated standard errors are exactly the same as the ones obtained using the summary function on the logistic regression

Question 9

We will now consider the Boston housing data set, from the MASS library.

a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate µhat.

boston_data <- MASS::Boston
muhat <- mean(boston_data$medv)
muhat

## [1] 22.53281

The estimated population mean for medv would be about 22.53

b) Provide an estimate of the standard error of µhat. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

se_muhat <- sqrt(var(boston_data$medv)/nrow(boston_data))
se_muhat

## [1] 0.4088611

# or we could use std.error(boston_data$medv)

The standard error of muhat would be around 0.135.

c) Now estimate the standard error of µhat using the bootstrap. How does this compare to your answer from (b)?

se_muhat_function <- function(boston_data, index) {
        mean(boston_data$medv[index])
}
boot_coefs_boston <- boot(data = boston_data, se_muhat_function, R = 200)
boot_coefs_boston

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston_data, statistic = se_muhat_function, R = 200)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.05218577   0.3739795

When using a bootstrap, we get a standard error of about 0.421, which is slightly higher than that of our estimated standard error (but not by much).

d) Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [µhat − 2SE(µhat), µhat + 2SE(µhat)].

boot_coefs_boston_second <- boot_coefs_boston %>%
                            tidy() %>%
                            dplyr::select(std.error)
lower_estimate <- muhat - (2 * boot_coefs_boston_second)
upper_estimate <- muhat + (2 * boot_coefs_boston_second)
lower_estimate

##   std.error
## 1  21.78485

upper_estimate

##   std.error
## 1  23.28077

t.test(boston_data$medv)

## 
##  One Sample t-test
## 
## data:  boston_data$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The results obtained from our 95% confidence interval are identical for both the t-test and the bootstrapping procedure.

e) Based on this data set, provide an estimate, µhatmed, for the median value of medv in the population.

muhatmed <- median(boston_data$medv)
muhatmed

## [1] 21.2

Our median value for medv is 21.2

f) We now would like to estimate the standard error of µhatmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

se_muhatmed_function <- function(boston_data, index) {
        median(boston_data$medv[index])
}
boot_med_boston <- boot(data = boston_data, se_muhatmed_function, R = 200)
boot_med_boston <- boot_med_boston %>%
                  tidy() %>%
                  dplyr::select(std.error)
boot_med_boston

## # A tibble: 1 x 1
##   std.error
##       <dbl>
## 1     0.354

The standard error for the median using bootstrapping is about 0.396

g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity µhat0.1. (You can use the quantile() function.)

muhat_perc_ten <- quantile(boston_data$medv, 0.1)
muhat_perc_ten

##   10% 
## 12.75

The tenth percentile is 12.75

h) Use the bootstrap to estimate the standard error of µhat0.1. Comment on your findings.

se_muhat10_function <- function(boston_data, index) {
        quantile(boston_data$medv[index], 0.1)
}
boot_10_boston <- boot(data = boston_data, se_muhat10_function, R = 200)
boot_10_boston <- boot_10_boston %>%
                  tidy() %>%
                  dplyr::select(std.error)
boot_10_boston

## # A tibble: 1 x 1
##   std.error
##       <dbl>
## 1     0.529

The standard error for the tenth quantile using bootstrapping is about 0.493

hw_4_data_mining

Eileen Ramirez del Rio

3/8/2022

Question 3

Question 5

Question 6

Question 9