R Markdown
- We now review k-fold cross-validation.
- Explain how k-fold cross-validation is implemented. cross-validation
is implemented when real life data is being used to model for
accuracy.k-fold is used when you want to find the average of data
sets
- What are the advantages and disadvantages of k-fold crossvalidation
relative to: An advantages of k-fold is that it is not biased however a
disadvantage is tha tit can over estimate.
- The validation set approach? the validation set approach can be
higher
- LOOCV? reduces bias
- In Chapter 4, we used logistic regression to predict the probability
of default using income and balance on the Default data set. We will now
estimate the test error of this logistic regression model using the
validation set approach. Do not forget to set a random seed before
beginning your analysis.
- Fit a logistic regression model that uses income and balance to
predict default.
library(ISLR)
set.seed(1)
fit.glm = glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## Number of Fisher Scoring iterations: 8
- Using the validation set approach, estimate the test error of this
model. In order to do this, you must perform the following steps:
- Split the sample set into a training set and a validation set.
train = sample(dim(Default)[1], dim(Default)[1] / 2)
- Fit a multiple logistic regression model using only the training
observations.
fit.glm = glm(default ~ income + balance, data = Default[train,], family = "binomial")
fit.glm = glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
#Both above formulas have the same outcome.
summary(fit.glm)
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3583 -0.1268 -0.0475 -0.0165 3.8116
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.208e+01 6.658e-01 -18.148 <2e-16 ***
## income 1.858e-05 7.573e-06 2.454 0.0141 *
## balance 6.053e-03 3.467e-04 17.457 <2e-16 ***
## Null deviance: 1457.0 on 4999 degrees of freedom
## Residual deviance: 734.4 on 4997 degrees of freedom
## AIC: 740.4
## Scoring iterations: 8
- Obtain a prediction of default status for each individual in the
validation set by computing the posterior probability of default for
that individual, and classifying the individual to the default category
if the posterior probability is greater than 0.5.
glm.probs = predict(fit.glm, newdata = Default[-train, ], type="response")
glm.pred[glm.probs>0.5] = "Yes"
- Compute the validation set error, which is the fraction of the
observations in the validation set that are misclassified.
mean(glm.pred != Default[-train, ]$default)
## 0.0286
- Repeat the process in (b) three times, using three different splits
of the observations into a training set and a validation set. Comment on
the results obtained.
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## 0.028
- Now consider a logistic regression model that predicts the
probability of default using income, balance, and a dummy variable for
student. Estimate the test error for this model using the validation set
approach. Comment on whether or not including a dummy variable for
student leads to a reduction in the test error rate.
fit.glm <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm[probs > 0.5] <- "Yes"
- We continue to consider the use of a logistic regression model to
predict the probability of default using income and balance on the
Default data set. In particular, we will now compute estimates for the
standard errors of the income and balance logistic regression
coefficients in two different ways: (1) using the bootstrap, and (2)
using the standard formula for computing the standard errors in the
glm() function. Do not forget to set a random seed before beginning your
analysis.
- Using the summary() and glm() functions, determine the estimated
standard errors for the coefficients associated with income and balance
in a multiple logistic regression model that uses both predictors.
set.seed(1)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(fit.glm)
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3583 -0.1268 -0.0475 -0.0165 3.8116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.208e+01 6.658e-01 -18.148 <2e-16 ***
## income 1.858e-05 7.573e-06 2.454 0.0141 *
## balance 6.053e-03 3.467e-04 17.457 <2e-16 ***
## Null deviance: 1457.0 on 4999 degrees of freedom
## Residual deviance: 734.4 on 4997 degrees of freedom
## AIC: 740.4
## Scoring iterations: 8
- Write a function, boot.fn(), that takes as input the Default data
set as well as an index of the observations, and that outputs the
coefficient estimates for income and balance in the multiple logistic
regression model.
boot.fn <- function(data, index) {
fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(fit))
}
- Use the boot() function together with your boot.fn() function to
estimate the standard errors of the logistic regression coefficients for
income and balance.
library(boot)
boot(Default, boot.fn, 1000)
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -8.331926e-03 4.240329e-01
## t2* 2.080898e-05 5.792741e-08 4.590086e-06
- We will now consider the Boston housing data set, from the ISLR2
library.
- Based on this data set, provide an estimate for the population mean
of medv. Call this estimate ˆµ.
mu = df_boston['medv'].mean(axis=0)
print("Population Mean:", mu)
- Provide an estimate of the standard error of ˆµ. Interpret this
result. Hint: We can compute the standard error of the sample mean by
dividing the sample standard deviation by the square root of the number
of observations.
def std_err(data):
return np.std(data, ddof=1) / np.sqrt(np.size(data))
- Now estimate the standard error of ˆµ using the bootstrap. How does
this compare to your answer from (b)?
sample_means = list() # Empty list
for _ in range(1000):
y = np.random.choice(df_boston['medv'], size=300, replace=True)
avg = np.mean(y)
sample_means.append(avg)
plt.hist(sample_means)
plt.ylabel("Count")
plt.xlabel("Mean")
- Based on your bootstrap estimate from (c), provide a 95 % confidence
interval for the mean of medv. Compare it to the results obtained using
t.test(Boston$medv). Hint: You can approximate a 95 % confidence
interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].
std_err_boot = np.std(sample_means)
print(mu - 2 * std_err_boot, mu + 2 * std_err_boot
- Based on this data set, provide an estimate, ˆµmed, for the median
value of medv in the population.
print("Population Median:", df_boston['medv'].median(axis=0))
- We now would like to estimate the standard error of ˆµmed.
Unfortunately, there is no simple formula for computing the standard
error of the median. Instead, estimate the standard error of the median
using the bootstrap. Comment on your findings.
sample_median = list()
for _ in range(1000):
y = np.random.choice(df_boston['medv'], size=300, replace=True)
sample_median.append(np.median(y)
plt.hist(sample_median)
plt.ylabel("Count")
plt.xlabel("Median")
- Based on this data set, provide an estimate for the tenth percentile
of medv in Boston census tracts. Call this quantity ˆµ0.1. (You can use
the quantile() function.)
print("Tenth percentile of medv is:", mu_10
- Use the bootstrap to estimate the standard error of ˆµ0.1. Comment
on your findings
for _ in range(1000):
y = np.random.choice(df_boston['medv'], size=505, replace=True)
sample_quantile_10.append(np.quantile(y, 0.1)