Homework 7

knitr::opts_chunk$set(echo=TRUE,warning=F,message=F)

Please do the following problems from the text book ISLR or written otherwise. (use set.seed(702) to replicate your results).

#1. Question 5.4.1 pg 197

##Using basic statistical properties of the variance, as well as single variable calculus, derive (5.6). In other words, prove that: $/alpha$

##given by (5.6) does indeed minimize $Var({\alpha}X + [1 - {\alpha}]Y)$

##Where $Var(X) = \sigma_{X}^{2}$, $Var(Y) = \sigma_{Y}^{2}$, $Cov(X,Y) = \sigma_{XY}$

$Var({\alpha}X + [1 - {\alpha}]Y) = \alpha^{2}\sigma_{X}^{2} + (1-\alpha)^{2}\sigma_{Y}^{2} + 2\alpha(1-\alpha)*\sigma_{XY}$

##Expand

$Var({\alpha}X + [1 - {\alpha}]Y) = \alpha^{2}\sigma_{X}^{2} + \sigma_{Y}^{2} - 2\alpha\sigma_{Y}^2 + \alpha^{2}\sigma_{Y}^{2} + 2\alpha\sigma_{XY} - 2\alpha^{2}\sigma_{XY}$

##Differentiate

$\frac{d}{d\alpha}Var({\alpha}X + [1 - {\alpha}]Y) = 2\alpha\sigma_{X}^{2} - 2\sigma_{Y}^2 + 2\alpha\sigma_{Y}^{2} + 2\sigma_{XY} - 4\alpha\sigma_{XY}$

##Set right hand expression equal to zero

$0 = 2\alpha\sigma_{X}^{2} - 2\sigma_{Y}^2 + 2\alpha\sigma_{Y}^{2} + 2\sigma_{XY} - 4\alpha\sigma_{XY}$

##Solve for alpha

$2\sigma_{Y} - 2\sigma_{XY} = \alpha(2\sigma_{X}^{2} + 2\sigma_{Y}^{2} - 4\sigma_{XY})$

$\frac{2\sigma_{Y} - 2\sigma_{XY}}{2\sigma_{X}^{2} + 2\sigma_{Y}^{2} - 4\sigma_{XY}} = \alpha$

##Common factor of two so it can be reduced by 2

##Hence:

$\frac{\sigma_{Y} - \sigma_{XY}}{\sigma_{X}^{2} + \sigma_{Y}^{2} - 2\sigma_{XY}} = \alpha$

#2. Question 5.4.6 pg 199 We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

##Problem 6 a)

I will load the data set and use the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(602)
library(ISLR)
data(Default)

default.glm <- glm(default ~ income + balance, data=Default, family='binomial')


d.coef <- summary(default.glm)$coefficients[,1]
d.std.err <- summary(default.glm)$coefficients[,2]

output.glm <- data.frame(
  Coefficients = d.coef,
  Standard.Errors = d.std.err
)

output.glm

##Problem 6 b)

Now I will write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

library(boot)

boot.fn <- function(data, index) {
  glm.fit <-glm(default ~ income + balance, data=data, family=binomial, subset=index)
  return(glm.fit$coefficients)
}

##Problem 6 c)

Now I will use the boot() function together with the boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

b.method <- boot(Default, boot.fn, 1000)

b.method

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.891298e-02 4.366882e-01
## t2*  2.080898e-05  9.411346e-08 4.747218e-06
## t3*  5.647103e-03  1.936295e-05 2.305903e-04

compare <- data.frame(
 glm.coefficinets = c(d.coef),
 boot.coefficients = c(-1.154047e+01,2.080898e-05,5.647103e-03),
 glm.standard.errors = c(d.std.err),
 boot.standard.errors = c(4.326330e-01,4.893396e-06,2.337714e-04)
)

compare

##Problem 5 d)

The coefficients are the same but the standard errors between the glm method and boot method are negligibly different.

#3. Question 5.4.9 pg 201

##Problem 9 a)

I will load the Boston data set from the library MASS. Then I will calculate the mean value medv variable and call this $\hat{\mu}$

library(MASS)
data(Boston)

mu_hat <- mean(Boston$medv)
mu_hat

## [1] 22.53281

$\hat{\mu}$ = 22.53281

##Problem 9 b)

Now I will provide an estimate of the standard error of $\hat{\mu}$ and intepret the result

#Number of observations
num_obs <- nrow(Boston)

#standard deviation
std <- sd(Boston$medv)

st_err <- std/sqrt(num_obs)

cat('The standard error of the sample mean is: ', st_err)

## The standard error of the sample mean is:  0.4088611

If we take the standard error and multiply by 1.96 and add and then add and subtract from it from mu then we can acquire the upper 95% confidence interval and the lower 95% confidence interval. This provides an estimates range of values of which the population mean is likely to fall. Now I will calculate the 95% confidence intervals

Upper_95_CI <- (mu_hat+(st_err*1.96))
Lower_95_CI <- ((mu_hat-st_err*1.96))

confid <- data.frame(
  Lower_95_CI = Lower_95_CI,
   Upper_95_CI = Upper_95_CI
)

confid

##Problem 9 c)

Now I will compare the standard error of $\hat{\mu}$ using the bootstrap method

set.seed(602)
boot.fn <- function(data,index) {
  mu <- mean(data[index])
  return(mu)
}

boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.02075435   0.4233395

There is a negligible difference between computation computing the standard error and using the bootstrap method. Below a side by side comparision with the absolute difference between the two.

compare_medv <- data.frame(
  computation_standard_error = st_err,
  boot_standard_error = 0.4107738,
  difference = abs(st_err - 0.4107738)
)

compare_medv

#Problem 9 d)

Now I will provide the 95% confidence intervals using the bootstrap standard error value and $\hat{\mu}$.

b_se <- 0.4107738
b_mu <- 22.53281

upper_95_CI <- (b_mu + (b_se*1.96))
lower_95_CI <- (b_mu - (b_se*1.96))

boot_CI <- data.frame(
  Lower_95_CI = lower_95_CI,
  Upper_95_CI = upper_95_CI
)

boot_CI

Now I will use the t.test(Boston$medv) to compare the mu the and SE values

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The 95% confidence interval from the bootstrap method and the 95% confidence interval for the t.test are negligibly different. Below are the differences between the two methods

compare_CI <- data.frame(
  lower_95_CI_difference = c(abs(lower_95_CI-21.72953)),
  upper_95_CI_difference = c(abs(upper_95_CI - 23.33608))
)

compare_CI

#Problem 10 e)

Based on this data set, I will provide an estimate, $\hat{\mu_{med}}$, for the median value of medv in the population.

med <- median(Boston$medv)

$\hat{\mu_{med}} = 21.2$

##Probelm 9 f)

We now would like to estimate the standard error of $\hat{\mu_{med}}$. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

set.seed(602)


boot.fn <- function(data, index) {
  mu_med <- median(data[index])
  return(mu_med)
}

boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0057   0.3825295

it turns out the median value between simply calculating the median as median(Boston$medv) and using the bootstrap method are identical which is 21.2

#Problem 9 g)

Based on this data set, I will provide an estimate for the tenth percentile of medv in Boston suburbs. I will call this quantity $/hat{\mu_{0.1}}$ I will use the quantile() function.

p10 <- quantile(Boston$medv, c(0.1))
p10

##   10% 
## 12.75

Therefore $\hat{\mu_{10}} = 12.75$

##Problem 9 h)

Now I will use a bootstap method to estimate the standard error of $\hat{\mu_{0.1}}$ and compare the results

set.seed(602)

boot.fn <- function(data,index) {
  p10_mu <- quantile(data[index], c(0.1))
  return(p10_mu)
}

boot_p10_mu <- boot(Boston$medv, boot.fn, 1000)

boot_p10_mu

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0282   0.4891923

Therefore, $\hat{\mu_{0.10}} = 12.75$ in both the computation and using the bootstrap method. The bootstrap method produces a standard error of 0.5232901

##Conclusion

Overall the methods above by using simple computation or bootstrap method produced very similar results based on this data set.

#4. Last homework you have used different classification methods to analyze the dataset you chose.

First I will load the data set using the package(data.table) and insert all the columns

#install.packags(data.table)
library(data.table)

bc.dat <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data', col.names=c('class','age','menopause','tumorsize','invnodes','nodecaps','degmalig','breast','breastquad','irradiant'))

head(bc.dat,5)

##Model Fitting

I will use all the variables in data set and asses their pvalues. Therefore class will be explained by all the other covariates. class consists of no-recurrence-events and recurrence-events

##Lostitic Regression

glm.mod.full <- glm(as.factor(class) ~ ., data=bc.dat, family='binomial')
summary(glm.mod.full)

## 
## Call:
## glm(formula = as.factor(class) ~ ., family = "binomial", data = bc.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6589  -0.7997  -0.4962   0.9368   2.2117  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -1.12860 3393.46916   0.000  0.99973   
## age30-39              16.05306 2399.54484   0.007  0.99466   
## age40-49              15.56564 2399.54482   0.006  0.99482   
## age50-59              15.48799 2399.54484   0.006  0.99485   
## age60-69              16.15688 2399.54488   0.007  0.99463   
## age70-79              15.93559 2399.54517   0.007  0.99470   
## menopauselt40          0.85838    0.96702   0.888  0.37473   
## menopausepremeno       0.49325    0.48562   1.016  0.30977   
## tumorsize10-14        -1.92983    1.63282  -1.182  0.23725   
## tumorsize15-19        -0.01695    1.30377  -0.013  0.98963   
## tumorsize20-24         0.38423    1.24737   0.308  0.75806   
## tumorsize25-29         0.32987    1.26109   0.262  0.79365   
## tumorsize30-34         0.39770    1.25561   0.317  0.75144   
## tumorsize35-39         0.11220    1.36190   0.082  0.93434   
## tumorsize40-44        -0.14348    1.34646  -0.107  0.91514   
## tumorsize45-49        -0.12714    1.78397  -0.071  0.94319   
## tumorsize5-9         -14.89178 1181.98983  -0.013  0.98995   
## tumorsize50-54         0.73428    1.45873   0.503  0.61471   
## invnodes12-14          1.22253    1.43624   0.851  0.39466   
## invnodes15-17          0.64516    0.97806   0.660  0.50949   
## invnodes24-26         15.99287 2399.54485   0.007  0.99468   
## invnodes3-5            0.57988    0.50035   1.159  0.24648   
## invnodes6-8            0.93369    0.67977   1.374  0.16959   
## invnodes9-11           0.99059    0.84136   1.177  0.23905   
## nodecapsno            -0.17754    0.95816  -0.185  0.85300   
## nodecapsyes            0.20114    0.97186   0.207  0.83604   
## degmalig               0.66217    0.23548   2.812  0.00492 **
## breastright           -0.34880    0.33266  -1.049  0.29440   
## breastquadcentral    -17.77212 2399.54487  -0.007  0.99409   
## breastquadleft_low   -17.33917 2399.54479  -0.007  0.99423   
## breastquadleft_up    -17.53362 2399.54479  -0.007  0.99417   
## breastquadright_low  -17.82730 2399.54485  -0.007  0.99407   
## breastquadright_up   -16.84093 2399.54481  -0.007  0.99440   
## irradiantyes           0.31221    0.36792   0.849  0.39611   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.05  on 285  degrees of freedom
## Residual deviance: 282.73  on 252  degrees of freedom
## AIC: 350.73
## 
## Number of Fisher Scoring iterations: 15

It appears that degmalig is the only variable that is significant pvalue in the model. However, based on the histograms from last week I am going to try fitting a model that is class is explained by degmalig and invnodes

glm.mod <- glm(as.factor(class) ~ degmalig + invnodes, data=bc.dat, family='binomial')
summary(glm.mod)

## 
## Call:
## glm(formula = as.factor(class) ~ degmalig + invnodes, family = "binomial", 
##     data = bc.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5709  -0.6916  -0.6916   0.9028   2.0980  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.8582     0.4841  -5.904 3.55e-09 ***
## degmalig        0.7747     0.2122   3.650 0.000262 ***
## invnodes12-14   1.2273     1.2502   0.982 0.326277    
## invnodes15-17   0.6597     0.8557   0.771 0.440709    
## invnodes24-26  15.1002   882.7434   0.017 0.986352    
## invnodes3-5     0.9749     0.3845   2.536 0.011226 *  
## invnodes6-8     1.2212     0.5377   2.271 0.023135 *  
## invnodes9-11    1.4239     0.6894   2.065 0.038893 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.05  on 285  degrees of freedom
## Residual deviance: 306.55  on 278  degrees of freedom
## AIC: 322.55
## 
## Number of Fisher Scoring iterations: 13

it appears degmalig still contains a highly signiciant pvalue in the model. Some of the invnodes have a a significance as well. Therefore, I will keep the model with both covariates.

#Splitng and testing data for GLM

Now I will split the data set 75% for training and 25% for testing. Then I acquire the error rate. I will also dummy code variables. class will 0 for no-recurrence-events and 1 otherwise.

invnodes will be: 0-2 = 0 3-5 = 1 6-8 = 2 9-11 = 3 12-14 = 4 15-17 = 5 24-26 = 6

The dummy coding is so that it can be fit some of the models.

i) Validation Set Approach (VSA)

bc.dat$class <- ifelse(bc.dat$class=="no-recurrence-events",0,1)

bc.dat$dummy <- ifelse(bc.dat$invnodes=='0-2',0,(ifelse(bc.dat$invnodes=='3-5',1,
(ifelse(bc.dat$invnodes=='6-8',2,(ifelse(bc.dat$invnodes=='9-11',3,(ifelse(bc.dat$invnodes=='12-14',4,(ifelse(bc.dat$invnodes=='15-17',5,(ifelse(bc.dat$invnodes=='24-26',6,99)))))))))))))


bc.dat <- subset(bc.dat, select=c(class, degmalig, dummy))

set.seed(123)
index <- sample(x=nrow(bc.dat), size=0.75*nrow(bc.dat))
training <- bc.dat[index,]
testing <- bc.dat[-index,]


glm.train <- glm(class ~ degmalig + dummy, data=training, family='binomial')

#putting together the confusion matrix

glm.probs <- predict(glm.train, testing, type='response', na.action=na.pass)
glm.preds <- ifelse(glm.probs > 0.50, '1','0')
cm.glm <- table(testing$class, glm.preds)
print(cm.glm)

##    glm.preds
##      0  1
##   0 46  2
##   1 15  9

glm.error.rate <- 1-((cm.glm[1] + cm.glm[2,2]) / sum(cm.glm))

cat("the glm error rate is: ",glm.error.rate)

## the glm error rate is:  0.2361111

ii) LOOCV and 5-fold Cross Validation 
to estimate the test error for the following models. Choose the best model based on test error.

library(boot)

# leave-one-out and 5-fold cross-validation prediction error for 
# the nodal data set.  Since the response is a binary variable an
# appropriate cost function is
set.seed(123)
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
bc.glm <- glm(class ~ degmalig + dummy, binomial, data = bc.dat)
cv.err <- cv.glm(bc.dat, bc.glm, cost, K = nrow(bc.dat))$delta
cv.5.err <- cv.glm(bc.dat, bc.glm, cost, K = 5)$delta

cv.err

## [1] 0.2587413 0.2589246

cv.5.err

## [1] 0.2482517 0.2552203

##i) Logistic Regression (or Multinomial Logistic Regression for more than two classes) Validation and testing set

glm.train <- glm(class ~ degmalig + dummy, data=training, family='binomial')

#putting together the confusion matrix

glm.probs <- predict(glm.train, testing, type='response', na.action=na.pass)
glm.preds <- ifelse(glm.probs > 0.50, '1','0')
cm.glm <- table(testing$class, glm.preds)
print(cm.glm)

##    glm.preds
##      0  1
##   0 46  2
##   1 15  9

glm.error.rate <- 1-((cm.glm[1] + cm.glm[2,2]) / sum(cm.glm))

cat("the glm error rate is: ",glm.error.rate)

## the glm error rate is:  0.2361111

##ii) KNN (choose the best of K)

library(class)
library(MASS)
library(tidyverse)
set.seed(123)

end <- data.frame(k=1:100, correct=NA)
for(i in 1:100){
  knn.pred = knn(train=data.frame(training$degmalig, training$dummy),
  test=data.frame(testing$degmalig, testing$dummy), cl=training$class, k=i, prob=TRUE)
  cm_knn <-table(testing$class, knn.pred)
  correct <- (cm_knn['0','0'] + cm_knn['1','1'])/sum(cm_knn)
  end$correct[i] <- correct
}

library(tidyverse)
ggplot(data=end, aes(k,correct)) + geom_line() +labs(title='Plot of K for KNN classifiers vs Accuracy of Model', x='No. K', y='Accuracy')

max_accuracy <- end[order(-end$correct),]

head(max_accuracy,10)

##iii) LDA

lda.bc <- lda(class ~ degmalig + dummy, data=training)

lda.bc.preds <- predict(lda.bc, newdata=testing, type='response')
cm.lda <- table(lda.bc.preds$class, testing$class)
print(cm.lda)

##    
##      0  1
##   0 46 15
##   1  2  9

lda.error.rate <- 1-((cm.lda[1] + cm.lda[2,2]) /sum(cm.lda))
cat("The LDA error rate is: ", lda.error.rate)

## The LDA error rate is:  0.2361111

##iv) QDA

qda.bc <- qda(class ~ degmalig + dummy, data=training)

qda.bc.preds <- predict(qda.bc, newdata=testing, type='response')
cm.qda <- table(qda.bc.preds$class, testing$class)
print(cm.qda)

##    
##      0  1
##   0 46 18
##   1  2  6

qda.error.rate <- 1-((cm.qda[1] + cm.qda[2,2]) /sum(cm.qda))
cat("The QDA error rate is: ", qda.error.rate)

## The QDA error rate is:  0.2777778

##v) MclustDA - best model chosen by BIC

library('mclust')

head(training)

mclust.mod <- MclustDA(training[,2:3], training$class, G=9)

summary(mclust.mod, newdata=testing[,2:3], newclass=testing$class)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##        76.63528 214 57 -152.5901
##        
## Classes   n    % Model G
##       0 153 71.5   EEE 9
##       1  61 28.5   EEI 9
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 144   9
##     1  41  20
## Classification error = 0.2336 
## Brier score          = 0.1753 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 45  3
##     1 14 10
## Classification error = 0.2361 
## Brier score          = 0.2111

##vi) MclustDA with modelType=“EDDA”

mclust.mod2.edda <- MclustDA(training[,2:3], training$class, modelType="EDDA")

summary(mclust.mod2.edda, newdata=testing[,2:3], newclass = testing$class)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -528.2006 214  8 -1099.329
##        
## Classes   n    % Model G
##       0 153 71.5   VVI 1
##       1  61 28.5   VVI 1
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 146   7
##     1  46  15
## Classification error = 0.2477 
## Brier score          = 0.1842 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 46  2
##     1 18  6
## Classification error = 0.2778 
## Brier score          = 0.2065

##vii) Find a new method that we haven’t covered in class that can do classification.

#install.packages('e1071')
library(e1071)

svm.train <- svm(class ~ degmalig + dummy, data=training, family='binomial')

#putting together the confusion matrix

svm.probs <- predict(svm.train, testing, type='response', na.action=na.pass)
svm.preds <- ifelse(svm.probs > 0.50, '1','0')
cm.svm <- table(testing$class, svm.preds)
print(cm.svm)

##    svm.preds
##      0  1
##   0 46  2
##   1 15  9

svm.error.rate <- 1-((cm.svm[1] + cm.svm[2,2]) / sum(cm.svm))

cat("the svm error rate is: ",svm.error.rate)

## the svm error rate is:  0.2361111

#Conclusion

The following testing error rates are:

GLM : 0.3194444 KNN : 0.2916667 LDA : 0.3194444 QDA : 0.3055556 MclustDA : 0.2777778 MclustDA (EDDA) : 0.3055556 SVM: 0.2916667 LOOCV: 0.2589246 CV 5: 0.2657709

It appears that LOOCV method has the lowest error rate with mclustDA as the second lowest error rate.

Homework 7

Modern Applied Statistics II

i) Validation Set Approach (VSA)