Homework 6

knitr::opts_chunk$set(echo = TRUE,warning=F,message=F)

#Kevin Kuipers (Completed byself) #02-26-2018

#1. Question 5.4.3 pg 198

##Question 3 Part a)

k-fold cross-validation begins by take the data set and splitting into k-folds. To make the explaination more concrete lets make k=5. So when k=5 we split the data set into 5 sections. Lets us also name the sections for ease of explaination. Secation A, B, C, D, E. So in the first case section A can be used as the validation data set and the B, C, D, E are the training data to fit a model. After the model is fit is tested against section A for the mean sequare error 1 (MSE1). Then this process repeats, section B can now be used as the validation data and A, C, D, and E are the training data set to develop the model. After fitting the model it is tested against section B for the mean squared error 2 (MSE2). Now secton C becomes the validation data set and A, B, D, and E are the training data set. And this process is repreated until all the each section have been used as a validation process. After acquiring all the MSEs for each testing section, MSE1, MS2,…MSE5 the 5-fold cross validation estimate is evaluated by averaging the values:

\[CV_{5} = \frac{1}{5}\sum_{i=1}^{5}MSE_{i}\]

##Question 3 Part b i)

k-fold cross validation has some advantages and disadvantages to the validation set approach. The Advantages of the validation set approach is very fast and simple to implement. However, validation set approach does come with some disadvantages. The validation set approach generally uses 70% of the data and then 30% of the remaining data is the testing data. Therefore, there are two main disadvantages to this approach that k-fold cross validation improves.

k-fold cross validation uses all the data elements to fit a model and test against all the k-subsets. Therefore, in some sense all the the data is being to used to fit the model and test against itself. Whereas the the validation is only using on subset to fit a model from the training side of things which is generally 70% of the data.

Another main disadvantage the validation set approach has compared to the k-fold cross validation is that the mean squared error can be highly variable due to what was just mentioned previously. Since the validation set approach only uses a subset once and testing set once and not the entire data set is used to train the model the mean squared errors can be highly variable when doing this method. k-fold cross validation can show us and reducde the highly variable MSE for a given model on a data set.

##Question 3 Part b ii)

k-fold cross validation and leave one out cross validation both have their advantages and disadvantages. A disadvantage of the leave one out approach has the potential to be more computionally expensive to implement. This can bog down the system and take more time to fit the model has to be fitted n times especially if n is large and the model is slow fitting. In these case, k-fold cross validation has the advantage because it is not as computionally expensive abd can be implemented to almost any kind of statistical learning method. k-fold cross validation.

k-fold cross validation and leave one out cross validation both to do come with a bias trade off. When k is less than n the leave one out approach tends have less bioas than k-fold cross validation in the prediction method. In contrast, leave one out cross validation has higher variance when k is less than n. In general, for k-fold cross validation K=5 or K-=10 are the widely used K’s but there are some cases this is not true.

#2. Question 5.4.5 pg 198 (use set.seed(702) to make results replicable)

##Problem 5 a)

I will fit and load the Default data set from the ISLR library. I set the seed to 702 and I will fit a logisitc regression model that uses income and balance to predict default. Then I will output the summary of the model.

set.seed(702)
library(ISLR)
data(Default)

glm.default <- glm(default ~ income + balance, data=Default, family='binomial')
summary(glm.default)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

#Problem 5 b)

##Using the validation set approach, estimate the test error of this model.

I will split the data for training and testing. The split will be 70% training and the testing will be 30%.
Then I will fit a multiple logistic regression model using only training observations
Next I will Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
lastly I will acquire the error rate of the training model against the testing set for predicting whether or not income and balance predict default status

set.seed(702)

#part i)

#testing and train data split. Training is 70% and testing is 30%
index.d1 <- sample(x=nrow(Default), size=0.70*nrow(Default))
training.d1 <- Default[index.d1,]
testing.d1 <- Default[-index.d1,]


#part ii) 

#Fitting logistic regression model
glm.train.d1 <- glm(default ~ income + balance, data=training.d1, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d1.probs <- predict(glm.train.d1, type='response', newdata=testing.d1)
glm.d1.preds <- ifelse(glm.d1.probs > 0.50, 'Yes','No')
cm.d1 <- table(testing.d1$default, glm.d1.preds)





#part iv)

#computing error rate against test data set
s70_30 <- 1-((cm.d1[1] +cm.d1[2,2])/(sum(cm.d1)))

cat("The error rate for 70% Training vs 30% testing split is: ", s70_30)

## The error rate for 70% Training vs 30% testing split is:  0.033

#Question 5 c)

##Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

##50-50 Split

Now I will repeat the above process using a 50% training vs 50% testing split

set.seed(702)

#part i)

#testing and train data split. Training is 50% and testing is 50%
index.d2 <- sample(x=nrow(Default), size=0.50*nrow(Default))
training.d2 <- Default[index.d2,]
testing.d2 <- Default[-index.d2,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d2 <- glm(default ~ income + balance, data=training.d2, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d2.probs <- predict(glm.train.d2, type='response', newdata=testing.d2)
glm.d2.preds <- ifelse(glm.d2.probs > 0.50, 'Yes','No')
cm.d2 <- table(testing.d2$default, glm.d2.preds)





#part iv)

#computing error rate against test data set
s50_50 <- 1-((cm.d2[1] +cm.d2[2,2])/(sum(cm.d2)))

cat("The error rate for 50% Training vs 50% testing split is: ", s50_50)

## The error rate for 50% Training vs 50% testing split is:  0.0298

##80-20 Split

Now I will repeat the above process using a 80% training vs 20% testing split

set.seed(702)

#part i)

#testing and train data split. Training is 80% and testing is 20%
index.d3 <- sample(x=nrow(Default), size=0.80*nrow(Default))
training.d3 <- Default[index.d3,]
testing.d3 <- Default[-index.d3,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d3 <- glm(default ~ income + balance, data=training.d3, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d3.probs <- predict(glm.train.d3, type='response', newdata=testing.d3)
glm.d3.preds <- ifelse(glm.d3.probs > 0.50, 'Yes','No')
cm.d3 <- table(testing.d3$default, glm.d3.preds)





#part iv)

#computing error rate against test data set
s80_20 <- 1-((cm.d3[1] +cm.d3[2,2])/(sum(cm.d3)))

cat("The error rate for 80% Training vs 20% testing split is: ", s80_20)

## The error rate for 80% Training vs 20% testing split is:  0.0325

##60-40 Split

Now I will repeat the above process using a 80% training vs 20% testing split

set.seed(702)

#part i)

#testing and train data split. Training is 60% and testing is 40%
index.d4 <- sample(x=nrow(Default), size=0.60*nrow(Default))
training.d4 <- Default[index.d4,]
testing.d4 <- Default[-index.d4,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d4 <- glm(default ~ income + balance, data=training.d4, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d4.probs <- predict(glm.train.d4, type='response', newdata=testing.d4)
glm.d4.preds <- ifelse(glm.d4.probs > 0.50, 'Yes','No')
cm.d4 <- table(testing.d4$default, glm.d4.preds)





#part iv)

#computing error rate against test data set
s60_40 <- 1-((cm.d4[1] +cm.d4[2,2])/(sum(cm.d4)))

cat("The error rate for 60% Training vs 20% testing split is: ", s60_40)

## The error rate for 60% Training vs 20% testing split is:  0.03175

split.error.rates.default <- data.frame(
  s70_30_error_rate = s70_30,
  s50_50_error_rate = s50_50,
  s80_20_error_rate = s80_20,
  s60_40_error_rate = s60_40
)

split.error.rates.default

The error rates are pretty close together but change depending on the split. It appears the larger the data set is for training the lower the error rate.

#Question 5 d)

Now I will fit a logistic regression model that predicts default status based on income, balance, and student dummy variable. Then I will use the validation set approach to calculate the error rate.

##70-30 Split

set.seed(702)

#part i)

#testing and train data split. Training is 70% and testing is 30%
index.d1 <- sample(x=nrow(Default), size=0.70*nrow(Default))
training.d1 <- Default[index.d1,]
testing.d1 <- Default[-index.d1,]


#part ii) 

#Fitting logistic regression model
glm.train.d1 <- glm(default ~ income + balance + student, data=training.d1, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d1.probs <- predict(glm.train.d1, type='response', newdata=testing.d1)
glm.d1.preds <- ifelse(glm.d1.probs > 0.50, 'Yes','No')
cm.d1 <- table(testing.d1$default, glm.d1.preds)





#part iv)

#computing error rate against test data set
s70_30 <- 1-((cm.d1[1] +cm.d1[2,2])/(sum(cm.d1)))

##50-50 Split

set.seed(702)

#part i)

#testing and train data split. Training is 50% and testing is 50%
index.d2 <- sample(x=nrow(Default), size=0.50*nrow(Default))
training.d2 <- Default[index.d2,]
testing.d2 <- Default[-index.d2,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d2 <- glm(default ~ income + balance + student, data=training.d2, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d2.probs <- predict(glm.train.d2, type='response', newdata=testing.d2)
glm.d2.preds <- ifelse(glm.d2.probs > 0.50, 'Yes','No')
cm.d2 <- table(testing.d2$default, glm.d2.preds)





#part iv)

#computing error rate against test data set
s50_50 <- 1-((cm.d2[1] +cm.d2[2,2])/(sum(cm.d2)))

##80-20 Split

set.seed(702)

#part i)

#testing and train data split. Training is 80% and testing is 20%
index.d3 <- sample(x=nrow(Default), size=0.80*nrow(Default))
training.d3 <- Default[index.d3,]
testing.d3 <- Default[-index.d3,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d3 <- glm(default ~ income + balance + student, data=training.d3, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d3.probs <- predict(glm.train.d3, type='response', newdata=testing.d3)
glm.d3.preds <- ifelse(glm.d3.probs > 0.50, 'Yes','No')
cm.d3 <- table(testing.d3$default, glm.d3.preds)





#part iv)

#computing error rate against test data set
s80_20 <- 1-((cm.d3[1] +cm.d3[2,2])/(sum(cm.d3)))

##60-40 Split

set.seed(702)

#part i)

#testing and train data split. Training is 60% and testing is 40%
index.d4 <- sample(x=nrow(Default), size=0.60*nrow(Default))
training.d4 <- Default[index.d4,]
testing.d4 <- Default[-index.d4,]
 

#part ii) 

#Fitting logistic regression model
glm.train.d4 <- glm(default ~ income + balance + student, data=training.d4, family='binomial')



#part iii)

#Obtraining prediction of default status.
#acquiring confusion matrix
glm.d4.probs <- predict(glm.train.d4, type='response', newdata=testing.d4)
glm.d4.preds <- ifelse(glm.d4.probs > 0.50, 'Yes','No')
cm.d4 <- table(testing.d4$default, glm.d4.preds)





#part iv)

#computing error rate against test data set
s60_40 <- 1-((cm.d4[1] +cm.d4[2,2])/(sum(cm.d4)))

##Error Rates using dummy student variable

split.error.rates.default <- data.frame(
  s70_30_error_rate = s70_30,
  s50_50_error_rate = s50_50,
  s80_20_error_rate = s80_20,
  s60_40_error_rate = s60_40
)

split.error.rates.default

It does not seem that adding a dummy student variable affected the error rates. 70-30 and 50-50 are the same. 80-20 and 60-40 are very close the the same with and with without the sudenty dummy variable.

#3. Question 5.4.7 pg 200

##Problem 7 a)

I will load the Weekly data set found in the ISLR library. Then I will fit a logisitc regression model that predicts Direction using Lag1 and Lag2

library(ISLR)
data(Weekly)

glm.weekly1 <- glm(Direction ~ Lag1 + Lag2, data=Weekly, family='binomial')
summary(glm.weekly1)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.623  -1.261   1.001   1.083   1.506  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

##Problem 7b)

Now I will fit a logisitc regression model that predicts Direction using Lag1 and Lag2 but it will remove the first observation in the data set

glm.weekly2 <- glm(Direction ~ Lag1 + Lag2, data=Weekly[-1,], family='binomial')
summary(glm.weekly2)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly[-1, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6258  -1.2617   0.9999   1.0819   1.5071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22324    0.06150   3.630 0.000283 ***
## Lag1        -0.03843    0.02622  -1.466 0.142683    
## Lag2         0.06085    0.02656   2.291 0.021971 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1494.6  on 1087  degrees of freedom
## Residual deviance: 1486.5  on 1085  degrees of freedom
## AIC: 1492.5
## 
## Number of Fisher Scoring iterations: 4

#Problem 7 c)

Now I will use the model to predict the first response to see if P(Direction=“Up”|Lag1, Lag2) > 0.5

glm.weekly2.test <- predict(glm.weekly2, type='response', newdata=Weekly[1,]) > .5
glm.weekly2.test

##    1 
## TRUE

It appears that it is classifiying the first observation for direction as up. However, the first observation is the down direction.

#Problem 7 d)

Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performs each of the following steps:

Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 and Lag2.
Compute the posterior probability of the market moving up for the ith observation.
Use the posterior probability for the ith observation in order to predict whether or not the market moves up.
Determine whether or not an error was made in predicting the direction for the ith observation. If an error was made, then indicate this as a 1, and otherwise indicate it as a 0

set.seed(123)

n <- rep(0, nrow(Weekly))

cat('There are ', nrow(Weekly), ' observations')

## There are  1089  observations

#Part i)
for (i in 1:nrow(Weekly)) {
  glm.mod <- glm(Direction ~ Lag1 + Lag2, data=Weekly[-i,], family='binomial')
#Part ii)
  up.pred <- predict.glm(glm.mod, Weekly[i, ], type='response') > .5
#part iii)
  up.t <- Weekly[i,]$Direction == 'Up'
  if (up.pred != up.t)
    n[i] <- 1
}


cat('There are ', sum(n) ,' that were misclassified predictions')

## There are  490  that were misclassified predictions

#Problem 7 e)

Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the test error. Comment on the results.

mean(n)

## [1] 0.4499541

It appears that the test error rate estimate is 0.4499541 for the LOOCV. This seems like a pretty high error rate.

Write your own code (similar to Q #3. above) to estimate test error using 6-fold cross validation for fitting linear regression with from the Auto data in the ISLR library. You should show the code in your final PDF.

I will create a loop that trains and tests each fold

library(ISLR)
data(Auto)
Auto <- subset(Auto, select=c(mpg,horsepower))
Auto$horsepower2 <- Auto$horsepower^2

#Creating the loop to test each fold and grabbing the error rates
set.seed(123)
library(boot)
cv.error.6 <- rep(0,6)
for (i in 1:6) {
  glm.fit=glm(mpg~horsepower + horsepower2, data=Auto)
  cv.error.6[i]=cv.glm(Auto, glm.fit, K=6)$delta [1]
}
cv.error.6

## [1] 19.37446 19.22604 19.21835 19.15180 19.11430 19.34824

It appears the error rates are very close to the same for each each fold.

##5. Last homework you started analyzing the dataset you chose from this website. Now continue the analysis and perform Logistic Regression, KNN, LDA, QDA, MclustDA, MclustDA with EDDA if appropriate. If it is not possible to perform any of the methods mentioned above please justify why.

First I will load the data set using the package(data.table) and insert all the columns

#install.packags(data.table)
library(data.table)

bc.dat <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data', col.names=c('class','age','menopause','tumorsize','invnodes','nodecaps','degmalig','breast','breastquad','irradiant'))

head(bc.dat,5)

##Model Fitting

I will use all the variables in data set and asses their pvalues. Therefore class will be explained by all the other covariates. class consists of no-recurrence-events and recurrence-events

##Lostitic Regression

glm.mod.full <- glm(as.factor(class) ~ ., data=bc.dat, family='binomial')
summary(glm.mod.full)

## 
## Call:
## glm(formula = as.factor(class) ~ ., family = "binomial", data = bc.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6589  -0.7997  -0.4962   0.9368   2.2117  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -1.12860 3393.46916   0.000  0.99973   
## age30-39              16.05306 2399.54484   0.007  0.99466   
## age40-49              15.56564 2399.54482   0.006  0.99482   
## age50-59              15.48799 2399.54484   0.006  0.99485   
## age60-69              16.15688 2399.54488   0.007  0.99463   
## age70-79              15.93559 2399.54517   0.007  0.99470   
## menopauselt40          0.85838    0.96702   0.888  0.37473   
## menopausepremeno       0.49325    0.48562   1.016  0.30977   
## tumorsize10-14        -1.92983    1.63282  -1.182  0.23725   
## tumorsize15-19        -0.01695    1.30377  -0.013  0.98963   
## tumorsize20-24         0.38423    1.24737   0.308  0.75806   
## tumorsize25-29         0.32987    1.26109   0.262  0.79365   
## tumorsize30-34         0.39770    1.25561   0.317  0.75144   
## tumorsize35-39         0.11220    1.36190   0.082  0.93434   
## tumorsize40-44        -0.14348    1.34646  -0.107  0.91514   
## tumorsize45-49        -0.12714    1.78397  -0.071  0.94319   
## tumorsize5-9         -14.89178 1181.98983  -0.013  0.98995   
## tumorsize50-54         0.73428    1.45873   0.503  0.61471   
## invnodes12-14          1.22253    1.43624   0.851  0.39466   
## invnodes15-17          0.64516    0.97806   0.660  0.50949   
## invnodes24-26         15.99287 2399.54485   0.007  0.99468   
## invnodes3-5            0.57988    0.50035   1.159  0.24648   
## invnodes6-8            0.93369    0.67977   1.374  0.16959   
## invnodes9-11           0.99059    0.84136   1.177  0.23905   
## nodecapsno            -0.17754    0.95816  -0.185  0.85300   
## nodecapsyes            0.20114    0.97186   0.207  0.83604   
## degmalig               0.66217    0.23548   2.812  0.00492 **
## breastright           -0.34880    0.33266  -1.049  0.29440   
## breastquadcentral    -17.77212 2399.54487  -0.007  0.99409   
## breastquadleft_low   -17.33917 2399.54479  -0.007  0.99423   
## breastquadleft_up    -17.53362 2399.54479  -0.007  0.99417   
## breastquadright_low  -17.82730 2399.54485  -0.007  0.99407   
## breastquadright_up   -16.84093 2399.54481  -0.007  0.99440   
## irradiantyes           0.31221    0.36792   0.849  0.39611   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.05  on 285  degrees of freedom
## Residual deviance: 282.73  on 252  degrees of freedom
## AIC: 350.73
## 
## Number of Fisher Scoring iterations: 15

It appears that degmalig is the only variable that is significant pvalue in the model. However, based on the histograms from last week I am going to try fitting a model that is class is explained by degmalig and invnodes

glm.mod <- glm(as.factor(class) ~ degmalig + invnodes, data=bc.dat, family='binomial')
summary(glm.mod)

## 
## Call:
## glm(formula = as.factor(class) ~ degmalig + invnodes, family = "binomial", 
##     data = bc.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5709  -0.6916  -0.6916   0.9028   2.0980  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.8582     0.4841  -5.904 3.55e-09 ***
## degmalig        0.7747     0.2122   3.650 0.000262 ***
## invnodes12-14   1.2273     1.2502   0.982 0.326277    
## invnodes15-17   0.6597     0.8557   0.771 0.440709    
## invnodes24-26  15.1002   882.7434   0.017 0.986352    
## invnodes3-5     0.9749     0.3845   2.536 0.011226 *  
## invnodes6-8     1.2212     0.5377   2.271 0.023135 *  
## invnodes9-11    1.4239     0.6894   2.065 0.038893 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 348.05  on 285  degrees of freedom
## Residual deviance: 306.55  on 278  degrees of freedom
## AIC: 322.55
## 
## Number of Fisher Scoring iterations: 13

it appears degmalig still contains a highly signiciant pvalue in the model. Some of the invnodes have a a significance as well. Therefore, I will keep the model with both covariates.

#Splitng and testing data for GLM

Now I will split the data set 75% for training and 25% for testing. Then I acquire the error rate. I will also dummy code variables. class will 0 for no-recurrence-events and 1 otherwise.

invnodes will be: 0-2 = 0 3-5 = 1 6-8 = 2 9-11 = 3 12-14 = 4 15-17 = 5 24-26 = 6

The dummy coding is so that it can be fit some of the models.

bc.dat$class <- ifelse(bc.dat$class=="no-recurrence-events",0,1)

bc.dat$dummy <- ifelse(bc.dat$invnodes=='0-2',0,(ifelse(bc.dat$invnodes=='3-5',1,
(ifelse(bc.dat$invnodes=='6-8',2,(ifelse(bc.dat$invnodes=='9-11',3,(ifelse(bc.dat$invnodes=='12-14',4,(ifelse(bc.dat$invnodes=='15-17',5,(ifelse(bc.dat$invnodes=='24-26',6,99)))))))))))))


bc.dat <- subset(bc.dat, select=c(class, degmalig, dummy))

set.seed(123)
index <- sample(x=nrow(bc.dat), size=0.75*nrow(bc.dat))
training <- bc.dat[index,]
testing <- bc.dat[-index,]


glm.train <- glm(class ~ degmalig + dummy, data=training, family='binomial')

#putting together the confusion matrix

glm.probs <- predict(glm.train, testing, type='response', na.action=na.pass)
glm.preds <- ifelse(glm.probs > 0.50, '1','0')
cm.glm <- table(testing$class, glm.preds)
print(cm.glm)

##    glm.preds
##      0  1
##   0 46  2
##   1 15  9

glm.error.rate <- 1-((cm.glm[1] + cm.glm[2,2]) / sum(cm.glm))

cat("the glm error rate is: ",glm.error.rate)

## the glm error rate is:  0.2361111

#KNN Method

library(class)
library(MASS)
library(tidyverse)
set.seed(123)

end <- data.frame(k=1:100, correct=NA)
for(i in 1:100){
  knn.pred = knn(train=data.frame(training$degmalig, training$dummy),
  test=data.frame(testing$degmalig, testing$dummy), cl=training$class, k=i, prob=TRUE)
  cm_knn <-table(testing$class, knn.pred)
  correct <- (cm_knn['0','0'] + cm_knn['1','1'])/sum(cm_knn)
  end$correct[i] <- correct
}

library(tidyverse)
ggplot(data=end, aes(k,correct)) + geom_line() +labs(title='Plot of K for KNN classifiers vs Accuracy of Model', x='No. K', y='Accuracy')

max_accuracy <- end[order(-end$correct),]

head(max_accuracy,1)

It appears that when k is 3 through 9 the highest accuracy rates for the model are produced. Therefore the error rate is 0.2916667

#LDA Method & Error Rate

lda.bc <- lda(class ~ degmalig + dummy, data=training)

lda.bc.preds <- predict(lda.bc, newdata=testing, type='response')
cm.lda <- table(lda.bc.preds$class, testing$class)
print(cm.lda)

##    
##      0  1
##   0 46 15
##   1  2  9

lda.error.rate <- 1-((cm.lda[1] + cm.lda[2,2]) /sum(cm.lda))
cat("The LDA error rate is: ", lda.error.rate)

## The LDA error rate is:  0.2361111

#QDA Method and Error Rate

qda.bc <- qda(class ~ degmalig + dummy, data=training)

qda.bc.preds <- predict(qda.bc, newdata=testing, type='response')
cm.qda <- table(qda.bc.preds$class, testing$class)
print(cm.qda)

##    
##      0  1
##   0 46 18
##   1  2  6

qda.error.rate <- 1-((cm.qda[1] + cm.qda[2,2]) /sum(cm.qda))
cat("The QDA error rate is: ", qda.error.rate)

## The QDA error rate is:  0.2777778

#MclustDA

library('mclust')

head(training)

mclust.mod <- MclustDA(training[,2:3], training$class, G=9)

summary(mclust.mod, newdata=testing[,2:3], newclass=testing$class)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##        76.63528 214 57 -152.5901
##        
## Classes   n    % Model G
##       0 153 71.5   EEE 9
##       1  61 28.5   EEI 9
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 144   9
##     1  41  20
## Classification error = 0.2336 
## Brier score          = 0.1753 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 45  3
##     1 14 10
## Classification error = 0.2361 
## Brier score          = 0.2111

#MclustDA with EDDA Model Type

mclust.mod2.edda <- MclustDA(training[,2:3], training$class, modelType="EDDA")

summary(mclust.mod2.edda, newdata=testing[,2:3], newclass = testing$class)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -528.2006 214  8 -1099.329
##        
## Classes   n    % Model G
##       0 153 71.5   VVI 1
##       1  61 28.5   VVI 1
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 146   7
##     1  46  15
## Classification error = 0.2477 
## Brier score          = 0.1842 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 46  2
##     1 18  6
## Classification error = 0.2778 
## Brier score          = 0.2065

#Conclusion

The following testing error rates are:

GLM : 0.3194444 KNN : 0.2916667 LDA : 0.3194444 QDA : 0.3055556 MclustDA : 0.2777778 MclustDA (EDDA) : 0.3055556

It appears that MclustDA with specifying the model type has the lowest error rate with KNN as the second lowest error rate.

Homework 6

Modern Applied Statistics II