Econ 7970, Spring 2023, Assignment 2

Exercise 5 in section 4.8

We now examine the differences between LDA and QDA.

(a). If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

LDA is a special case of QDA. In terms of training data, QDA can be better fitted because it is nonlinear models. But this can also lead to “over fitting” situations. Therefore, the accuracy of LDA on test data may be higher.

(b). If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

If the decision boundary is non-linear then QDA may be better on both situation because QDA can provide a non-linear decision boundary while taking advantage of a parametric form.

(c). In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

Because QDA can provide a non-linear decision boundary, so as the sample size n increases, QDA can be closer to the Bayes decision boundary, we expect the test prediction accuracy of QDA relative to LDA to improve.

(d). True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

False. LDA tends to be a better bet than QDA if there are relatively few training observations and the Bayes decision boundary for a given problem is linear Because QDA tends to over fit.

Exercise 8 in section 4.8

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

We should use logistic regression. Because the K value of the KNN model is set at 1, the training data set has been “overfitted” – the error rate of the test model is 0%. Because KNN has an average error rate of 18%, the test data set has an error rate of 36% that is higher than the logistic regression model. But at the same time, we can see that the error rate of the logistic regression model is as high as 30%. So maybe the boundary is nonlinear. In this case, a QDA might be better.

Exercise 14 in section 4.8

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

library(ISLR2)
library(MASS)
library(class)
summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a). Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

data("Auto")
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365  
##      mpg01    
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0  
##

(b). Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

library(corrplot)

corrplot::corrplot.mixed(cor(Auto[, -9]), upper="circle")

pairs(Auto[, -9])

par(mfrow=c(2,3))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")

Has some anti-correlated with cylinders, weight, displacement, horsepower.

(c). Split the data into a training set and a test set.

set.seed(123)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]

(d). Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model

## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4963504 0.5036496 
## 
## Group means:
##   cylinders   weight displacement horsepower
## 0  6.786765 3641.022     275.2941  130.96324
## 1  4.188406 2314.000     114.5290   78.00725
## 
## Coefficients of linear discriminants:
##                        LD1
## cylinders    -0.3974647924
## weight       -0.0009670704
## displacement -0.0029615583
## horsepower    0.0049004106

lda_pred = predict(lda_model, testing_data)
names(lda_pred)

## [1] "class"     "posterior" "x"

pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)

##    mpg01.test
##      0  1
##   0 50  3
##   1 10 55

mean(pred.lda$class != mpg01.test)

## [1] 0.1101695

The test error is 11.02%

(e). Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)

##          
## qda.class  0  1
##         0 53  4
##         1  7 54

mean(qda.class != testing_data$mpg01)

## [1] 0.09322034

The test error is 9.32%

(f). Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)

##         mpg01.test
## pred.glm  0  1
##        0 53  6
##        1  7 52

mean(pred.glm != mpg01.test)

## [1] 0.1101695

The test error is 11.02%

(g). Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(e1071)
naiveBayes<-naiveBayes(mpg01~cylinders+weight+displacement+horsepower,data = training_data)
naiveBayes_pred<-predict(naiveBayes,testing_data)
mean(naiveBayes_pred!=testing_data$mpg01)

## [1] 0.1016949

The test error is 10.17%

(h). Perform KNN on the training data, with several values of K, in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(tidyverse)
require(class)
set.seed(1)
acc <- list()

set.seed(1)
num_train <- nrow(Auto) * 0.75

inTrain <- sample(nrow(Auto), size = num_train)

training <- Auto[inTrain,]
testing <- Auto[-inTrain,]

x_train <- training[,c('cylinders', 'displacement', 'horsepower', 'weight')]
y_train <- training$mpg0
x_test <- testing[,c('cylinders', 'displacement', 'horsepower', 'weight')]

for (i in 1:20) {
    knn_pred <- knn(train = x_train, test = x_test, cl = y_train, k = i)
    acc[as.character(i)] = mean(knn_pred == testing$mpg01)
}
acc <- unlist(acc)

data_frame(acc = acc) %>%
    mutate(k = row_number()) %>%
    ggplot(aes(k, acc)) +
    geom_col(aes(fill = k == which.max(acc))) +
    labs(x = 'K', y = 'Accuracy', title = 'KNN Accuracy for different values of K') +
    scale_x_continuous(breaks = 1:20) +
    scale_y_continuous(breaks = round(c(seq(0.90, 0.94, 0.01), max(acc)),
                                      digits = 3)) +
    geom_hline(yintercept = max(acc), lty = 2) +
    coord_cartesian(ylim = c(min(acc), max(acc))) +
    guides(fill = FALSE)

The value of K that seems to perform the best on this data set is 3

Exercise 15 in section 4.8

This problem involves writing functions.

(a). Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute \(2^3\) and print out the results.

Hint: Recall that x^a raises x to the power a. Use the print() function to output the result.

Power = function() {
    2^3
}
print(Power())

## [1] 8

(b). Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line

Power2 <- function(x, a) {
    x^a
}

You should be able to call your function by entering, for instance,

Power2(3, 8)

## [1] 6561

on the command line. This should output the value of \(3^8\), namely, 6561.

(c). Using the Power2() function that you just wrote, compute \(10^3\), \(8^{17}\) , and \(131^3\).

Power2(10, 3)

## [1] 1000

Power2(8, 17)

## [1] 2.2518e+15

Power2(131, 3)

## [1] 2248091

(d). Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line:

# return(result)

The line above should be the last line in your function, before the } symbol.

Power3 = function(x, a) {
    result = x^a
    return(result)
}

(e). Now using the Power3() function, create a plot of \(f(x)=x^2\). The x-axis should display a range of integers from 1 to 10, and the y-axis should display \(x^2\). Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale. You can do this by using log = “x”, log = “y”, or log = “xy” as arguments to the plot() function.

x = 1:10
plot(x, Power3(x, 2), log = "xy", ylab = "Log of y = x^2", xlab = "Log of x", 
    main = "Log of x^2 versus Log of x")

(f). Create a function, PlotPower(), that allows you to create a plot of x against \(x^a\) for a fixed a and for a range of values of x. For instance, if you call

# PlotPower(1:10, 3)

then a plot should be created with an x-axis taking on values 1, 2, …, 10, and a y-axis taking on values \(1^3\), \(2^3\), …, \(10^3\).

PlotPower = function(x, a) {
    plot(x, Power3(x, a))
}
PlotPower(1:10, 3)

Exercise 16 in section 4.8

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore the logistic regression, LDA, naive Bayes and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.

library(MASS)
library(class)
library(e1071)
summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Boston$response <- as.factor(ifelse(Boston$crim > median(Boston$crim),"Above","Below"))
Boston %>% filter(response=="Above") %>% head()

# Set up Training and Test Sets
set.seed(2)

train <- sample(1:nrow(Boston), round((nrow(Boston)/4)*3,0))

Boston.test <- Boston[-train,]
boston.response <- Boston$response[-train]
boston.train <- Boston[train,]

Logistic regression

glm.fits <- glm(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, family = binomial, data = Boston, subset = train)
summary(glm.fits)

## 
## Call:
## glm(formula = response ~ zn + indus + chas + nox + rm + age + 
##     dis + rad + tax, family = binomial, data = Boston, subset = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.05155  -0.00772   0.00850   0.28098   1.80684  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  24.861723   4.752476   5.231 1.68e-07 ***
## zn            0.061220   0.031442   1.947 0.051524 .  
## indus         0.008342   0.047150   0.177 0.859563    
## chas         -0.125309   0.847579  -0.148 0.882467    
## nox         -34.478533   6.935623  -4.971 6.65e-07 ***
## rm           -0.515107   0.353771  -1.456 0.145381    
## age          -0.011939   0.009914  -1.204 0.228468    
## dis          -0.338121   0.202467  -1.670 0.094918 .  
## rad          -0.594197   0.153996  -3.859 0.000114 ***
## tax           0.005827   0.002710   2.151 0.031508 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 526.53  on 379  degrees of freedom
## Residual deviance: 175.75  on 370  degrees of freedom
## AIC: 195.75
## 
## Number of Fisher Scoring iterations: 9

glm.probs <-predict(glm.fits, type = "response", Boston.test)
contrasts(Boston$response)

##       Below
## Above     0
## Below     1

glm.pred <- rep("Above",126)
glm.pred[glm.probs >.5 ] = "Below"
table(glm.pred,boston.response)

##         boston.response
## glm.pred Above Below
##    Above    55     3
##    Below    13    55

mean(glm.pred==boston.response)

## [1] 0.8730159

acc <- mean(glm.pred==boston.response)

Prediction_Accuracy <- tibble("Model" = "Logistic Regression", "Accuracy" = acc)
acc

## [1] 0.8730159

Prediction_Accuracy

LDA

lda.fit <- lda(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, data = Boston)
summary(lda.fit)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   18     -none- numeric  
## scaling  9     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  0     -none- list

lda.pred <- predict(lda.fit,Boston)
lda.class <- lda.pred$class

table(lda.class,Boston$response)

##          
## lda.class Above Below
##     Above   194    14
##     Below    59   239

acc <- mean(lda.class==Boston$response)

Prediction_Accuracy <- rbind(Prediction_Accuracy,  c("LDA", acc))

acc

## [1] 0.8557312

Prediction_Accuracy

Naive Bayes

nb.fit <- naiveBayes(response ~ zn + indus+chas+nox+rm+age+dis+rad+tax, family = binomial, data = Boston, subset = train)

nb.class <- predict(nb.fit, Boston.test)
table(nb.class, boston.response)

##         boston.response
## nb.class Above Below
##    Above    52    11
##    Below    16    47

acc <- mean(lda.class==Boston$response)

Prediction_Accuracy <- rbind(Prediction_Accuracy, c("Naive Bayes", acc))

acc

## [1] 0.8557312

Prediction_Accuracy

KNN

attach(Boston)
crime01 = rep(0, length(crim))
crime01[crim > median(crim)] = 1
Boston = data.frame(Boston, crime01)

train = 1:(dim(Boston)[1]/2)
test = (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime01.test = crime01[test]
library(class)
train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax)[train, ]
test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax)[test, ]
train.crime01 = crime01[train]
set.seed(1)



x<- NULL
y <- NULL
for(i in 1:20){
  knn.pred <- knn(train.X, test.X, train.crime01, k = i)
  x<- rbind(x,mean(knn.pred != crime01.test))
  y <- rbind(y,paste(i))
}
data.frame("Kvalue" = y, "Error rate" = x) %>% arrange(desc(x))

# KNN(k=10) has the highest accuracy
knn.pred = knn(train.X, test.X, train.crime01, k = 10)


acc <- 1 - mean(knn.pred != crime01.test)

Prediction_Accuracy <- rbind(Prediction_Accuracy, c("KNN", acc))

acc

## [1] 0.8893281

Prediction_Accuracy

I’m going to use the KNN model because it has the highest accuracy.