Conceptual Exercises

Exercise 5 in section 4.7

We now examine the differences between LDA and QDA.

(a). If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
ANS:
QDA performs better on the training set and LDA performs better on the test set. The reason for QDA outperforming on training set is due to overfiiting to some non-linearity situations which would not appear in the test set.

(b). If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
ANS:
QDA performs better both on training set and test set. Because QDA has higher flexibility, it is more accurately approaximate. (ISLR, Chapter4, figure4.9 )

(c). In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?
ANS:
The test prediction accuracy of QDA relative to LDA will be uncertain. Whether the accuracy will be improve, it depends on the Bayes decision boundary. If the boundary is linear, LDA performs better, when n increase, LDA could also predict more accurately than QDA. If the boundary is non-linear, QDA’s prediction relative to LDA to improve.

(d).True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.
ANS:
False. Flexibility is not the only indicator to measure a good model. Just as my answer mentioned before, using LDA will achieve a superior test error rate than using QDA.



Exercise 8 in section 4.7

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures.First, we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why
ANS:
We should prefer to use logistic regression method for classification of new observations. During we use 1-NN classifier model on the training data, when classifying each training data point, any reasonable distance metric would lead to the retrieval of that training point itself as its own nearest neighbor, it means that its own value for the target variable would be used to predict itself, then there is no error in this model.

So the test error rate must be 2*18% = 36%, which is worse than the 30% test error of logistic regression. Logistic regression classifier will be better to predict new data.



Exercise 11 in section 4.7

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

# Remove the # below
library(ISLR)
library(MASS)
library(class)
summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

(a). Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

attach(Auto)
mpg01 = rep(0,length(mpg))
mpg01[mpg>median(mpg)] = 1
mpg01 = as.factor(mpg01)
Auto = data.frame(Auto,mpg01)

(b).Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01 ? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

library(dplyr)
library(ggplot2)
library(gridExtra) # in order to use "grid.arrange" function to split screen
glimpse(Auto)
## Rows: 392
## Columns: 10
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14,…
## $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4…
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, …
## $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, …
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 3…
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5…
## $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70,…
## $ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3…
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth …
## $ mpg01        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
# I delete "mpg", although there is strong correlation between mpg and mpg01, it is meaningless.
# I delete "name", because the number of its unique value is 301, I don't think name has influence in mpg01.

# 1. Cylinders
g1 = ggplot(Auto, aes(x = mpg01, y = cylinders, col = mpg01)) + 
  geom_jitter() + ggtitle("Cylinders vs mpg01 - Jitter Plot")
# 2. displacement
g2 = ggplot(Auto, aes(x = mpg01, y = displacement, fill = mpg01)) + 
  geom_boxplot() + ggtitle("Displacement vs mpg01 - Boxplot")
# 3. horsepower
g3 = ggplot(Auto, aes(x = mpg01, y = horsepower, fill = mpg01)) + 
  geom_boxplot() + ggtitle("Horsepower vs mpg01 - Boxplot")
# 4. weight
g4 = ggplot(Auto, aes(x = mpg01,y = weight, fill = mpg01)) + 
  geom_boxplot() + ggtitle("Weight vs mpg01 - Boxplot")
# 5. acceleration
g5 = ggplot(Auto, aes(x = mpg01,y = acceleration , fill = mpg01)) + 
  geom_boxplot() + ggtitle("Acceleration vs mpg01 - Boxplot")
# 6. year
g6 = ggplot(Auto, aes(x = mpg01, y = year, fill = mpg01)) + 
  geom_boxplot() + ggtitle("Year vs mpg01 - Boxplot")
# 7. origin
g7 = ggplot(Auto, aes(x = origin, fill = mpg01)) + 
  geom_bar(position = 'fill') + ggtitle("Origin vs mpg01 - Bar")
  
grid.arrange(g1,g2,g3,g4, nrow=2, ncol=2)

grid.arrange(g5,g6,g7, nrow=2, ncol=2)

Useful feature in predicting mpg01:
cylinders, displacement, horsepower, weight and origin.

(c). Split the data into a training set and a test set.

# Set 70% training and 30% testing data
set.seed(1)
a = seq(1,nrow(Auto),by=1)
i = sample(a,nrow(Auto)*0.7, replace=FALSE) 

# LDA&QDA
train = Auto[i,]
test = Auto[-i,]

# KNN
train.X = Auto[i,-c(1,6,7,9,10)]
test.X = Auto[-i,-c(1,6,7,9,10)]

train.mpg01 = mpg01[i]
test.mpg01 = mpg01[-i]

(d). Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained ?

library(MASS)
lda.fit = lda(mpg01~cylinders+weight+displacement+horsepower+origin, data = train)
lda.pred = predict(lda.fit,test)
lda.class = lda.pred$class
table(lda.class,test.mpg01)
##          test.mpg01
## lda.class  0  1
##         0 50  3
##         1 11 54
mean(lda.class!= test.mpg01) # test error of LDA: 11.9%
## [1] 0.1186441

(e). Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained ?

qda.fit = qda(mpg01~cylinders+weight+displacement+horsepower+origin, data = train) 
qda.class = predict(qda.fit,test)$class
table(qda.class,test.mpg01)
##          test.mpg01
## qda.class  0  1
##         0 52  5
##         1  9 52
mean(lda.class!= test.mpg01) # test error of QDA: 11.9%
## [1] 0.1186441
# 1: miles per gallon > median

(f).Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained ?

glm.fit=glm(mpg01~cylinders+weight+displacement+horsepower+origin,data=train,family=binomial)
glm.probs=predict(glm.fit,test,type="response")
glm.pred=rep(0,118) 
glm.pred[glm.probs >.5]=1
table(glm.pred,test.mpg01)
##         test.mpg01
## glm.pred  0  1
##        0 53  3
##        1  8 54
mean(glm.pred != test.mpg01)
## [1] 0.09322034

(g).Perform KNN on the training data, with several values of K, in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What test errors do you obtain ? Which value of K seems to perform the best on this data set ?

# k = 1
library(class)
knn.pred = knn(train.X,test.X,train.mpg01,k=1)
mean(test.mpg01 != knn.pred) # test error of model: 13.6%
## [1] 0.1355932
# k = 3
knn.pred = knn(train.X,test.X,train.mpg01,k=3)
mean(test.mpg01 != knn.pred) # test error of model: 11%
## [1] 0.1101695
# k = 5
knn.pred = knn(train.X,test.X,train.mpg01,k=5)
mean(test.mpg01 != knn.pred) # test error of model: 12.7%
## [1] 0.1271186

When k = 3, the knn model could perform best on this data set.



Exercise 12 in section 4.7

This problem involves writing functions.

(a). Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute \(2^3\) and print out the results. Hint: Recall that \(x^a\) raises x to the power a. Use the print() function to output the result.

Power = function(){2^3}
Power()
## [1] 8

(b). Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of \(x^a\). You can do this by beginning your function with the line

Power2 = function(x,a) { x^a }

You should be able to call your function by entering, for instance,

Power2 (3,8)
## [1] 6561

on the command line. This should output the value of \(3^8\), namely,6561

(c). Using the Power2() function that you just wrote, compute \(10^3\), \(8^17\), and \(131^3\).

Power2(10,3)
## [1] 1000
Power2(8,17)
## [1] 2.2518e+15
Power2(131,3)
## [1] 2248091

(d). Now create a new function, Power3(), that actually returns the result “\(x^a\)” as an R object, rather than simply printing it to the screen. That is, if you store the value “\(x^a\)” in an object called result within your function, then you can simply return() this result.

Power3 = function(x,a){
  result = x^a
  return(result)
}

return() The line above should be the last line in your function, before the } symbol.

(e). Now using the Power3() function, create a plot of \(f(x)=x^2\). The x-axis should display a range of integers from 1 to 10, and the y-axis should display \(x^2\). Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale.You can do this by using log=“x”, log=“y”, or log=“xy” as arguments to the plot() function.

x = c(1:10)
y = Power3(x,2)
plot(x,y,log ='xy', 
     ylab = "log of y = x^2",
     xlab = "log of x",
     main = "Log of X^2 VS Log of x",
     col = 'blue')

(f). Create a function, PlotPower(), that allows you to create a plot of x against \(x^a\) for a fixed a and for a range of values of x.

PlotPower = function(x,a){
  plot(x, Power2(x,a))
}

For instance, if you call

PlotPower (1:10,3)

then a plot should be created with an x-axis taking on values 1, 2,…, 10, and a y-axis taking on values \(1^3\), \(2^3\),…, \(10^3\).



Exercise 13 in section 4.7

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore the logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

# Remove the # below
library(MASS)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
attach(Boston)

crim01 = rep(0,length(crim))
crim01[crim>median(crim)] = 1
Boston = data.frame(Boston,crim01)

# Set 50% training and 50% test data
set.seed(1)
b = seq(1,nrow(Boston),by=1)
j = sample(b,nrow(Boston)*0.5, replace=FALSE) 
# logistic regression & LDA & QDA
train = Boston[j,-1]
test = Boston[-j,-1]
# KNN
train.X = Boston[j,-c(1,15)]
test.X = Boston[-j,-c(1,15)]

train.Y = crim01[j]
test.Y = crim01[-j]

Logistic regression

glm.fit = glm(crim01~., data = train, family = binomial)
glm.probs = predict(glm.fit, test, type = "response")
glm.pred = rep(0,nrow(test))
glm.pred[glm.probs>0.5] = 1
# test error of logistic regression model: 9.9%
mean(glm.pred != test.Y)
## [1] 0.09881423

LDA

lda.fit = lda(crim01~.,data = train)
lda.pred = predict(lda.fit, test)
# test error of LDA: 16.2%
mean(lda.pred$class != test.Y)
## [1] 0.1620553

KNN

# k = 1, test error of KNN: 9.4% 
knn.pred = knn(train.X,test.X,train.Y, k=1)
mean(knn.pred != test.Y)
## [1] 0.09486166
# k = 3, test error of KNN: 10.3% 
knn.pred = knn(train.X,test.X,train.Y, k=3)
mean(knn.pred != test.Y)
## [1] 0.1027668
# k = 5, test error of KNN: 13.4% 
knn.pred = knn(train.X,test.X,train.Y, k=5)
mean(knn.pred != test.Y)
## [1] 0.1343874

(Type your findings here)