DeclareUnicodeCharacter{7A0B}{}
We now examine the differences between LDA and QDA.
(a). If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
QDA will perform better on the training set because its increased flexibility will result in a closer fit. However, LDA will have better performance on test set, and QDA may cause overfitting problem.
(b). If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
QDA will have better performance in accuracy both for train and test data.
(c). In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?
We expect the test prediction accuracy of QDA relative to LDA improve. As sample size n enlarge, variance may be offset and more flexible method can yield better fit.
(d). True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.
False, because here we focus on the prediction performance on test data. When the Bayes decision boundary is linear, although QDA may fit better due to the potential over-fitting problem, LDA will have better performance on the test set.
Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?
Logistic regression. The average error rate of logistic regression and 1-nearest neighbors are 25% and 18% respectively. It seems that the 1-nearest neighbors has the higher accuracy, but this method potentially exists over-fitting problem which may lead to very poor prediction.
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
install.packages("ISLR2",repos = "http://cran.us.r-project.org")
## 程序包'ISLR2'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\LBder\AppData\Local\Temp\RtmpKiUbhn\downloaded_packages里
install.packages("MASS",repos = "http://cran.us.r-project.org")
## 程序包'MASS'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\LBder\AppData\Local\Temp\RtmpKiUbhn\downloaded_packages里
install.packages("class",repos = "http://cran.us.r-project.org")
## 程序包'class'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\LBder\AppData\Local\Temp\RtmpKiUbhn\downloaded_packages里
library(ISLR2)
library(MASS)
library(class)
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
(a). Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
m<-mean(Auto$mpg)
mpg01<-ifelse(Auto$mpg>m,yes=1,no=0)
Auto<-data.frame(Auto,mpg01)
(b). Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
cor(Auto[,-9])#delete the categorical column "name"
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8408220 -0.7340366 -0.7355848 -0.6494910 -0.7445033
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8408220
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7340366
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7355848
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6494910
## weight -0.4168392 -0.3091199 -0.5850054 -0.7445033
## acceleration 1.0000000 0.2903161 0.2127458 0.3209701
## year 0.2903161 1.0000000 0.1815277 0.4510104
## origin 0.2127458 0.1815277 1.0000000 0.5192282
## mpg01 0.3209701 0.4510104 0.5192282 1.0000000
pairs(Auto)
par(mfrow=c(3,3))
boxplot(mpg~mpg01,data=Auto)
boxplot(cylinders~mpg01,data=Auto)
boxplot(displacement~mpg01,data=Auto)
boxplot(horsepower~mpg01,data=Auto)
boxplot(weight~mpg01,data=Auto)
boxplot(acceleration~mpg01,data=Auto)
boxplot(year~mpg01,data=Auto)
boxplot(origin~mpg01,data=Auto)
mpg, cylinders, wight, displacement, and horsepower seem most likely to
be useful in predicting mpg01.
(c). Split the data into a training set and a test set.
set.seed(0)
tn=0.3
sub<-sample(1:nrow(Auto),round(nrow(Auto)*tn),rep=FALSE)
data_train<-Auto[-sub,]
data_test<-Auto[sub,]
dim(data_train)
## [1] 274 10
dim(data_test)
## [1] 118 10
(d). Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
LDA<-lda(mpg01~cylinders+weight+displacement+horsepower,data=data_train)
LDA_pred<-predict(LDA,data_test)
mean(LDA_pred$class!=data_test$mpg01) # mean of a logical variable will be the rate of TRUE
## [1] 0.1440678
table(LDA_pred$class,data_test$mpg01)
##
## 0 1
## 0 51 5
## 1 12 50
The test error is about 0.1441
(e). Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
QDA<-qda(mpg01~cylinders+weight+displacement+horsepower,data=data_train)
QDA_pred<-predict(QDA,data_test)
mean(QDA_pred$class!=data_test$mpg01)
## [1] 0.1355932
table(QDA_pred$class,data_test$mpg01)
##
## 0 1
## 0 54 7
## 1 9 48
The test error is about 0.1356
(f). Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
LR<-glm(mpg01~cylinders+weight+displacement+horsepower,data=data_train,family = binomial)
LR_prob<-predict(LR,data_test,type="response")
LR_pred<-rep(0,length(LR_prob))
LR_pred[LR_prob>0.5]<-1
mean(LR_pred!=data_test$mpg01)
## [1] 0.1186441
The test error is 0.1186
(g). Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
install.packages("e1071",repos = "http://cran.us.r-project.org")
## 程序包'e1071'打开成功,MD5和检查也通过
##
## 下载的二进制程序包在
## C:\Users\LBder\AppData\Local\Temp\RtmpKiUbhn\downloaded_packages里
library(e1071)
NB<-naiveBayes(mpg01~cylinders+weight+displacement+horsepower,data=data_train)
NB_pred<-predict(NB,data_test)
mean(NB_pred!=data_test$mpg01)
## [1] 0.1271186
The test error is 0.1271
(h). Perform KNN on the training data, with several values of K, in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
knn_pred1 <-knn(data_train[,-9], data_test[,-9], data_train$mpg01, k = 1)
mean(knn_pred1!=data_test$mpg01)
## [1] 0.1525424
knn_pred2 <-knn(data_train[,-9], data_test[,-9], data_train$mpg01, k = 2)
mean(knn_pred2!=data_test$mpg01)
## [1] 0.1525424
knn_pred3 <-knn(data_train[,-9], data_test[,-9], data_train$mpg01, k = 3)
mean(knn_pred3!=data_test$mpg01)
## [1] 0.1271186
knn_pred4 <-knn(data_train[,-9], data_test[,-9], data_train$mpg01, k = 5)
mean(knn_pred4!=data_test$mpg01)
## [1] 0.1355932
The test errors are 0.1525, 0.1525, 0.1271, 0.1356 respectively when K is 1,2,3,5. The value of K that seems to perform the best on this data set is 3.
This problem involves writing functions.
(a). Write a function, Power(), that prints out the result of raising 2 to the 3rd power. In other words, your function should compute \(2^3\) and print out the results.
Hint: Recall that x^a raises x to the power a. Use the print() function to output the result.
Power<- function(x=2, a=3) print(x^a)
Power()
## [1] 8
(b). Create a new function, Power2(), that allows you to pass any two numbers, x and a, and prints out the value of x^a. You can do this by beginning your function with the line
Power2 <- function(x, a) print(x^a)
You should be able to call your function by entering, for instance,
Power2(3, 8)
## [1] 6561
on the command line. This should output the value of \(3^8\), namely, 6561.
(c). Using the Power2() function that you just wrote, compute \(10^3\), \(8^{17}\) , and \(131^3\).
Power(10,3)
## [1] 1000
Power(8,17)
## [1] 2.2518e+15
Power(131,3)
## [1] 2248091
(d). Now create a new function, Power3(), that actually returns the result x^a as an R object, rather than simply printing it to the screen. That is, if you store the value x^a in an object called result within your function, then you can simply return() this result, using the following line:
Power3 <- function(x, a) {
result = x^a
return (result)
}
The line above should be the last line in your function, before the } symbol.
(e). Now using the Power3() function, create a plot of \(f(x)=x^2\). The x-axis should display a range of integers from 1 to 10, and the y-axis should display \(x^2\). Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the x-axis, the y-axis, or both on the log-scale. You can do this by using log = “x”, log = “y”, or log = “xy” as arguments to the plot() function.
x=seq(1,10)
y=Power3(x,2)
plot(x,y,type="l",ylab=expression(x^2),main=expression('plot of f(x)='*x^2*''))
(f). Create a function, PlotPower(), that allows you to create a plot of x against \(x^a\) for a fixed a and for a range of values of x. For instance, if you call
PlotPower<-function(c,a){
x=rep(c);
y=x^a;
plot(x,y,type="l",ylab=expression(x^a),main=expression('plot of f(x)='*x^{a}*''));
axis(1,seq(1,10))} # define the interval of x_axis manually
then a plot should be created with an x-axis taking on values 1, 2, …, 10, and a y-axis taking on values \(1^3\), \(2^3\), …, \(10^3\).
PlotPower(1:10, 3)
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore the logistic regression, LDA, naive Bayes and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set.
library(MASS)
library(tidyverse)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
attach(Boston)
Boston<-Boston %>%
mutate(chas=factor(chas),crime_factor=factor(ifelse(crim>median(crim),"H","L"),levels=c("H","L")))
Logistic regression
set.seed(6)
subb<-sample(1:nrow(Boston),round(0.3*nrow(Boston)))
train<-Boston[-subb,]
test<-Boston[subb,]
LR<-glm(crime_factor~.,data=train,family = binomial())
pred<-predict(LR,test,type="response")
LR_pred<-ifelse(pred>=0.5,"L","H")
ac<-mean(LR_pred==test$crime_factor)
ac
## [1] 0.9802632
LDA
LDA<-lda(crime_factor~.,data=train)
LDA_pred<-predict(LDA,test)
mean(LDA_pred$class==test$crime_factor)
## [1] 0.8355263
Naive Bayes
#install.packages("e1071")
library(e1071)
NB<-naiveBayes(crime_factor~.,data=train)
NB_pred<-predict(NB,test)
mean(NB_pred==test$crime_factor)
## [1] 0.9539474
KNN
library(class)
vs<-c("zn","indus","nox","rm","age","dis","rad","tax","ptratio","black","lstat","medv")
acc<-list()
for (i in 1:20) {
knn_pred <- knn(train = train[,vs], test
= test[,vs], cl=
train$crime_factor,
k = i)
acc[as.character(i)] = mean(knn_pred == test$crime_factor)
}
acc
## $`1`
## [1] 0.9144737
##
## $`2`
## [1] 0.9078947
##
## $`3`
## [1] 0.9144737
##
## $`4`
## [1] 0.9078947
##
## $`5`
## [1] 0.8947368
##
## $`6`
## [1] 0.8684211
##
## $`7`
## [1] 0.8684211
##
## $`8`
## [1] 0.8684211
##
## $`9`
## [1] 0.8421053
##
## $`10`
## [1] 0.8552632
##
## $`11`
## [1] 0.8289474
##
## $`12`
## [1] 0.8289474
##
## $`13`
## [1] 0.8223684
##
## $`14`
## [1] 0.8289474
##
## $`15`
## [1] 0.8223684
##
## $`16`
## [1] 0.8289474
##
## $`17`
## [1] 0.8355263
##
## $`18`
## [1] 0.8355263
##
## $`19`
## [1] 0.8355263
##
## $`20`
## [1] 0.8355263
It seems that the best model for this particular problem at this time is a logistic regression model.