(a).Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.
Answer: We may predict probability using equation, \(^p(X)=e^(\beta_0 +\beta_1X_1 +\beta_2X_2)/(1+e^(\beta_0+\beta_1X_1 +\beta_2X_2))\)……..(A)
We have ^\(\beta_0\)= -6 ^\(\beta_1\)= 0.05 ^\(\beta_2\)= 1 \(X_1\)= 40 \(X_2\)= 3.5 Now ,Substituting the corresponding values, \(^p(X)=e^(-6 +0.05*40+ 1*3.5)/(1+e^(-6+0.05*40 +1*3.5))\) = 0.3775
Probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class is 0.3775.
(b).How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class ?
We have the equation as, \(0.5=e^(-6 +0.05*X_1 +1*3.5)/(1+e^(-6+0.05*X_1 +1*3.5))\) or \(0.5*e^(-6 +0.05*X_1 +1*3.5)=0.5\) or \(e^(-6 +0.05*X_1 +1*3.5)=1\)
By taking log on both side, \((0.05*X_1 -2.5)= 0\) \(Log(1)=0\) \(X_1 = 2.5/0.05\) = 50
Therefore, the student in part (a) Should need to study 50 hours to have a 50% chance of getting an A in the class.
First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data, So an average error rate(averaged over both testing and training data sets) is equal to $ (20+30)/2=25$ and test data error rate is 30%.
In the case of KNN with K=1, we have a training error rate of 0% because in this case, we have P(Y=j|X=xi)=I(yi=j) which is equal to 1 if yi=j and 0 if not. We do not make any error on the training data within this setting, that explains the 0% training error rate. However, we have an average error rate of 18% which indicates a test error rate of 18*2= 36% for KNN which is greater than the test error rate for logistic regression of 30%.
So, In this case,It is better to select logistic regression because of its lower test error rate.
Answer 11:
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.3.2
data("Auto")
mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name mpg01
## amc matador : 5 Min. :0.0
## ford pinto : 5 1st Qu.:0.0
## toyota corolla : 5 Median :0.5
## amc gremlin : 4 Mean :0.5
## amc hornet : 4 3rd Qu.:1.0
## chevrolet chevette: 4 Max. :1.0
## (Other) :365
(b).Explore the data graphically in order to investigate the association between “mpg01” and the other features. Which of the other features seem most likely to be useful in predictiong “mpg01” ? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings. ### Correlation
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
corrplot::corrplot.mixed(cor(Auto[, -9]), upper="circle")
pairs(Auto[, -9])
par(mfrow=c(2,3))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "Horsepower vs mpg01")
boxplot(weight ~ mpg01, data = Auto, main = "Weight vs mpg01")
boxplot(acceleration ~ mpg01, data = Auto, main = "Acceleration vs mpg01")
boxplot(year ~ mpg01, data = Auto, main = "Year vs mpg01")
The graph indicates that there exists some association between “mpg01” and “cylinders”, “weight”, “displacement” and “horsepower”.
(c).Split the data into a training set and a test set.
set.seed(123)
train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data<- Auto[train, ]
testing_data= Auto[test, ]
mpg01.test <- mpg01[test]
(d).Perform LDA on the training data in order to predict “mpg01” using the variables that seemed most associated with “mpg01” in (b). What is the test error of the model obtained ?
library(MASS)
## Warning: package 'MASS' was built under R version 3.3.2
lda_model <- lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
lda_model
## Call:
## lda(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4817518 0.5182482
##
## Group means:
## cylinders weight displacement horsepower
## 0 6.795455 3659.008 274.3333 130.88636
## 1 4.218310 2343.599 117.0528 79.71127
##
## Coefficients of linear discriminants:
## LD1
## cylinders -0.4483062756
## weight -0.0012201696
## displacement 0.0001078142
## horsepower 0.0046253024
lda_pred = predict(lda_model, testing_data)
names(lda_pred)
## [1] "class" "posterior" "x"
## compute the confusion matrix
pred.lda <- predict(lda_model, testing_data)
table(pred.lda$class, mpg01.test)
## mpg01.test
## 0 1
## 0 53 3
## 1 11 51
mean(pred.lda$class != mpg01.test)
## [1] 0.1186441
We conclude that we have a test error rate of 11.86%
qda_model = qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data=training_data)
qda_model
## Call:
## qda(mpg01 ~ cylinders + horsepower + weight + acceleration, data = training_data)
##
## Prior probabilities of groups:
## 0 1
## 0.4817518 0.5182482
##
## Group means:
## cylinders horsepower weight acceleration
## 0 6.795455 130.88636 3659.008 14.65303
## 1 4.218310 79.71127 2343.599 16.46620
# compute the confusion matrix
qda.class=predict(qda_model, testing_data)$class
table(qda.class, testing_data$mpg01)
##
## qda.class 0 1
## 0 56 3
## 1 8 51
mean(qda.class != testing_data$mpg01)
## [1] 0.09322034
We may conclude that we have a test error rate of 0.093%.
(f).Perform logistic regression on the training data in order to predict “mpg01” using the variables that seemed most associated with “mpg01” in (b). What is the test error of the model obtained ?
glm_model <- glm(mpg01 ~ cylinders + weight + displacement + horsepower, data = training_data, family = binomial)
summary(glm_model)
##
## Call:
## glm(formula = mpg01 ~ cylinders + weight + displacement + horsepower,
## family = binomial, data = training_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4870 -0.1763 0.1231 0.3633 3.2024
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.6661375 1.9885822 5.867 4.45e-09 ***
## cylinders 0.0365940 0.3817545 0.096 0.92363
## weight -0.0023706 0.0008319 -2.850 0.00438 **
## displacement -0.0106969 0.0096956 -1.103 0.26991
## horsepower -0.0327332 0.0154688 -2.116 0.03434 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 379.48 on 273 degrees of freedom
## Residual deviance: 147.50 on 269 degrees of freedom
## AIC: 157.5
##
## Number of Fisher Scoring iterations: 7
probs <- predict(glm_model, testing_data, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, mpg01.test)
## mpg01.test
## pred.glm 0 1
## 0 54 3
## 1 10 51
mean(pred.glm != mpg01.test)
## [1] 0.1101695
Test error rate in logistic regression is 11.01695%.
str(Auto)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : num 0 0 0 0 0 0 0 0 0 0 ...
data = scale(Auto[,-c(9,10)])
set.seed(1234)
train <- sample(1:dim(Auto)[1], 392*.7, rep=FALSE)
#train <- sample(1:dim(Auto)[1], dim(Auto)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train,c("cylinders","horsepower","weight","acceleration")]
testing_data = data[test, c("cylinders", "horsepower","weight","acceleration")]
## KNN take the training response variable seperately
train.mpg01 = Auto$mpg01[train]
## we also need the have the testing_y seperately for assesing the model later on
test.mpg01= Auto$mpg01[test]
library(class)
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.mpg01, k = 1)
table(knn_pred_y, test.mpg01)
## test.mpg01
## knn_pred_y 0 1
## 0 50 4
## 1 6 58
mean(knn_pred_y != test.mpg01)
## [1] 0.08474576
Test error rate is 8.47% for K=1
knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.mpg01,k=i)
error_rate[i] = mean(test.mpg01 != knn_pred_y)
}
### find the minimum error rate
min_error_rate = min(error_rate)
print(min_error_rate)
## [1] 0.06779661
### get the index of that error rate, which is the k
K = which(error_rate == min_error_rate)
print(K)
## [1] 4
When we train a KNN model with k=4, then we get the lowest misclassification error rate of 6.77%.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))
library(MASS)
data("Boston")
crim01 <- rep(0, length(Boston$crim))
crim01[Boston$crim > median(Boston$crim)] <- 1
Boston <- data.frame(Boston, crim01)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv crim01
## Min. : 1.73 Min. : 5.00 Min. :0.0
## 1st Qu.: 6.95 1st Qu.:17.02 1st Qu.:0.0
## Median :11.36 Median :21.20 Median :0.5
## Mean :12.65 Mean :22.53 Mean :0.5
## 3rd Qu.:16.95 3rd Qu.:25.00 3rd Qu.:1.0
## Max. :37.97 Max. :50.00 Max. :1.0
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crim01.test <- crim01[test]
fit.glm1 <- glm(crim01 ~ . - crim01 - crim, data = Boston, family = binomial)
fit.glm1
##
## Call: glm(formula = crim01 ~ . - crim01 - crim, family = binomial,
## data = Boston)
##
## Coefficients:
## (Intercept) zn indus chas nox
## -34.103704 -0.079918 -0.059389 0.785327 48.523782
## rm age dis rad tax
## -0.425596 0.022172 0.691400 0.656465 -0.006412
## ptratio black lstat medv
## 0.368716 -0.013524 0.043862 0.167130
##
## Degrees of Freedom: 505 Total (i.e. Null); 492 Residual
## Null Deviance: 701.5
## Residual Deviance: 211.9 AIC: 239.9
library(corrplot)
corrplot::corrplot.mixed(cor(Boston[, -1]), upper="circle")
fit.glm <- glm(crim01 ~ nox + indus + age + rad, data = Boston, family = binomial)
probs <- predict(fit.glm, Boston.test, type = "response")
pred.glm <- rep(0, length(probs))
pred.glm[probs > 0.5] <- 1
table(pred.glm, crim01.test)
## crim01.test
## pred.glm 0 1
## 0 65 15
## 1 4 68
mean(pred.glm != crim01.test)
## [1] 0.125
For the logistic regression, we have a test error rate of 12.5%.
fit.lda <- lda(crim01 ~ nox + indus + age + rad , data = Boston)
pred.lda <- predict(fit.lda, Boston.test)
table(pred.lda$class, crim01.test)
## crim01.test
## 0 1
## 0 66 20
## 1 3 63
mean(pred.lda$class != crim01.test)
## [1] 0.1513158
For the LDA regression model, we have a test error rate of 15.1%.
data = scale(Boston[,-c(1,15)])
set.seed(1234)
train <- sample(1:dim(Boston)[1], dim(Boston)[1]*.7, rep=FALSE)
test <- -train
training_data = data[train, c("nox" , "indus" , "age" , "rad")]
testing_data = data[test, c("nox" , "indus" , "age" , "rad")]
## KNN take the training response variable seperately
train.crime01 = Boston$crim01[train]
## we also need the have the testing_y seperately for assesing the model later on
test.crime01= Boston$crim01[test]
library(class)
set.seed(1234)
knn_pred_y = knn(training_data, testing_data, train.crime01, k = 1)
table(knn_pred_y, test.crime01)
## test.crime01
## knn_pred_y 0 1
## 0 62 7
## 1 7 76
mean(knn_pred_y != test.crime01)
## [1] 0.09210526
For this KNN (k=1), we have a test error rate of 9.21%
knn_pred_y = NULL
error_rate = NULL
for(i in 1:dim(testing_data)[1]){
set.seed(1234)
knn_pred_y = knn(training_data,testing_data,train.crime01,k=i)
error_rate[i] = mean(test.crime01 != knn_pred_y)
}
### find the minimum error rate
min_error_rate = min(error_rate)
print(min_error_rate)
## [1] 0.06578947
### get the index of that error rate, which is the k
K = which(error_rate == min_error_rate)
print(K)
## [1] 3
Minimum error rate is 6.57% at K=3
library(ggplot2)
qplot(1:dim(testing_data)[1], error_rate, xlab = "K",
ylab = "Error Rate",
geom=c("point", "line"))