Chapter 04 (page 168): 10, 11, 13
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
library(MASS)
library(class)
data(Weekly)
summary(Weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
cor(Weekly[ , -9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
pairs(Weekly)
By observing the above output, we can see that Volume and Year are clearly positively correlated. From 1990 to 2010, there appears to be an exponential increase in the volume of trades over time.
Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?log.reg <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume , data = Weekly , family = "binomial")
summary(log.reg)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = "binomial", data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The above output shows that Lag2 is statistically significant as it returned a p-value of 0.0296, which is below our significance level of 0.05.
weekly.prob <- predict(log.reg, Weekly, type = 'response')
weekly.preds <- rep("Down", length(weekly.prob) )
weekly.preds[weekly.prob > 0.5] = "Up"
table(weekly.preds, Weekly$Direction)
##
## weekly.preds Down Up
## Down 54 48
## Up 430 557
The above output shows that the percentage of correct predictions is 56.11% \[((54+557)/1089) = 0.56106\].
The model was highly accurate at predicting Up trends with a score of 92.07% \[(557/(48+557)) = 0.92066\].
However, the model was very inaccurate at predicting Down with a score of 11.16% \[(54/(430+54)) = 0.11157\].
weekly.train <- (Weekly$Year < 2009)
weekly.latest <- Weekly[!weekly.train, ]
weekly.glm <- glm(Direction ~ Lag2, data = Weekly, family = 'binomial', subset = weekly.train)
weekly.prob <- predict(weekly.glm, weekly.latest, type = "response")
weekly.preds <- rep("Down", length(weekly.prob))
weekly.preds[weekly.prob > 0.5] = "Up"
weekly.direction <- Weekly$Direction[!weekly.train]
table(weekly.preds, weekly.direction)
## weekly.direction
## weekly.preds Down Up
## Down 9 5
## Up 34 56
The above output shows that the percentage of correct predictions is 62.50% \[((9+56)/104) = 0.625\].
The model was highly accurate at predicting Up trends with a score of 91.80% \[(56/(56+5)) = 0.9180\].
However, the model was relatively inaccurate at predicting Down with a score of 20.93% \[(9/(9+34)) = 0.2093\].
weekly.lda <- lda(Direction ~ Lag2, data = Weekly, family = 'binomial', subset = weekly.train)
weekly.lda.preds <- predict(weekly.lda, weekly.latest)
table(weekly.lda.preds$class, weekly.direction)
## weekly.direction
## Down Up
## Down 9 5
## Up 34 56
The above output shows that using LDA resulted in identical calculations.
weekly.qda <- qda(Direction ~ Lag2, data = Weekly, subset = weekly.train)
weekly.qda.preds <- predict(weekly.qda, weekly.latest)$class
table(weekly.qda.preds, weekly.direction)
## weekly.direction
## weekly.qda.preds Down Up
## Down 0 0
## Up 43 61
The above output shows that the percentage of correct predictions is 58.65% \[((0+61)/104) = 0.5865\].
The model was perfectly accurate at predicting Up trends with a score of 100% \[(61/(61+0)) = 1\].
However, the model was unable to predict Down with a score of 0% \[(0/(0+43) = 0\].
weekly.train.X <- as.matrix(Weekly$Lag2[weekly.train])
weekly.test.X <- as.matrix(Weekly$Lag2[!weekly.train])
weekly.train.direction = Weekly$Direction[weekly.train]
set.seed(1)
weekly.knn.preds <- knn(weekly.train.X, weekly.test.X, weekly.train.direction, k=1)
table(weekly.knn.preds, weekly.direction)
## weekly.direction
## weekly.knn.preds Down Up
## Down 21 30
## Up 22 31
The above output shows that KNN with k=1 resulted in a model with an accuracy rate of exactly 50% \[((21+31)/104) = 0.5\]
The models with the highest accuracy rates are the Logistic Regression (62.50%) and LDA (62.50%).
weekly.glm.2 <- glm(Direction ~ Lag1:Lag2, data = Weekly, family = "binomial", subset = weekly.train)
weekly.glm.2.prob <- predict(weekly.glm.2, weekly.latest, type = "response")
weekly.glm.2.preds <- rep("Down", length(weekly.glm.2.prob))
weekly.glm.2.preds[weekly.glm.2.prob > 0.5] = "Up"
weekly.direction = Weekly$Direction[!weekly.train]
table(weekly.glm.2.preds, weekly.direction)
## weekly.direction
## weekly.glm.2.preds Down Up
## Down 1 1
## Up 42 60
The above Logistic Regression model shows an accuracy rate of 58.65%.
weekly.lda.2 <- lda(Direction ~ Lag1:Lag2, data = Weekly, family = 'binomial', subset = weekly.train)
weekly.lda.2.preds <- predict(weekly.lda.2, weekly.latest)
table(weekly.lda.2.preds$class, weekly.direction)
## weekly.direction
## Down Up
## Down 0 1
## Up 43 60
The above LDA model shows an accuracy rate of 57.69%.
weekly.qda.2 = qda(Direction ~ Lag1:Lag2, data = Weekly, subset = weekly.train)
weekly.qda.2.preds = predict(weekly.qda.2, weekly.latest)$class
table(weekly.qda.2.preds, weekly.direction)
## weekly.direction
## weekly.qda.2.preds Down Up
## Down 16 32
## Up 27 29
The above QDA model shows an accuracy rate of 43.27%.
set.seed(1)
weekly.knn.preds.2 <- knn(weekly.train.X, weekly.test.X, weekly.train.direction, k=10)
table(weekly.knn.preds.2, weekly.direction)
## weekly.direction
## weekly.knn.preds.2 Down Up
## Down 17 21
## Up 26 40
The above KNN model with k=10 shows an accuracy rate of 54.81%.
set.seed(1)
weekly.knn.preds.3 <- knn(weekly.train.X, weekly.test.X, weekly.train.direction, k=15)
table(weekly.knn.preds.3, weekly.direction)
## weekly.direction
## weekly.knn.preds.3 Down Up
## Down 20 20
## Up 23 41
The above KNN model with k=15 shows an accuracy rate of 58.65%.
After running multiple different models, our highest achieved accuracy rate was tied between Logistic Regression with Lag1 and Lag2 (58.65%) and KNN with k=15 (58.65%).
Auto data set.data(Auto)
mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.mpg01 <- rep(0, length(Auto$mpg))
mpg01[Auto$mpg > median(Auto$mpg)] = 1
Auto <- data.frame(Auto, mpg01)
pairs(Auto)
par(mfrow=c(1,2))
plot(Auto$year, Auto$acceleration, xlab = "Year", ylab = "Acceleration")
plot(Auto$year, Auto$horsepower, xlab = "Year", ylab = "Horsepower")
Our above scatter plots seem to show a positive correlation between acceleration and year. Conversely, we also observe a slight negative correlation between horsepower and year.
par(mfrow=c(1,2))
boxplot(cylinders ~ mpg01, data = Auto, main = "Cylinders")
boxplot(displacement ~ mpg01, data = Auto, main = "Displacement")
The above box plots show a negative correlation between cylinders and mpg01 as a negative correlation between displacement and mpg01.
train <- (Auto$year %% 2 == 0)
train.auto <- Auto[train, ]
test.auto <- Auto[!train, ]
test.mpg01 <- Auto$mpg01[!train]
auto.lda <- lda(mpg01 ~ cylinders + displacement + horsepower + weight + year, data = train.auto)
auto.lda.preds <- predict(auto.lda, test.auto)
table(auto.lda.preds$class, test.mpg01)
## test.mpg01
## 0 1
## 0 87 6
## 1 13 76
The above output of our LDA model returns an error rate of 10.44% \[1 - ((87+76)/182) = 0.1044\]
auto.qda <- qda(mpg01 ~ cylinders + displacement + horsepower + weight + year, data = train.auto)
auto.qda.preds <- predict(auto.qda, test.auto)
table(auto.qda.preds$class, test.auto$mpg01)
##
## 0 1
## 0 89 12
## 1 11 70
The above output of our QDA model returns an error rate of 12.64% \[1 - ((89+70)/182) = 0.1264\]
auto.lm <- glm(mpg01 ~ cylinders + displacement + horsepower + weight + year, data = train.auto, family = 'binomial')
auto.lm.prob <- predict(auto.lm, test.auto, type = "response")
auto.lm.preds = rep(0, length(auto.lm.prob))
auto.lm.preds[auto.lm.prob > 0.5] = 1
table(auto.lm.preds, test.auto$mpg01)
##
## auto.lm.preds 0 1
## 0 87 8
## 1 13 74
The above output of our Logistic Regression model returns an error rate of 11.54% \[1 - ((87+74)/182) = 0.1154\]
attach(Auto)
## The following object is masked _by_ .GlobalEnv:
##
## mpg01
set.seed(1)
#K = 1
auto.train.X = cbind(cylinders, displacement, horsepower, weight, year)[train, ]
auto.test.X = cbind(cylinders, displacement, horsepower, weight, year)[!train, ]
train.mpg01 = mpg01[train]
auto.knn.preds.1 = knn(auto.train.X, auto.test.X, train.mpg01, k = 1)
table(auto.knn.preds.1, test.mpg01)
## test.mpg01
## auto.knn.preds.1 0 1
## 0 83 11
## 1 17 71
#K = 10
auto.knn.preds.10 = knn(auto.train.X, auto.test.X, train.mpg01, k = 10)
table(auto.knn.preds.10, test.mpg01)
## test.mpg01
## auto.knn.preds.10 0 1
## 0 77 7
## 1 23 75
#K = 100
auto.knn.preds.100 = knn(auto.train.X, auto.test.X, train.mpg01, k = 100)
table(auto.knn.preds.100, test.mpg01)
## test.mpg01
## auto.knn.preds.100 0 1
## 0 81 7
## 1 19 75
Observed test error rates:
K=1: 15.38% \[1 - (83+71)/182) = 0.1538\]K=10: 16.48% \[1 - (77+75)/182) = 0.1648\]K=100: 14.29% \[1 - (81+75)/182) = 0.1429\]Our best KNN model was where K=100, which resulted in a test error rate of 14.29%.
Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.data(Boston)
attach(Boston)
# Create variable
crime <- rep(0, length(crim))
crime[crim > median(crim)] <- 1
Boston <- data.frame(Boston, crime)
# Split data
train = 1:(length(crime) / 2)
test = (length(crime) / 2 + 1):length(crime)
Boston.train = Boston[train, ]
Boston.test = Boston[test, ]
crime.test = crime[test]
# Logistic Regression 1
crime.lm.1 = glm(crime ~ . - crime - crim, data = Boston, family = "binomial", subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
crime.prob.1 = predict(crime.lm.1, Boston.test, type = "response")
crime.preds.1 = rep(0, length(crime.prob.1))
crime.preds.1[crime.prob.1 > 0.5] = 1
table(crime.preds.1, crime.test)
## crime.test
## crime.preds.1 0 1
## 0 68 24
## 1 22 139
Our first Logistic Regression model returned an error rate of 18.19% \[1 - ((68+139)/253) = .1819\]
#Logistic Regression 2
crime.lm.2 = glm(crime ~ . - crime - crim - chas - nox, data = Boston, family = 'binomial', subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
crime.prob.2 = predict(crime.lm.2, Boston.test, type = "response")
crime.preds.2 = rep(0, length(crime.prob.2))
crime.preds.2[crime.prob.2 > 0.5] = 1
table(crime.preds.2, crime.test)
## crime.test
## crime.preds.2 0 1
## 0 78 28
## 1 12 135
Our second Logistic Regression model returned an error rate of 15.81% \[1 - ((78+135)/253) = .1581\]
# LDA 1
crime.lda.1 = lda(crime ~ . - crime - crim, data = Boston, subset = train)
crime.lda.preds.1 = predict(crime.lda.1, Boston.test)
table(crime.lda.preds.1$class, crime.test)
## crime.test
## 0 1
## 0 80 24
## 1 10 139
Our first LDA model returned an error rate of 13.44% \[1 - ((80+139)/253) = .1344\]
# LDA 2
crime.lda.2 = lda(crime ~ . - crime - crim - chas - nox - tax, data = Boston, subset = train)
crime.lda.preds.2 = predict(crime.lda.2, Boston.test)
table(crime.lda.preds.2$class, crime.test)
## crime.test
## 0 1
## 0 83 28
## 1 7 135
Our second LDA model returned an error rate of 13.83% \[1 - ((83+135)/253) = .1383\]
#KNN, k = 10
set.seed(1)
Boston.train.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[train, ]
Boston.test.X = cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv)[test, ]
crime.train = crime[train]
crime.knn.preds.10 = knn(Boston.train.X, Boston.test.X, crime.train, k = 10)
table(crime.knn.preds.10, crime.test)
## crime.test
## crime.knn.preds.10 0 1
## 0 83 21
## 1 7 142
Our KNN model with k=10 returned an error rate of 11.07% \[1 - ((83+142)/253) = .1107\]
#KNN, k = 100
crime.knn.preds.100 = knn(Boston.train.X, Boston.test.X, crime.train, k = 100)
table(crime.knn.preds.100, crime.test)
## crime.test
## crime.knn.preds.100 0 1
## 0 86 121
## 1 4 42
Our KNN model with k=100 returned an error rate of 49.01% \[1 - ((86+43))/253 = .4901\]
Our best performing model was KNN with k=10 which resulted in an error rate of approximately 11%