There is a high correlation between year and volume. This is intuitive as over time more companies are listed and more activity is seen on the market, therefore there is a higher trade volume.
weekly = Weekly
summary(weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
cor(weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
plot(weekly)
The only predictor that is statistically significant is Lag2.
glm.fits = glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = weekly ,family=binomial )
summary (glm.fits)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
The logistic model is correct 56.1% of the time.This is only slightly better than simply guessing ‘Up’ every time which would provide a 55.6% accuracy. The model also seems to err on the side of guessing ‘Up’ which makes sense given that is the more likely option. When it guesses the market will go down, it is correct 53% of the time, which is good considering only 44.4% of the time will the market go down. However, it does not often guess the market will go down.
glm.probs = predict (glm.fits, type = "response")
glm.pred = rep("Up", 1089)
glm.pred[glm.probs < .5] = "Down"
table(glm.pred, weekly$Direction )
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
print(mean(glm.pred == weekly$Direction))
## [1] 0.5610652
print(mean(weekly$Direction == "Up"))
## [1] 0.5555556
print(54/(54+48))
## [1] 0.5294118
The new accuracy is 62.5% which is awesome! It isn’t perfect, but it is a step in the right direction, especially considering we are using a hold out data set.
train = (weekly$Year < 2009)
glm.fits2 = glm(formula = Direction ~ Lag2, data = weekly[train, ], family = binomial )
summary (glm.fits2)
##
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = weekly[train,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.536 -1.264 1.021 1.091 1.368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20326 0.06428 3.162 0.00157 **
## Lag2 0.05810 0.02870 2.024 0.04298 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1354.7 on 984 degrees of freedom
## Residual deviance: 1350.5 on 983 degrees of freedom
## AIC: 1354.5
##
## Number of Fisher Scoring iterations: 4
glm.probs2 = predict(glm.fits2, weekly[!train, ], type = "response")
glm.pred2 = rep("Up", length(weekly$Direction[!train]))
glm.pred2[glm.probs2 < .5] = "Down"
table(glm.pred2, weekly$Direction[!train])
##
## glm.pred2 Down Up
## Down 9 5
## Up 34 56
print(mean(glm.pred2 == weekly$Direction[!train]))
## [1] 0.625
The linear discriminant analysis is identical to the logistic regression when only one variable is used.
lda.fit = lda(Direction ~ Lag2 , data = weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
##
## Coefficients of linear discriminants:
## LD1
## Lag2 0.4414162
plot(lda.fit)
lda.pred = predict(lda.fit, weekly[!train, ])
The quadratic discriminant analysis opts to predict the market will go up every time when only using Lad 2 as a predictor variable. Accuracy decreases to 58.6%.
qda.fit = qda(Direction ~ Lag2 , data = weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = weekly, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4477157 0.5522843
##
## Group means:
## Lag2
## Down -0.03568254
## Up 0.26036581
qda.pred = predict(qda.fit, weekly[!train, ])
table(qda.pred$class, weekly$Direction[!train])
##
## Down Up
## Down 0 0
## Up 43 61
print(mean(qda.pred$class == weekly$Direction[!train]))
## [1] 0.5865385
KNN has an accuracy of 50%. This is worse than simply guessing the market will go up every time. It is also wrong more often than it is right when predicting the market will go down.
train.X = as.data.frame(weekly$Lag2[train])
test.X = as.data.frame(weekly$Lag2[!train])
train.response = weekly$Direction[train]
set.seed(42)
knn.pred = knn(train.X, test.X, train.response, k = 1)
table(knn.pred, weekly$Direction[!train])
##
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == weekly$Direction[!train])
## [1] 0.5
The method that provided the best results was the linear discriminant model, which is the same as the logistic model. These models not only provided the greatest accuracy, but also allow tuning to develop better predictions.
The best predictive model I could find was KNN using only Lag2 with k = 2. Here the model was 62.5% accurate. It is even correct more often than not when predicting down weeks for the market.
train.X = as.data.frame(weekly$Lag2[train])
test.X = as.data.frame(weekly$Lag2[!train])
train.response = weekly$Direction[train]
set.seed(42)
knn.pred = knn(train.X, test.X, train.response, k = 2, l = 0)
table(knn.pred, weekly$Direction[!train])
##
## knn.pred Down Up
## Down 26 22
## Up 17 39
mean(knn.pred == weekly$Direction[!train])
## [1] 0.625
auto = Auto
mpgmedian = auto$mpg > median(auto$mpg)
auto$mpg01[mpgmedian == TRUE] = 1
auto$mpg01[mpgmedian == FALSE] = 0
From the scatter plots, we see that the features most likely to help predicting mpg01 are displacement, horsepower, weight, and acceleration. From the boxplots of the factored variables, it seems that year and origin will be useful in predicting mpg01.
pairs(mpg01~ displacement + horsepower + weight + acceleration,auto)
boxplot(mpg01 ~ cylinders, auto)
boxplot(mpg01 ~ year, auto)
boxplot(mpg01 ~ origin, auto)
set.seed(42)
train = sample(1:nrow(auto), nrow(auto) * .7)
x.train = auto[train, 2:8]
y.train = auto[train, 10]
x.test = auto[-train, 2:8]
y.test = auto[-train, 10]
The test error is only 7.6%. That’s pretty good!
lda.fit.auto = lda(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, data = auto, subset = train)
lda.fit.auto
## Call:
## lda(mpg01 ~ displacement + horsepower + weight + acceleration +
## year + origin, data = auto, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5437956 0.4562044
##
## Group means:
## displacement horsepower weight acceleration year origin
## 0 268.7584 127.8121 3573.557 14.7349 74.58389 1.167785
## 1 116.9720 78.3280 2336.640 16.7304 77.56800 1.960000
##
## Coefficients of linear discriminants:
## LD1
## displacement -0.004013814
## horsepower 0.007876578
## weight -0.001343671
## acceleration 0.024734805
## year 0.139744209
## origin 0.198743128
lda.auto.pred = predict(lda.fit.auto, auto[-train, ])
table(lda.auto.pred$class, auto[-train, 'mpg01'])
##
## 0 1
## 0 41 3
## 1 6 68
1 - mean(lda.auto.pred$class == auto[-train, 'mpg01'])
## [1] 0.07627119
Coincidentally, the QDA test error is also 7.6%.
qda.fit.auto = qda(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, data = auto, subset = train)
qda.fit.auto
## Call:
## qda(mpg01 ~ displacement + horsepower + weight + acceleration +
## year + origin, data = auto, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.5437956 0.4562044
##
## Group means:
## displacement horsepower weight acceleration year origin
## 0 268.7584 127.8121 3573.557 14.7349 74.58389 1.167785
## 1 116.9720 78.3280 2336.640 16.7304 77.56800 1.960000
qda.auto.pred = predict(qda.fit.auto, auto[-train, ])
table(qda.auto.pred$class, auto[-train, 'mpg01'])
##
## 0 1
## 0 43 5
## 1 4 66
1 - mean(qda.auto.pred$class == auto[-train, 'mpg01'])
## [1] 0.07627119
The test error of the logistic model is 9.5%. This is a decrease in performance from the LDA and QDA models.
glm.fit.auto = glm(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, auto[train, ], family = binomial)
summary(glm.fit.auto)
##
## Call:
## glm(formula = mpg01 ~ displacement + horsepower + weight + acceleration +
## year + origin, family = binomial, data = auto[train, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.28340 -0.14323 -0.00281 0.22601 3.01004
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.536837 7.456870 -3.291 0.001000 **
## displacement 0.004097 0.009304 0.440 0.659728
## horsepower -0.029734 0.026032 -1.142 0.253366
## weight -0.004983 0.001407 -3.542 0.000397 ***
## acceleration 0.101368 0.150591 0.673 0.500863
## year 0.500225 0.100043 5.000 5.73e-07 ***
## origin 0.452448 0.382441 1.183 0.236789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 377.74 on 273 degrees of freedom
## Residual deviance: 116.05 on 267 degrees of freedom
## AIC: 130.05
##
## Number of Fisher Scoring iterations: 7
glm.prob.auto = predict(glm.fit.auto, auto[-train, ], type = "response")
glm.pred.auto = rep(1, nrow(auto[-train,]))
glm.pred.auto[glm.prob.auto < .5] = 0
table(glm.pred.auto, auto[-train, 'mpg01'])
##
## glm.pred.auto 0 1
## 0 46 9
## 1 1 62
print(1- mean(glm.pred.auto == auto[-train, 'mpg01']))
## [1] 0.08474576
The best test error is obtained when k = 1. The error is 11.0%.
for (i in 1:10) {
set.seed(42)
knn.pred.auto = knn(x.train, x.test, y.train, k = i)
table(knn.pred.auto, y.test)
error = 1 - mean(knn.pred.auto == y.test)
print(paste('I: ', i, ' Error: ', error))}
## [1] "I: 1 Error: 0.110169491525424"
## [1] "I: 2 Error: 0.127118644067797"
## [1] "I: 3 Error: 0.11864406779661"
## [1] "I: 4 Error: 0.11864406779661"
## [1] "I: 5 Error: 0.127118644067797"
## [1] "I: 6 Error: 0.127118644067797"
## [1] "I: 7 Error: 0.127118644067797"
## [1] "I: 8 Error: 0.152542372881356"
## [1] "I: 9 Error: 0.152542372881356"
## [1] "I: 10 Error: 0.144067796610169"
The high crime variable was created where 1 means the neighborhood crime rate is above the median and 0 means the neighborhood crime rate is below median. The data was also split into 70% train and 30% test subsets. The correlation between high crime and the predictors is calculated. ‘Tax’ is removed from the formulas for logistic regression as it has high covariance with ‘rad’. From this, 2 formulas are created to evaluate in the LDA, QDA, and Logit models. The first uses only the correlated predictors with a value above 0.5 to build the models. This uses These are ‘nox’, ‘dis’, ‘rad’, ‘age’, and ‘indus’. The second formula uses all predictors for the models.
boston = Boston
crimemedian = boston$crim > median(boston$crim)
boston$highcrime[crimemedian == TRUE] = 1
boston$highcrime[crimemedian == FALSE] = 0
train = sample(1:nrow(boston), .7 * nrow(boston))
cor(boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## highcrime 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## highcrime -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128
## ptratio black lstat medv highcrime
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046 0.40939545
## zn -0.3916785 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252 0.60326017
## chas -0.1215152 0.04878848 -0.0539293 0.1752602 0.07009677
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208 0.72323480
## rm -0.3555015 0.12806864 -0.6138083 0.6953599 -0.15637178
## age 0.2615150 -0.27353398 0.6023385 -0.3769546 0.61393992
## dis -0.2324705 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262 0.61978625
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867 0.25356836
## black -0.1773833 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627 0.45326273
## medv -0.5077867 0.33346082 -0.7376627 1.0000000 -0.26301673
## highcrime 0.2535684 -0.35121093 0.4532627 -0.2630167 1.00000000
formula1 = highcrime ~ nox + dis + rad + age + indus + tax
formula1L = highcrime ~ nox + dis + rad + age + indus
formula2 = highcrime ~ nox + dis + rad + age + indus + tax + zn + chas + rm + ptratio + black + lstat + medv
formula2L = highcrime ~ nox + dis + rad + age + indus + zn + chas + rm + ptratio + black + lstat + medv
The Logit model performs better when all predictors, except tax, are included. The test error is 15.1% for the smaller model, and 11.2% for the full model.
glm.boston1 = glm(formula1L, boston[train, ], family = binomial)
summary(glm.boston1)
##
## Call:
## glm(formula = formula1L, family = binomial, data = boston[train,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06060 -0.32197 -0.08894 0.01065 2.68649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -23.311634 4.113145 -5.668 1.45e-08 ***
## nox 38.430111 7.407417 5.188 2.12e-07 ***
## dis 0.199664 0.171105 1.167 0.243248
## rad 0.447298 0.117510 3.806 0.000141 ***
## age 0.004681 0.010151 0.461 0.644705
## indus -0.075341 0.045848 -1.643 0.100328
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.19 on 353 degrees of freedom
## Residual deviance: 181.14 on 348 degrees of freedom
## AIC: 193.14
##
## Number of Fisher Scoring iterations: 8
glm.prob.boston1 = predict(glm.boston1, boston[-train, ], type = "response")
glm.pred.boston1 = rep(1, nrow(boston[-train,]))
glm.pred.boston1[glm.prob.boston1 < .5] = 0
table(glm.pred.boston1, boston[-train, 'highcrime'])
##
## glm.pred.boston1 0 1
## 0 60 14
## 1 9 69
print(1 - mean(glm.pred.boston1 == boston[-train, 'highcrime']))
## [1] 0.1513158
glm.boston2 = glm(formula2L, boston[train, ], family = binomial)
summary(glm.boston2)
##
## Call:
## glm(formula = formula2L, family = binomial, data = boston[train,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8672 -0.2217 -0.0047 0.0100 3.3451
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.516751 7.306674 -4.724 2.31e-06 ***
## nox 48.512254 8.581395 5.653 1.57e-08 ***
## dis 0.794387 0.248767 3.193 0.00141 **
## rad 0.402876 0.139056 2.897 0.00376 **
## age 0.012271 0.013964 0.879 0.37953
## indus -0.069888 0.049296 -1.418 0.15627
## zn -0.077616 0.039959 -1.942 0.05209 .
## chas 1.487094 1.056864 1.407 0.15940
## rm -0.408475 0.789851 -0.517 0.60505
## ptratio 0.232225 0.136739 1.698 0.08945 .
## black -0.009964 0.005803 -1.717 0.08598 .
## lstat 0.075162 0.053679 1.400 0.16145
## medv 0.198062 0.075480 2.624 0.00869 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.19 on 353 degrees of freedom
## Residual deviance: 154.43 on 341 degrees of freedom
## AIC: 180.43
##
## Number of Fisher Scoring iterations: 9
glm.prob.boston2 = predict(glm.boston2, boston[-train, ], type = "response")
glm.pred.boston2 = rep(1, nrow(boston[-train,]))
glm.pred.boston2[glm.prob.boston2 < .5] = 0
table(glm.pred.boston2, boston[-train, 'highcrime'])
##
## glm.pred.boston2 0 1
## 0 61 10
## 1 8 73
print(1 - mean(glm.pred.boston2 == boston[-train, 'highcrime']))
## [1] 0.1184211
LDA also better with the chosen subset of predictors, improving to 16.4% from 17.1% when all predictors are used.
set.seed(42)
lda.boston1 = lda(formula1, data = boston, subset = train)
lda.boston1
## Call:
## lda(formula1, data = boston, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.519774 0.480226
##
## Group means:
## nox dis rad age indus tax
## 0 0.4717451 5.013656 4.01087 51.71141 7.056467 305.7663
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941
##
## Coefficients of linear discriminants:
## LD1
## nox 7.357924376
## dis -0.053402428
## rad 0.084543742
## age 0.008785933
## indus 0.015816315
## tax -0.001469737
lda.pred.boston1 = predict(lda.boston1, boston[-train, ])
table(lda.pred.boston1$class, boston[-train, 'highcrime'])
##
## 0 1
## 0 66 22
## 1 3 61
1 - mean(lda.pred.boston1$class == boston[-train, 'highcrime'])
## [1] 0.1644737
set.seed(42)
lda.boston2 = lda(formula2, data = boston, subset = train)
lda.boston2
## Call:
## lda(formula2, data = boston, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.519774 0.480226
##
## Group means:
## nox dis rad age indus tax zn chas
## 0 0.4717451 5.013656 4.01087 51.71141 7.056467 305.7663 21.035326 0.03260870
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941 1.305882 0.08235294
## rm ptratio black lstat medv
## 0 6.384060 18.05380 388.3198 9.341359 24.68696
## 1 6.220671 18.92882 323.2791 15.535059 20.71412
##
## Coefficients of linear discriminants:
## LD1
## nox 8.2571533859
## dis 0.0613105160
## rad 0.0677918403
## age 0.0093078031
## indus 0.0271014719
## tax -0.0004555629
## zn -0.0057270353
## chas -0.1519712801
## rm 0.1167760980
## ptratio 0.0146070536
## black -0.0008069021
## lstat 0.0233850211
## medv 0.0455685064
lda.pred.boston2 = predict(lda.boston2, boston[-train, ])
table(lda.pred.boston2$class, boston[-train, 'highcrime'])
##
## 0 1
## 0 64 21
## 1 5 62
1 - mean(lda.pred.boston2$class == boston[-train, 'highcrime'])
## [1] 0.1710526
QDA performs much better with the subset of predictors as opposed to all predictors, increasing the test error from 9.9% when the subset of predictors is used to 12.5% when all predictors are used.
set.seed(42)
qda.boston1 = qda(formula1, data = boston, subset = train)
qda.boston1
## Call:
## qda(formula1, data = boston, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.519774 0.480226
##
## Group means:
## nox dis rad age indus tax
## 0 0.4717451 5.013656 4.01087 51.71141 7.056467 305.7663
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941
qda.pred.boston1 = predict(qda.boston1, boston[-train, ])
table(qda.pred.boston1$class, boston[-train, 'highcrime'])
##
## 0 1
## 0 69 15
## 1 0 68
1 - mean(qda.pred.boston1$class == boston[-train, 'highcrime'])
## [1] 0.09868421
set.seed(42)
qda.boston2 = qda(formula2, data = boston, subset = train)
qda.boston2
## Call:
## qda(formula2, data = boston, subset = train)
##
## Prior probabilities of groups:
## 0 1
## 0.519774 0.480226
##
## Group means:
## nox dis rad age indus tax zn chas
## 0 0.4717451 5.013656 4.01087 51.71141 7.056467 305.7663 21.035326 0.03260870
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941 1.305882 0.08235294
## rm ptratio black lstat medv
## 0 6.384060 18.05380 388.3198 9.341359 24.68696
## 1 6.220671 18.92882 323.2791 15.535059 20.71412
qda.pred.boston2 = predict(qda.boston2, boston[-train, ])
table(qda.pred.boston2$class, boston[-train, 'highcrime'])
##
## 0 1
## 0 65 15
## 1 4 68
1 - mean(qda.pred.boston2$class == boston[-train, 'highcrime'])
## [1] 0.125
KNN performed better with the full model than with the limited model. The full model had a test error rate of 7.9% when using k = 2 or 3. The limited model had a test error rate of 7.9% when using k = 8, but otherwise performed worst for all other values of k. As KNN seemed the most adequate model for predicting this dataset, I tuned the predictors to determine the best model. It was discovered that only using ‘nox’ when k = 1 or 2 reduced the test error rate to 3.3%!
x.train1 = boston[train, c('nox', 'dis', 'rad', 'age', 'indus', 'tax')]
x.test1 = boston[-train, c('nox', 'dis', 'rad', 'age', 'indus', 'tax')]
x.train2 = boston[train, 2:14]
x.test2 = boston[-train, 2:14]
x.train3 = as.data.frame(boston[train, 'nox'])
x.test3 = as.data.frame(boston[-train, 'nox'])
y.train = boston[train, 'highcrime']
y.test = boston[-train, 'highcrime']
for (i in 1:10) {
set.seed(42)
knn.pred.boston1 = knn(x.train1, x.test1, y.train, k = i)
error = 1 - mean(knn.pred.boston1 == y.test)
print(paste('I: ', i, ' Error: ', error))}
## [1] "I: 1 Error: 0.131578947368421"
## [1] "I: 2 Error: 0.118421052631579"
## [1] "I: 3 Error: 0.0855263157894737"
## [1] "I: 4 Error: 0.105263157894737"
## [1] "I: 5 Error: 0.0921052631578947"
## [1] "I: 6 Error: 0.0855263157894737"
## [1] "I: 7 Error: 0.0855263157894737"
## [1] "I: 8 Error: 0.0789473684210527"
## [1] "I: 9 Error: 0.0986842105263158"
## [1] "I: 10 Error: 0.0986842105263158"
for (i in 1:10) {
set.seed(42)
knn.pred.boston2 = knn(x.train2, x.test2, y.train, k = i)
error = 1 - mean(knn.pred.boston2 == y.test)
print(paste('I: ', i, ' Error: ', error))}
## [1] "I: 1 Error: 0.0855263157894737"
## [1] "I: 2 Error: 0.0789473684210527"
## [1] "I: 3 Error: 0.0789473684210527"
## [1] "I: 4 Error: 0.0986842105263158"
## [1] "I: 5 Error: 0.111842105263158"
## [1] "I: 6 Error: 0.0986842105263158"
## [1] "I: 7 Error: 0.125"
## [1] "I: 8 Error: 0.125"
## [1] "I: 9 Error: 0.125"
## [1] "I: 10 Error: 0.118421052631579"
for (i in 1:10) {
set.seed(42)
knn.pred.boston3 = knn(x.train3, x.test3, y.train, k = i)
error = 1 - mean(knn.pred.boston3 == y.test)
print(paste('I: ', i, ' Error: ', error))}
## [1] "I: 1 Error: 0.0328947368421053"
## [1] "I: 2 Error: 0.0328947368421053"
## [1] "I: 3 Error: 0.0460526315789473"
## [1] "I: 4 Error: 0.0921052631578947"
## [1] "I: 5 Error: 0.0921052631578947"
## [1] "I: 6 Error: 0.0986842105263158"
## [1] "I: 7 Error: 0.105263157894737"
## [1] "I: 8 Error: 0.105263157894737"
## [1] "I: 9 Error: 0.0723684210526315"
## [1] "I: 10 Error: 0.0789473684210527"
In conclusion, all models were able to predict whether or not a neighborhood would have above median crime levels effectively. However, KNN was able to predict the most accurately, and after tuning, was able to predict with 96.7% accuracy by comparing the neighborhood of interest to a single other known neighborhood with the most similar ‘nox’ levels.