Question 10

A

There is a high correlation between year and volume. This is intuitive as over time more companies are listed and more activity is seen on the market, therefore there is a higher trade volume.

weekly = Weekly
summary(weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
## 
cor(weekly[,-9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
plot(weekly)

B

The only predictor that is statistically significant is Lag2.

glm.fits = glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = weekly ,family=binomial )
summary (glm.fits)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

C

The logistic model is correct 56.1% of the time.This is only slightly better than simply guessing ‘Up’ every time which would provide a 55.6% accuracy. The model also seems to err on the side of guessing ‘Up’ which makes sense given that is the more likely option. When it guesses the market will go down, it is correct 53% of the time, which is good considering only 44.4% of the time will the market go down. However, it does not often guess the market will go down.

glm.probs = predict (glm.fits, type = "response")
glm.pred = rep("Up", 1089)
glm.pred[glm.probs < .5] = "Down"
table(glm.pred, weekly$Direction )
##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557
print(mean(glm.pred == weekly$Direction))
## [1] 0.5610652
print(mean(weekly$Direction == "Up"))
## [1] 0.5555556
print(54/(54+48))
## [1] 0.5294118

D

The new accuracy is 62.5% which is awesome! It isn’t perfect, but it is a step in the right direction, especially considering we are using a hold out data set.

train = (weekly$Year < 2009)
glm.fits2 = glm(formula = Direction ~ Lag2, data = weekly[train, ], family = binomial )
summary (glm.fits2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = weekly[train, 
##     ])
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
glm.probs2 = predict(glm.fits2, weekly[!train, ], type = "response")
glm.pred2 = rep("Up", length(weekly$Direction[!train]))
glm.pred2[glm.probs2 < .5] = "Down"
table(glm.pred2, weekly$Direction[!train])
##          
## glm.pred2 Down Up
##      Down    9  5
##      Up     34 56
print(mean(glm.pred2 == weekly$Direction[!train]))
## [1] 0.625

E

The linear discriminant analysis is identical to the logistic regression when only one variable is used.

lda.fit = lda(Direction ~ Lag2 , data = weekly, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag2, data = weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
plot(lda.fit)

lda.pred = predict(lda.fit, weekly[!train, ])

F

The quadratic discriminant analysis opts to predict the market will go up every time when only using Lad 2 as a predictor variable. Accuracy decreases to 58.6%.

qda.fit = qda(Direction ~ Lag2 , data = weekly, subset = train)
qda.fit
## Call:
## qda(Direction ~ Lag2, data = weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda.pred = predict(qda.fit, weekly[!train, ])
table(qda.pred$class, weekly$Direction[!train])
##       
##        Down Up
##   Down    0  0
##   Up     43 61
print(mean(qda.pred$class == weekly$Direction[!train]))
## [1] 0.5865385

G

KNN has an accuracy of 50%. This is worse than simply guessing the market will go up every time. It is also wrong more often than it is right when predicting the market will go down.

train.X = as.data.frame(weekly$Lag2[train])
test.X = as.data.frame(weekly$Lag2[!train])
train.response = weekly$Direction[train]
set.seed(42)
knn.pred = knn(train.X, test.X, train.response, k = 1)
table(knn.pred, weekly$Direction[!train])
##         
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred == weekly$Direction[!train])
## [1] 0.5

H

The method that provided the best results was the linear discriminant model, which is the same as the logistic model. These models not only provided the greatest accuracy, but also allow tuning to develop better predictions.

I

The best predictive model I could find was KNN using only Lag2 with k = 2. Here the model was 62.5% accurate. It is even correct more often than not when predicting down weeks for the market.

train.X = as.data.frame(weekly$Lag2[train])
test.X = as.data.frame(weekly$Lag2[!train])
train.response = weekly$Direction[train]
set.seed(42)
knn.pred = knn(train.X, test.X, train.response, k = 2, l = 0)
table(knn.pred, weekly$Direction[!train])
##         
## knn.pred Down Up
##     Down   26 22
##     Up     17 39
mean(knn.pred == weekly$Direction[!train])
## [1] 0.625

Question 11

A

auto = Auto
mpgmedian = auto$mpg > median(auto$mpg)
auto$mpg01[mpgmedian == TRUE] = 1
auto$mpg01[mpgmedian == FALSE] = 0

B

From the scatter plots, we see that the features most likely to help predicting mpg01 are displacement, horsepower, weight, and acceleration. From the boxplots of the factored variables, it seems that year and origin will be useful in predicting mpg01.

pairs(mpg01~ displacement + horsepower + weight + acceleration,auto)

boxplot(mpg01 ~ cylinders, auto)

boxplot(mpg01 ~ year, auto)

boxplot(mpg01 ~ origin, auto)

C

set.seed(42)
train = sample(1:nrow(auto), nrow(auto) * .7)
x.train = auto[train, 2:8]
y.train = auto[train, 10]
x.test = auto[-train, 2:8]
y.test = auto[-train, 10]

D

The test error is only 7.6%. That’s pretty good!

lda.fit.auto = lda(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, data = auto, subset = train)
lda.fit.auto
## Call:
## lda(mpg01 ~ displacement + horsepower + weight + acceleration + 
##     year + origin, data = auto, subset = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5437956 0.4562044 
## 
## Group means:
##   displacement horsepower   weight acceleration     year   origin
## 0     268.7584   127.8121 3573.557      14.7349 74.58389 1.167785
## 1     116.9720    78.3280 2336.640      16.7304 77.56800 1.960000
## 
## Coefficients of linear discriminants:
##                       LD1
## displacement -0.004013814
## horsepower    0.007876578
## weight       -0.001343671
## acceleration  0.024734805
## year          0.139744209
## origin        0.198743128
lda.auto.pred = predict(lda.fit.auto, auto[-train, ])
table(lda.auto.pred$class, auto[-train, 'mpg01'])
##    
##      0  1
##   0 41  3
##   1  6 68
1 - mean(lda.auto.pred$class == auto[-train, 'mpg01'])
## [1] 0.07627119

E

Coincidentally, the QDA test error is also 7.6%.

qda.fit.auto = qda(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, data = auto, subset = train)
qda.fit.auto
## Call:
## qda(mpg01 ~ displacement + horsepower + weight + acceleration + 
##     year + origin, data = auto, subset = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5437956 0.4562044 
## 
## Group means:
##   displacement horsepower   weight acceleration     year   origin
## 0     268.7584   127.8121 3573.557      14.7349 74.58389 1.167785
## 1     116.9720    78.3280 2336.640      16.7304 77.56800 1.960000
qda.auto.pred = predict(qda.fit.auto, auto[-train, ])
table(qda.auto.pred$class, auto[-train, 'mpg01'])
##    
##      0  1
##   0 43  5
##   1  4 66
1 - mean(qda.auto.pred$class == auto[-train, 'mpg01'])
## [1] 0.07627119

F

The test error of the logistic model is 9.5%. This is a decrease in performance from the LDA and QDA models.

glm.fit.auto = glm(mpg01 ~ displacement + horsepower + weight + acceleration + year + origin, auto[train, ], family = binomial)
summary(glm.fit.auto)
## 
## Call:
## glm(formula = mpg01 ~ displacement + horsepower + weight + acceleration + 
##     year + origin, family = binomial, data = auto[train, ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28340  -0.14323  -0.00281   0.22601   3.01004  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -24.536837   7.456870  -3.291 0.001000 ** 
## displacement   0.004097   0.009304   0.440 0.659728    
## horsepower    -0.029734   0.026032  -1.142 0.253366    
## weight        -0.004983   0.001407  -3.542 0.000397 ***
## acceleration   0.101368   0.150591   0.673 0.500863    
## year           0.500225   0.100043   5.000 5.73e-07 ***
## origin         0.452448   0.382441   1.183 0.236789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 377.74  on 273  degrees of freedom
## Residual deviance: 116.05  on 267  degrees of freedom
## AIC: 130.05
## 
## Number of Fisher Scoring iterations: 7
glm.prob.auto = predict(glm.fit.auto, auto[-train, ], type = "response")
glm.pred.auto = rep(1, nrow(auto[-train,]))
glm.pred.auto[glm.prob.auto < .5] = 0
table(glm.pred.auto, auto[-train, 'mpg01'])
##              
## glm.pred.auto  0  1
##             0 46  9
##             1  1 62
print(1- mean(glm.pred.auto == auto[-train, 'mpg01']))
## [1] 0.08474576

G

The best test error is obtained when k = 1. The error is 11.0%.

for (i in 1:10) {
  set.seed(42)
  knn.pred.auto = knn(x.train, x.test, y.train, k = i)
  table(knn.pred.auto, y.test)
  error = 1 - mean(knn.pred.auto == y.test)
  print(paste('I: ', i, '   Error: ', error))}
## [1] "I:  1    Error:  0.110169491525424"
## [1] "I:  2    Error:  0.127118644067797"
## [1] "I:  3    Error:  0.11864406779661"
## [1] "I:  4    Error:  0.11864406779661"
## [1] "I:  5    Error:  0.127118644067797"
## [1] "I:  6    Error:  0.127118644067797"
## [1] "I:  7    Error:  0.127118644067797"
## [1] "I:  8    Error:  0.152542372881356"
## [1] "I:  9    Error:  0.152542372881356"
## [1] "I:  10    Error:  0.144067796610169"

Question 13

Data

The high crime variable was created where 1 means the neighborhood crime rate is above the median and 0 means the neighborhood crime rate is below median. The data was also split into 70% train and 30% test subsets. The correlation between high crime and the predictors is calculated. ‘Tax’ is removed from the formulas for logistic regression as it has high covariance with ‘rad’. From this, 2 formulas are created to evaluate in the LDA, QDA, and Logit models. The first uses only the correlated predictors with a value above 0.5 to build the models. This uses These are ‘nox’, ‘dis’, ‘rad’, ‘age’, and ‘indus’. The second formula uses all predictors for the models.

boston = Boston
crimemedian = boston$crim > median(boston$crim)
boston$highcrime[crimemedian == TRUE] = 1
boston$highcrime[crimemedian == FALSE] = 0
train = sample(1:nrow(boston), .7 * nrow(boston))
cor(boston)
##                  crim          zn       indus         chas         nox
## crim       1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn        -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus      0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas      -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox        0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm        -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age        0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis       -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad        0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax        0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio    0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black     -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat      0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv      -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## highcrime  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                    rm         age         dis          rad         tax
## crim      -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn         0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus     -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas       0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox       -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm         1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age       -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis        0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad       -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax       -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio   -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black      0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat     -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv       0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
## highcrime -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128
##              ptratio       black      lstat       medv   highcrime
## crim       0.2899456 -0.38506394  0.4556215 -0.3883046  0.40939545
## zn        -0.3916785  0.17552032 -0.4129946  0.3604453 -0.43615103
## indus      0.3832476 -0.35697654  0.6037997 -0.4837252  0.60326017
## chas      -0.1215152  0.04878848 -0.0539293  0.1752602  0.07009677
## nox        0.1889327 -0.38005064  0.5908789 -0.4273208  0.72323480
## rm        -0.3555015  0.12806864 -0.6138083  0.6953599 -0.15637178
## age        0.2615150 -0.27353398  0.6023385 -0.3769546  0.61393992
## dis       -0.2324705  0.29151167 -0.4969958  0.2499287 -0.61634164
## rad        0.4647412 -0.44441282  0.4886763 -0.3816262  0.61978625
## tax        0.4608530 -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio    1.0000000 -0.17738330  0.3740443 -0.5077867  0.25356836
## black     -0.1773833  1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat      0.3740443 -0.36608690  1.0000000 -0.7376627  0.45326273
## medv      -0.5077867  0.33346082 -0.7376627  1.0000000 -0.26301673
## highcrime  0.2535684 -0.35121093  0.4532627 -0.2630167  1.00000000
formula1 = highcrime ~ nox + dis + rad + age + indus + tax
formula1L = highcrime ~ nox + dis + rad + age + indus
formula2 = highcrime ~ nox + dis + rad + age + indus + tax + zn + chas + rm + ptratio + black + lstat + medv
formula2L = highcrime ~ nox + dis + rad + age + indus + zn + chas + rm + ptratio + black + lstat + medv

Analysis

The Logit model performs better when all predictors, except tax, are included. The test error is 15.1% for the smaller model, and 11.2% for the full model.

glm.boston1 = glm(formula1L, boston[train, ], family = binomial)
summary(glm.boston1)
## 
## Call:
## glm(formula = formula1L, family = binomial, data = boston[train, 
##     ])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06060  -0.32197  -0.08894   0.01065   2.68649  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -23.311634   4.113145  -5.668 1.45e-08 ***
## nox          38.430111   7.407417   5.188 2.12e-07 ***
## dis           0.199664   0.171105   1.167 0.243248    
## rad           0.447298   0.117510   3.806 0.000141 ***
## age           0.004681   0.010151   0.461 0.644705    
## indus        -0.075341   0.045848  -1.643 0.100328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.19  on 353  degrees of freedom
## Residual deviance: 181.14  on 348  degrees of freedom
## AIC: 193.14
## 
## Number of Fisher Scoring iterations: 8
glm.prob.boston1 = predict(glm.boston1, boston[-train, ], type = "response")
glm.pred.boston1 = rep(1, nrow(boston[-train,]))
glm.pred.boston1[glm.prob.boston1 < .5] = 0
table(glm.pred.boston1, boston[-train, 'highcrime'])
##                 
## glm.pred.boston1  0  1
##                0 60 14
##                1  9 69
print(1 - mean(glm.pred.boston1 == boston[-train, 'highcrime']))
## [1] 0.1513158
glm.boston2 = glm(formula2L, boston[train, ], family = binomial)
summary(glm.boston2)
## 
## Call:
## glm(formula = formula2L, family = binomial, data = boston[train, 
##     ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8672  -0.2217  -0.0047   0.0100   3.3451  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -34.516751   7.306674  -4.724 2.31e-06 ***
## nox          48.512254   8.581395   5.653 1.57e-08 ***
## dis           0.794387   0.248767   3.193  0.00141 ** 
## rad           0.402876   0.139056   2.897  0.00376 ** 
## age           0.012271   0.013964   0.879  0.37953    
## indus        -0.069888   0.049296  -1.418  0.15627    
## zn           -0.077616   0.039959  -1.942  0.05209 .  
## chas          1.487094   1.056864   1.407  0.15940    
## rm           -0.408475   0.789851  -0.517  0.60505    
## ptratio       0.232225   0.136739   1.698  0.08945 .  
## black        -0.009964   0.005803  -1.717  0.08598 .  
## lstat         0.075162   0.053679   1.400  0.16145    
## medv          0.198062   0.075480   2.624  0.00869 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.19  on 353  degrees of freedom
## Residual deviance: 154.43  on 341  degrees of freedom
## AIC: 180.43
## 
## Number of Fisher Scoring iterations: 9
glm.prob.boston2 = predict(glm.boston2, boston[-train, ], type = "response")
glm.pred.boston2 = rep(1, nrow(boston[-train,]))
glm.pred.boston2[glm.prob.boston2 < .5] = 0
table(glm.pred.boston2, boston[-train, 'highcrime'])
##                 
## glm.pred.boston2  0  1
##                0 61 10
##                1  8 73
print(1 - mean(glm.pred.boston2 == boston[-train, 'highcrime']))
## [1] 0.1184211

LDA also better with the chosen subset of predictors, improving to 16.4% from 17.1% when all predictors are used.

set.seed(42)
lda.boston1 = lda(formula1, data = boston, subset = train)
lda.boston1
## Call:
## lda(formula1, data = boston, subset = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.519774 0.480226 
## 
## Group means:
##         nox      dis      rad      age     indus      tax
## 0 0.4717451 5.013656  4.01087 51.71141  7.056467 305.7663
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941
## 
## Coefficients of linear discriminants:
##                LD1
## nox    7.357924376
## dis   -0.053402428
## rad    0.084543742
## age    0.008785933
## indus  0.015816315
## tax   -0.001469737
lda.pred.boston1 = predict(lda.boston1, boston[-train, ])
table(lda.pred.boston1$class, boston[-train, 'highcrime'])
##    
##      0  1
##   0 66 22
##   1  3 61
1 - mean(lda.pred.boston1$class == boston[-train, 'highcrime'])
## [1] 0.1644737
set.seed(42)
lda.boston2 = lda(formula2, data = boston, subset = train)
lda.boston2
## Call:
## lda(formula2, data = boston, subset = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.519774 0.480226 
## 
## Group means:
##         nox      dis      rad      age     indus      tax        zn       chas
## 0 0.4717451 5.013656  4.01087 51.71141  7.056467 305.7663 21.035326 0.03260870
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941  1.305882 0.08235294
##         rm  ptratio    black     lstat     medv
## 0 6.384060 18.05380 388.3198  9.341359 24.68696
## 1 6.220671 18.92882 323.2791 15.535059 20.71412
## 
## Coefficients of linear discriminants:
##                   LD1
## nox      8.2571533859
## dis      0.0613105160
## rad      0.0677918403
## age      0.0093078031
## indus    0.0271014719
## tax     -0.0004555629
## zn      -0.0057270353
## chas    -0.1519712801
## rm       0.1167760980
## ptratio  0.0146070536
## black   -0.0008069021
## lstat    0.0233850211
## medv     0.0455685064
lda.pred.boston2 = predict(lda.boston2, boston[-train, ])
table(lda.pred.boston2$class, boston[-train, 'highcrime'])
##    
##      0  1
##   0 64 21
##   1  5 62
1 - mean(lda.pred.boston2$class == boston[-train, 'highcrime'])
## [1] 0.1710526

QDA performs much better with the subset of predictors as opposed to all predictors, increasing the test error from 9.9% when the subset of predictors is used to 12.5% when all predictors are used.

set.seed(42)
qda.boston1 = qda(formula1, data = boston, subset = train)
qda.boston1
## Call:
## qda(formula1, data = boston, subset = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.519774 0.480226 
## 
## Group means:
##         nox      dis      rad      age     indus      tax
## 0 0.4717451 5.013656  4.01087 51.71141  7.056467 305.7663
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941
qda.pred.boston1 = predict(qda.boston1, boston[-train, ])
table(qda.pred.boston1$class, boston[-train, 'highcrime'])
##    
##      0  1
##   0 69 15
##   1  0 68
1 - mean(qda.pred.boston1$class == boston[-train, 'highcrime'])
## [1] 0.09868421
set.seed(42)
qda.boston2 = qda(formula2, data = boston, subset = train)
qda.boston2
## Call:
## qda(formula2, data = boston, subset = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.519774 0.480226 
## 
## Group means:
##         nox      dis      rad      age     indus      tax        zn       chas
## 0 0.4717451 5.013656  4.01087 51.71141  7.056467 305.7663 21.035326 0.03260870
## 1 0.6370941 2.480384 14.46471 85.64588 15.356882 504.5941  1.305882 0.08235294
##         rm  ptratio    black     lstat     medv
## 0 6.384060 18.05380 388.3198  9.341359 24.68696
## 1 6.220671 18.92882 323.2791 15.535059 20.71412
qda.pred.boston2 = predict(qda.boston2, boston[-train, ])
table(qda.pred.boston2$class, boston[-train, 'highcrime'])
##    
##      0  1
##   0 65 15
##   1  4 68
1 - mean(qda.pred.boston2$class == boston[-train, 'highcrime'])
## [1] 0.125

KNN performed better with the full model than with the limited model. The full model had a test error rate of 7.9% when using k = 2 or 3. The limited model had a test error rate of 7.9% when using k = 8, but otherwise performed worst for all other values of k. As KNN seemed the most adequate model for predicting this dataset, I tuned the predictors to determine the best model. It was discovered that only using ‘nox’ when k = 1 or 2 reduced the test error rate to 3.3%!

x.train1 = boston[train, c('nox', 'dis', 'rad', 'age', 'indus', 'tax')]
x.test1 = boston[-train, c('nox', 'dis', 'rad', 'age', 'indus', 'tax')]

x.train2 = boston[train, 2:14]
x.test2 = boston[-train, 2:14]

x.train3 = as.data.frame(boston[train, 'nox'])
x.test3 = as.data.frame(boston[-train, 'nox'])

y.train = boston[train, 'highcrime']
y.test = boston[-train, 'highcrime'] 


for (i in 1:10) {
  set.seed(42)
  knn.pred.boston1 = knn(x.train1, x.test1, y.train, k = i)
  error = 1 - mean(knn.pred.boston1 == y.test)
  print(paste('I: ', i, '   Error: ', error))}
## [1] "I:  1    Error:  0.131578947368421"
## [1] "I:  2    Error:  0.118421052631579"
## [1] "I:  3    Error:  0.0855263157894737"
## [1] "I:  4    Error:  0.105263157894737"
## [1] "I:  5    Error:  0.0921052631578947"
## [1] "I:  6    Error:  0.0855263157894737"
## [1] "I:  7    Error:  0.0855263157894737"
## [1] "I:  8    Error:  0.0789473684210527"
## [1] "I:  9    Error:  0.0986842105263158"
## [1] "I:  10    Error:  0.0986842105263158"
for (i in 1:10) {
  set.seed(42)
  knn.pred.boston2 = knn(x.train2, x.test2, y.train, k = i)
  error = 1 - mean(knn.pred.boston2 == y.test)
  print(paste('I: ', i, '   Error: ', error))}
## [1] "I:  1    Error:  0.0855263157894737"
## [1] "I:  2    Error:  0.0789473684210527"
## [1] "I:  3    Error:  0.0789473684210527"
## [1] "I:  4    Error:  0.0986842105263158"
## [1] "I:  5    Error:  0.111842105263158"
## [1] "I:  6    Error:  0.0986842105263158"
## [1] "I:  7    Error:  0.125"
## [1] "I:  8    Error:  0.125"
## [1] "I:  9    Error:  0.125"
## [1] "I:  10    Error:  0.118421052631579"
for (i in 1:10) {
  set.seed(42)
  knn.pred.boston3 = knn(x.train3, x.test3, y.train, k = i)
  error = 1 - mean(knn.pred.boston3 == y.test)
  print(paste('I: ', i, '   Error: ', error))}
## [1] "I:  1    Error:  0.0328947368421053"
## [1] "I:  2    Error:  0.0328947368421053"
## [1] "I:  3    Error:  0.0460526315789473"
## [1] "I:  4    Error:  0.0921052631578947"
## [1] "I:  5    Error:  0.0921052631578947"
## [1] "I:  6    Error:  0.0986842105263158"
## [1] "I:  7    Error:  0.105263157894737"
## [1] "I:  8    Error:  0.105263157894737"
## [1] "I:  9    Error:  0.0723684210526315"
## [1] "I:  10    Error:  0.0789473684210527"

Conclusion

In conclusion, all models were able to predict whether or not a neighborhood would have above median crime levels effectively. However, KNN was able to predict the most accurately, and after tuning, was able to predict with 96.7% accuracy by comparing the neighborhood of interest to a single other known neighborhood with the most similar ‘nox’ levels.