10. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
library(MASS)
data("Weekly")
  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
summary(Weekly)
##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260
pairs(Weekly)

cor(Weekly[, -9])
##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000
attach(`Weekly`)
plot(Volume)

There are a couple of things I am noticing about the dataset that I think are worth mentioning after exploring it. My visual interpretation from the scatterplot matrix is that in it there were not any distinguishable correlations between individuals predictors, other than year and volume. When I looked at the correlation matrix of my variables, I noticed none of the variables (aside from volume and year) presented values beyond a magnitude of 0.1 positive/negative correlation, which makes sense given how the scatterplot matrix turned out. Looking a little closer at volume, It appears the trading volume has increased over time.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

The only statistically significant predictor seems to be lag2

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
probs = predict(glm.fit, type = "response")
pred.glm = rep("Down", length(probs))
pred.glm[probs > 0.5] <- "Up"
table(pred.glm, Direction)
##         Direction
## pred.glm Down  Up
##     Down   54  48
##     Up    430 557
# Compute the fraction of days for which the prediction was correct
mean(pred.glm == Direction)
## [1] 0.5610652
# Compute missclassification rate 
478/1089
## [1] 0.4389348
# Calculating the percent of times upward movement was predicted correctly
557/(557+48)
## [1] 0.9206612
# Calculating the percent of times downward movement was predicted correctly
54/(54+430)
## [1] 0.1115702

When looking at the confusion matrix, it seems that the model predicted the direction of the market correctly about 56% of the time, while our error rate was about 43.9%. The model also seemed to have predicted upward movement correctly about 92.1% of the time, while it was only able to predict downward movement correctly about 11% of the time. Upward and downward movement in this case are refering to weeks where the stock market goes Up or Down.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
train = Weekly[Weekly$Year <= 2008, ]
test = Weekly[Weekly$Year > 2008, ]
testdirect = Weekly$Direction[Weekly$Year>2008]
glm.fit2 <- glm(Direction ~ Lag2, data = train, family = "binomial")
summary(glm.fit2)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
#########
prob = predict(glm.fit2, type="response", newdata = test)
pred2 = rep("Down", 104)
pred2[prob>0.5] = "Up"
table(pred2, testdirect)
##       testdirect
## pred2  Down Up
##   Down    9  5
##   Up     34 56
# Compute the fraction of days for which the prediction was correct
mean(pred2 == testdirect)
## [1] 0.625
# Compute missclassification rate 
39/104
## [1] 0.375
# Calculating the percent of times upward movement was predicted correctly
56/61
## [1] 0.9180328
# Calculating the percent of times downward movement was predicted correctly
9/43
## [1] 0.2093023

We can see this time that the percentage of correct predictions for test data is 62.5%, which makes our test error rate about 37.5%. Compared to the confusion matrix generated from fitting a logistic regression model on the full data set (glm.fit), it seems this one (glm.fit2) predictied a higher percentage of outcomes. Also, for weeks when the market goes up, the model is right 91.8% of the time. For weeks when the market goes down, the model is right 20.9% of the time.

  1. Repeat (d) using LDA.
lda_fit = lda(Direction~Lag2, data = train)
lda_fit
## Call:
## lda(Direction ~ Lag2, data = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162
lda_pred = predict(lda_fit, type="response", newdata = test)
names(lda_pred)
## [1] "class"     "posterior" "x"
ldaclass = lda_pred$class
table(ldaclass, testdirect)
##         testdirect
## ldaclass Down Up
##     Down    9  5
##     Up     34 56

We can see this time the output is the same as last time. The percentage of correct predictions for test data is 62.5%, which makes our test error rate about 37.5%. For weeks when the market goes up, the model is right 91.8% of the time. For weeks when the market goes down, the model is right 20.9% of the time.

  1. Repeat (d) using QDA.
qda_fit = qda(Direction~Lag2, data = train)
qda_fit
## Call:
## qda(Direction ~ Lag2, data = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
qda_pred = predict(qda_fit, type="response", newdata = test)
names(qda_pred)
## [1] "class"     "posterior"
qdaclass = qda_pred$class
table(qdaclass, testdirect)
##         testdirect
## qdaclass Down Up
##     Down    0  0
##     Up     43 61
# Compute the fraction of days for which the prediction was correct
61/104
## [1] 0.5865385
# Compute missclassification rate 
43/104
## [1] 0.4134615

We can see this time the output is NOT the same as the last 2 times. The percentage of correct predictions for test data is 58.7%, which makes our test error rate about 41.3%. For weeks when the market goes up, the model is right 100% of the time. For weeks when the market goes down, the model is right 0% of the time.

  1. Repeat (d) using KNN with K = 1.
library(class)
training1 = cbind(train$Lag2)
training2 = cbind(train$Direction)
testing1 = cbind(test$Lag2)
set.seed(1)
pred_K = knn(training1, testing1, training2, k=1)
table(pred_K, testdirect)
##       testdirect
## pred_K Down Up
##      1   21 30
##      2   22 31

The percentage of correct predictions for test data is 50%, which makes our test error rate about 50%. For weeks when the market goes up, the model is right 50.8% of the time. For weeks when the market goes down, the model is right 48.8% of the time

  1. Which of these methods appears to provide the best results on this data?

Lda and Logistic regression seemed to provide the best results, as they had the lowest misclassification rates and highest rates of corret classification.

  1. Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

logistic regression experiment

#adding interaction term to logistic regression example.
INTXNglm.fit = glm(Direction ~ Lag1 + Lag3*Lag4, data = train, family = binomial)
summary(INTXNglm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag3 * Lag4, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.524  -1.251   1.011   1.090   1.559  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.2284079  0.0649830   3.515  0.00044 ***
## Lag1        -0.0604336  0.0291941  -2.070  0.03845 *  
## Lag3        -0.0004513  0.0305567  -0.015  0.98822    
## Lag4        -0.0236111  0.0294339  -0.802  0.42245    
## Lag3:Lag4    0.0124882  0.0081777   1.527  0.12674    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1347.3  on 980  degrees of freedom
## AIC: 1357.3
## 
## Number of Fisher Scoring iterations: 4
#getting probabilities and table
probb = predict(INTXNglm.fit, type="response", newdata = test)
predd = rep("Down", 104)
predd[probb>0.5] = "Up"
table(predd, testdirect)
##       testdirect
## predd  Down Up
##   Down    7  6
##   Up     36 55
#error rate
42/104
## [1] 0.4038462

LDA experiment

lda_fit2 = lda(Direction~Lag1 + Lag3*Lag4, data = train)
lda_pred2 = predict(lda_fit2, type="response", newdata = test)
ldaclass2 = lda_pred2$class
table(ldaclass2, testdirect)
##          testdirect
## ldaclass2 Down Up
##      Down    6  6
##      Up     37 55
#error rate
43/104
## [1] 0.4134615

QDA experiment

qda_fit2 = qda(Direction~Lag1 + Lag3*Lag4, data = train)
qda_pred2 = predict(qda_fit2, type="response", newdata = test)
qdaclass2 = qda_pred2$class
table(qdaclass2, testdirect)
##          testdirect
## qdaclass2 Down Up
##      Down   24 31
##      Up     19 30
#error rate
50/104
## [1] 0.4807692

KNN Runs k = 50, k = 20, k = 4

training3 = cbind(train$Lag4)
training4 = cbind(train$Direction)
testing2 = cbind(test$Lag4)


set.seed(1)
pred_K50 = knn(training3, testing2, training4, k=50)
table(pred_K50, testdirect)
##         testdirect
## pred_K50 Down Up
##        1   18 16
##        2   25 45
#Errort Rate k = 50
41/104
## [1] 0.3942308
set.seed(1)
pred_K20 = knn(training3, testing2, training4, k=20)
table(pred_K20, testdirect)
##         testdirect
## pred_K20 Down Up
##        1   19 19
##        2   24 42
#Error Rate k = 20
43/104
## [1] 0.4134615
set.seed(1)
pred_K4 = knn(training3, testing2, training4, k=4)
table(pred_K4, testdirect)
##        testdirect
## pred_K4 Down Up
##       1   20 22
##       2   23 39
#Error Rate k = 4
45/104
## [1] 0.4326923

After experimenting with a a couple of different models and methods, what I noticed was the best error rate, as in lowest, came from doing a KNN on Lag4 with k = 50. That specific error rate was about 39.4%. Doing a logistic regression model on Lag1 + Lag3 * Lag4 resulted in a close 2nd place, with an error rate at about 40.3%.

11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

  1. Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.
attach(Auto)
mpg01 <- rep(0, length(mpg))
mpg01[mpg > median(mpg)] <- 1
Auto <- data.frame(Auto, mpg01)
  1. Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
pairs(Auto)

cor(Auto[-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000
par(mfrow=c(2,2))

boxplot(cylinders ~ mpg01, data = Auto, main = "cylinders vs mpg01")
boxplot(displacement ~ mpg01, data = Auto, main = "displacement vs mpg01")
boxplot(horsepower ~ mpg01, data = Auto, main = "horsepower vs mpg01")

When doing a scatterplot matrix and correlation matrix I noticed that there were a couple of variables who’s correlation value seemed of a high enough magnitude to be worth taking a closer look at. These variables include cylinders, displacement, and horsepower.

  1. Split the data into a training set and a test set.
train <- (year %% 2 == 0)
Autotrain <- Auto[train, ]
Autotest <- Auto[!train, ]
mpg01test <- mpg01[!train]
  1. Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
fit.lda = lda(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain)
pred.lda = predict(fit.lda, Autotest)
table(pred.lda$class, mpg01test)
##    mpg01test
##      0  1
##   0 88 10
##   1 12 72
#Error rate
22/182
## [1] 0.1208791

The test error rate appears to be about 12.1% in this case

  1. Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
fit.qda = qda(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain)
pred.qda = predict(fit.qda, Autotest)
table(pred.qda$class, mpg01test)
##    mpg01test
##      0  1
##   0 88 11
##   1 12 71
#Error rate
23/182
## [1] 0.1263736

The test error rate appears to be about 12.6% in this case

  1. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?
fit.glm = glm(mpg01 ~ cylinders + displacement + horsepower, data = Autotrain, family = binomial)
autoprobs = predict(fit.glm, Autotest, type = "response")
autopreds = rep(0, length(autoprobs))
autopreds[autoprobs > 0.5] = 1
table(autopreds, mpg01test)
##          mpg01test
## autopreds  0  1
##         0 88  9
##         1 12 73
#error rate
21/182
## [1] 0.1153846

The test error rate appears to be about 11.5% in this case

  1. Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?
train.X = cbind(cylinders, displacement, horsepower)[train, ]
test.X = cbind(cylinders, displacement, horsepower)[!train, ]
train.mpg01 = mpg01[train]


set.seed(1)
pred_knn1 = knn(train.X, test.X, train.mpg01, k = 1)
table(pred_knn1, mpg01test)
##          mpg01test
## pred_knn1  0  1
##         0 90 12
##         1 10 70
#Error Rate k = 1
22/182
## [1] 0.1208791
set.seed(1)
pred_knn5 = knn(train.X, test.X, train.mpg01, k = 5)
table(pred_knn5, mpg01test)
##          mpg01test
## pred_knn5  0  1
##         0 88 13
##         1 12 69
#Errort Rate k = 5
25/182
## [1] 0.1373626
set.seed(1)
pred_knn15 = knn(train.X, test.X, train.mpg01, k = 15)
table(pred_knn15, mpg01test)
##           mpg01test
## pred_knn15  0  1
##          0 84  5
##          1 16 77
#Error Rate k = 15
21/182
## [1] 0.1153846
set.seed(1)
pred_knn40 = knn(train.X, test.X, train.mpg01, k = 40)
table(pred_knn40, mpg01test)
##           mpg01test
## pred_knn40  0  1
##          0 84  4
##          1 16 78
#Error Rate k = 40
20/182
## [1] 0.1098901

The test error rates are given for 4 different values of k. For k=1, the error rate was about 12.1%. For k=4, the test error rate was about 13.7%. For k=15, the error rate was about 11.5%. For k=40 we get the smallest error rate, which was about 11%, and I would have to conclude with this information that KNN with k = 40 performed the best overall.

13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

First I’d like to see what variables appear to be significant in a logistic regression model with all the predictors, aside from crim and my newly created variable crime1. After this, I’m going to run a confusion matrix to keep and compare to the rest of my confusion matrices. After that, I’m going to recreate the same process, but only with the variables in the full model that had significant pvalues.

attach(Boston)
crime1 = rep(0, length(crim))
crime1[crim > median(crim)] <- 1
Bdf = data.frame(Boston, crime1)



train = 1:(length(crim) / 2)
test = (length(crim) / 2 + 1):length(crim)
Btrain = Bdf[train, ]
Btest = Bdf[test, ]
crim01test = crime1[test]

crime.fglm = glm(crime1~. - crime1 - crim, data = Bdf, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(crime.fglm)
## 
## Call:
## glm(formula = crime1 ~ . - crime1 - crim, family = binomial, 
##     data = Bdf, subset = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83229  -0.06593   0.00000   0.06181   2.61513  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -91.319906  19.490273  -4.685 2.79e-06 ***
## zn           -0.815573   0.193373  -4.218 2.47e-05 ***
## indus         0.354172   0.173862   2.037  0.04164 *  
## chas          0.167396   0.991922   0.169  0.86599    
## nox          93.706326  21.202008   4.420 9.88e-06 ***
## rm           -4.719108   1.788765  -2.638  0.00833 ** 
## age           0.048634   0.024199   2.010  0.04446 *  
## dis           4.301493   0.979996   4.389 1.14e-05 ***
## rad           3.039983   0.719592   4.225 2.39e-05 ***
## tax          -0.006546   0.007855  -0.833  0.40461    
## ptratio       1.430877   0.359572   3.979 6.91e-05 ***
## black        -0.017552   0.006734  -2.606  0.00915 ** 
## lstat         0.190439   0.086722   2.196  0.02809 *  
## medv          0.598533   0.185514   3.226  0.00125 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 329.367  on 252  degrees of freedom
## Residual deviance:  69.568  on 239  degrees of freedom
## AIC: 97.568
## 
## Number of Fisher Scoring iterations: 10
#error rate for full glm model
Bprobs <- predict(crime.fglm, Btest, type = "response")
Bpreds <- rep(0, length(Bprobs))
Bpreds[Bprobs > 0.5] <- 1
table(Bpreds, crim01test)
##       crim01test
## Bpreds   0   1
##      0  68  24
##      1  22 139
#error rate
46/253
## [1] 0.1818182

Running the first logistic regression model we generate an error rate of about 18.2%.

Now to create a logistic regression model using only the significant terms. After this gets created I’ll generate my next confusion matrix to get the error rate.

crime.fglm2 = glm(crime1~. - crime1 - crim - chas - tax, data = Bdf, family = binomial, subset = train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Bprobs2 <- predict(crime.fglm2, Btest, type = "response")
Bpreds2 <- rep(0, length(Bprobs2))
Bpreds2[Bprobs2 > 0.5] <- 1
table(Bpreds2, crim01test)
##        crim01test
## Bpreds2   0   1
##       0  67  24
##       1  23 139
#error rate
47/253
## [1] 0.1857708

This logistic regression model containing all the significant predictor variable from crime.fglm ended up givng an error rare of about 18.6%. Not much different from the first model, but slightly worse…

I’m wondering if LDA will prouce a model that predicts at a lower error rate when applied to the same set of predictor variables as crime.fglm2. Lets see how it works.

crimefit.lda = lda(crime1 ~ . - crime1 - crim - chas - tax, data = Bdf, subset = train)
crimepred.lda = predict(crimefit.lda, Btest)
table(crimepred.lda$class, crim01test)
##    crim01test
##       0   1
##   0  80  21
##   1  10 142
#error rate
31/253
## [1] 0.1225296

The LDA method gave us a lower error rate than the previous two runs, this time being about 12.3%.

I’m going to see how KNN method does with predicting on the same set of variables used in crime.fglm2 and crimefit.lda. Then, I will experiment with a couple of differen values of K to see if I can find a low error rate

crimetrain = cbind(zn, indus, nox, rm, age, dis, rad, ptratio, black, lstat, medv)[train, ]
crimetest = cbind(zn, indus, nox, rm, age, dis, rad, ptratio, black, lstat, medv)[test, ]
traincrime = crime1[train]
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 1)
table(Cpredk, traincrime)
##       traincrime
## Cpredk   0   1
##      0 116  49
##      1  47  41
#error rate
96/253
## [1] 0.3794466
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 2)
table(Cpredk, traincrime)
##       traincrime
## Cpredk   0   1
##      0 101  41
##      1  62  49
#error rate
103/253
## [1] 0.4071146
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 4)
table(Cpredk, traincrime)
##       traincrime
## Cpredk  0  1
##      0 99 34
##      1 64 56
#error rate
98/253
## [1] 0.3873518
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 8)
table(Cpredk, traincrime)
##       traincrime
## Cpredk   0   1
##      0 102  32
##      1  61  58
#error rate
93/253
## [1] 0.3675889
set.seed(1)
Cpredk = knn(crimetrain, crimetest, traincrime, k = 16)
table(Cpredk, traincrime)
##       traincrime
## Cpredk  0  1
##      0 99 34
##      1 64 56
#error rate
98/253
## [1] 0.3873518

I ended up using K = 1,2,4,8 and 16. It appears the error rates vary between about 36.8%(k=8) and about 40.7%(k=2). Compared to my other methods, the error rates derived from the KNN method are much higher.

For the experiments that were trialed above on the Boston data set it appears the best method for predicting whether a given suburb has a crime rate above or below the median was the LDA method at an error rate of about 12.3%