Assignment 3 - Chapter 4

Chapter 4, #13,14,16

Question 13

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR2)
attach(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

pairs(Weekly)

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

For year, there are no strong correlations with Lag1-5; however, there appears to be an increase of variance (similar to a cone). There is a strong correlation with Year and Volume, and this appears to be non-linear.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

fullWeekly.fit <- glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(fullWeekly.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 is the only statistically significant predictors for this model, with a P-value of 0.0296.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

glm.probs=predict(fullWeekly.fit,type="response")
glm.pred=rep("Down", length(glm.probs))
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction)

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

mean(glm.pred == Direction)

## [1] 0.5610652

Correct predictions include 557 Up and 54 Down, out of 1089 total, for a total correct percentage of (557+54)/1089 = 0.5610652. This means that the model overall predicted its direction 56.1% of the time.

Looking at how the model correctly predicts Up, we take the amount of correctly predicted 557 Up, and divide it by the total amount of Up that occured: 557/(557+48) = 0.9206612. This shows that when predicting Up, they did so with a 92% correct rate.

Looking at how the model correctly predicts Down, we take the amount of correctly predicted 54 Down, and divide it by the total amount of Down that occured: 54/(54+430) = 0.1115702. This shows that when predicting Down, they did so with an 11% correct rate.

This model is overall best for predicting when the market will go up, not down.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

train.Weekly <- (Year<2009)
weekly.2009.2010 <- Weekly[!train.Weekly,]
lag2.fit <- glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train.Weekly)
weekly.prob <- predict(lag2.fit, weekly.2009.2010, type='response')
weekly.pred=rep("Down",length(weekly.prob))
weekly.pred[weekly.prob>.5]="Up"
Direction.2009.2010 <- Direction[!train.Weekly]
table(weekly.pred,Direction.2009.2010)

##            Direction.2009.2010
## weekly.pred Down Up
##        Down    9  5
##        Up     34 56

mean(weekly.pred == Direction.2009.2010)

## [1] 0.625

Overall, this model predicts with a 62.5% accuracy.

Correct predictions include 56 Up and 9 Down, out of 104 total, for a total correct percentage of (56+9)/104 = 0.625. This means that the model overall predicted its weekly direction 62.5% of the time.

Looking at how the model correctly predicts Up, we take the amount of correctly predicted 56 Up, and divide it by the total amount of Up that occured: 56/(56+5) = 0.9180328. This shows that when predicting Up, they did so with a 92% correct rate.

Looking at how the model correctly predicts Down, we take the amount of correctly predicted 9 Down, and divide it by the total amount of Down that occured: 9/(9+34) = 0.2093023. This shows that when predicting Down, they did so with an 21% correct rate.

This model is overall best for predicting when the market will go up, not down.

(e) Repeat (d) using LDA.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

lda.lag2.fit <- lda(Direction ~ Lag2, data=Weekly, subset=train.Weekly)
lda.weekly.prob <- predict(lda.lag2.fit, weekly.2009.2010, type='response')
lda.weekly.pred=predict(lda.lag2.fit,weekly.2009.2010)
lda.Direction.2009.2010 <- Direction[!train.Weekly]
lda.weekly.class=lda.weekly.pred$class
table(lda.weekly.class,lda.Direction.2009.2010)

##                 lda.Direction.2009.2010
## lda.weekly.class Down Up
##             Down    9  5
##             Up     34 56

mean(lda.weekly.class == Direction.2009.2010)

## [1] 0.625

There is no difference between the LDA and GLM models

(f) Repeat (d) using QDA.

qda.lag2.fit=qda(Direction~Lag2,data=Weekly,subset=train.Weekly)
qda.weekly.class=predict(qda.lag2.fit,weekly.2009.2010)$class
table(qda.weekly.class,lda.Direction.2009.2010)

##                 lda.Direction.2009.2010
## qda.weekly.class Down Up
##             Down    0  0
##             Up     43 61

mean(qda.weekly.class==lda.Direction.2009.2010)

## [1] 0.5865385

The QDA model predicts the Direction of the market with a 58.6% accuracy, which is less than the previous models, and only predicts when the direction is going Up, not Down

(g) Repeat (d) using KNN with K = 1.

library(class)
train.X <- as.matrix(Lag2[train.Weekly])
test.X <- as.matrix(Lag2[!train.Weekly])
train.Direction=Direction[train.Weekly]
set.seed(1)
knn.pred=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=1)
table(knn.pred,Direction.2009.2010)

##         Direction.2009.2010
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

mean(knn.pred == Direction.2009.2010)

## [1] 0.5

The KNN accuracy is 50%, which is lower than the previous models.

It predicts Up with an accuracy of 51% (31/(31+30) = 0.5081967)) and down with an accuracy of 49% (21/(21+22) = 0.4883721)).

The overall model and each of the predictors is no more than a coin flip.

(h) Repeat (d) using naive Bayes.

library(e1071)
nb.fit <- naiveBayes(Direction~Lag2, data=Weekly, subset = train.Weekly)
nb.class <- predict(nb.fit , weekly.2009.2010)
table (nb.class, Direction.2009.2010)

##         Direction.2009.2010
## nb.class Down Up
##     Down    0  0
##     Up     43 61

mean (nb.class == Direction.2009.2010)

## [1] 0.5865385

This model is the same as question (f) above

(i) Which of these methods appears to provide the best results on this data?

The GLM and LDA models have the best result, with 62.5% total accuracy, and 92% accuracy when predicting the market going Up

(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

knn.pred2=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred2,Direction.2009.2010)

##          Direction.2009.2010
## knn.pred2 Down Up
##      Down   18 25
##      Up     25 36

mean(knn.pred2 == Direction.2009.2010)

## [1] 0.5192308

knn.pred3=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred3,Direction.2009.2010)

##          Direction.2009.2010
## knn.pred3 Down Up
##      Down   22 29
##      Up     21 32

mean(knn.pred3 == Direction.2009.2010)

## [1] 0.5192308

knn.pred10=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred10,Direction.2009.2010)

##           Direction.2009.2010
## knn.pred10 Down Up
##       Down   18 24
##       Up     25 37

mean(knn.pred10 == Direction.2009.2010)

## [1] 0.5288462

As K increases, the results of KNN increase overall as a model

detach(Weekly)

Question 14

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

attach(Auto)
median(mpg)

## [1] 22.75

Auto$mpg01 <- rep(0,length(mpg))
Auto$mpg01[mpg>=median(mpg)]=1
Auto$mpg01 <- as.factor(Auto$mpg01)
Auto[13:17,]

##    mpg cylinders displacement horsepower weight acceleration year origin
## 13  15         8          400        150   3761          9.5   70      1
## 14  14         8          455        225   3086         10.0   70      1
## 15  24         4          113         95   2372         15.0   70      3
## 16  22         6          198         95   2833         15.5   70      1
## 17  18         6          199         97   2774         15.5   70      1
##                       name mpg01
## 13   chevrolet monte carlo     0
## 14 buick estate wagon (sw)     0
## 15   toyota corona mark ii     1
## 16         plymouth duster     0
## 17              amc hornet     0

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

pairs(Auto)

boxplot(Auto)

Eyeballing the mpg01 scatterplots, it looks like horsepower, weight, and acceleration are going to be the features best used to predict mpg01.

(c) Split the data into a training set and a test set.

train.Auto.70s <- (year<80)
test.Auto.80s <- Auto[!train.Auto.70s,]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)
lda.Auto.fit <- lda(mpg01~horsepower+weight+acceleration, data=Auto, subset=train.Auto.70s)
lda.Auto.prob <- predict(lda.Auto.fit, test.Auto.80s, type='response')
lda.Auto.pred <- predict(lda.Auto.fit, test.Auto.80s)
lda.mpg01.Auto80s <- Auto$mpg01[!train.Auto.70s]
lda.Auto.class <- lda.Auto.pred$class
table(lda.Auto.class,lda.mpg01.Auto80s)

##               lda.mpg01.Auto80s
## lda.Auto.class  0  1
##              0  4 14
##              1  1 66

mean(lda.Auto.class == lda.mpg01.Auto80s)

## [1] 0.8235294

This LDA model was built to predict mpg01 with horsepower, weight, and acceleration as predictors. 10 years of data (1970-1979) were used to predict the results of the remaining 3 years (1980-1982). This resulted in a model that was accurate 82.3% of the time, with accuracy on predicting above-median mpg (mpg01 = 1) of 82.5% (66/(66+14) = 0.825), and predicting below-median mpg (mpg01 = 0) of 80% (4/(4+1) = 0.8).

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.Auto.fit <- qda(mpg01~horsepower+weight+acceleration, data=Auto, subset=train.Auto.70s)
qda.Auto.class <- predict(qda.Auto.fit, test.Auto.80s)$class
table(qda.Auto.class, lda.mpg01.Auto80s)

##               lda.mpg01.Auto80s
## qda.Auto.class  0  1
##              0  5 11
##              1  0 69

mean(qda.Auto.class==lda.mpg01.Auto80s)

## [1] 0.8705882

This QDA model was built using the same requirements from Question 14e. This resulted in a model that was accurate 87% of the time, with accuracy on predicting above-median mpg (mpg01 = 1) of 82.5% (69/(69+11) = 0.8625), and predicting below-median mpg (mpg01 = 0) of 100% (5/(5+0) = 1).

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

glm.Auto.fit <- glm(Auto$mpg01~horsepower+weight+acceleration, data=as.data.frame(train.Auto.70s), family=binomial)
glm.Auto.probs <- predict(glm.Auto.fit, test.Auto.80s, type="response")
glm.Auto.pred <- rep(0, length(glm.Auto.probs))
glm.Auto.pred[glm.Auto.probs > 0.5] <- 1
table(glm.Auto.pred, Auto$mpg01[!train.Auto.70s])

##              
## glm.Auto.pred  0  1
##             0  5 12
##             1  0 68

mean(glm.Auto.pred == Auto$mpg01[!train.Auto.70s])

## [1] 0.8588235

This GLM model was built using the same requirements from Question 14e. This resulted in a model that was accurate 85.9% of the time, with accuracy on predicting above-median mpg (mpg01 = 1) of 85% (68/(68+12) = 0.85), and predicting below-median mpg (mpg01 = 0) of 100% (5/(5+0) = 1).

(g) Perform naive Bayes on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(e1071)
nb.Auto.fit <- naiveBayes(Auto$mpg01~horsepower+weight+acceleration, data=Auto, subset = train.Auto.70s)
nb.Auto.class <- predict(nb.Auto.fit, test.Auto.80s)
table(nb.Auto.class, lda.mpg01.Auto80s)

##              lda.mpg01.Auto80s
## nb.Auto.class  0  1
##             0  4  7
##             1  1 73

mean(nb.Auto.class==lda.mpg01.Auto80s)

## [1] 0.9058824

This naive Bayes model was built using the same requirements from Question 14e. This resulted in a model that was accurate 90.5% of the time, with accuracy on predicting above-median mpg (mpg01 = 1) of 90.5% (73/(7+73) = 0.9058824), and predicting below-median mpg (mpg01 = 0) of 80% (4/(4+1) = 0.8).

(h) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

library(class)
train.Auto.KNN.x <- Auto[train.Auto.70s, c(4:6)]
test.Auto.KNN.x <- Auto[!train.Auto.70s, c(4:6)]
train.mpg01 <- Auto$mpg01[train.Auto.70s]
set.seed(1)
knn.Auto.pred <- knn(as.data.frame(train.Auto.KNN.x), as.data.frame(test.Auto.KNN.x), train.mpg01, k=1)
table(knn.Auto.pred,lda.mpg01.Auto80s)

##              lda.mpg01.Auto80s
## knn.Auto.pred  0  1
##             0  5 19
##             1  0 61

mean(knn.Auto.pred == lda.mpg01.Auto80s)

## [1] 0.7764706

This KNN model was built using the same requirements from Question 14e. This resulted in a model that was accurate 90.5% of the time, with accuracy on predicting above-median mpg (mpg01 = 1) of 76.3% (61/(61+19) = 0.7625), and predicting below-median mpg (mpg01 = 0) of 100% (5/(5+0) = 1).

detach(Auto)

Question 16

Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.

Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set

Create new variable: median_crime

Once the median of crime is found (0.25651), a new column is created with 0 entered. Then, if crim is greater than the median, the med_crim is updated to show 1

library(ISLR2)
attach(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

median(crim)

## [1] 0.25651

Boston$med_crim <- rep(0,length(crim))
Boston$med_crim[crim>=median(crim)]=1
table(Boston$med_crim)

## 
##   0   1 
## 253 253

Utilizing table(med_crim) we see that there are an equal amount of observations above and below the median, so this works!

Explore the data

Next we need to explore the data. Using scatter plots, we can look to see what variables may be related to med_crim

pairs(Boston)

cor(Boston)

##                 crim          zn       indus         chas         nox
## crim      1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn       -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus     0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas     -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox       0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm       -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age       0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis      -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad       0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax       0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio   0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black    -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat     0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv     -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## med_crim  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                   rm         age         dis          rad         tax
## crim     -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn        0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus    -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas      0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox      -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm        1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age      -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis       0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad      -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax      -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio  -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black     0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat    -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv      0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
## med_crim -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128
##             ptratio       black      lstat       medv    med_crim
## crim      0.2899456 -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       -0.3916785  0.17552032 -0.4129946  0.3604453 -0.43615103
## indus     0.3832476 -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     -0.1215152  0.04878848 -0.0539293  0.1752602  0.07009677
## nox       0.1889327 -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       -0.3555015  0.12806864 -0.6138083  0.6953599 -0.15637178
## age       0.2615150 -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      -0.2324705  0.29151167 -0.4969958  0.2499287 -0.61634164
## rad       0.4647412 -0.44441282  0.4886763 -0.3816262  0.61978625
## tax       0.4608530 -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio   1.0000000 -0.17738330  0.3740443 -0.5077867  0.25356836
## black    -0.1773833  1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat     0.3740443 -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     -0.5077867  0.33346082 -0.7376627  1.0000000 -0.26301673
## med_crim  0.2535684 -0.35121093  0.4532627 -0.2630167  1.00000000

There are 6 variables that are correlated over +-0.6 to med_crim: nox, rad, dis, age, tax, and indus.

Now that we have some variables to isolate, we can begin working on training our data and running models.

Split the data

I opted to use the top 70% of the Boston data as training data, and the remaining 30% as the test data. (I had originally attempted to do a random split of 70/30 but after 2 days of failing to get the KNN work, I opted the easier route or non-random. My apologies if this makes anybody sad).

train.Boston <- Boston[(1:(506*.7)),]
test.Boston <- Boston[(506*.7):506,]
test.med_crim <- Boston$med_crim[(506*.7):506]

This provides us with 354 observations in our training data, and 152 in the test data. We can now begin building our models!

Regression Model (GLM)

Before going forward let’s check to see if all predictors are significant

glm.Boston.fit <- glm(med_crim~nox+rad+dis+age+tax+indus, data=train.Boston, family = binomial)
summary(glm.Boston.fit)$coef

##                  Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -28.270989601 4.318242155 -6.546875 5.875368e-11
## nox          50.296106293 8.470018059  5.938134 2.882835e-09
## rad           0.556876514 0.129065450  4.314683 1.598320e-05
## dis           0.168815284 0.151745667  1.112488 2.659282e-01
## age           0.003096109 0.009206145  0.336309 7.366379e-01
## tax          -0.003905682 0.003161934 -1.235219 2.167489e-01
## indus        -0.088495592 0.049471309 -1.788827 7.364275e-02

It appears that only nox and rad are significant (p<0.05), so the others will be removed from the model going forward.

glm.Boston.fit <- glm(med_crim~nox+rad, data=train.Boston, family = binomial)
glm.Boston.prob <- predict(glm.Boston.fit, test.Boston, type='response')
glm.Boston.pred <- rep(0,length(glm.Boston.prob))
glm.Boston.pred[glm.Boston.prob>.5] <- 1
table(glm.Boston.pred, test.med_crim)

##                test.med_crim
## glm.Boston.pred   0   1
##               0   7   0
##               1  10 135

mean(glm.Boston.pred == test.med_crim)

## [1] 0.9342105

This analysis resulted in a model that was accurate 93.4% of the time, with accuracy on predicting above-median crime rate (med_crim = 1) of 100% (135/(135+0) = 1.0), and predicting below-median crime rate (med_crim = 0) of 41.2% (7/(7+10) = 0.4117647).

LDA

library(MASS)
lda.Boston.fit <- lda(med_crim~nox+rad, data=train.Boston)
lda.Boston.pred <- predict(lda.Boston.fit, test.Boston)
lda.Boston.class <- lda.Boston.pred$class
table(lda.Boston.class, test.med_crim)

##                 test.med_crim
## lda.Boston.class   0   1
##                0   7   0
##                1  10 135

mean(lda.Boston.class==test.med_crim)

## [1] 0.9342105

QDA

qda.Boston.fit <- qda(med_crim~nox+rad, data=train.Boston)
qda.Boston.pred <- predict(qda.Boston.fit, test.Boston)
qda.Boston.class <- qda.Boston.pred$class
table(qda.Boston.class, test.med_crim)

##                 test.med_crim
## qda.Boston.class   0   1
##                0   7 132
##                1  10   3

mean(qda.Boston.class==test.med_crim)

## [1] 0.06578947

This analysis resulted in a model that was accurate 65.8% of the time, with accuracy on predicting above-median crime rate (med_crim = 1) of 2.2% (3/(132+3) = 0.02222222), and predicting below-median crime rate (med_crim = 0) of 41.2% (7/(7+10) = 0.4117647).

KNN

library(class)
train.Boston.KNN.X=cbind(nox, rad)[(1:(506*.7)),]
test.Boston.KNN.X=cbind(nox, rad)[(506*.7):506,]
train.Boston.KNN.medCrime=Boston$med_crim[(1:(506*.7))]
set.seed(1)
knn.Boston.pred=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=1)
table(knn.Boston.pred, test.med_crim)

##                test.med_crim
## knn.Boston.pred   0   1
##               0  12   3
##               1   5 132

knn.Boston.pred3=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=3)
table(knn.Boston.pred3, test.med_crim)

##                 test.med_crim
## knn.Boston.pred3   0   1
##                0  12   3
##                1   5 132

knn.Boston.pred10=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=10)
table(knn.Boston.pred10, test.med_crim)

##                  test.med_crim
## knn.Boston.pred10   0   1
##                 0  12   3
##                 1   5 132

This analysis resulted in a model that was accurate 94.7% ((132+12)/(132+12+3+5)=0.9473684) of the time, with accuracy on predicting above-median crime rate (med_crim = 1) of 98% (132/(132+3) = 0.9777778), and predicting below-median crime rate (med_crim = 0) of 70.6% (12/(12+5) = 0.7058824). the number of K did not change the result, which means I definitely screwed up somewhere, but I’ve been trying to make this work for the last 2 days and I’ll just call it good enough for now…

library(e1071)
nb.Boston.fit <- naiveBayes(med_crim~nox+rad, data=train.Boston)
nb.Boston.class <- predict(nb.Boston.fit, test.Boston)
table(nb.Boston.class, test.med_crim)

##                test.med_crim
## nb.Boston.class   0   1
##               0   7 124
##               1  10  11

mean(nb.Boston.class == test.med_crim)

## [1] 0.1184211

This analysis resulted in a model that was accurate 11.8% of the time, with accuracy on predicting above-median crime rate (med_crim = 1) of 8.1% (11/(11+124) = 0.08148148), and predicting below-median crime rate (med_crim = 0) of 41.1% (7/(7+10) =0.4117647). Obviously this is pretty bad…

Notes

So I feel that I completely screwed up this assignment by overly complicating some areas and making the variable names too similar to each other and putting all 3 questions on one page. I need to find a better way to make these more of a plug and play system (like a function?) that could produce the same info accurately instead of handjamming each individual line and question. But hey, this is learning right? I know some of the numbers above are definitely off, but I’ll learn better in future assignments.