Nicole-Brown-Algorithms-II-HW-3

Problem 10. This question should be answered using the [Weekly] https://rdrr.io/cran/ISLR/man/Weekly.html data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

library(ISLR)
attach(Weekly)

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

pairs(Weekly)

summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

cor(Weekly[,-9])

##               Year         Lag1        Lag2        Lag3         Lag4
## Year    1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1   -0.03228927  1.000000000 -0.07485305  0.05863568 -0.071273876
## Lag2   -0.03339001 -0.074853051  1.00000000 -0.07572091  0.058381535
## Lag3   -0.03000649  0.058635682 -0.07572091  1.00000000 -0.075395865
## Lag4   -0.03112792 -0.071273876  0.05838153 -0.07539587  1.000000000
## Lag5   -0.03051910 -0.008183096 -0.07249948  0.06065717 -0.075675027
## Volume  0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today  -0.03245989 -0.075031842  0.05916672 -0.07124364 -0.007825873
##                Lag5      Volume        Today
## Year   -0.030519101  0.84194162 -0.032459894
## Lag1   -0.008183096 -0.06495131 -0.075031842
## Lag2   -0.072499482 -0.08551314  0.059166717
## Lag3    0.060657175 -0.06928771 -0.071243639
## Lag4   -0.075675027 -0.06107462 -0.007825873
## Lag5    1.000000000 -0.05851741  0.011012698
## Volume -0.058517414  1.00000000 -0.033077783
## Today   0.011012698 -0.03307778  1.000000000

The pairs output indicates that a relationship may exist between Volume and Year. The correlation confirms this with an 0.84 output. No other patterns readily emerge from the data.

(b) Use the full data set to perform a logistic regression with direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

glm.weekly=glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family = binomial)
summary(glm.weekly)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 is the only predictor showing significance and has a p-value of 0.0296.

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

create a probability vector using the predict function

glm.probs=predict(glm.weekly,type='response')
#this will print the first 10 obs and their probability
glm.probs[1:10]

##         1         2         3         4         5         6         7         8 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972 
##         9        10 
## 0.5715200 0.5554287

contrasts(Direction)

##      Up
## Down  0
## Up    1

To convert the predicted probabilities to a class level based on probability value, use a rep function to replicate elements x amount of times. In this case 1089 times.

glm.pred=rep("Down", 1089)

create logic to reassign the class based on the probability value out of the probability vector.

glm.pred[glm.probs>0.5]="Up"
glm.pred[1:10]

##  [1] "Up"   "Up"   "Up"   "Down" "Up"   "Up"   "Up"   "Up"   "Up"   "Up"

# call the prediction table and compare the predictive accuracy
table(glm.pred, Direction)

##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

#accuracy
mean(glm.pred==Direction)

## [1] 0.5610652

#misclassification
mean(glm.pred!=Direction)

## [1] 0.4389348

The confusion matrix is informing us of the accuracy and misclassification rates. The accuracy rate tells us how often our model accurately predicted the outcome. The error/misclassification rate tells us how often our model inaccurately predicted the outcome. Accuracy Rate=56%

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

#This will add a train column with a TRUE/False identifier if the year is less than 2009 to the weekly dataset
train=(Year<2009)
#create actual training set by using logic to only pull back those with false from steps above. The !train tells r that you want those that do not equal true
Weekly.09_10=Weekly[!train ,]
#Create a dataset to hold the outcome variable for the training dataset
Direction.09_10= Direction[!train]
dim(Weekly)

## [1] 1089    9

dim(Weekly.09_10)

## [1] 104   9

head(Weekly.09_10)

##     Year   Lag1   Lag2   Lag3   Lag4   Lag5   Volume  Today Direction
## 986 2009  6.760 -1.698  0.926  0.418 -2.251 3.793110 -4.448      Down
## 987 2009 -4.448  6.760 -1.698  0.926  0.418 5.043904 -4.518      Down
## 988 2009 -4.518 -4.448  6.760 -1.698  0.926 5.948758 -2.137      Down
## 989 2009 -2.137 -4.518 -4.448  6.760 -1.698 6.129763 -0.730      Down
## 990 2009 -0.730 -2.137 -4.518 -4.448  6.760 5.602004  5.173        Up
## 991 2009  5.173 -0.730 -2.137 -4.518 -4.448 6.217632 -4.808      Down

head(Direction.09_10)

## [1] Down Down Down Down Up   Down
## Levels: Down Up

#create model using train dataset
glm.fits=glm(Direction~Lag2, data=Weekly,family=binomial ,subset=train)
summary(glm.fits)

## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4

#create a vector to store probabilities of the test data set, using the model from the training data set. 
glm.probs=predict(glm.fits,Weekly.09_10, type="response")

To convert the predicted probabilities to a class level based on probability value, use a rep function to replicate elements x amount of times. In this case 104 times.

glm.pred=rep("Down",104)
#create logic to reassign the class based on the probability value out of the probability vector.
glm.pred[glm.probs >.5]="Up"
table(glm.pred,Direction.09_10)

##         Direction.09_10
## glm.pred Down Up
##     Down    9  5
##     Up     34 56

#test set error rate
mean(glm.pred!=Direction.09_10)

## [1] 0.375

#test set accuracy rate
mean(glm.pred==Direction.09_10)

## [1] 0.625

Accuracy Rate= 62.5%

(e) Repeat (d) using LDA.

Must call the mass library to use the LDA function. There is no need for a family statement because LDA is its own class of model

library(MASS)

Create an object called LDA fit and fit the LDA

lda.fit=lda(Direction~Lag2, data=Weekly, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581
## 
## Coefficients of linear discriminants:
##            LD1
## Lag2 0.4414162

Use predict function to return predictive probabilities

This says to take the LDA fit and run it against the test data

lda.pred=predict(lda.fit, Weekly.09_10)

create a lda class object to store the class element from the lda predictor. This will allow me to make the confusion matrix

lda.class=lda.pred$class
table(lda.class,Direction.09_10)

##          Direction.09_10
## lda.class Down Up
##      Down    9  5
##      Up     34 56

#accuracy Rate/error
mean(lda.class == Direction.09_10)

## [1] 0.625

#misclassification rate
mean(lda.class != Direction.09_10)

## [1] 0.375

Accuracy Rate = 62.5%

(f) Repeat (d) using QDA.

Will not provide coefficients because QDA is quadratic and not linear. With this model, prediction is our goal. Inference does not matter as much.

qda.fit=qda(Direction~Lag2,data=Weekly, subset=train)
qda.fit

## Call:
## qda(Direction ~ Lag2, data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2
## Down -0.03568254
## Up    0.26036581

qda.class=predict(qda.fit, Weekly.09_10)$class
qda.class

##   [1] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [26] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [51] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
##  [76] Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up Up
## [101] Up Up Up Up
## Levels: Down Up

table(qda.class, Direction.09_10)

##          Direction.09_10
## qda.class Down Up
##      Down    0  0
##      Up     43 61

#accuracy Rate/error
mean(qda.class == Direction.09_10)

## [1] 0.5865385

#misclassification rate
mean(qda.class != Direction.09_10)

## [1] 0.4134615

Accuracy Rate = 58.7%

(g) Repeat (d) using KNN with K = 1.

Takes 4 arguments. 1. matrix of predictors as training data set 2. matrix of predictors as test data set 3. vector of classes for training observations 4. pass a value for k which is nearest neighbors level will use cbind to bind lag1 and lag2 together to form two matrices for test and training

library(class)
# bind lag2 for training dataset
train.X <- as.matrix(Lag2[train])
#bind lag2 for test dataset
test.X <- as.matrix(Lag2[!train])
#create Direction info for train set
train.Direction=Direction[train]
dim(train.X)

## [1] 985   1

dim(test.X)

## [1] 104   1

length(train.Direction)

## [1] 985

Create the KNN. Must set a seed to make results reproducible

set.seed(1)
knn.pred=knn(train.X, test.X, train.Direction, k=1)
# show results for prediction vs test data set
table(knn.pred, Direction.09_10)

##         Direction.09_10
## knn.pred Down Up
##     Down   21 30
##     Up     22 31

#accuracy Rate/error
mean(knn.pred == Direction.09_10)

## [1] 0.5

#misclassification rate
mean(knn.pred != Direction.09_10)

## [1] 0.5

Accuracry Rate= 50%

(h) Which of these methods appears to provide the best results on this data?

The Logistic and LDA models provided the best results at 62.5% accuracy rates.

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

# LDA Trial
lda.fit=lda(Direction~Lag2 + I(Lag1^2), data=Weekly, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag2 + I(Lag1^2), data = Weekly, subset = train)
## 
## Prior probabilities of groups:
##      Down        Up 
## 0.4477157 0.5522843 
## 
## Group means:
##             Lag2 I(Lag1^2)
## Down -0.03568254  4.964397
## Up    0.26036581  5.318940
## 
## Coefficients of linear discriminants:
##                  LD1
## Lag2      0.45006708
## I(Lag1^2) 0.02767883

#This says to take the LDA fit and run it against the test data

lda.pred=predict(lda.fit, Weekly.09_10)

#create a lda class object to store the class element from the lda predictor. This will allow me to make the confusion matrix

lda.class=lda.pred$class
table(lda.class,Direction.09_10)

##          Direction.09_10
## lda.class Down Up
##      Down    8  2
##      Up     35 59

#accuracy Rate/error
mean(lda.class == Direction.09_10)

## [1] 0.6442308

#misclassification rate
mean(lda.class != Direction.09_10)

## [1] 0.3557692

A model that incorporates Lag2 and transformation of Lag1 produces the best accuracy at 64.4%. Of interest is that the KNN model was able to get very close to the prior 62.5% baseline with K=4, it in fact came in at 61.5% at that break point.

Problem 11. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

attach(Auto)

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and other Auto variables.

mpg01<-ifelse(Auto$mpg > median(Auto$mpg), c(1), c(0))
Auto2<- data.frame(Auto, mpg01)
summary(Auto2)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365  
##      mpg01    
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.5  
##  Mean   :0.5  
##  3rd Qu.:1.0  
##  Max.   :1.0  
##

(b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

attach(Auto2)
pairs(Auto2[,-9])

par(mfrow=c(2,2))
boxplot(mpg~cylinders)
plot(mpg~horsepower)
plot((mpg~acceleration))

Horsepower Cylinders Weight Displacement seem like they will be the most helpful predictors based on the boxplot and pairs outputs. we can see that a relationship exists between those predictors and the `mpg01’ outcome variable.

(c) Split the data into a training set and a test set.

# Create a 70/30 train/test
set.seed(1)
#trainid will be the variable that holds the TRUE/FALSE indicator
trainid <- sample(1:nrow(Auto2), nrow(Auto2)*0.7 , replace=F)  # 70% train, 30% test
train <- Auto2[trainid,]
test <- Auto2[-trainid,]

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

library(MASS)
lda.fit=lda(mpg01~horsepower+cylinders+ weight+displacement,data=train)
lda.fit

## Call:
## lda(mpg01 ~ horsepower + cylinders + weight + displacement, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4927007 0.5072993 
## 
## Group means:
##   horsepower cylinders   weight displacement
## 0  129.13333  6.777778 3611.052     271.9333
## 1   79.27338  4.187050 2342.165     116.8129
## 
## Coefficients of linear discriminants:
##                        LD1
## horsepower    0.0061919395
## cylinders    -0.3962357999
## weight       -0.0008321338
## displacement -0.0047630097

lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$mpg01)

##          
## lda.class  0  1
##         0 50  3
##         1 11 54

#accuracy Rate
mean(lda.class == test$mpg01)

## [1] 0.8813559

#misclassification rate/Error
mean(lda.class != test$mpg01)

## [1] 0.1186441

Test Error: 11.8%

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit=qda(mpg01~horsepower+cylinders+ weight+displacement,data=train)
qda.fit

## Call:
## qda(mpg01 ~ horsepower + cylinders + weight + displacement, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4927007 0.5072993 
## 
## Group means:
##   horsepower cylinders   weight displacement
## 0  129.13333  6.777778 3611.052     271.9333
## 1   79.27338  4.187050 2342.165     116.8129

qda.class=predict(qda.fit, test)$class
qda.class

##   [1] 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 1 1 0 1
##  [38] 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1
##  [75] 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0
## [112] 0 1 1 1 0 1 1
## Levels: 0 1

table(qda.class,test$mpg01)

##          
## qda.class  0  1
##         0 52  5
##         1  9 52

#accuracy Rate
mean(qda.class == test$mpg01)

## [1] 0.8813559

#misclassification rate/Error Rate
mean(qda.class != test$mpg01)

## [1] 0.1186441

Test Error Rate: 11.8%

(f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

#creates model from the training data 
glm.fit2=glm(mpg01~displacement+horsepower+weight+cylinders, data=train, family=binomial)
#tests the model on the test set
glm.probs=predict(glm.fit2, test, type= 'response')
#head(glm.probs)
glm.pred2=ifelse(glm.probs > 0.5, '1', '0')
table(glm.pred2, test$mpg01)

##          
## glm.pred2  0  1
##         0 53  3
##         1  8 54

#accuracy Rate
mean(glm.pred2== test$mpg01)

## [1] 0.9067797

#misclassification rate/Error Rate
mean(glm.pred2!= test$mpg01)

## [1] 0.09322034

Test Error Rate: 9.3%

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

set.seed(1)
#create new dataset and add trainid column to indicate if it will be in the training or testing sample using True/False in the trainid column.
trainid <- sample(1:nrow(Auto2), nrow(Auto2)*0.7 , replace=F)  # 70% train, 30% test
train <- Auto2[trainid,]
test <- Auto2[-trainid,]

library(class)
#Create matrix of predictors for training set
train.x=cbind(train$displacement,train$horsepower,train$weight,train$cylinders)
#Create matrix of predictors for testing set
test.x=cbind(test$displacement,test$horsepower,test$weight,test$cylinders)
#create vector of classes for training observations
train.mpg01=train$mpg01
#Check dimensions
dim(train.x)

## [1] 274   4

dim(test.x)

## [1] 118   4

length(train.mpg01)

## [1] 274

Create the KNN. Must set a seed to make results reproducible

set.seed(1)
knn.pred=knn(train.x, test.x, train.mpg01, k=3)
# show results for prediction vs test data set
table(knn.pred, test$mpg01)

##         
## knn.pred  0  1
##        0 52  4
##        1  9 53

#accuracy Rate
mean(knn.pred == test$mpg01)

## [1] 0.8898305

#misclassification rate/Error Rate
mean(knn.pred != test$mpg01)

## [1] 0.1101695

The best value of K seems to be 3 and produces an error rate of 11%

Problem 13. Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

attach(Boston)

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

#Create new variable to hold binary and indicate if crime rate is above or below median
crime01<-ifelse(Boston$crim>median(Boston$crim), c(1), c(0))
bostoncrime<-data.frame(Boston, crime01)
summary(bostoncrime$crime01)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.5     0.5     1.0     1.0

cor(bostoncrime)

##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## crime01  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
## crime01 -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128  0.2535684
##               black      lstat       medv     crime01
## crim    -0.38506394  0.4556215 -0.3883046  0.40939545
## zn       0.17552032 -0.4129946  0.3604453 -0.43615103
## indus   -0.35697654  0.6037997 -0.4837252  0.60326017
## chas     0.04878848 -0.0539293  0.1752602  0.07009677
## nox     -0.38005064  0.5908789 -0.4273208  0.72323480
## rm       0.12806864 -0.6138083  0.6953599 -0.15637178
## age     -0.27353398  0.6023385 -0.3769546  0.61393992
## dis      0.29151167 -0.4969958  0.2499287 -0.61634164
## rad     -0.44441282  0.4886763 -0.3816262  0.61978625
## tax     -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio -0.17738330  0.3740443 -0.5077867  0.25356836
## black    1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat   -0.36608690  1.0000000 -0.7376627  0.45326273
## medv     0.33346082 -0.7376627  1.0000000 -0.26301673
## crime01 -0.35121093  0.4532627 -0.2630167  1.00000000

pairs(bostoncrime)

par(mfrow=c(2,2))
plot(crim~tax)
plot(crim~nox)
plot(crim~dis)
plot(crim~age)

par(mfrow=c(2,2))
boxplot(crim~chas)
boxplot((crim~rad))

set.seed(1)
#create new dataset and add trainid column to indicate if it will be in the training or testing sample using True/False in the trainid column.
trainid <- sample(1:nrow(bostoncrime), nrow(bostoncrime)*0.7 , replace=F)  # 70% train, 30% test
train <- bostoncrime[trainid,]
test <- bostoncrime[-trainid,]

#Create logistic model
logistic.fit<-glm(crime01~tax + I(rad^2) + nox,  data=train, family = binomial)
summary(logistic.fit)

## 
## Call:
## glm(formula = crime01 ~ tax + I(rad^2) + nox, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.87652  -0.32888  -0.06167   0.00000   2.44944  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -17.727046   2.315947  -7.654 1.94e-14 ***
## tax          -0.007050   0.002404  -2.933  0.00336 ** 
## I(rad^2)      0.058470   0.012196   4.794 1.63e-06 ***
## nox          34.103945   4.684851   7.280 3.35e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 490.65  on 353  degrees of freedom
## Residual deviance: 176.79  on 350  degrees of freedom
## AIC: 184.79
## 
## Number of Fisher Scoring iterations: 9

#Create probability vector to store probabilites of outcome variable
logistic.probs<-predict(logistic.fit, test, type='response')
#Create binary prediction based on probabilites
logistic.pred=ifelse(logistic.probs > 0.5, '1', '0')
#create confusion matrix
table(logistic.pred, test$crime01)

##              
## logistic.pred  0  1
##             0 69 14
##             1  4 65

#test set accuracy rate
mean(logistic.pred==test$crime01)

## [1] 0.8815789

#Test Misclassification/Error Rate
mean(logistic.pred!=test$crime01)

## [1] 0.1184211

After starting with tax rad dis age nox chas and indus, the logistic output informed that tax rad dis and nox were the only significant predictors. Thus a new model was fit. Dis was ultimately dropped and a transformation was performed on rad.The final logistic output’s error is 11.8%.

#Create LDA Model
lda.fit=lda(crime01 ~ I(rad^2) + nox ,data=train)
lda.fit

## Call:
## lda(crime01 ~ I(rad^2) + nox, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5084746 0.4915254 
## 
## Group means:
##    I(rad^2)       nox
## 0  19.92778 0.4695150
## 1 311.44828 0.6377989
## 
## Coefficients of linear discriminants:
##                  LD1
## I(rad^2) 0.002269416
## nox      9.548628235

lda.pred=predict(lda.fit, test)
lda.class=lda.pred$class
table(lda.class,test$crime01)

##          
## lda.class  0  1
##         0 72 20
##         1  1 59

#accuracy Rate
mean(lda.class == test$crime01)

## [1] 0.8618421

#misclassification rate/Error
mean(lda.class != test$crime01)

## [1] 0.1381579

After changing out the variables it seems that rad and nox are the best predictors for the LDA model. The LDA test error is 13.8%

#Create a QDA model
qda.fit=qda(crime01 ~ dis + I(rad^2) + nox ,data=train)
qda.fit

## Call:
## qda(crime01 ~ dis + I(rad^2) + nox, data = train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.5084746 0.4915254 
## 
## Group means:
##        dis  I(rad^2)       nox
## 0 5.070315  19.92778 0.4695150
## 1 2.537681 311.44828 0.6377989

qda.class=predict(qda.fit, test)$class
qda.class

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0
##  [75] 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
## [149] 0 0 0 0
## Levels: 0 1

table(qda.class,test$crime01)

##          
## qda.class  0  1
##         0 73 20
##         1  0 59

#accuracy Rate
mean(qda.class == test$crime01)

## [1] 0.8684211

#misclassification rate/Error Rate
mean(qda.class != test$crime01)

## [1] 0.1315789

It is interesting to note that the QDA model prefers dis over tax. The resulting test error rate is: 13.2%

# Create a knn model
#create a matrix of predictors for the training and test set 
train.x<-cbind(train$dis, train$rad, train$nox, train$tax)
test.x<-cbind(test$dis, test$rad, test$nox, test$tax)
#create a vector of classes for training observations
train.crim01<-train$crime01
#Check dimensions
dim(train.x)

## [1] 354   4

dim(test.X)

## [1] 104   1

length(train.crim01)

## [1] 354

set.seed(1)
knn.pred=knn(train.x, test.x, train.crim01, k=1)
# show results for prediction vs test data set
table(knn.pred, test$crime01)

##         
## knn.pred  0  1
##        0 68  2
##        1  5 77

#accuracy Rate
mean(knn.pred == test$crime01)

## [1] 0.9539474

#misclassification rate/Error Rate
mean(knn.pred != test$crime01)

## [1] 0.04605263

After fitting various knn models, it was found that knn=1 is the best fit and produces an error rate of 4.6% This model contained dis rad nox and tax.