hw_3_data_mining.knit

output: html_document title:hw_3_data_mining

Question 1

This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, expect that this one contains 1089 returns weekly returns for 21 years (from the beginning of 1990 to the end of 2010)

a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be some patterns?

weekly <- ISLR::Weekly
weekly <- na.omit(weekly)
summary(weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume            Today         
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747   Min.   :-18.1950  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202   1st Qu.: -1.1540  
##  Median :  0.2380   Median :  0.2340   Median :1.00268   Median :  0.2410  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462   Mean   :  0.1499  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373   3rd Qu.:  1.4050  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821   Max.   : 12.0260  
##  Direction 
##  Down:484  
##  Up  :605  
##            
##            
##            
##

boxplot(weekly$Lag1, weekly$Lag2, weekly$Lag3, weekly$Lag4, weekly$Lag5)

volume <- weekly$Volume
hist(volume)

pairs(weekly)

There is equal distribution for all of the lag variables in this data set, with all of them having the same exact minimum and maximum value, and having their mean and median being slightly different from one another. As for the other 2 variables, year seems to be mostly a qualitative variable, and volume seem to be skewed to the right.

b) Use the full data set to perform a logistic regression with direction as the response and the 5 lag variables plus volume as predictors. Use the summary function to print these results. Do any of the predictors appear to be statistically significant? If so, which ones?

original_function <-glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = weekly, family = binomial)
summary(original_function)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

When running this type of regression, we see that the only statistically significant predictors appear to be the intercept of our model and Lag2.

c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

coef(original_function)

## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
##  0.26686414 -0.04126894  0.05844168 -0.01606114 -0.02779021 -0.01447206 
##      Volume 
## -0.02274153

summary(original_function)$coef

##                Estimate Std. Error    z value    Pr(>|z|)
## (Intercept)  0.26686414 0.08592961  3.1056134 0.001898848
## Lag1        -0.04126894 0.02641026 -1.5626099 0.118144368
## Lag2         0.05844168 0.02686499  2.1753839 0.029601361
## Lag3        -0.01606114 0.02666299 -0.6023760 0.546923890
## Lag4        -0.02779021 0.02646332 -1.0501409 0.293653342
## Lag5        -0.01447206 0.02638478 -0.5485006 0.583348244
## Volume      -0.02274153 0.03689812 -0.6163330 0.537674762

glm.probs <- predict(original_function , type="response")
glm.probs[1:10]

##         1         2         3         4         5         6         7         8 
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972 
##         9        10 
## 0.5715200 0.5554287

contrasts(weekly$Direction)

##      Up
## Down  0
## Up    1

glm.pred <- rep("Down",1089) #this creates a vector of 1250 entries, all having down
glm.pred[glm.probs>.5]="Up" #this will make the vector change the entry to up if the probability exceeds 0.5
table(glm.pred, weekly$Direction)

##         
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557

Our confusion matrix states that (54 + 557) = 611 observations in total were correct predictions, meaning that our model was correct about 56.11% of the time [(611 / 1089) = 0.5911]

d) Now fit the logistic regression model using a training data period from 1990 to 2008, with lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions from the held-out data (that is, the data from 2009-2010).

train_weekly <- split(weekly, weekly$Year<2009)[[2]]
weekly_2008 <- split(weekly, weekly$Year<2009)[[1]]
glm_fit_2008 <- glm(Direction ~ Lag2, data = weekly_2008, family = binomial)
glm.probs.2008 <- predict(glm_fit_2008, type = "response")
glm.pred.2008 <- rep("Down", length(glm.probs.2008))
glm.pred.2008[glm.probs.2008>0.5] = "Up"
table(glm.pred.2008, weekly_2008$Direction)

##              
## glm.pred.2008 Down Up
##          Down    8  4
##          Up     35 57

Our confusion matrix states that (8 + 57) = 65 observations in total were correct predictions, meaning that our model was correct about 62.5% of the time [(65 / 104) = 0.625]

e) Repeat using LDA

lda.fit.2008 <- lda(Direction ~ Lag2, data = weekly)
lda.pred.2008 <- predict(lda.fit.2008, weekly_2008)
lda.class <- lda.pred.2008$class
table(lda.class, weekly_2008$Direction)

##          
## lda.class Down Up
##      Down    9  5
##      Up     34 56

mean(lda.class == weekly_2008$Direction)

## [1] 0.625

Our confusion matrix obtained from a linear discriminant analysis states that (9 + 57) = 65 total observations in total were correct predictions, meaning that our model was correct about 62.5% of the time (65 / 104 = 0.625)

f) Repeat using QDA

qda.fit.2008 <- qda(Direction ~ Lag2, data = weekly)
qda.class <- predict(qda.fit.2008, weekly_2008)$class
table(qda.class, weekly_2008$Direction)

##          
## qda.class Down Up
##      Down    0  0
##      Up     43 61

mean(qda.class == weekly_2008$Direction)

## [1] 0.5865385

Our confusion matrix obtained from a quadratic discriminant analysis states that 61 total observations in total were correct predictions, meaning that our model was correct about 58.65% of the time (61 / 104 = 0.5865)

g) Repeat using KNN with K = 1

train_x <- as.matrix(train_weekly$Lag2)
test_x <- as.matrix(weekly_2008$Lag2)
train_direction <- train_weekly$Direction
set.seed(1)
knn_pred <- knn(train_x, test_x, train_direction, k = 1)
table(knn_pred, weekly_2008$Direction)

##         
## knn_pred Down Up
##     Down   21 30
##     Up     22 31

mean(knn_pred == weekly_2008$Direction)

## [1] 0.5

Our confusion matrix obtained from a quadratic discriminant analysis states that (21 + 31) = 52 total observations in total were correct predictions, meaning that our model was correct 50% of the time (52 / 104 = 0.50)

h) Which of these methods appear to provide the best results for this data? The methods that seem to have highest accuracy rates are both logistic regression and linear discriminant analysis, both having the same rate of 62.5%

i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier

glm_fit_i <- glm(Direction ~ Lag3 + Volume, data = weekly_2008, family = binomial)
glm.probs.i <- predict(glm_fit_i, type = "response")
glm.pred.i <- rep("Down", length(glm.probs.i))
glm.pred.i[glm.probs.i>0.5] = "Up"
table(glm.pred.i, weekly_2008$Direction)

##           
## glm.pred.i Down Up
##       Down    0  1
##       Up     43 60

For this question, we are using Lag3 and Volume to predict our direction. The first thing we see when running a logistic regression using this model is that it is only able to accurately predict when it goes is the up direction, but not the down direction. Still, our confusion matrix states that our model is correct about 57.69% of the time (60 / 104 = 0.5769).

lda.fit.i <- lda(Direction ~ Lag3 + Volume, data = weekly)
lda.pred.i <- predict(lda.fit.i, weekly_2008)
lda.class.i <- lda.pred.i$class
table(lda.class.i, weekly_2008$Direction)

##            
## lda.class.i Down Up
##        Down    1  4
##        Up     42 57

mean(lda.class.i == weekly_2008$Direction)

## [1] 0.5576923

As with the previous regression, when doing a linear discriminant analysis, we are able to correct most of the instances when the direction is going up, but not when its going down. Our confusion matrix states that our model is correct about 55.77% of the time (58/104 = 0.5577).

qda.fit.i <- qda(Direction ~ Lag3 + Volume, data = weekly)
qda.class.i <- predict(qda.fit.i, weekly_2008)$class
table(qda.class.i, weekly_2008$Direction)

##            
## qda.class.i Down Up
##        Down   10 14
##        Up     33 47

mean(qda.class.i == weekly_2008$Direction)

## [1] 0.5480769

When doing a quadratic discriminant analysis, we are able to increase our prediction for when the direction will go down, but decrease the prediction for the direction going up. Our confusion matrix states that our model is correct about 54.81% of the time (57 / 104 = 54.81)

train_x_i <- as.matrix(train_weekly$Lag3, train_weekly$Volume)
test_x_i <- as.matrix(weekly_2008$Lag3, weekly_2008$Volume)
train_direction_i <- train_weekly$Direction
set.seed(1)
knn_pred_i <- knn(train_x_i, test_x_i, train_direction_i, k = 3)
table(knn_pred_i, weekly_2008$Direction)

##           
## knn_pred_i Down Up
##       Down   17 32
##       Up     26 29

mean(knn_pred_i == weekly_2008$Direction)

## [1] 0.4423077

As with our previous fitting, using a K nearest neighbor analysis with a K=3, we are able to increase the prediction of when the direction will go down, but decrease the prediction of when the direction will go up. Overall, our confusion matrix states that our model is correct about 44.23% of the time (46 / 104 = 0.4423)

##Question 2

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

a) Create a binary variable (mpg01), that contains a 1 if mpg is above the median and a 0 if is it below. You can compute the median using the median() function. Note you may find it helpful to use the data.fram() function to create a single data set containing both mpg01 and the other Auto variables.

Auto <- read.csv("https://www.statlearning.com/s/Auto.csv", 
                 header = TRUE, na.strings = "?")
Auto <- na.omit(Auto)
median(Auto$mpg)

## [1] 22.75

Auto <- Auto %>%
        mutate(mpg01 = case_when(mpg < 22.75 ~ 0,
                                 mpg >= 22.75 ~ 1))

b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem the most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.

auto_no_name <- Auto %>%
                dplyr::select(-"name")
cor(auto_no_name)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
## mpg01         0.8369392 -0.7591939   -0.7534766 -0.6670526 -0.7577566
##              acceleration       year     origin      mpg01
## mpg             0.4233285  0.5805410  0.5652088  0.8369392
## cylinders      -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement   -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower     -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight         -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration    1.0000000  0.2903161  0.2127458  0.3468215
## year            0.2903161  1.0000000  0.1815277  0.4299042
## origin          0.2127458  0.1815277  1.0000000  0.5136984
## mpg01           0.3468215  0.4299042  0.5136984  1.0000000

The variables that seem to have the highest correlation with mpg01 are cylinders, displacement, and weight (which interestingly, all have negative correlations with mpg01 and falling around 0.75).

c) Split the data into a training set and a test set.

train_auto <- split(Auto, Auto$year<78)[[2]]
test_auto <- split(Auto, Auto$year<78)[[1]]

d) Perform LDA on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?

lda.fit.auto.79 <- lda(mpg01 ~ cylinders + displacement + weight, data = train_auto)
lda.pred.auto.79 <- predict(lda.fit.auto.79, test_auto)
lda.class.auto <- lda.pred.auto.79$class
table(lda.class.auto, test_auto$mpg01)

##               
## lda.class.auto  0  1
##              0 35 14
##              1  4 97

mean(lda.class.auto == test_auto$mpg01)

## [1] 0.88

Our confusion matrix obtained from a linear discriminant analysis states that (35 + 97) = 132 total observations in total were correct predictions, meaning that our model was correct 88% of the time (132 / 150 = 0.88)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.88) = 0.12 or 12%

e) Perform QDA on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?

qda.fit.auto.79 <- qda(mpg01 ~ cylinders + displacement + weight, data = train_auto)
qda.class <- predict(qda.fit.auto.79, test_auto)$class
table(qda.class, test_auto$mpg01)

##          
## qda.class  0  1
##         0 36 20
##         1  3 91

mean(qda.class ==  test_auto$mpg01)

## [1] 0.8466667

Our confusion matrix obtained from a quadratic discriminant analysis states that (36 + 91) = 127 total observations in total were correct predictions, meaning that our model was correct 84.67% of the time (127 / 150 = 0.8467)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8467) = 0.1533 or 15.33%

f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?

glm_fit_auto_79 <- glm(mpg01 ~ cylinders + displacement + weight, data = train_auto, family = binomial)
glm.probs.auto.79 <- predict(glm_fit_auto_79, test_auto, type = "response")
glm.pred.auto.79 <- rep(0, length(glm.probs.auto.79))
glm.pred.auto.79[glm.probs.auto.79>0.5] = 1
table(glm.pred.auto.79, test_auto$mpg01)

##                 
## glm.pred.auto.79  0  1
##                0 38 34
##                1  1 77

mean(glm.pred.auto.79 == test_auto$mpg01)

## [1] 0.7666667

Our confusion matrix obtained from a logistic regression analysis states that (38 + 77) = 115 total observations in total were correct predictions, meaning that our model was correct 76.67% of the time (115 / 150 = 0.7667)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.7667) = 0.2333 or 23.33%

g) Perform KKN on the training data (using several values of K) in order to predict mppg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained? Which value of K seems to perform the best on this data set?

train_x_auto <- as.matrix(train_auto$cylinders, train_auto$displacement, train_auto$weight)
test_x_auto <- as.matrix(test_auto$cylinders, test_auto$displacement, test_auto$weight)
train_mpg01 <- train_auto$mpg01
set.seed(1)
knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 1)
table(knn_pred_auto, test_auto$mpg01)

##              
## knn_pred_auto  0  1
##             0 35 13
##             1  4 98

mean(knn_pred_auto == test_auto$mpg01)

## [1] 0.8866667

knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 3)
table(knn_pred_auto, test_auto$mpg01)

##              
## knn_pred_auto  0  1
##             0 35 13
##             1  4 98

knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 8)
table(knn_pred_auto, test_auto$mpg01)

##              
## knn_pred_auto  0  1
##             0 35 12
##             1  4 99

Our confusion matrix obtained from a K nearest neighbors analysis states that (35 + 99) = 134 total observations in total were correct predictions, meaning that our model was correct 89.33% of the time (134 / 150 = 0.8933), although the other 2 Ks (1 and 3 respectively) are not far behind with a correct percentage of predictions of 88.67%

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8933) = 0.1067 or 10.67%

Question 3

Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.

boston_data <- MASS::Boston
boston_data <- boston_data %>%
              mutate(high_crime = case_when(crim < median(crim) ~ 0,
                                            crim > median(crim) ~ 1))
cor(boston_data)

##                   crim          zn       indus         chas         nox
## crim        1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn         -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus       0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas       -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox         0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm         -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age         0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis        -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad         0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax         0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio     0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black      -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat       0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv       -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
## high_crime  0.40939545 -0.43615103  0.60326017  0.070096774  0.72323480
##                     rm         age         dis          rad         tax
## crim       -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn          0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus      -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas        0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox        -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm          1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age        -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis         0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad        -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax        -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio    -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black       0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat      -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv        0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
## high_crime -0.15637178  0.61393992 -0.61634164  0.619786249  0.60874128
##               ptratio       black      lstat       medv  high_crime
## crim        0.2899456 -0.38506394  0.4556215 -0.3883046  0.40939545
## zn         -0.3916785  0.17552032 -0.4129946  0.3604453 -0.43615103
## indus       0.3832476 -0.35697654  0.6037997 -0.4837252  0.60326017
## chas       -0.1215152  0.04878848 -0.0539293  0.1752602  0.07009677
## nox         0.1889327 -0.38005064  0.5908789 -0.4273208  0.72323480
## rm         -0.3555015  0.12806864 -0.6138083  0.6953599 -0.15637178
## age         0.2615150 -0.27353398  0.6023385 -0.3769546  0.61393992
## dis        -0.2324705  0.29151167 -0.4969958  0.2499287 -0.61634164
## rad         0.4647412 -0.44441282  0.4886763 -0.3816262  0.61978625
## tax         0.4608530 -0.44180801  0.5439934 -0.4685359  0.60874128
## ptratio     1.0000000 -0.17738330  0.3740443 -0.5077867  0.25356836
## black      -0.1773833  1.00000000 -0.3660869  0.3334608 -0.35121093
## lstat       0.3740443 -0.36608690  1.0000000 -0.7376627  0.45326273
## medv       -0.5077867  0.33346082 -0.7376627  1.0000000 -0.26301673
## high_crime  0.2535684 -0.35121093  0.4532627 -0.2630167  1.00000000

We can create a new variable called high_crime in which a value of 0 denotes this neighborhood to have a crime rate lower than the median of the same and a value of 1 denotes this neighborhood to have a crime rate higher than the median of the same. When running a correlation matrix between the variables, we are able to see that the variables with the highest correlation towards our newly created variable include nox (nitrogen oxides concentration), tax (full-value property-tax rate per $10K), and dis (weighted mean of distances to 5 Boston employment centers).

train_boston_dim <- 1:(dim(boston_data)[1]/2)
test_boston_dim <-  (dim(boston_data)[1]/2 + 1):dim(boston_data)[1]
train_boston <- boston_data[train_boston_dim, ]
test_boston <- boston_data[test_boston_dim, ]

logistic_boston <- glm(high_crime ~ nox + tax + dis, data = train_boston, family = binomial)
glm.probs.boston <- predict(logistic_boston, test_boston, type = "response")
glm.preds.boston <- rep(0, length(glm.probs.boston))
glm.preds.boston[glm.probs.boston>0.5] = 1
table(glm.preds.boston, test_boston$high_crime)

##                 
## glm.preds.boston   0   1
##                0  72   8
##                1  18 155

mean(glm.preds.boston == test_boston$high_crime)

## [1] 0.8972332

Our confusion matrix obtained from a logistic regression analysis states that (72 + 155) = 227 total observations in total were correct predictions, meaning that our model was correct 89.72% of the time (227 / 253 = 0.8972)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8972) = 0.1028 or 10.28%

lda.fit.boston <- lda(high_crime ~ nox + tax + dis, data = train_boston)
lda.pred.boston <- predict(lda.fit.boston, test_boston)
lda.class.boston <- lda.pred.boston$class
table(lda.class.boston, test_boston$high_crime)

##                 
## lda.class.boston   0   1
##                0  80  18
##                1  10 145

mean(lda.class.boston == test_boston$high_crime)

## [1] 0.8893281

Our confusion matrix obtained from a linear discriminant analysis states that (80 + 145) = 225 total observations in total were correct predictions, meaning that our model was correct 88.93% of the time (225 / 253 = 0.8893)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8893) = 0.1107 or 11.07%

qda.fit.boston <- qda(high_crime ~ nox + tax + dis, data = train_boston)
qda.class.boston <- predict(qda.fit.boston, test_boston)$class
table(qda.class.boston, test_boston$high_crime)

##                 
## qda.class.boston  0  1
##                0 83 75
##                1  7 88

mean(qda.class.boston ==  test_boston$high_crime)

## [1] 0.6758893

Our confusion matrix obtained from a quadratic regression analysis states that (83 + 88) = 171 total observations in total were correct predictions, meaning that our model was correct 67.59% of the time (171 / 253 = 0.6759)

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.6759) = 0.3241 or 32.41%

train_x_boston <- as.matrix(train_boston$nox, train_boston$tax, train_boston$dis)
test_x_boston <- as.matrix(test_boston$nox, test_boston$tax, test_boston$dis)
train_high_crime <- train_boston$high_crime
set.seed(1)
knn_pred_boston <- knn(train_x_boston, test_x_boston, train_high_crime, k = 3)
table(knn_pred_boston, test_boston$high_crime)

##                
## knn_pred_boston   0   1
##               0  85  37
##               1   5 126

mean(knn_pred_boston == test_boston$high_crime)

## [1] 0.8339921

Our confusion matrix obtained from a K nearest neighbors analysis with a K = 3 states that (85 + 126) = 221 total observations in total were correct predictions, meaning that our model was correct 83.40% of the time (221 / 253 = 0.8340).

The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8340) = 0.166 or 16.6%