| output: html_document title:hw_3_data_mining |
This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, expect that this one contains 1089 returns weekly returns for 21 years (from the beginning of 1990 to the end of 2010)
a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be some patterns?
weekly <- ISLR::Weekly
weekly <- na.omit(weekly)
summary(weekly)
## Year Lag1 Lag2 Lag3
## Min. :1990 Min. :-18.1950 Min. :-18.1950 Min. :-18.1950
## 1st Qu.:1995 1st Qu.: -1.1540 1st Qu.: -1.1540 1st Qu.: -1.1580
## Median :2000 Median : 0.2410 Median : 0.2410 Median : 0.2410
## Mean :2000 Mean : 0.1506 Mean : 0.1511 Mean : 0.1472
## 3rd Qu.:2005 3rd Qu.: 1.4050 3rd Qu.: 1.4090 3rd Qu.: 1.4090
## Max. :2010 Max. : 12.0260 Max. : 12.0260 Max. : 12.0260
## Lag4 Lag5 Volume Today
## Min. :-18.1950 Min. :-18.1950 Min. :0.08747 Min. :-18.1950
## 1st Qu.: -1.1580 1st Qu.: -1.1660 1st Qu.:0.33202 1st Qu.: -1.1540
## Median : 0.2380 Median : 0.2340 Median :1.00268 Median : 0.2410
## Mean : 0.1458 Mean : 0.1399 Mean :1.57462 Mean : 0.1499
## 3rd Qu.: 1.4090 3rd Qu.: 1.4050 3rd Qu.:2.05373 3rd Qu.: 1.4050
## Max. : 12.0260 Max. : 12.0260 Max. :9.32821 Max. : 12.0260
## Direction
## Down:484
## Up :605
##
##
##
##
boxplot(weekly$Lag1, weekly$Lag2, weekly$Lag3, weekly$Lag4, weekly$Lag5)
volume <- weekly$Volume
hist(volume)
pairs(weekly)
There is equal distribution for all of the lag variables in this data set, with all of them having the same exact minimum and maximum value, and having their mean and median being slightly different from one another. As for the other 2 variables, year seems to be mostly a qualitative variable, and volume seem to be skewed to the right.
b) Use the full data set to perform a logistic regression with direction as the response and the 5 lag variables plus volume as predictors. Use the summary function to print these results. Do any of the predictors appear to be statistically significant? If so, which ones?
original_function <-glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = weekly, family = binomial)
summary(original_function)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
When running this type of regression, we see that the only statistically significant predictors appear to be the intercept of our model and Lag2.
c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
coef(original_function)
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.26686414 -0.04126894 0.05844168 -0.01606114 -0.02779021 -0.01447206
## Volume
## -0.02274153
summary(original_function)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686414 0.08592961 3.1056134 0.001898848
## Lag1 -0.04126894 0.02641026 -1.5626099 0.118144368
## Lag2 0.05844168 0.02686499 2.1753839 0.029601361
## Lag3 -0.01606114 0.02666299 -0.6023760 0.546923890
## Lag4 -0.02779021 0.02646332 -1.0501409 0.293653342
## Lag5 -0.01447206 0.02638478 -0.5485006 0.583348244
## Volume -0.02274153 0.03689812 -0.6163330 0.537674762
glm.probs <- predict(original_function , type="response")
glm.probs[1:10]
## 1 2 3 4 5 6 7 8
## 0.6086249 0.6010314 0.5875699 0.4816416 0.6169013 0.5684190 0.5786097 0.5151972
## 9 10
## 0.5715200 0.5554287
contrasts(weekly$Direction)
## Up
## Down 0
## Up 1
glm.pred <- rep("Down",1089) #this creates a vector of 1250 entries, all having down
glm.pred[glm.probs>.5]="Up" #this will make the vector change the entry to up if the probability exceeds 0.5
table(glm.pred, weekly$Direction)
##
## glm.pred Down Up
## Down 54 48
## Up 430 557
Our confusion matrix states that (54 + 557) = 611 observations in total were correct predictions, meaning that our model was correct about 56.11% of the time [(611 / 1089) = 0.5911]
d) Now fit the logistic regression model using a training data period from 1990 to 2008, with lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions from the held-out data (that is, the data from 2009-2010).
train_weekly <- split(weekly, weekly$Year<2009)[[2]]
weekly_2008 <- split(weekly, weekly$Year<2009)[[1]]
glm_fit_2008 <- glm(Direction ~ Lag2, data = weekly_2008, family = binomial)
glm.probs.2008 <- predict(glm_fit_2008, type = "response")
glm.pred.2008 <- rep("Down", length(glm.probs.2008))
glm.pred.2008[glm.probs.2008>0.5] = "Up"
table(glm.pred.2008, weekly_2008$Direction)
##
## glm.pred.2008 Down Up
## Down 8 4
## Up 35 57
Our confusion matrix states that (8 + 57) = 65 observations in total were correct predictions, meaning that our model was correct about 62.5% of the time [(65 / 104) = 0.625]
e) Repeat using LDA
lda.fit.2008 <- lda(Direction ~ Lag2, data = weekly)
lda.pred.2008 <- predict(lda.fit.2008, weekly_2008)
lda.class <- lda.pred.2008$class
table(lda.class, weekly_2008$Direction)
##
## lda.class Down Up
## Down 9 5
## Up 34 56
mean(lda.class == weekly_2008$Direction)
## [1] 0.625
Our confusion matrix obtained from a linear discriminant analysis states that (9 + 57) = 65 total observations in total were correct predictions, meaning that our model was correct about 62.5% of the time (65 / 104 = 0.625)
f) Repeat using QDA
qda.fit.2008 <- qda(Direction ~ Lag2, data = weekly)
qda.class <- predict(qda.fit.2008, weekly_2008)$class
table(qda.class, weekly_2008$Direction)
##
## qda.class Down Up
## Down 0 0
## Up 43 61
mean(qda.class == weekly_2008$Direction)
## [1] 0.5865385
Our confusion matrix obtained from a quadratic discriminant analysis states that 61 total observations in total were correct predictions, meaning that our model was correct about 58.65% of the time (61 / 104 = 0.5865)
g) Repeat using KNN with K = 1
train_x <- as.matrix(train_weekly$Lag2)
test_x <- as.matrix(weekly_2008$Lag2)
train_direction <- train_weekly$Direction
set.seed(1)
knn_pred <- knn(train_x, test_x, train_direction, k = 1)
table(knn_pred, weekly_2008$Direction)
##
## knn_pred Down Up
## Down 21 30
## Up 22 31
mean(knn_pred == weekly_2008$Direction)
## [1] 0.5
Our confusion matrix obtained from a quadratic discriminant analysis states that (21 + 31) = 52 total observations in total were correct predictions, meaning that our model was correct 50% of the time (52 / 104 = 0.50)
h) Which of these methods appear to provide the best results for this data? The methods that seem to have highest accuracy rates are both logistic regression and linear discriminant analysis, both having the same rate of 62.5%
i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier
glm_fit_i <- glm(Direction ~ Lag3 + Volume, data = weekly_2008, family = binomial)
glm.probs.i <- predict(glm_fit_i, type = "response")
glm.pred.i <- rep("Down", length(glm.probs.i))
glm.pred.i[glm.probs.i>0.5] = "Up"
table(glm.pred.i, weekly_2008$Direction)
##
## glm.pred.i Down Up
## Down 0 1
## Up 43 60
For this question, we are using Lag3 and Volume to predict our direction. The first thing we see when running a logistic regression using this model is that it is only able to accurately predict when it goes is the up direction, but not the down direction. Still, our confusion matrix states that our model is correct about 57.69% of the time (60 / 104 = 0.5769).
lda.fit.i <- lda(Direction ~ Lag3 + Volume, data = weekly)
lda.pred.i <- predict(lda.fit.i, weekly_2008)
lda.class.i <- lda.pred.i$class
table(lda.class.i, weekly_2008$Direction)
##
## lda.class.i Down Up
## Down 1 4
## Up 42 57
mean(lda.class.i == weekly_2008$Direction)
## [1] 0.5576923
As with the previous regression, when doing a linear discriminant analysis, we are able to correct most of the instances when the direction is going up, but not when its going down. Our confusion matrix states that our model is correct about 55.77% of the time (58/104 = 0.5577).
qda.fit.i <- qda(Direction ~ Lag3 + Volume, data = weekly)
qda.class.i <- predict(qda.fit.i, weekly_2008)$class
table(qda.class.i, weekly_2008$Direction)
##
## qda.class.i Down Up
## Down 10 14
## Up 33 47
mean(qda.class.i == weekly_2008$Direction)
## [1] 0.5480769
When doing a quadratic discriminant analysis, we are able to increase our prediction for when the direction will go down, but decrease the prediction for the direction going up. Our confusion matrix states that our model is correct about 54.81% of the time (57 / 104 = 54.81)
train_x_i <- as.matrix(train_weekly$Lag3, train_weekly$Volume)
test_x_i <- as.matrix(weekly_2008$Lag3, weekly_2008$Volume)
train_direction_i <- train_weekly$Direction
set.seed(1)
knn_pred_i <- knn(train_x_i, test_x_i, train_direction_i, k = 3)
table(knn_pred_i, weekly_2008$Direction)
##
## knn_pred_i Down Up
## Down 17 32
## Up 26 29
mean(knn_pred_i == weekly_2008$Direction)
## [1] 0.4423077
As with our previous fitting, using a K nearest neighbor analysis with a K=3, we are able to increase the prediction of when the direction will go down, but decrease the prediction of when the direction will go up. Overall, our confusion matrix states that our model is correct about 44.23% of the time (46 / 104 = 0.4423)
##Question 2
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
a) Create a binary variable (mpg01), that contains a 1 if mpg is above the median and a 0 if is it below. You can compute the median using the median() function. Note you may find it helpful to use the data.fram() function to create a single data set containing both mpg01 and the other Auto variables.
Auto <- read.csv("https://www.statlearning.com/s/Auto.csv",
header = TRUE, na.strings = "?")
Auto <- na.omit(Auto)
median(Auto$mpg)
## [1] 22.75
Auto <- Auto %>%
mutate(mpg01 = case_when(mpg < 22.75 ~ 0,
mpg >= 22.75 ~ 1))
b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem the most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
auto_no_name <- Auto %>%
dplyr::select(-"name")
cor(auto_no_name)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## mpg01 0.8369392 -0.7591939 -0.7534766 -0.6670526 -0.7577566
## acceleration year origin mpg01
## mpg 0.4233285 0.5805410 0.5652088 0.8369392
## cylinders -0.5046834 -0.3456474 -0.5689316 -0.7591939
## displacement -0.5438005 -0.3698552 -0.6145351 -0.7534766
## horsepower -0.6891955 -0.4163615 -0.4551715 -0.6670526
## weight -0.4168392 -0.3091199 -0.5850054 -0.7577566
## acceleration 1.0000000 0.2903161 0.2127458 0.3468215
## year 0.2903161 1.0000000 0.1815277 0.4299042
## origin 0.2127458 0.1815277 1.0000000 0.5136984
## mpg01 0.3468215 0.4299042 0.5136984 1.0000000
The variables that seem to have the highest correlation with mpg01 are cylinders, displacement, and weight (which interestingly, all have negative correlations with mpg01 and falling around 0.75).
c) Split the data into a training set and a test set.
train_auto <- split(Auto, Auto$year<78)[[2]]
test_auto <- split(Auto, Auto$year<78)[[1]]
d) Perform LDA on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?
lda.fit.auto.79 <- lda(mpg01 ~ cylinders + displacement + weight, data = train_auto)
lda.pred.auto.79 <- predict(lda.fit.auto.79, test_auto)
lda.class.auto <- lda.pred.auto.79$class
table(lda.class.auto, test_auto$mpg01)
##
## lda.class.auto 0 1
## 0 35 14
## 1 4 97
mean(lda.class.auto == test_auto$mpg01)
## [1] 0.88
Our confusion matrix obtained from a linear discriminant analysis states that (35 + 97) = 132 total observations in total were correct predictions, meaning that our model was correct 88% of the time (132 / 150 = 0.88)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.88) = 0.12 or 12%
e) Perform QDA on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?
qda.fit.auto.79 <- qda(mpg01 ~ cylinders + displacement + weight, data = train_auto)
qda.class <- predict(qda.fit.auto.79, test_auto)$class
table(qda.class, test_auto$mpg01)
##
## qda.class 0 1
## 0 36 20
## 1 3 91
mean(qda.class == test_auto$mpg01)
## [1] 0.8466667
Our confusion matrix obtained from a quadratic discriminant analysis states that (36 + 91) = 127 total observations in total were correct predictions, meaning that our model was correct 84.67% of the time (127 / 150 = 0.8467)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8467) = 0.1533 or 15.33%
f) Perform logistic regression on the training data in order to predict mpg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained?
glm_fit_auto_79 <- glm(mpg01 ~ cylinders + displacement + weight, data = train_auto, family = binomial)
glm.probs.auto.79 <- predict(glm_fit_auto_79, test_auto, type = "response")
glm.pred.auto.79 <- rep(0, length(glm.probs.auto.79))
glm.pred.auto.79[glm.probs.auto.79>0.5] = 1
table(glm.pred.auto.79, test_auto$mpg01)
##
## glm.pred.auto.79 0 1
## 0 38 34
## 1 1 77
mean(glm.pred.auto.79 == test_auto$mpg01)
## [1] 0.7666667
Our confusion matrix obtained from a logistic regression analysis states that (38 + 77) = 115 total observations in total were correct predictions, meaning that our model was correct 76.67% of the time (115 / 150 = 0.7667)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.7667) = 0.2333 or 23.33%
g) Perform KKN on the training data (using several values of K) in order to predict mppg01 using the variables that seem the most associated with mpg01 in (b). What is the test error of the model obtained? Which value of K seems to perform the best on this data set?
train_x_auto <- as.matrix(train_auto$cylinders, train_auto$displacement, train_auto$weight)
test_x_auto <- as.matrix(test_auto$cylinders, test_auto$displacement, test_auto$weight)
train_mpg01 <- train_auto$mpg01
set.seed(1)
knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 1)
table(knn_pred_auto, test_auto$mpg01)
##
## knn_pred_auto 0 1
## 0 35 13
## 1 4 98
mean(knn_pred_auto == test_auto$mpg01)
## [1] 0.8866667
knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 3)
table(knn_pred_auto, test_auto$mpg01)
##
## knn_pred_auto 0 1
## 0 35 13
## 1 4 98
knn_pred_auto <- knn(train_x_auto, test_x_auto, train_mpg01, k = 8)
table(knn_pred_auto, test_auto$mpg01)
##
## knn_pred_auto 0 1
## 0 35 12
## 1 4 99
Our confusion matrix obtained from a K nearest neighbors analysis states that (35 + 99) = 134 total observations in total were correct predictions, meaning that our model was correct 89.33% of the time (134 / 150 = 0.8933), although the other 2 Ks (1 and 3 respectively) are not far behind with a correct percentage of predictions of 88.67%
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8933) = 0.1067 or 10.67%
Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
boston_data <- MASS::Boston
boston_data <- boston_data %>%
mutate(high_crime = case_when(crim < median(crim) ~ 0,
crim > median(crim) ~ 1))
cor(boston_data)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## high_crime 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## high_crime -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128
## ptratio black lstat medv high_crime
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046 0.40939545
## zn -0.3916785 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252 0.60326017
## chas -0.1215152 0.04878848 -0.0539293 0.1752602 0.07009677
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208 0.72323480
## rm -0.3555015 0.12806864 -0.6138083 0.6953599 -0.15637178
## age 0.2615150 -0.27353398 0.6023385 -0.3769546 0.61393992
## dis -0.2324705 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262 0.61978625
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867 0.25356836
## black -0.1773833 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627 0.45326273
## medv -0.5077867 0.33346082 -0.7376627 1.0000000 -0.26301673
## high_crime 0.2535684 -0.35121093 0.4532627 -0.2630167 1.00000000
We can create a new variable called high_crime in which a value of 0 denotes this neighborhood to have a crime rate lower than the median of the same and a value of 1 denotes this neighborhood to have a crime rate higher than the median of the same. When running a correlation matrix between the variables, we are able to see that the variables with the highest correlation towards our newly created variable include nox (nitrogen oxides concentration), tax (full-value property-tax rate per $10K), and dis (weighted mean of distances to 5 Boston employment centers).
train_boston_dim <- 1:(dim(boston_data)[1]/2)
test_boston_dim <- (dim(boston_data)[1]/2 + 1):dim(boston_data)[1]
train_boston <- boston_data[train_boston_dim, ]
test_boston <- boston_data[test_boston_dim, ]
logistic_boston <- glm(high_crime ~ nox + tax + dis, data = train_boston, family = binomial)
glm.probs.boston <- predict(logistic_boston, test_boston, type = "response")
glm.preds.boston <- rep(0, length(glm.probs.boston))
glm.preds.boston[glm.probs.boston>0.5] = 1
table(glm.preds.boston, test_boston$high_crime)
##
## glm.preds.boston 0 1
## 0 72 8
## 1 18 155
mean(glm.preds.boston == test_boston$high_crime)
## [1] 0.8972332
Our confusion matrix obtained from a logistic regression analysis states that (72 + 155) = 227 total observations in total were correct predictions, meaning that our model was correct 89.72% of the time (227 / 253 = 0.8972)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8972) = 0.1028 or 10.28%
lda.fit.boston <- lda(high_crime ~ nox + tax + dis, data = train_boston)
lda.pred.boston <- predict(lda.fit.boston, test_boston)
lda.class.boston <- lda.pred.boston$class
table(lda.class.boston, test_boston$high_crime)
##
## lda.class.boston 0 1
## 0 80 18
## 1 10 145
mean(lda.class.boston == test_boston$high_crime)
## [1] 0.8893281
Our confusion matrix obtained from a linear discriminant analysis states that (80 + 145) = 225 total observations in total were correct predictions, meaning that our model was correct 88.93% of the time (225 / 253 = 0.8893)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8893) = 0.1107 or 11.07%
qda.fit.boston <- qda(high_crime ~ nox + tax + dis, data = train_boston)
qda.class.boston <- predict(qda.fit.boston, test_boston)$class
table(qda.class.boston, test_boston$high_crime)
##
## qda.class.boston 0 1
## 0 83 75
## 1 7 88
mean(qda.class.boston == test_boston$high_crime)
## [1] 0.6758893
Our confusion matrix obtained from a quadratic regression analysis states that (83 + 88) = 171 total observations in total were correct predictions, meaning that our model was correct 67.59% of the time (171 / 253 = 0.6759)
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.6759) = 0.3241 or 32.41%
train_x_boston <- as.matrix(train_boston$nox, train_boston$tax, train_boston$dis)
test_x_boston <- as.matrix(test_boston$nox, test_boston$tax, test_boston$dis)
train_high_crime <- train_boston$high_crime
set.seed(1)
knn_pred_boston <- knn(train_x_boston, test_x_boston, train_high_crime, k = 3)
table(knn_pred_boston, test_boston$high_crime)
##
## knn_pred_boston 0 1
## 0 85 37
## 1 5 126
mean(knn_pred_boston == test_boston$high_crime)
## [1] 0.8339921
Our confusion matrix obtained from a K nearest neighbors analysis with a K = 3 states that (85 + 126) = 221 total observations in total were correct predictions, meaning that our model was correct 83.40% of the time (221 / 253 = 0.8340).
The test error for this model can be obtained by subtracting the total correct prediction percentage from 100, so (1 - 0.8340) = 0.166 or 16.6%