This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.
library(ISLR2)
attach(Weekly)
pairs(Weekly)
cor(Weekly[,-9])
## Year Lag1 Lag2 Lag3 Lag4
## Year 1.00000000 -0.032289274 -0.03339001 -0.03000649 -0.031127923
## Lag1 -0.03228927 1.000000000 -0.07485305 0.05863568 -0.071273876
## Lag2 -0.03339001 -0.074853051 1.00000000 -0.07572091 0.058381535
## Lag3 -0.03000649 0.058635682 -0.07572091 1.00000000 -0.075395865
## Lag4 -0.03112792 -0.071273876 0.05838153 -0.07539587 1.000000000
## Lag5 -0.03051910 -0.008183096 -0.07249948 0.06065717 -0.075675027
## Volume 0.84194162 -0.064951313 -0.08551314 -0.06928771 -0.061074617
## Today -0.03245989 -0.075031842 0.05916672 -0.07124364 -0.007825873
## Lag5 Volume Today
## Year -0.030519101 0.84194162 -0.032459894
## Lag1 -0.008183096 -0.06495131 -0.075031842
## Lag2 -0.072499482 -0.08551314 0.059166717
## Lag3 0.060657175 -0.06928771 -0.071243639
## Lag4 -0.075675027 -0.06107462 -0.007825873
## Lag5 1.000000000 -0.05851741 0.011012698
## Volume -0.058517414 1.00000000 -0.033077783
## Today 0.011012698 -0.03307778 1.000000000
For year, there are no strong correlations with Lag1-5; however, there appears to be an increase of variance (similar to a cone). There is a strong correlation with Year and Volume, and this appears to be non-linear.
fullWeekly.fit <- glm(Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(fullWeekly.fit)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
Lag2 is the only statistically significant predictors for this model, with a P-value of 0.0296.
glm.probs=predict(fullWeekly.fit,type="response")
glm.pred=rep("Down", length(glm.probs))
glm.pred[glm.probs>.5]="Up"
table(glm.pred,Direction)
## Direction
## glm.pred Down Up
## Down 54 48
## Up 430 557
mean(glm.pred == Direction)
## [1] 0.5610652
Correct predictions include 557 Up and 54
Down, out of 1089 total, for a total correct percentage of
(557+54)/1089 = 0.5610652. This means that the model
overall predicted its direction 56.1% of the time.
Looking at how the model correctly predicts Up, we take
the amount of correctly predicted 557 Up, and divide it by
the total amount of Up that occured:
557/(557+48) = 0.9206612. This shows that when predicting
Up, they did so with a 92% correct rate.
Looking at how the model correctly predicts Down, we
take the amount of correctly predicted 54 Down, and divide
it by the total amount of Down that occured:
54/(54+430) = 0.1115702. This shows that when predicting
Down, they did so with an 11% correct rate.
This model is overall best for predicting when the market will go up, not down.
train.Weekly <- (Year<2009)
weekly.2009.2010 <- Weekly[!train.Weekly,]
lag2.fit <- glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train.Weekly)
weekly.prob <- predict(lag2.fit, weekly.2009.2010, type='response')
weekly.pred=rep("Down",length(weekly.prob))
weekly.pred[weekly.prob>.5]="Up"
Direction.2009.2010 <- Direction[!train.Weekly]
table(weekly.pred,Direction.2009.2010)
## Direction.2009.2010
## weekly.pred Down Up
## Down 9 5
## Up 34 56
mean(weekly.pred == Direction.2009.2010)
## [1] 0.625
Overall, this model predicts with a 62.5% accuracy.
Correct predictions include 56 Up and 9
Down, out of 104 total, for a total correct percentage of
(56+9)/104 = 0.625. This means that the model overall
predicted its weekly direction 62.5% of the time.
Looking at how the model correctly predicts Up, we take
the amount of correctly predicted 56 Up, and divide it by
the total amount of Up that occured:
56/(56+5) = 0.9180328. This shows that when predicting
Up, they did so with a 92% correct rate.
Looking at how the model correctly predicts Down, we
take the amount of correctly predicted 9 Down, and divide
it by the total amount of Down that occured:
9/(9+34) = 0.2093023. This shows that when predicting
Down, they did so with an 21% correct rate.
This model is overall best for predicting when the market will go up, not down.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
lda.lag2.fit <- lda(Direction ~ Lag2, data=Weekly, subset=train.Weekly)
lda.weekly.prob <- predict(lda.lag2.fit, weekly.2009.2010, type='response')
lda.weekly.pred=predict(lda.lag2.fit,weekly.2009.2010)
lda.Direction.2009.2010 <- Direction[!train.Weekly]
lda.weekly.class=lda.weekly.pred$class
table(lda.weekly.class,lda.Direction.2009.2010)
## lda.Direction.2009.2010
## lda.weekly.class Down Up
## Down 9 5
## Up 34 56
mean(lda.weekly.class == Direction.2009.2010)
## [1] 0.625
There is no difference between the LDA and GLM models
qda.lag2.fit=qda(Direction~Lag2,data=Weekly,subset=train.Weekly)
qda.weekly.class=predict(qda.lag2.fit,weekly.2009.2010)$class
table(qda.weekly.class,lda.Direction.2009.2010)
## lda.Direction.2009.2010
## qda.weekly.class Down Up
## Down 0 0
## Up 43 61
mean(qda.weekly.class==lda.Direction.2009.2010)
## [1] 0.5865385
The QDA model predicts the Direction of the market with a 58.6%
accuracy, which is less than the previous models, and only predicts when
the direction is going Up, not Down
library(class)
train.X <- as.matrix(Lag2[train.Weekly])
test.X <- as.matrix(Lag2[!train.Weekly])
train.Direction=Direction[train.Weekly]
set.seed(1)
knn.pred=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=1)
table(knn.pred,Direction.2009.2010)
## Direction.2009.2010
## knn.pred Down Up
## Down 21 30
## Up 22 31
mean(knn.pred == Direction.2009.2010)
## [1] 0.5
The KNN accuracy is 50%, which is lower than the previous models.
It predicts Up with an accuracy of 51%
(31/(31+30) = 0.5081967)) and down with an accuracy of 49%
(21/(21+22) = 0.4883721)).
The overall model and each of the predictors is no more than a coin flip.
library(e1071)
nb.fit <- naiveBayes(Direction~Lag2, data=Weekly, subset = train.Weekly)
nb.class <- predict(nb.fit , weekly.2009.2010)
table (nb.class, Direction.2009.2010)
## Direction.2009.2010
## nb.class Down Up
## Down 0 0
## Up 43 61
mean (nb.class == Direction.2009.2010)
## [1] 0.5865385
This model is the same as question (f) above
The GLM and LDA models have the best result, with 62.5% total
accuracy, and 92% accuracy when predicting the market going
Up
knn.pred2=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred2,Direction.2009.2010)
## Direction.2009.2010
## knn.pred2 Down Up
## Down 18 25
## Up 25 36
mean(knn.pred2 == Direction.2009.2010)
## [1] 0.5192308
knn.pred3=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred3,Direction.2009.2010)
## Direction.2009.2010
## knn.pred3 Down Up
## Down 22 29
## Up 21 32
mean(knn.pred3 == Direction.2009.2010)
## [1] 0.5192308
knn.pred10=knn(as.data.frame(train.X),as.data.frame(test.X),train.Direction,k=2)
table(knn.pred10,Direction.2009.2010)
## Direction.2009.2010
## knn.pred10 Down Up
## Down 18 24
## Up 25 37
mean(knn.pred10 == Direction.2009.2010)
## [1] 0.5288462
As K increases, the results of KNN increase overall as a model
detach(Weekly)
In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.
attach(Auto)
median(mpg)
## [1] 22.75
Auto$mpg01 <- rep(0,length(mpg))
Auto$mpg01[mpg>=median(mpg)]=1
Auto$mpg01 <- as.factor(Auto$mpg01)
Auto[13:17,]
## mpg cylinders displacement horsepower weight acceleration year origin
## 13 15 8 400 150 3761 9.5 70 1
## 14 14 8 455 225 3086 10.0 70 1
## 15 24 4 113 95 2372 15.0 70 3
## 16 22 6 198 95 2833 15.5 70 1
## 17 18 6 199 97 2774 15.5 70 1
## name mpg01
## 13 chevrolet monte carlo 0
## 14 buick estate wagon (sw) 0
## 15 toyota corona mark ii 1
## 16 plymouth duster 0
## 17 amc hornet 0
pairs(Auto)
boxplot(Auto)
Eyeballing the mpg01 scatterplots, it looks like
horsepower, weight, and
acceleration are going to be the features best used to
predict mpg01.
train.Auto.70s <- (year<80)
test.Auto.80s <- Auto[!train.Auto.70s,]
library(MASS)
lda.Auto.fit <- lda(mpg01~horsepower+weight+acceleration, data=Auto, subset=train.Auto.70s)
lda.Auto.prob <- predict(lda.Auto.fit, test.Auto.80s, type='response')
lda.Auto.pred <- predict(lda.Auto.fit, test.Auto.80s)
lda.mpg01.Auto80s <- Auto$mpg01[!train.Auto.70s]
lda.Auto.class <- lda.Auto.pred$class
table(lda.Auto.class,lda.mpg01.Auto80s)
## lda.mpg01.Auto80s
## lda.Auto.class 0 1
## 0 4 14
## 1 1 66
mean(lda.Auto.class == lda.mpg01.Auto80s)
## [1] 0.8235294
This LDA model was built to predict mpg01 with
horsepower, weight, and
acceleration as predictors. 10 years of data (1970-1979)
were used to predict the results of the remaining 3 years (1980-1982).
This resulted in a model that was accurate 82.3% of the time, with
accuracy on predicting above-median mpg (mpg01 = 1) of
82.5% (66/(66+14) = 0.825), and predicting below-median mpg
(mpg01 = 0) of 80% (4/(4+1) = 0.8).
qda.Auto.fit <- qda(mpg01~horsepower+weight+acceleration, data=Auto, subset=train.Auto.70s)
qda.Auto.class <- predict(qda.Auto.fit, test.Auto.80s)$class
table(qda.Auto.class, lda.mpg01.Auto80s)
## lda.mpg01.Auto80s
## qda.Auto.class 0 1
## 0 5 11
## 1 0 69
mean(qda.Auto.class==lda.mpg01.Auto80s)
## [1] 0.8705882
This QDA model was built using the same requirements from Question
14e. This resulted in a model that was accurate 87% of the time, with
accuracy on predicting above-median mpg (mpg01 = 1) of
82.5% (69/(69+11) = 0.8625), and predicting below-median
mpg (mpg01 = 0) of 100% (5/(5+0) = 1).
glm.Auto.fit <- glm(Auto$mpg01~horsepower+weight+acceleration, data=as.data.frame(train.Auto.70s), family=binomial)
glm.Auto.probs <- predict(glm.Auto.fit, test.Auto.80s, type="response")
glm.Auto.pred <- rep(0, length(glm.Auto.probs))
glm.Auto.pred[glm.Auto.probs > 0.5] <- 1
table(glm.Auto.pred, Auto$mpg01[!train.Auto.70s])
##
## glm.Auto.pred 0 1
## 0 5 12
## 1 0 68
mean(glm.Auto.pred == Auto$mpg01[!train.Auto.70s])
## [1] 0.8588235
This GLM model was built using the same requirements from Question
14e. This resulted in a model that was accurate 85.9% of the time, with
accuracy on predicting above-median mpg (mpg01 = 1) of 85%
(68/(68+12) = 0.85), and predicting below-median mpg
(mpg01 = 0) of 100% (5/(5+0) = 1).
library(e1071)
nb.Auto.fit <- naiveBayes(Auto$mpg01~horsepower+weight+acceleration, data=Auto, subset = train.Auto.70s)
nb.Auto.class <- predict(nb.Auto.fit, test.Auto.80s)
table(nb.Auto.class, lda.mpg01.Auto80s)
## lda.mpg01.Auto80s
## nb.Auto.class 0 1
## 0 4 7
## 1 1 73
mean(nb.Auto.class==lda.mpg01.Auto80s)
## [1] 0.9058824
This naive Bayes model was built using the same requirements from
Question 14e. This resulted in a model that was accurate 90.5% of the
time, with accuracy on predicting above-median mpg
(mpg01 = 1) of 90.5% (73/(7+73) = 0.9058824),
and predicting below-median mpg (mpg01 = 0) of 80%
(4/(4+1) = 0.8).
library(class)
train.Auto.KNN.x <- Auto[train.Auto.70s, c(4:6)]
test.Auto.KNN.x <- Auto[!train.Auto.70s, c(4:6)]
train.mpg01 <- Auto$mpg01[train.Auto.70s]
set.seed(1)
knn.Auto.pred <- knn(as.data.frame(train.Auto.KNN.x), as.data.frame(test.Auto.KNN.x), train.mpg01, k=1)
table(knn.Auto.pred,lda.mpg01.Auto80s)
## lda.mpg01.Auto80s
## knn.Auto.pred 0 1
## 0 5 19
## 1 0 61
mean(knn.Auto.pred == lda.mpg01.Auto80s)
## [1] 0.7764706
This KNN model was built using the same requirements from Question
14e. This resulted in a model that was accurate 90.5% of the time, with
accuracy on predicting above-median mpg (mpg01 = 1) of
76.3% (61/(61+19) = 0.7625), and predicting below-median
mpg (mpg01 = 0) of 100% (5/(5+0) = 1).
detach(Auto)
Using the Boston data set, fit classification models in order to predict whether a given census tract has a crime rate above or below the median. Explore logistic regression, LDA, naive Bayes, and KNN models using various subsets of the predictors. Describe your findings.
Hint: You will have to create the response variable yourself, using the variables that are contained in the Boston data set
Once the median of crime is found (0.25651), a new
column is created with 0 entered. Then, if
crim is greater than the median, the med_crim is updated to
show 1
library(ISLR2)
attach(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
median(crim)
## [1] 0.25651
Boston$med_crim <- rep(0,length(crim))
Boston$med_crim[crim>=median(crim)]=1
table(Boston$med_crim)
##
## 0 1
## 253 253
Utilizing table(med_crim) we see that there are an equal
amount of observations above and below the median, so this works!
Next we need to explore the data. Using scatter plots, we can look to
see what variables may be related to med_crim
pairs(Boston)
cor(Boston)
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## med_crim 0.40939545 -0.43615103 0.60326017 0.070096774 0.72323480
## rm age dis rad tax
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## med_crim -0.15637178 0.61393992 -0.61634164 0.619786249 0.60874128
## ptratio black lstat medv med_crim
## crim 0.2899456 -0.38506394 0.4556215 -0.3883046 0.40939545
## zn -0.3916785 0.17552032 -0.4129946 0.3604453 -0.43615103
## indus 0.3832476 -0.35697654 0.6037997 -0.4837252 0.60326017
## chas -0.1215152 0.04878848 -0.0539293 0.1752602 0.07009677
## nox 0.1889327 -0.38005064 0.5908789 -0.4273208 0.72323480
## rm -0.3555015 0.12806864 -0.6138083 0.6953599 -0.15637178
## age 0.2615150 -0.27353398 0.6023385 -0.3769546 0.61393992
## dis -0.2324705 0.29151167 -0.4969958 0.2499287 -0.61634164
## rad 0.4647412 -0.44441282 0.4886763 -0.3816262 0.61978625
## tax 0.4608530 -0.44180801 0.5439934 -0.4685359 0.60874128
## ptratio 1.0000000 -0.17738330 0.3740443 -0.5077867 0.25356836
## black -0.1773833 1.00000000 -0.3660869 0.3334608 -0.35121093
## lstat 0.3740443 -0.36608690 1.0000000 -0.7376627 0.45326273
## medv -0.5077867 0.33346082 -0.7376627 1.0000000 -0.26301673
## med_crim 0.2535684 -0.35121093 0.4532627 -0.2630167 1.00000000
There are 6 variables that are correlated over +-0.6 to
med_crim: nox, rad,
dis, age, tax, and
indus.
Now that we have some variables to isolate, we can begin working on training our data and running models.
I opted to use the top 70% of the Boston data as training data, and the remaining 30% as the test data. (I had originally attempted to do a random split of 70/30 but after 2 days of failing to get the KNN work, I opted the easier route or non-random. My apologies if this makes anybody sad).
train.Boston <- Boston[(1:(506*.7)),]
test.Boston <- Boston[(506*.7):506,]
test.med_crim <- Boston$med_crim[(506*.7):506]
This provides us with 354 observations in our training data, and 152 in the test data. We can now begin building our models!
Before going forward let’s check to see if all predictors are significant
glm.Boston.fit <- glm(med_crim~nox+rad+dis+age+tax+indus, data=train.Boston, family = binomial)
summary(glm.Boston.fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -28.270989601 4.318242155 -6.546875 5.875368e-11
## nox 50.296106293 8.470018059 5.938134 2.882835e-09
## rad 0.556876514 0.129065450 4.314683 1.598320e-05
## dis 0.168815284 0.151745667 1.112488 2.659282e-01
## age 0.003096109 0.009206145 0.336309 7.366379e-01
## tax -0.003905682 0.003161934 -1.235219 2.167489e-01
## indus -0.088495592 0.049471309 -1.788827 7.364275e-02
It appears that only nox and rad are
significant (p<0.05), so the others will be removed from the model
going forward.
glm.Boston.fit <- glm(med_crim~nox+rad, data=train.Boston, family = binomial)
glm.Boston.prob <- predict(glm.Boston.fit, test.Boston, type='response')
glm.Boston.pred <- rep(0,length(glm.Boston.prob))
glm.Boston.pred[glm.Boston.prob>.5] <- 1
table(glm.Boston.pred, test.med_crim)
## test.med_crim
## glm.Boston.pred 0 1
## 0 7 0
## 1 10 135
mean(glm.Boston.pred == test.med_crim)
## [1] 0.9342105
This analysis resulted in a model that was accurate 93.4% of the
time, with accuracy on predicting above-median crime rate
(med_crim = 1) of 100% (135/(135+0) = 1.0),
and predicting below-median crime rate (med_crim = 0) of
41.2% (7/(7+10) = 0.4117647).
library(MASS)
lda.Boston.fit <- lda(med_crim~nox+rad, data=train.Boston)
lda.Boston.pred <- predict(lda.Boston.fit, test.Boston)
lda.Boston.class <- lda.Boston.pred$class
table(lda.Boston.class, test.med_crim)
## test.med_crim
## lda.Boston.class 0 1
## 0 7 0
## 1 10 135
mean(lda.Boston.class==test.med_crim)
## [1] 0.9342105
This analysis resulted in a model that was accurate 93.4% of the
time, with accuracy on predicting above-median crime rate
(med_crim = 1) of 100% (135/(135+0) = 1.0),
and predicting below-median crime rate (med_crim = 0) of
41.2% (7/(7+10) = 0.4117647).
qda.Boston.fit <- qda(med_crim~nox+rad, data=train.Boston)
qda.Boston.pred <- predict(qda.Boston.fit, test.Boston)
qda.Boston.class <- qda.Boston.pred$class
table(qda.Boston.class, test.med_crim)
## test.med_crim
## qda.Boston.class 0 1
## 0 7 132
## 1 10 3
mean(qda.Boston.class==test.med_crim)
## [1] 0.06578947
This analysis resulted in a model that was accurate 65.8% of the
time, with accuracy on predicting above-median crime rate
(med_crim = 1) of 2.2%
(3/(132+3) = 0.02222222), and predicting below-median crime
rate (med_crim = 0) of 41.2%
(7/(7+10) = 0.4117647).
library(class)
train.Boston.KNN.X=cbind(nox, rad)[(1:(506*.7)),]
test.Boston.KNN.X=cbind(nox, rad)[(506*.7):506,]
train.Boston.KNN.medCrime=Boston$med_crim[(1:(506*.7))]
set.seed(1)
knn.Boston.pred=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=1)
table(knn.Boston.pred, test.med_crim)
## test.med_crim
## knn.Boston.pred 0 1
## 0 12 3
## 1 5 132
knn.Boston.pred3=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=3)
table(knn.Boston.pred3, test.med_crim)
## test.med_crim
## knn.Boston.pred3 0 1
## 0 12 3
## 1 5 132
knn.Boston.pred10=knn(train.Boston.KNN.X, test.Boston.KNN.X, train.Boston.KNN.medCrime,k=10)
table(knn.Boston.pred10, test.med_crim)
## test.med_crim
## knn.Boston.pred10 0 1
## 0 12 3
## 1 5 132
This analysis resulted in a model that was accurate 94.7%
((132+12)/(132+12+3+5)=0.9473684) of the time, with
accuracy on predicting above-median crime rate
(med_crim = 1) of 98%
(132/(132+3) = 0.9777778), and predicting below-median
crime rate (med_crim = 0) of 70.6%
(12/(12+5) = 0.7058824). the number of K did not change the
result, which means I definitely screwed up somewhere, but I’ve been
trying to make this work for the last 2 days and I’ll just call it good
enough for now…
library(e1071)
nb.Boston.fit <- naiveBayes(med_crim~nox+rad, data=train.Boston)
nb.Boston.class <- predict(nb.Boston.fit, test.Boston)
table(nb.Boston.class, test.med_crim)
## test.med_crim
## nb.Boston.class 0 1
## 0 7 124
## 1 10 11
mean(nb.Boston.class == test.med_crim)
## [1] 0.1184211
This analysis resulted in a model that was accurate 11.8% of the
time, with accuracy on predicting above-median crime rate
(med_crim = 1) of 8.1%
(11/(11+124) = 0.08148148), and predicting below-median
crime rate (med_crim = 0) of 41.1%
(7/(7+10) =0.4117647). Obviously this is pretty bad…
So I feel that I completely screwed up this assignment by overly complicating some areas and making the variable names too similar to each other and putting all 3 questions on one page. I need to find a better way to make these more of a plug and play system (like a function?) that could produce the same info accurately instead of handjamming each individual line and question. But hey, this is learning right? I know some of the numbers above are definitely off, but I’ll learn better in future assignments.