library(e1071)
Warning: package ‘e1071’ was built under R version 4.4.2
str(Weekly)
'data.frame': 1089 obs. of 9 variables:
$ Year : num 1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
$ Lag1 : num 0.816 -0.27 -2.576 3.514 0.712 ...
$ Lag2 : num 1.572 0.816 -0.27 -2.576 3.514 ...
$ Lag3 : num -3.936 1.572 0.816 -0.27 -2.576 ...
$ Lag4 : num -0.229 -3.936 1.572 0.816 -0.27 ...
$ Lag5 : num -3.484 -0.229 -3.936 1.572 0.816 ...
$ Volume : num 0.155 0.149 0.16 0.162 0.154 ...
$ Today : num -0.27 -2.576 3.514 0.712 1.178 ...
$ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...
Weekly|>
ggplot(aes(x = Direction, Y = as.factor(Year), fill = Direction
))+
geom_bar()+
theme_minimal()
pairs(Weekly[,2:8])
Weekly|>
select(Lag1,Lag2,Lag3,Volume,Today)|>
ggpairs()
Weekly|>
select(Lag4,Lag5, Volume, Today)|>
ggpairs()
Weekly|>
ggscatmat(columns = 2:8, alpha = 0.2)
Week = Weekly|> select(where(is.numeric))
cor = cor(Week)
corrplot(cor,method = "number")
skim(Weekly)
── Data Summary ────────────────────────
Values
Name Weekly
Number of rows 1089
Number of columns 9
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None
Looking at these graphical and numerical summaries, we can see that there is not a lot of correlation between the variables. Year and Volume seem to have a positive correlation with one another. Most of the graphs appear to have no particular patterns and appear to have more of a cluster than anything. The distribution of these variables look to be normally distributed. Volume is right skewed meaning that mean is greater than the median and is pulling the tail to the right.
glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = "binomial" )
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = "binomial", data = Weekly)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.26686 0.08593 3.106 0.0019
Lag1 -0.04127 0.02641 -1.563 0.1181
Lag2 0.05844 0.02686 2.175 0.0296
Lag3 -0.01606 0.02666 -0.602 0.5469
Lag4 -0.02779 0.02646 -1.050 0.2937
Lag5 -0.01447 0.02638 -0.549 0.5833
Volume -0.02274 0.03690 -0.616 0.5377
(Intercept) **
Lag1
Lag2 *
Lag3
Lag4
Lag5
Volume
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1496.2 on 1088 degrees of freedom
Residual deviance: 1486.4 on 1082 degrees of freedom
AIC: 1500.4
Number of Fisher Scoring iterations: 4
The only variable that is statistically significant is Lag2. We reject the null hypothesis and state that Lag2 coefficient is not equal to zero.
pred = predict(glm.fit, Weekly, type = "response")
pred.labels = ifelse(pred > .5, "Up", "Down")
actual.label = factor(Weekly$Direction)
pred.labels = factor(pred.labels)
confusionMatrix(pred.labels, actual.label)
Confusion Matrix and Statistics
Reference
Prediction Down Up
Down 54 48
Up 430 557
Accuracy : 0.5611
95% CI : (0.531, 0.5908)
No Information Rate : 0.5556
P-Value [Acc > NIR] : 0.369
Kappa : 0.035
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.11157
Specificity : 0.92066
Pos Pred Value : 0.52941
Neg Pred Value : 0.56434
Prevalence : 0.44444
Detection Rate : 0.04959
Detection Prevalence : 0.09366
Balanced Accuracy : 0.51612
'Positive' Class : Down
There is a lot of false negatives(430) and a smaller amount of false positives(48). Sensitivity is calculated by (TP TP + FN) $ $ which means \(11.1\%\) of the positive predictions were correctly classified. While the Specificity is the percentage of a models ability to correctly identify negative instances, that is calculated \(\frac{48}{48+557} \approx 92.0\), meaning that \(92\%\) are correctly classified. This confusion matrix is exemplifying that Logistic Regression might not be the best model to use to predict whether the stock of the S&P 500 in a given week.
train = (Year < 2009 )
test = Weekly[!train,]
dim(test)
[1] 104 9
Direction.2009 = Direction[!train]
glm.fit2 = glm(Direction ~ Lag2, data = Weekly, family = "binomial", subset = train)
glm.probs = predict(glm.fit2, test, type = "response")
glm.pred2 = rep("Down", length(glm.probs))
glm.pred2[glm.probs > .5] = "Up"
table(glm.pred2, Direction.2009)
Direction.2009
glm.pred2 Down Up
Down 9 5
Up 34 56
mean(glm.pred2 == Direction.2009)
[1] 0.625
mean(glm.pred2 != Direction.2009)
[1] 0.375
lda.model = lda(Direction ~ Lag2, data = Weekly, subset = train)
lda.probs = predict(lda.model,test, type = "response")
lda.pred = rep("Down", 104)
lda.probs_up = lda.probs$posterior[, 2]
lda.pred = rep("Down", length(lda.probs_up))
lda.pred[lda.probs_up > .5] = "Up"
table(lda.pred, Direction.2009)
Direction.2009
lda.pred Down Up
Down 9 5
Up 34 56
mean(lda.pred == Direction.2009)
[1] 0.625
mean(lda.pred != Direction.2009)
[1] 0.375
qda.model = qda(Direction ~ Lag2, data = Weekly, subset = train)
qda.pred = predict(qda.model,test)$class
table(qda.pred, Direction.2009)
Direction.2009
qda.pred Down Up
Down 0 0
Up 43 61
mean(qda.pred == Direction.2009)
[1] 0.5865385
mean(qda.pred != Direction.2009)
[1] 0.4134615
week.train = as.matrix(Lag2[train])
week.test = as.matrix(Lag2[!train])
train.direction = Direction[train]
set.seed(1)
knn = knn(week.train,week.test,train.direction,k = 1)
table(knn, Direction.2009)
Direction.2009
knn Down Up
Down 21 30
Up 22 31
mean(knn != Direction.2009)
[1] 0.5
bayes = naiveBayes(Direction ~ Lag2, Weekly, subset = train)
bayes.pred = predict(bayes, test)
table(bayes.pred, Direction.2009)
Direction.2009
bayes.pred Down Up
Down 0 0
Up 43 61
mean(bayes.pred == Direction.2009)
[1] 0.5865385
mean(bayes.pred != Direction.2009)
[1] 0.4134615
Linear Discriminant Analysis has an accuracy of \(62.5\%\) and Logistic Regression \(62.5\%\) are the best methods in predicting the direction of the stock market between 1990 and 2008.
holdout = (Year > 2009)
glm.fit2 = glm(Direction ~ Lag2 + exp(Lag3) + Lag4 + Volume, data = Weekly, family = "binomial", subset = holdout)
glm.probs = predict(glm.fit2, test, type = "response")
glm.pred2 = rep("Down", length(glm.probs))
glm.pred2[glm.probs > .5] = "Up"
table(glm.pred2, Direction.2009)
Direction.2009
glm.pred2 Down Up
Down 15 13
Up 28 48
mean(glm.pred2 == Direction.2009)
[1] 0.6057692
mean(glm.pred2 != Direction.2009)
[1] 0.3942308
lda.model = lda(Direction ~ Lag1+ Lag2 + Lag5 + Volume, data = Weekly, subset = holdout)
lda.probs = predict(lda.model,test, type = "response")
lda.pred = rep("Down", 104)
lda.probs_up = lda.probs$posterior[, 2]
lda.pred = rep("Down", length(lda.probs_up))
lda.pred[lda.probs_up > .5] = "Up"
table(lda.pred, Direction.2009)
Direction.2009
lda.pred Down Up
Down 16 13
Up 27 48
mean(lda.pred == Direction.2009)
[1] 0.6153846
mean(lda.pred != Direction.2009)
[1] 0.3846154
qda.model = qda(Direction ~ Lag3 + Lag1 + Lag5 + Volume, data = Weekly, subset = holdout)
qda.pred = predict(qda.model,test)$class
table(qda.pred, Direction.2009)
Direction.2009
qda.pred Down Up
Down 22 14
Up 21 47
mean(qda.pred == Direction.2009)
[1] 0.6634615
mean(qda.pred != Direction.2009)
[1] 0.3365385
set.seed(1)
knn = knn(week.train,week.test,train.direction,k = 5)
table(knn, Direction.2009)
Direction.2009
knn Down Up
Down 16 21
Up 27 40
mean(knn == Direction.2009)
[1] 0.5384615
bayes = naiveBayes(Direction ~ Lag2 + Volume + Lag1 + Lag3 + Lag4 + Lag5, Weekly, subset = holdout)
bayes.pred = predict(bayes, test)
table(bayes.pred, Direction.2009)
Direction.2009
bayes.pred Down Up
Down 16 16
Up 27 45
mean(bayes.pred == Direction.2009)
[1] 0.5865385
mean(bayes.pred != Direction.2009)
[1] 0.4134615
The method that has the best results in most accurately predicting the direction of the S&P 500 after the year 2009, is the QDA model with an accuracy of \(66.34\%\).
mpg01 = ifelse(median(mpg)< Auto$mpg,1, 0)
mpg = data.frame(Auto,mpg01)
These variables mpg,cylinders,displacement,horsepower,weight,acceleration are not normally distributed, besides acceleration and mpg. There is evidence of high and low values of positive and negative correlation.
MPG, Displacement, Horsepower, Weight are skewed to the right, which again shows that the mean is greater than the median, thus the reason that the distribution of the values are skewed to the right. While Cylinders is not normally distributed and is left tailed.
mpg |>
ggscatmat(columns = 1:6, alpha = .2)
View(mpg)
par(mfrow = c(3,3))
boxplot(mpg$mpg, main = "Boxplot of MPG")
boxplot(mpg$cylinders, main = "Boxplot of Cylinders")
boxplot(mpg$displacement, main = "Boxplot of Displacement")
boxplot(mpg$horsepower, main = "Boxplot of Horsepower")
boxplot(mpg$weight, main = "Boxplot of Weight")
boxplot(mpg$acceleration, main = "Boxplot of Acceleration")
set.seed(123)
train.index = createDataPartition(mpg$mpg01, p = 0.8, list = FALSE)
mpg.train = mpg[train.index,]
mpg.test = mpg[-train.index,]
form14 = mpg01 ~ mpg + cylinders + displacement + weight
This model gave a \(93.59\%\) accuracy!
mpg.lda = lda(form14, data = mpg)
mpg.lda.pred = predict(mpg.lda,mpg.test)$class
table(mpg.lda.pred, mpg.test$mpg01)
mpg.lda.pred 0 1
0 35 1
1 4 38
mean(mpg.lda.pred == mpg.test$mpg01)
[1] 0.9358974
This model yields an accuracy of \(92.2\%\).
mpg.qda = qda(form14, data = mpg)
mpg.qda.pred = predict(mpg.qda, mpg.test)$class
table(mpg.qda.pred,mpg.test$mpg01)
mpg.qda.pred 0 1
0 34 1
1 5 38
mean(mpg.qda.pred == mpg.test$mpg01)
[1] 0.9230769
Logistic Regression has \(100\%\) accuracy!
mpg.glm = glm(form14, data = mpg, family = "binomial")
mpg.glm.pred = predict(mpg.glm, mpg.test, type = "response")
mpg.glm.pred.class = ifelse(mpg.glm.pred > .5, 1, 0)
table(mpg.glm.pred.class,mpg.test$mpg01)
mean(mpg.glm.pred.class == mpg.test$mpg01)
Naive Bayes model displayed an accuracy of \(88.46\%\)!
mpg.bayes = naiveBayes(form14, data = mpg)
mpg.bayes.pred = predict(mpg.bayes, mpg.test)
table(mpg.bayes.pred, mpg.test$mpg01)
mean(mpg.bayes.pred == mpg.test$mpg01)
KNN of 23 clusters has an accuracy of \(86\%\)
set.seed(123)
idx = sample(1:nrow(Auto), 0.8*nrow(Auto))
train.auto = Auto[idx,]
test.auto = Auto[-idx,]
y.train = train.auto$mpg
x.train = train.auto[,-1]
x.test = test.auto[,-1]
x.train = x.train[,-8]
x.test = x.test[,-8]
y.test = ifelse(test.auto$mpg>=23,1,0)
high.low = ifelse(y.train>=23,1,0)
knn.pred <- knn (train=x.train, test=x.test, cl=high.low, k=1)
table(knn.pred, y.test)
mean(knn.pred == y.test)
set.seed(7)
knn.pred <- knn (train=x.train, test=x.test, cl=high.low, k=23)
table(knn.pred, y.test)
mean(knn.pred == y.test)
Crime = ifelse(Boston$crim > median(Boston$crim), 1, 0)
Boston.p2 = data.frame(Boston,Crime)
cor(Boston.p2)
crim zn indus chas
crim 1.00000000 -0.20046922 0.40658341 -0.055891582
zn -0.20046922 1.00000000 -0.53382819 -0.042696719
indus 0.40658341 -0.53382819 1.00000000 0.062938027
chas -0.05589158 -0.04269672 0.06293803 1.000000000
nox 0.42097171 -0.51660371 0.76365145 0.091202807
rm -0.21924670 0.31199059 -0.39167585 0.091251225
age 0.35273425 -0.56953734 0.64477851 0.086517774
dis -0.37967009 0.66440822 -0.70802699 -0.099175780
rad 0.62550515 -0.31194783 0.59512927 -0.007368241
tax 0.58276431 -0.31456332 0.72076018 -0.035586518
ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174
black -0.38506394 0.17552032 -0.35697654 0.048788485
lstat 0.45562148 -0.41299457 0.60379972 -0.053929298
medv -0.38830461 0.36044534 -0.48372516 0.175260177
Crime 0.40939545 -0.43615103 0.60326017 0.070096774
nox rm age dis
crim 0.42097171 -0.21924670 0.35273425 -0.37967009
zn -0.51660371 0.31199059 -0.56953734 0.66440822
indus 0.76365145 -0.39167585 0.64477851 -0.70802699
chas 0.09120281 0.09125123 0.08651777 -0.09917578
nox 1.00000000 -0.30218819 0.73147010 -0.76923011
rm -0.30218819 1.00000000 -0.24026493 0.20524621
age 0.73147010 -0.24026493 1.00000000 -0.74788054
dis -0.76923011 0.20524621 -0.74788054 1.00000000
rad 0.61144056 -0.20984667 0.45602245 -0.49458793
tax 0.66802320 -0.29204783 0.50645559 -0.53443158
ptratio 0.18893268 -0.35550149 0.26151501 -0.23247054
black -0.38005064 0.12806864 -0.27353398 0.29151167
lstat 0.59087892 -0.61380827 0.60233853 -0.49699583
medv -0.42732077 0.69535995 -0.37695457 0.24992873
Crime 0.72323480 -0.15637178 0.61393992 -0.61634164
rad tax ptratio black
crim 0.625505145 0.58276431 0.2899456 -0.38506394
zn -0.311947826 -0.31456332 -0.3916785 0.17552032
indus 0.595129275 0.72076018 0.3832476 -0.35697654
chas -0.007368241 -0.03558652 -0.1215152 0.04878848
nox 0.611440563 0.66802320 0.1889327 -0.38005064
rm -0.209846668 -0.29204783 -0.3555015 0.12806864
age 0.456022452 0.50645559 0.2615150 -0.27353398
dis -0.494587930 -0.53443158 -0.2324705 0.29151167
rad 1.000000000 0.91022819 0.4647412 -0.44441282
tax 0.910228189 1.00000000 0.4608530 -0.44180801
ptratio 0.464741179 0.46085304 1.0000000 -0.17738330
black -0.444412816 -0.44180801 -0.1773833 1.00000000
lstat 0.488676335 0.54399341 0.3740443 -0.36608690
medv -0.381626231 -0.46853593 -0.5077867 0.33346082
Crime 0.619786249 0.60874128 0.2535684 -0.35121093
lstat medv Crime
crim 0.4556215 -0.3883046 0.40939545
zn -0.4129946 0.3604453 -0.43615103
indus 0.6037997 -0.4837252 0.60326017
chas -0.0539293 0.1752602 0.07009677
nox 0.5908789 -0.4273208 0.72323480
rm -0.6138083 0.6953599 -0.15637178
age 0.6023385 -0.3769546 0.61393992
dis -0.4969958 0.2499287 -0.61634164
rad 0.4886763 -0.3816262 0.61978625
tax 0.5439934 -0.4685359 0.60874128
ptratio 0.3740443 -0.5077867 0.25356836
black -0.3660869 0.3334608 -0.35121093
lstat 1.0000000 -0.7376627 0.45326273
medv -0.7376627 1.0000000 -0.26301673
Crime 0.4532627 -0.2630167 1.00000000
set.seed(123)
Boston.index = createDataPartition(Boston.p2$Crime, p = .7, list = F)
Boston.train = Boston.p2[Boston.index,]
Boston.test = Boston.p2[-Boston.index,]
train = 1:(dim(Boston.p2)[1]/2)
test = (dim(Boston.p2)[1]/2 + 1):dim(Boston.p2)[1]
Crime.test = Crime[test]
This Logistic Regression Model with indus,nox,age,rad,tax has an accuracy of \(90.66\%\).
boston.glm = glm(form16, data = Boston.train,family = "binomial")
boston.glm.probs = predict(boston.glm,Boston.test,type = "response")
boston.glm.pred = ifelse(boston.glm.probs > .5,1,0)
table(boston.glm.pred,Boston.test$Crime)
mean(boston.glm.pred == Boston.test$Crime)
Linear Discriminant Analysis with indus,nox,age,rad,tax has an accuracy of \(85.33\%\)
boston.lda = lda(form16, data = Boston.train)
boston.lda.pred = predict(boston.lda, Boston.test)$class
table(boston.lda.pred,Boston.test$Crime)
mean(boston.lda.pred == Boston.test$Crime)
This model yields an \(84.66\%\) accuracy.
boston.naive = naiveBayes(form16, data = Boston.train)
boston.naive.pred = predict(boston.naive, Boston.test)
table(boston.naive.pred, Boston.test$Crime)
mean(boston.naive.pred == Boston.test$Crime)
KNN with 5 clusters has an accuracy of \(76.67\%\)
set.seed(2)
train.knn = cbind(indus,nox,age,dis,rad,tax)[train,]
test.knn = cbind(indus,nox,age,dis,rad,tax)[test,]
Bostonknn.pred = knn(train.knn, test.knn, Crime.test, k=10)
table(Bostonknn.pred, Crime.test)
mean(Bostonknn.pred == Crime.test)