library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.2
attach(Weekly)
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
weekly.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)
model.summary = summary(weekly.fit)
model.summary
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = Weekly)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6949 -1.2565 0.9913 1.0849 1.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26686 0.08593 3.106 0.0019 **
## Lag1 -0.04127 0.02641 -1.563 0.1181
## Lag2 0.05844 0.02686 2.175 0.0296 *
## Lag3 -0.01606 0.02666 -0.602 0.5469
## Lag4 -0.02779 0.02646 -1.050 0.2937
## Lag5 -0.01447 0.02638 -0.549 0.5833
## Volume -0.02274 0.03690 -0.616 0.5377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1486.4 on 1082 degrees of freedom
## AIC: 1500.4
##
## Number of Fisher Scoring iterations: 4
model.summary$coefficients[ , 4]
## (Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
## 0.001898848 0.118144368 0.029601361 0.546923890 0.293653342 0.583348244
## Volume
## 0.537674762
The one variable that is statistically significant at \(\alpha = 0.05\) is Lag2. All the other variables fail to reject the null hypothesis; \(\beta = 0\).
logWeekly.prob= predict(weekly.fit, type='response')
logWeekly.pred =rep("Down", length(logWeekly.prob))
logWeekly.pred[logWeekly.prob > 0.5] = "Up"
table(logWeekly.pred, Direction)
## Direction
## logWeekly.pred Down Up
## Down 54 48
## Up 430 557
After computing the confusion matrix, we can determine the percentage of current predictions.
\[ \frac{(54+557)}{(54+48+430+557)} = 0.5611 \] This illustrates that the model predicted the weekly market trend correctly \(56.11\%\) of the time. Even though the model predicted the Up weekly trends \(55748+557=0.9207\); \(92.07\%\) correctly, the down weekly trends were predicted at a much lower rate, \(54430+54=0.1115\); or only \({11.15\%}\) correctly predicted.
train = (Year<2009)
Weekly.0910 <-Weekly[!train,]
weekly.fit2<-glm(Direction~Lag2, data=Weekly,family=binomial, subset=train)
logWeekly.prob= predict(weekly.fit2, Weekly.0910, type = "response")
logWeekly.pred = rep("Down", length(logWeekly.prob))
logWeekly.pred[logWeekly.prob > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(logWeekly.pred, Direction.0910)
## Direction.0910
## logWeekly.pred Down Up
## Down 9 5
## Up 34 56
mean(logWeekly.pred == Direction.0910)
## [1] 0.625
After splitting the Weekly data set into training and test sets, it seems that the new model has a better rate of correctly predicting weekly trends at \(62.5\%\), which is slightly better than the previous model that was based on the entire data set. Also this model, like the previous one, did better at predicting upward trends at a rate of \(91.80\%\) compared to downward trends at a rate of \(20.93\%\).
library(ISLR2)
attach(Auto)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
df_auto = data.frame(Auto)
#Median of the variable mpg->22.75
mpg_median=median(median(Auto$mpg))
#We use the function "ifelse" to assing the binary based on mpg
df_auto$mpg01 <- ifelse(df_auto$mpg > mpg_median, 1, 0)
#Elimination of "mpg" and "name" to easy visualization
df_auto$mpg <- NULL
df_auto$name <- NULL
mpg01 is a binary variable that signifies 1 when the mpg value is above its median, and 0 otherwise.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'Auto':
##
## mpg
#install.packages('GGally')
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
df_auto$mpg01_temp <- ifelse(df_auto$mpg01 >= 1, "above", "bellow")
ggpairs(df_auto, aes(colour = mpg01_temp))
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## Warning in cor(x, y): the standard deviation is zero
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#install.packages('corrplot')
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(Auto[,-9]), method="square")
The variables that appear to correlate strongly with mpg01 are Cylinders,Displacement, and Weight; these variables appear to correlate negatively with this variable. Also Horsepower and Origin appear to correlate moderately with mpg01
We can see from the distribution plots that Weight and Year seem to have a distinct separation between “Above” and “Below” the median mpg. It is again confirmed using their box plot as well, in addition to some outliers but most of population seem to be centered around the median, meaning that trying to create a decision boundary appears feasible.
For horsepower, displacement and cylinders we see that a decision boundary seems also feasible, however their distribution over the binary variable “below” appears to be far from normal.
However, for the Acceleration variable, it appears to be very difficult to classify, the distribution over the binary variable overlaps each other, which could easily cause an over fitting problem. By looking into pairs plot with acceleration, we realize that its combination with weight, horsepower displacement and cylinders, creates easily spotted values, which may come in handy during modelling.
Finally we have Origin; since this is a categorical variable, by itself, can not be used to predict high or low mpg, nevertheless, its combination with Weight could give us an extra criteria for classification.
total_observations=nrow(df_auto)
# We use 80% of our data for training and 20% for testing.
p_train=80/100;
# 80% for training
df_auto_train=df_auto[1:as.integer(total_observations*p_train),]
# 20% for testing
df_auto_test=df_auto[(as.integer(total_observations*p_train)+1):total_observations,]
glm.fit=glm(mpg01~weight+year ,data=df_auto_train ,family=binomial)
summary (glm.fit)
##
## Call:
## glm(formula = mpg01 ~ weight + year, family = binomial, data = df_auto_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22543 -0.21827 -0.01308 0.25496 3.11123
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.299e+01 5.863e+00 -2.215 0.0268 *
## weight -5.298e-03 6.664e-04 -7.950 1.87e-15 ***
## year 3.648e-01 8.927e-02 4.086 4.39e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 418.57 on 312 degrees of freedom
## Residual deviance: 141.63 on 310 degrees of freedom
## AIC: 147.63
##
## Number of Fisher Scoring iterations: 7
glm.probs=predict(glm.fit,type="response",df_auto_test)
glm.pred=rep(0,79)
glm.pred[glm.probs >.5]=1
#Results
table(glm.pred ,df_auto_test[['mpg01']] )
##
## glm.pred 0 1
## 0 3 4
## 1 2 70
test_error=((6)/(79))*100
test_error
## [1] 7.594937
The test error is 7.6%.