HW5 MATH-420

HW-Q 1: This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? (Hint: Please use specific code to extract specific information)

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.1.2

attach(Weekly)
head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

weekly.fit <- glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data = Weekly, family = binomial)

model.summary = summary(weekly.fit)
model.summary

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

model.summary$coefficients[ , 4]

## (Intercept)        Lag1        Lag2        Lag3        Lag4        Lag5 
## 0.001898848 0.118144368 0.029601361 0.546923890 0.293653342 0.583348244 
##      Volume 
## 0.537674762

The one variable that is statistically significant at \(\alpha = 0.05\) is Lag2. All the other variables fail to reject the null hypothesis; \(\beta = 0\).

(b) Compute the confusion matrix and overall fraction of correct predictions.

logWeekly.prob= predict(weekly.fit, type='response')
logWeekly.pred =rep("Down", length(logWeekly.prob))
logWeekly.pred[logWeekly.prob > 0.5] = "Up"
table(logWeekly.pred, Direction)

##               Direction
## logWeekly.pred Down  Up
##           Down   54  48
##           Up    430 557

After computing the confusion matrix, we can determine the percentage of current predictions.

(c) Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression. (No codes required with this part of the question)

\[ \frac{(54+557)}{(54+48+430+557)} = 0.5611 \] This illustrates that the model predicted the weekly market trend correctly \(56.11\%\) of the time. Even though the model predicted the Up weekly trends \(55748+557=0.9207\); \(92.07\%\) correctly, the down weekly trends were predicted at a much lower rate, \(54430+54=0.1115\); or only \({11.15\%}\) correctly predicted.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010). (Hint: concerning to section-4.7.2 from the textbook might be helpful )

train = (Year<2009)
Weekly.0910 <-Weekly[!train,]
weekly.fit2<-glm(Direction~Lag2, data=Weekly,family=binomial, subset=train)
logWeekly.prob= predict(weekly.fit2, Weekly.0910, type = "response")
logWeekly.pred = rep("Down", length(logWeekly.prob))
logWeekly.pred[logWeekly.prob > 0.5] = "Up"
Direction.0910 = Direction[!train]
table(logWeekly.pred, Direction.0910)

##               Direction.0910
## logWeekly.pred Down Up
##           Down    9  5
##           Up     34 56

mean(logWeekly.pred == Direction.0910)

## [1] 0.625

After splitting the Weekly data set into training and test sets, it seems that the new model has a better rate of correctly predicting weekly trends at \(62.5\%\), which is slightly better than the previous model that was based on the entire data set. Also this model, like the previous one, did better at predicting upward trends at a rate of \(91.80\%\) compared to downward trends at a rate of \(20.93\%\).

HW-Q 2: In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

library(ISLR2)
attach(Auto)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

df_auto = data.frame(Auto)

#Median of the variable mpg->22.75
mpg_median=median(median(Auto$mpg))

#We use the function "ifelse" to assing the binary based on mpg
df_auto$mpg01 <- ifelse(df_auto$mpg > mpg_median, 1, 0)

#Elimination of "mpg" and "name" to easy visualization
df_auto$mpg <- NULL
df_auto$name <- NULL

mpg01 is a binary variable that signifies 1 when the mpg value is above its median, and 0 otherwise.

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? (Hint: Scatter plots and box plots may be useful tools to answer this question).

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'Auto':
## 
##     mpg

#install.packages('GGally')
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

df_auto$mpg01_temp <- ifelse(df_auto$mpg01 >= 1, "above", "bellow")

ggpairs(df_auto, aes(colour = mpg01_temp))

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## Warning in cor(x, y): the standard deviation is zero

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#install.packages('corrplot')
library(corrplot)

## corrplot 0.92 loaded

corrplot(cor(Auto[,-9]), method="square")

The variables that appear to correlate strongly with mpg01 are Cylinders,Displacement, and Weight; these variables appear to correlate negatively with this variable. Also Horsepower and Origin appear to correlate moderately with mpg01

(c) Describe your findings. (Hint: no codes required)

We can see from the distribution plots that Weight and Year seem to have a distinct separation between “Above” and “Below” the median mpg. It is again confirmed using their box plot as well, in addition to some outliers but most of population seem to be centered around the median, meaning that trying to create a decision boundary appears feasible.

For horsepower, displacement and cylinders we see that a decision boundary seems also feasible, however their distribution over the binary variable “below” appears to be far from normal.

However, for the Acceleration variable, it appears to be very difficult to classify, the distribution over the binary variable overlaps each other, which could easily cause an over fitting problem. By looking into pairs plot with acceleration, we realize that its combination with weight, horsepower displacement and cylinders, creates easily spotted values, which may come in handy during modelling.

Finally we have Origin; since this is a categorical variable, by itself, can not be used to predict high or low mpg, nevertheless, its combination with Weight could give us an extra criteria for classification.

(d) Split the data into a training set and a test set. Perform logistic regression on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). After performing the logistic regression, calculate the test error of the model you have obtained?

total_observations=nrow(df_auto)

# We use 80% of our data for training and 20% for testing.
p_train=80/100;

# 80% for training
df_auto_train=df_auto[1:as.integer(total_observations*p_train),]

# 20% for testing
df_auto_test=df_auto[(as.integer(total_observations*p_train)+1):total_observations,]

glm.fit=glm(mpg01~weight+year ,data=df_auto_train ,family=binomial)
summary (glm.fit)

## 
## Call:
## glm(formula = mpg01 ~ weight + year, family = binomial, data = df_auto_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22543  -0.21827  -0.01308   0.25496   3.11123  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.299e+01  5.863e+00  -2.215   0.0268 *  
## weight      -5.298e-03  6.664e-04  -7.950 1.87e-15 ***
## year         3.648e-01  8.927e-02   4.086 4.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 418.57  on 312  degrees of freedom
## Residual deviance: 141.63  on 310  degrees of freedom
## AIC: 147.63
## 
## Number of Fisher Scoring iterations: 7

glm.probs=predict(glm.fit,type="response",df_auto_test)
glm.pred=rep(0,79)
glm.pred[glm.probs >.5]=1

#Results
table(glm.pred ,df_auto_test[['mpg01']] )

##         
## glm.pred  0  1
##        0  3  4
##        1  2 70

test_error=((6)/(79))*100
test_error

## [1] 7.594937

The test error is 7.6%.