Logistic Regression Lab

Load library

Check dataset

?Smarket
names(Smarket)

## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"

summary(Smarket)

##       Year           Lag1                Lag2                Lag3          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
##  Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
##       Lag4                Lag5              Volume           Today          
##  Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
##  1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
##  Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
##  Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
##  3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
##  Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
##  Direction 
##  Down:602  
##  Up  :648  
##            
##            
##            
##

data <- Smarket

Plot data

pairs(Smarket,col=Smarket$Direction,cex=.5)

Logistic regression

glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data=Smarket,family=binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

glm.probs=predict(glm.fit,type="response") 
glm.probs[1:5]

##         1         2         3         4         5 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812

glm.pred <- ifelse(glm.probs>0.5,"Up","Down")
attach(Smarket)
table(glm.pred,Direction)

##         Direction
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507

mean(glm.pred==Direction)

## [1] 0.5216

Make training set and test set

train <- Year<2005
glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
            data=Smarket,family=binomial, subset=train)
glm.probs <- predict(glm.fit,newdata=Smarket[!train,],type="response") 
glm.pred <- ifelse(glm.probs >0.5,"Up","Down")
Direction.2005 <- Smarket$Direction[!train]
table(glm.pred,Direction.2005)

##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44

mean(glm.pred==Direction.2005)

## [1] 0.4801587

mean() here is taking the mean of a logical vector (when the predicted direction matches the actual direction on the test set) which is 1 when the logical vector evaluates to true and 0 when the logical vector evaluates to false. The mean value of 0.48 indicates that more than half of the time, the logical vector is evaluating as false and the prediction is incorrect. This model is more accurate in predicting up movement than down movement. Check accuracy rate

44/(34+44)

## [1] 0.5641026

77/(77+97)

## [1] 0.4425287

Fit smaller model

glm.fit <- glm(Direction~Lag1+Lag2,
            data=Smarket,family=binomial, subset=train)
glm.probs <- predict(glm.fit,newdata=Smarket[!train,],type="response") 
glm.pred <- ifelse(glm.probs >0.5,"Up","Down")
table(glm.pred,Direction.2005)

##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106

mean(glm.pred==Direction.2005)

## [1] 0.5595238

Check accuracy rate for up prediction and real direction up / total prediction of up for glm

106/(76+106)

## [1] 0.5824176

Interpretation of the results: The model is 56% correct in predicting the direction (56% of the time the predicted direction = the actual direction) including both up and down. 58% of the time that the model predicts that the stock market will go up, the stock market actually goes up. 50% of the time when the model predicts that the stock market will go down, the stock market went down. This indicates that the model is more accurate at predicting up than down. Depending on the purpose of the model, the creator may prioritize greater accuracy in up or down.

Assignment 7 Questions

Review ISLR Chapter 4 and look up answers for the following questions a. What is/are the requirement(s) of LDA? For fk(x) being the density function of of x for an observation k (f is large if there is a high probability that the observation has X = x =) the following assumptions need to be made: - p = 1 - fk(x) is normal or Gaussian - there is a shared variance term across all k classes - class-specific mean - p > 1 - drawn from a multivariate Gaussian distribution with class-specific mean vector and common covariance matrix

How LDA is different from Logistic Regression?

the linear logistic model only specifies the conditional distribution Pr(G = k | X = x) and no assumption is made about Pr(X) while the LDA model specifies the joint distribution of X and G
linear logistic regression is solved by maximizing the conditional likelihood of G given X Pr(G = k | X = x); while LDA maximizes the joint likelihood of G and X Pr(X = x | G = k)
if the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data
another advantage of LDA is that samples without class labels can be used under the model of LDA. On the other hand, LDA is not robust to gross outliers
they generally give similar results

What is ROC? ROC stands for Receiver Operating Characteristics curve It is used for checking a classification model’s performance ROC is a probability curve It is plotted with the false positive rate (1 specificity) on the x-axis and the true positive rate (sensitivity) on the y-axis The area under the ROC curve indicates how accurately the model can distinguish between the groups in classification. The closer the ROC curve is to the upper left corner, the more accurate the model is.
What is sensitivity and specificity? Which is more important in your opinion? Sensitivity = # of true positives / # of true positives and false negatives; evaluates a model’s ability to predict true positives of each available category Specificity = # of true negatives / # # of true negatives and false positives; evaluates a model’s ability to predict true negatives of each available category The importance of these two things depends on the stakes that are involved in the classification decision. If a true positive being falsely classified is worse than a true negative being falsely classified, then prioritize sensitivity and if the other is true then vice versa.
From the following chart, for the purpose of prediction, which is more critical? Because there are so few true yes values compared with true no values, it is more important to prioritize correctly classify true yes values than true no values.

Calculate the prediction error from the following: Sensitivity = 81/104 = 0.779 Specificity = 9644/9896 = 0.975

81/104

## [1] 0.7788462

9644/9896

## [1] 0.9745352

Assignment 7

Genna Campain

4/3/2022

Logistic Regression Lab

Assignment 7 Questions