Assignment 2

Q1)

Q1 Solution )

Q2)

A) Create log_mdl

library(dplyr)
Mdl_dat <- Weekly |> dplyr::select(Direction, starts_with("Lag"), Volume) 
Mdl_dat$Direction <- factor(Mdl_dat$Direction)


log_mdl <- glm(Direction~., family = binomial(),data=Mdl_dat);log_mdl
## 
## Call:  glm(formula = Direction ~ ., family = binomial(), data = Mdl_dat)
## 
## Coefficients:
## (Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
##     0.26686     -0.04127      0.05844     -0.01606     -0.02779     -0.01447  
##      Volume  
##    -0.02274  
## 
## Degrees of Freedom: 1088 Total (i.e. Null);  1082 Residual
## Null Deviance:       1496 
## Residual Deviance: 1486  AIC: 1500

B) Calculate Standard Metrics :

B_i)

  1. Confusion Matrix :
  2. Accuracy :
  3. Recall :
  4. F1 Score :

Confusion Matrix

##          Actual
## Predicted Down  Up
##      Down   54  48
##      Up    430 557
##          Actual
## Predicted  Down    Up
##      Down  4.96  4.41
##      Up   39.49 51.15

Accuracy is therefore the sum of the diagnols (trace) :

## [1] 56.11

Interpretation : We make the right guess about 60% of the time

B_ii) Does the model uniformly beat random guessing in terms of these performance metrics?

In order to answer this we must compare the relative frequency of UP guesses inherently within our sample to our modles Precision Rate as Precision is the strength of our right guesses when our model guesses yes.

precision > tbl$freq[2]
## [1] TRUE
recall > tbl$freq[2]
## [1] TRUE

Therefore : Model identifies positives better than chance & Model captures actual positives better than random.

C) Plot precision and recall against the threshold (varying over [0, 1] ) used to generate predicted labels from predicted probabilities.

The threshold is when our predicitons are better than random guessing :

## Warning: Removed 22 rows containing missing values or values outside the scale range
## (`geom_line()`).

The model captures most “Up” weeks at low thresholds (high recall), but recall drops sharply as the threshold increases, showing reduced sensitivity. Precision stays stable, then rises—indicating more confident predictions—before dropping at high thresholds due to overconfident errors.

D) fit the logistic regression using only data up to (and including) the year 2008, with Lag2 as the only predictor.

## 
## Call:  glm(formula = Direction ~ Lag2, family = binomial(), data = Up_to_2008)
## 
## Coefficients:
## (Intercept)         Lag2  
##      0.2033       0.0581  
## 
## Degrees of Freedom: 984 Total (i.e. Null);  983 Residual
## Null Deviance:       1355 
## Residual Deviance: 1351  AIC: 1355

E) Metrics & Training

Confusion Matrix :

##          Actual
## Predicted   0   1
##      Down  23  20
##      Up   418 524
##          Actual
## Predicted     0     1
##      Down  2.34  2.03
##      Up   42.44 53.20

Separate Metrics

## [1] "Accuracy"
## [1] 55.54
## [1] "precision"
## [1] 0.5562526
## [1] "recall"
## [1] 0.9632446
## [1] "F1"
## [1] 0.7052429

Did it work?

## [1] "Does the model uniformly beat random guessing in terms of these performance metrics?"
precision > tbl$freq[2]
## [1] FALSE
recall > tbl$freq[2]
## [1] TRUE