Question1 (concept)[10p]

Suppose we collect data for a group of students in a statistics class with variables \(X_1\) = hours studied, \(X_2\) = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, \(\hat\beta_0 = −6, \hat\beta_1 = 0.05, \hat\beta_2 = 1\).

  1. Estimate the probability that a student who studies for 40 hours and has an undergrad GPA of 3.5 gets an A in the class.

\[P(receive=A|X_1=40,X_2=3.5) = \frac{1}{1+e^{-(-6+40*0.05+3.5)}} = 0.3775407\]

  1. How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

We can get

\[-6+0.05hours+3.5 = 0\]

\[\Rightarrow hours = 50\]

Question 2 (concept)[10p]

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

We should prioritise the classification of new observations using logistic regression.

Although the kNN classifier here has a lower error rate than the logistic regression model on both the training and test sets, a kNN with k=1 means that the model has a training set error rate of 0. The average error rate for both the training and test sets is 0.18, indicating a test set error rate of 0.36, which is greater than the logistic regression error rate of 0.3.

Question 3 (concept)[10p]

This problem has to do with odds.

\[odds = \frac{p(x)}{1-p(x)}\]

  1. On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?

\[p(default) = \frac{odds}{odds+1}=\frac{0.37}{0.37+1}=0.270073\] we have on average a fraction of 27% of people defaulting on their credit card payment.

  1. Suppose that an individual has a 16% chance of defaulting on her credit card payment. What are the odds that she will default?

\[odds(default) = \frac{0.16}{1-0.16}=0.1904762\]

The odds that she will default is then 19%.

Question 4 (applied)[20p]

This question should be answered using the Weekly data set, which is part of the ISLR2 package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

  1. Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?
library(ISLR2)
table(Weekly$Direction)
## 
## Down   Up 
##  484  605
par(mfrow = c(2,1))
plot(Weekly$Volume,type = 'l',ylab = 'Volume',xlab = '')
plot(Volume~Year,data=Weekly)

Weekly returns rose 605 times and fell 484 times, between 1990 and 2010. Overall, weekly returns trended upwards year on year, peaking in approximately 2008. Weekly returns trended upwards from 1990 to 2002, then slightly downwards until 2004 and then upwards until 2008 when they gradually stabilised.

library(corrplot)
corrplot(cor(Weekly[,-9]))

Only year and volume had a significant linear relationship. The correlation plot does not indicate that any of the other variables are linearly related.

  1. Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?
fit <- glm(Direction ~ ., data = Weekly[,-c(1,8)],family = 'binomial')
summary(fit)
## 
## Call:
## glm(formula = Direction ~ ., family = "binomial", data = Weekly[, 
##     -c(1, 8)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

At the 0.05 significance level, the Lag2 predictor is significant.

  1. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.
p <- predict(fit,data = Weekly[,-c(1,8)],type = 'response')
pred <- ifelse(p>0.5,'Up','Down')
table(pred,Weekly$Direction)
##       
## pred   Down  Up
##   Down   54  48
##   Up    430 557
# overall fraction of correct predictions

mean(pred == Weekly$Direction) 
## [1] 0.5610652

overall fraction of correct predictions = 0.5610652. While the model correctly predicted the Up weekly trends \(557/(48+557) = 0.9207\). Only \(54/(430+54) = 0.115\) Down weekly trends were predicted.

  1. Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).
d2 = subset(Weekly,Year>=1990&Year<=2008)
fit2 <- glm(Direction ~ Lag2, data = d2,family = 'binomial')


p2 <- predict(fit2,subset(Weekly,Year>=2009&Year<=2010),type = 'response')
pred2 <- ifelse(p2>0.5,'Up','Down')

d3 = subset(Weekly,Year>=2009&Year<=2010)

table(pred2,d3$Direction)
##       
## pred2  Down Up
##   Down    9  5
##   Up     34 56
# overall fraction of correct predictions.
mean(pred2 == d3$Direction) 
## [1] 0.625

When the entire Weekly dataset was split into training and test datasets, the model accuracy was 62.5%, which was a modest improvement over the model using the entire dataset. In addition, the model was better at predicting uptrends (91.80%) than it was at predicting downtrends (20.93%) compared to the previous model; although the model was significantly better at correctly predicting downtrends.