Textbook problems

Q. 1: Using a little bit of algebra, prove that (4.2) is equivalent to (4.3). In other words, the logistic function representation and logit represen- tation for the logistic regression model are equivalent.

Starting with the logistic function representation (4.2): p(X) = 1/(1 + e^(-β₀ - β₁X)) To show this is equivalent to the logit representation (4.3): log(p(X)/(1-p(X))) = β₀ + β₁X To solve this: 1. Start with p(X) = 1/(1 + e^(-β₀ - β₁X)) 2. First, let’s get p(X)/(1-p(X)): From (4.2), p(X) = 1/(1 + e^(-β₀ - β₁X)) Therefore, 1-p(X) = 1 - 1/(1 + e^(-β₀ - β₁X)) = (1 + e^(-β₀ - β₁X))/(1 + e^(-β₀ - β₁X)) - 1/(1 + e^(-β₀ - β₁X)) = e^(-β₀ - β₁X)/(1 + e^(-β₀ - β₁X)) 3. Now we can form p(X)/(1-p(X)): p(X)/(1-p(X)) = [1/(1 + e^(-β₀ - β₁X))] / [e^(-β₀ - β₁X)/(1 + e^(-β₀ - β₁X))] = 1/e^(-β₀ - β₁X) = e^(β₀ + β₁X) 4. Taking the natural log of both sides: log(p(X)/(1-p(X))) = log(e^(β₀ + β₁X)) = β₀ + β₁X


Q.6 Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y= receive an A. We fit a logistic regression and produce estimated coefficient,ˆ ˆ ˆ β0 =−6, β1 = 0.05, β2 = 1. (a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

Solution: p(x) = 0.622 or 62.2%; the probability of getting an A is 62.2%.

  1. How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class? Solution: p(x) = 0.5, plugging into the logistic regression formula, solving for X1 = 50. The student would need to study 50 hours to have a 50% chance of getting an A.

Q.9 This problem has to do with odds. (a) On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default? (b) Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will de- fault?

Solution: (a) Odds = 0.37 0.37 = p/(1-p), Solving p = 0.27 or 27%; 27% of people with odds of 0.37 will default on their credit card payments. (b) p = 0.16 (16%) P(odds) = 0.16/(1-0.16) = 0.19 ; The odds of defaulting with 16% chances on their credit card payments are 0.19.


Problem 1

In this problem you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set from the ISLR2 package.

Create a binary variable, mpg01, which equals a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. This variable should be added to the Auto dataset using the $ operator.

data(Auto)
median_mpg = median(Auto$mpg)
Auto$mpg01 = ifelse(Auto$mpg > median_mpg, 1, 0)
Auto$mpg = NULL

Part A

Create a training and testing set by dividing the data into two parts. There are 392 total observations. Designate the training set as the odd observations, and the testing set as even observations.

train_dataset = Auto[seq(1, nrow(Auto), by =2), ] #odd
test_dataset = Auto[seq(2, nrow(Auto), by =2), ] #even

print(paste("Training set size:", nrow(train_dataset)))
## [1] "Training set size: 196"
print(paste("Testing set size:", nrow(test_dataset)))
## [1] "Testing set size: 196"
print(paste("Original size:", nrow(Auto)))
## [1] "Original size: 392"

Part B

Using logistic regression, predict mpg01 using:

  1. Variables horsepower, cylinders, and weight
  2. Variables horsepower, cylinders, weight, and year
  3. Variables horsepower, cylinders, weight, and year PLUS all two way interactions
#a. Variables Hp, cylinders and weight
model1 = glm(mpg01 ~ horsepower + cylinders + weight, data=train_dataset, family = "binomial")  
summary(model1) 
## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight, family = "binomial", 
##     data = train_dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 12.3665998  1.9360643   6.387 1.69e-10 ***
## horsepower  -0.0358216  0.0197315  -1.815  0.06945 .  
## cylinders   -0.5519014  0.3326000  -1.659  0.09704 .  
## weight      -0.0021753  0.0008164  -2.664  0.00771 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.63  on 195  degrees of freedom
## Residual deviance: 111.40  on 192  degrees of freedom
## AIC: 119.4
## 
## Number of Fisher Scoring iterations: 7
#b. Variables Hp, cylinders, weight and year
model2 = glm(mpg01 ~ horsepower + cylinders + weight + year, data=train_dataset, family = "binomial")  
summary(model2)
## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight + year, 
##     family = "binomial", data = train_dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.576602   6.063947  -2.074 0.038080 *  
## horsepower   -0.041059   0.021660  -1.896 0.058008 .  
## cylinders    -0.063915   0.377010  -0.170 0.865380    
## weight       -0.004056   0.001126  -3.601 0.000317 ***
## year          0.371326   0.093733   3.962 7.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.632  on 195  degrees of freedom
## Residual deviance:  89.971  on 191  degrees of freedom
## AIC: 99.971
## 
## Number of Fisher Scoring iterations: 7
#c. Variables HP, cylinders, weight and year plus all 2-way interactions

model3 = glm(mpg01 ~ horsepower + cylinders + weight + year + horsepower * cylinders + horsepower * weight + horsepower * year + cylinders * weight + cylinders * year + weight * year, data=train_dataset, family = "binomial")  
summary(model3)
## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight + year + 
##     horsepower * cylinders + horsepower * weight + horsepower * 
##     year + cylinders * weight + cylinders * year + weight * year, 
##     family = "binomial", data = train_dataset)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           3.940e+01  6.401e+01   0.616    0.538
## horsepower           -1.837e-01  6.275e-01  -0.293    0.770
## cylinders            -8.481e+00  9.940e+00  -0.853    0.394
## weight               -6.296e-03  2.497e-02  -0.252    0.801
## year                 -2.045e-01  8.777e-01  -0.233    0.816
## horsepower:cylinders  2.357e-02  1.787e-02   1.319    0.187
## horsepower:weight    -3.967e-05  7.453e-05  -0.532    0.595
## horsepower:year       1.655e-03  8.941e-03   0.185    0.853
## cylinders:weight      4.552e-04  9.802e-04   0.464    0.642
## cylinders:year        6.148e-02  1.403e-01   0.438    0.661
## weight:year           5.124e-05  3.221e-04   0.159    0.874
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.632  on 195  degrees of freedom
## Residual deviance:  85.378  on 185  degrees of freedom
## AIC: 107.38
## 
## Number of Fisher Scoring iterations: 9

Part C

For each model, calculate the misclassification rate and confusion matrix on the training data, assuming a threshold of 0.5. Which model would you prefer?

# Predict probabilities for model1 on training set
phat1 = predict(model1, train_dataset, type = "response")

# Using threshold 0.5
yhat1 =  ifelse(phat1 > 0.5, 1, 0)

y1 = train_dataset$mpg01
MCR1 = mean(y1 != yhat1)
accuracy1 = 1 - MCR1
accuracy1
## [1] 0.9132653
MCR1
## [1] 0.08673469
table(Predicted = yhat1, Actual = y1)
##          Actual
## Predicted  0  1
##         0 90  7
##         1 10 89
# Predict probabilities for model2 on training set
phat2 = predict(model2, train_dataset, type = "response")

# Using threshold 0.5
yhat2 =  ifelse(phat2 > 0.5, 1, 0)

y2 = train_dataset$mpg01
MCR2 = mean(y2 != yhat2)
accuracy2 = 1 - MCR2
accuracy2
## [1] 0.8877551
MCR2
## [1] 0.1122449
table(Predicted = yhat2, Actual = y2)
##          Actual
## Predicted  0  1
##         0 87  9
##         1 13 87
# Predict probabilities for model3 on training set
phat3 = predict(model3, train_dataset, type = "response")

# Using threshold 0.5
yhat3 =  ifelse(phat3 > 0.5, 1, 0)

y3 = train_dataset$mpg01
MCR3 = mean(y3 != yhat3)
accuracy3 = 1 - MCR3
accuracy3
## [1] 0.8979592
MCR3
## [1] 0.1020408
table(Predicted = yhat3, Actual = y3)
##          Actual
## Predicted  0  1
##         0 89  9
##         1 11 87

Answer: In comparison to all the three models, MODEL1(Variables Hp, cylinders and weight) has a higher accuracy of 0.913 (91.33%) and lowest MCR of 0.087 (8.67%).Model1 appears to be the most reliable model with the fewest errors compared to the other models.

Part D

For each model, calculate the misclassification rate and confusion matrix on the testing data, assuming a threshold of 0.5. Which model would you prefer? Is this the same as the prior step?

#Testing Data MCR

# Predict probabilities for model1 on testing set
phat1 = predict(model1, test_dataset, type = "response")

# Using threshold 0.5
yhat1 =  ifelse(phat1 > 0.5, 1, 0)

y1 = test_dataset$mpg01
MCR1 = mean(y1 != yhat1)
accuracy1 = 1 - MCR1
accuracy1
## [1] 0.9081633
MCR1
## [1] 0.09183673
table(Predicted = yhat1, Actual = y1)
##          Actual
## Predicted  0  1
##         0 86  8
##         1 10 92
# Predict probabilities for model2 on testing set
phat2 = predict(model2, test_dataset, type = "response")

# Using threshold 0.5
yhat2 =  ifelse(phat2 > 0.5, 1, 0)

y2 = test_dataset$mpg01
MCR2 = mean(y2 != yhat2)
accuracy2 = 1 - MCR2
accuracy2
## [1] 0.9081633
MCR2
## [1] 0.09183673
table(Predicted = yhat2, Actual = y2)
##          Actual
## Predicted  0  1
##         0 85  7
##         1 11 93
# Predict probabilities for model3 on testing set
phat3 = predict(model3, test_dataset, type = "response")

# Using threshold 0.5
yhat3 =  ifelse(phat3 > 0.5, 1, 0)

y3 = test_dataset$mpg01
MCR3 = mean(y3 != yhat3)
accuracy3 = 1 - MCR3
accuracy3
## [1] 0.9183673
MCR3
## [1] 0.08163265
table(Predicted = yhat3, Actual = y3)
##          Actual
## Predicted  0  1
##         0 88  8
##         1  8 92

Answer: The training data and testing data results for the dataset seem relatively consistent for MODEL3, in comparison. Model3 has highest accuracy (91.94%) and lowest MCR (8.16%) & fewer false positives (8 compares to 11 or 10). So there is no signs of overfitting & shows good generalisation for Model3.

Part E

Report the true positive (true value 1, predicted value 1) and false positive (true value 0, predicted value 1) numbers for the testing data.

# TP/FP values

#Model1
cmatrix1 = table(Predicted = yhat1, Actual = y1)

TP1 = cmatrix1[2,2]
FP1 = cmatrix1[1,2]

print(paste("The TP of Model1 is",TP1))
## [1] "The TP of Model1 is 92"
print(paste("The FP of Model1 is",FP1))
## [1] "The FP of Model1 is 8"
#Model2
cmatrix2 = table(Predicted = yhat2, Actual = y2)

TP2 = cmatrix2[2,2]
FP2 = cmatrix2[1,2]

print(paste("The TP of Model2 is",TP2))
## [1] "The TP of Model2 is 93"
print(paste("The FP of Model2 is",FP2))
## [1] "The FP of Model2 is 7"
#Model3
cmatrix3 = table(Predicted = yhat3, Actual = y3)

TP3 = cmatrix3[2,2]
FP3 = cmatrix3[1,2]

print(paste("The TP of Model3 is",TP3))
## [1] "The TP of Model3 is 92"
print(paste("The FP of Model3 is",FP3))
## [1] "The FP of Model3 is 8"

Part F

Provide ROC curves for the three methods (ideally in one graph) on the testing data. Print out the AUCs. Which method has the lowest AUC?

# Create ROC curves with AUC

#NOTE: I took help of ChatGPT for this portion, since I didn't understand the notes from class.


true_labels = test_dataset$mpg01

# Model1
phat1 = predict(model1, test_dataset, type = "response")

plot.roc(true_labels ~ phat1, 
         print.auc = TRUE, col = "blue", main = "ROC Curves for Models 1, 2, and 3")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Model2 
phat2 = predict(model2, test_dataset, type = "response")

plot.roc(true_labels ~ phat2, 
         print.auc = TRUE, print.auc.y = 0.4, add = TRUE, col = "red")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Model3
phat3 = predict(model3, test_dataset, type = "response")

plot.roc(true_labels ~ phat3, 
         print.auc = TRUE, print.auc.y = 0.3, add = TRUE, col = "green")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Add a legend to the plot
legend("bottomright", legend = c("Model 1", "Model 2", "Model 3"), 
       col = c("blue", "red", "green"), lwd = 2)

Answer: Model 1 has the lowest AUC (0.961) as seen in the plot above.

Part G

Based on the results of steps C-F, are there any characteristics of the data you can infer?

Answer: 1. Testing and training performance are relatively consistent across the models and is not over-fitting, as seen for Model3. This indicates the model is capturing the relationships in the data is useful for prediction across further subsets. 2. Model1 is also a strong model, even thought its simpler, offering good performance with fewer predictors. 3.More advanced techniques such as the Lasso or Ridge might offer better predictions compared to a sinple regression interaction.