Problem Set 2

Problem 1

In this problem you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set from the ISLR2 package.

Create a binary variable, mpg01, which equals a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. This variable should be added to the Auto dataset using the $ operator.

data(Auto)
median_mpg = median(Auto$mpg)
Auto$mpg01 = ifelse(Auto$mpg > median_mpg, 1, 0)
Auto$mpg = NULL

Part A

Create a training and testing set by dividing the data into two parts. There are 392 total observations. Designate the training set as the odd observations, and the testing set as even observations.

train_dataset = Auto[seq(1, nrow(Auto), by =2), ] #odd
test_dataset = Auto[seq(2, nrow(Auto), by =2), ] #even

print(paste("Training set size:", nrow(train_dataset)))

## [1] "Training set size: 196"

print(paste("Testing set size:", nrow(test_dataset)))

## [1] "Testing set size: 196"

print(paste("Original size:", nrow(Auto)))

## [1] "Original size: 392"

Part B

Using logistic regression, predict mpg01 using:

Variables horsepower, cylinders, and weight
Variables horsepower, cylinders, weight, and year
Variables horsepower, cylinders, weight, and year PLUS all two way interactions

#a. Variables Hp, cylinders and weight
model1 = glm(mpg01 ~ horsepower + cylinders + weight, data=train_dataset, family = "binomial")  
summary(model1)

## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight, family = "binomial", 
##     data = train_dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 12.3665998  1.9360643   6.387 1.69e-10 ***
## horsepower  -0.0358216  0.0197315  -1.815  0.06945 .  
## cylinders   -0.5519014  0.3326000  -1.659  0.09704 .  
## weight      -0.0021753  0.0008164  -2.664  0.00771 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.63  on 195  degrees of freedom
## Residual deviance: 111.40  on 192  degrees of freedom
## AIC: 119.4
## 
## Number of Fisher Scoring iterations: 7

#b. Variables Hp, cylinders, weight and year
model2 = glm(mpg01 ~ horsepower + cylinders + weight + year, data=train_dataset, family = "binomial")  
summary(model2)

## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight + year, 
##     family = "binomial", data = train_dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.576602   6.063947  -2.074 0.038080 *  
## horsepower   -0.041059   0.021660  -1.896 0.058008 .  
## cylinders    -0.063915   0.377010  -0.170 0.865380    
## weight       -0.004056   0.001126  -3.601 0.000317 ***
## year          0.371326   0.093733   3.962 7.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.632  on 195  degrees of freedom
## Residual deviance:  89.971  on 191  degrees of freedom
## AIC: 99.971
## 
## Number of Fisher Scoring iterations: 7

#c. Variables HP, cylinders, weight and year plus all 2-way interactions

model3 = glm(mpg01 ~ horsepower + cylinders + weight + year + horsepower * cylinders + horsepower * weight + horsepower * year + cylinders * weight + cylinders * year + weight * year, data=train_dataset, family = "binomial")  
summary(model3)

## 
## Call:
## glm(formula = mpg01 ~ horsepower + cylinders + weight + year + 
##     horsepower * cylinders + horsepower * weight + horsepower * 
##     year + cylinders * weight + cylinders * year + weight * year, 
##     family = "binomial", data = train_dataset)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)
## (Intercept)           3.940e+01  6.401e+01   0.616    0.538
## horsepower           -1.837e-01  6.275e-01  -0.293    0.770
## cylinders            -8.481e+00  9.940e+00  -0.853    0.394
## weight               -6.296e-03  2.497e-02  -0.252    0.801
## year                 -2.045e-01  8.777e-01  -0.233    0.816
## horsepower:cylinders  2.357e-02  1.787e-02   1.319    0.187
## horsepower:weight    -3.967e-05  7.453e-05  -0.532    0.595
## horsepower:year       1.655e-03  8.941e-03   0.185    0.853
## cylinders:weight      4.552e-04  9.802e-04   0.464    0.642
## cylinders:year        6.148e-02  1.403e-01   0.438    0.661
## weight:year           5.124e-05  3.221e-04   0.159    0.874
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.632  on 195  degrees of freedom
## Residual deviance:  85.378  on 185  degrees of freedom
## AIC: 107.38
## 
## Number of Fisher Scoring iterations: 9

Part C

For each model, calculate the misclassification rate and confusion matrix on the training data, assuming a threshold of 0.5. Which model would you prefer?

# Predict probabilities for model1 on training set
phat1 = predict(model1, train_dataset, type = "response")

# Using threshold 0.5
yhat1 =  ifelse(phat1 > 0.5, 1, 0)

y1 = train_dataset$mpg01
MCR1 = mean(y1 != yhat1)
accuracy1 = 1 - MCR1
accuracy1

## [1] 0.9132653

MCR1

## [1] 0.08673469

table(Predicted = yhat1, Actual = y1)

##          Actual
## Predicted  0  1
##         0 90  7
##         1 10 89

# Predict probabilities for model2 on training set
phat2 = predict(model2, train_dataset, type = "response")

# Using threshold 0.5
yhat2 =  ifelse(phat2 > 0.5, 1, 0)

y2 = train_dataset$mpg01
MCR2 = mean(y2 != yhat2)
accuracy2 = 1 - MCR2
accuracy2

## [1] 0.8877551

MCR2

## [1] 0.1122449

table(Predicted = yhat2, Actual = y2)

##          Actual
## Predicted  0  1
##         0 87  9
##         1 13 87

# Predict probabilities for model3 on training set
phat3 = predict(model3, train_dataset, type = "response")

# Using threshold 0.5
yhat3 =  ifelse(phat3 > 0.5, 1, 0)

y3 = train_dataset$mpg01
MCR3 = mean(y3 != yhat3)
accuracy3 = 1 - MCR3
accuracy3

## [1] 0.8979592

MCR3

## [1] 0.1020408

table(Predicted = yhat3, Actual = y3)

##          Actual
## Predicted  0  1
##         0 89  9
##         1 11 87

Answer: In comparison to all the three models, MODEL1(Variables Hp, cylinders and weight) has a higher accuracy of 0.913 (91.33%) and lowest MCR of 0.087 (8.67%).Model1 appears to be the most reliable model with the fewest errors compared to the other models.

Part D

For each model, calculate the misclassification rate and confusion matrix on the testing data, assuming a threshold of 0.5. Which model would you prefer? Is this the same as the prior step?

#Testing Data MCR

# Predict probabilities for model1 on testing set
phat1 = predict(model1, test_dataset, type = "response")

# Using threshold 0.5
yhat1 =  ifelse(phat1 > 0.5, 1, 0)

y1 = test_dataset$mpg01
MCR1 = mean(y1 != yhat1)
accuracy1 = 1 - MCR1
accuracy1

## [1] 0.9081633

MCR1

## [1] 0.09183673

table(Predicted = yhat1, Actual = y1)

##          Actual
## Predicted  0  1
##         0 86  8
##         1 10 92

# Predict probabilities for model2 on testing set
phat2 = predict(model2, test_dataset, type = "response")

# Using threshold 0.5
yhat2 =  ifelse(phat2 > 0.5, 1, 0)

y2 = test_dataset$mpg01
MCR2 = mean(y2 != yhat2)
accuracy2 = 1 - MCR2
accuracy2

## [1] 0.9081633

MCR2

## [1] 0.09183673

table(Predicted = yhat2, Actual = y2)

##          Actual
## Predicted  0  1
##         0 85  7
##         1 11 93

# Predict probabilities for model3 on testing set
phat3 = predict(model3, test_dataset, type = "response")

# Using threshold 0.5
yhat3 =  ifelse(phat3 > 0.5, 1, 0)

y3 = test_dataset$mpg01
MCR3 = mean(y3 != yhat3)
accuracy3 = 1 - MCR3
accuracy3

## [1] 0.9183673

MCR3

## [1] 0.08163265

table(Predicted = yhat3, Actual = y3)

##          Actual
## Predicted  0  1
##         0 88  8
##         1  8 92

Answer: The training data and testing data results for the dataset seem relatively consistent for MODEL3, in comparison. Model3 has highest accuracy (91.94%) and lowest MCR (8.16%) & fewer false positives (8 compares to 11 or 10). So there is no signs of overfitting & shows good generalisation for Model3.

Part E

Report the true positive (true value 1, predicted value 1) and false positive (true value 0, predicted value 1) numbers for the testing data.

# TP/FP values

#Model1
cmatrix1 = table(Predicted = yhat1, Actual = y1)

TP1 = cmatrix1[2,2]
FP1 = cmatrix1[1,2]

print(paste("The TP of Model1 is",TP1))

## [1] "The TP of Model1 is 92"

print(paste("The FP of Model1 is",FP1))

## [1] "The FP of Model1 is 8"

#Model2
cmatrix2 = table(Predicted = yhat2, Actual = y2)

TP2 = cmatrix2[2,2]
FP2 = cmatrix2[1,2]

print(paste("The TP of Model2 is",TP2))

## [1] "The TP of Model2 is 93"

print(paste("The FP of Model2 is",FP2))

## [1] "The FP of Model2 is 7"

#Model3
cmatrix3 = table(Predicted = yhat3, Actual = y3)

TP3 = cmatrix3[2,2]
FP3 = cmatrix3[1,2]

print(paste("The TP of Model3 is",TP3))

## [1] "The TP of Model3 is 92"

print(paste("The FP of Model3 is",FP3))

## [1] "The FP of Model3 is 8"

Part F

Provide ROC curves for the three methods (ideally in one graph) on the testing data. Print out the AUCs. Which method has the lowest AUC?

# Create ROC curves with AUC

#NOTE: I took help of ChatGPT for this portion, since I didn't understand the notes from class.


true_labels = test_dataset$mpg01

# Model1
phat1 = predict(model1, test_dataset, type = "response")

plot.roc(true_labels ~ phat1, 
         print.auc = TRUE, col = "blue", main = "ROC Curves for Models 1, 2, and 3")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Model2 
phat2 = predict(model2, test_dataset, type = "response")

plot.roc(true_labels ~ phat2, 
         print.auc = TRUE, print.auc.y = 0.4, add = TRUE, col = "red")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Model3
phat3 = predict(model3, test_dataset, type = "response")

plot.roc(true_labels ~ phat3, 
         print.auc = TRUE, print.auc.y = 0.3, add = TRUE, col = "green")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Add a legend to the plot
legend("bottomright", legend = c("Model 1", "Model 2", "Model 3"), 
       col = c("blue", "red", "green"), lwd = 2)

Answer: Model 1 has the lowest AUC (0.961) as seen in the plot above.

Part G

Based on the results of steps C-F, are there any characteristics of the data you can infer?

Answer: 1. Testing and training performance are relatively consistent across the models and is not over-fitting, as seen for Model3. This indicates the model is capturing the relationships in the data is useful for prediction across further subsets. 2. Model1 is also a strong model, even thought its simpler, offering good performance with fewer predictors. 3.More advanced techniques such as the Lasso or Ridge might offer better predictions compared to a sinple regression interaction.