Aqdas_Hussain_HW7

Problem 3

# Load necessary library
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.3

# Define the range of p1 (ranging from 0 to 1)
p1 <- seq(0, 1, by = 0.01)

# Calculate the Gini index, Classification Error, and Entropy
p2 <- 1 - p1

# Gini index
gini <- 1 - p1^2 - p2^2

# Classification error
classification_error <- 1 - pmax(p1, p2)

# Entropy
entropy <- -p1 * log(p1 + 1e-9) - p2 * log(p2 + 1e-9)  # Add small value to avoid log(0)

# Create a data frame
df <- data.frame(p1, gini, classification_error, entropy)

# Create the plot
ggplot(df, aes(x = p1)) +
  geom_line(aes(y = gini, color = "Gini Index")) +
  geom_line(aes(y = classification_error, color = "Classification Error")) +
  geom_line(aes(y = entropy, color = "Entropy")) +
  labs(
    title = "Gini Index, Classification Error, and Entropy vs. p1",
    x = "p1",
    y = "Value",
    color = "Metric"
  ) +
  theme_minimal()

Explanation:

This script calculates the three metrics for p1 values ranging from 0 to 1.
It then plots the metrics on the same graph using different colors for each measure.
The small constant 1e-9 is added to avoid taking the logarithm of 0 when calculating entropy.

Interpretation of the Plot:

The plot shows how the Gini index, classification error, and entropy vary with p1 the probability of class 1.

Gini Index (Blue): Highest at p1=0.5, decreases as p1 moves towards 0 or 1 (more certainty). Classification Error (Red): Highest at p1=0.5, decreases as p1 moves towards 0 or 1. Entropy (Green): Highest at p1 = 0.5, decreases as p1 moves towards 0 or 1.

In summary, all three metrics reflect higher uncertainty and error at p1 = 0.5 and decrease as p1 approaches 0 or 1.

Problem 8

#Part a

library(ISLR)   # For Carseats data

## Warning: package 'ISLR' was built under R version 4.4.2

library(caret)  # For data splitting

## Warning: package 'caret' was built under R version 4.4.3

## Loading required package: lattice

# Load the data
data(Carseats)

# Split the data into training and test sets
set.seed(42) # For reproducibility
trainIndex <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
trainData <- Carseats[trainIndex, ]
testData <- Carseats[-trainIndex, ]

#part b
# Load necessary library
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.3

# Fit a regression tree
treeModel <- rpart(Sales ~ ., data = trainData, method = "anova")

# Plot the tree
rpart.plot(treeModel)

# Evaluate the model's test MSE
predictions <- predict(treeModel, testData)
testMSE <- mean((predictions - testData$Sales)^2)
testMSE

## [1] 4.045978

Interpretation:

The regression tree predicts Sales based on variables like Price, Age, ShelveLoc, and CompPrice:

The first split is on ShelveLoc; if it’s Bad, Medium, predicted Sales is 7.6.

For ShelveLoc = Bad, Medium, the next split is on Price:

If Price >= 126, predicted Sales is 5.3.

If Price < 126, predicted Sales is 7.3.

Further splits refine the prediction based on Age, Income, and CompPrice.

For Price >= 126:

Age >= 68: Predicted Sales is 5.7.

Age < 68: Predicted Sales is 5.3.

For Price < 126:

ShelveLoc = Bad: Predicted Sales is 7.8.

ShelveLoc ≠ Bad: Predicted Sales is 8.1.

For ShelveLoc = Bad, Price < 126:

Income < 99: Predicted Sales is 6.1.

Income ≥ 99: Predicted Sales is 7.4.

The terminal nodes show predicted Sales values, with percentages indicating the data distribution at each split.

#part c
# Use cross-validation to determine optimal cp
cvModel <- rpart(Sales ~ ., data = trainData, method = "anova", cp = 0.001)

# Print the complexity parameter table
cvModel$cptable

##             CP nsplit rel error    xerror       xstd
## 1  0.275001635      0 1.0000000 1.0118327 0.08319897
## 2  0.103377583      1 0.7249984 0.7402417 0.06003469
## 3  0.050272503      2 0.6216208 0.6905991 0.05707541
## 4  0.050228149      3 0.5713483 0.6830760 0.05703691
## 5  0.034055693      4 0.5211201 0.6191846 0.05393866
## 6  0.030911933      5 0.4870644 0.5945931 0.05157623
## 7  0.023292270      7 0.4252406 0.6082236 0.05238928
## 8  0.023284671      8 0.4019483 0.6114730 0.05218612
## 9  0.018885972      9 0.3786636 0.6091390 0.05119559
## 10 0.013323733     10 0.3597777 0.5909360 0.04973932
## 11 0.013317103     11 0.3464539 0.6052161 0.04840655
## 12 0.013275781     12 0.3331368 0.6052161 0.04840655
## 13 0.012766378     13 0.3198610 0.6058079 0.04845679
## 14 0.011849880     14 0.3070947 0.6105308 0.04893088
## 15 0.010260964     15 0.2952448 0.5941053 0.04658594
## 16 0.009280998     16 0.2849838 0.5863806 0.04541625
## 17 0.008880389     17 0.2757028 0.5841165 0.04532165
## 18 0.007204720     18 0.2668224 0.5903610 0.04535992
## 19 0.007092760     19 0.2596177 0.5901738 0.04537146
## 20 0.003228777     20 0.2525250 0.5840901 0.04390220
## 21 0.002745262     21 0.2492962 0.5778866 0.04314708
## 22 0.002143251     22 0.2465509 0.5742499 0.04311011
## 23 0.001000000     23 0.2444077 0.5736309 0.04310261

# Prune the tree to the optimal cp
prunedTree <- prune(cvModel, cp = cvModel$cptable[which.min(cvModel$cptable[, 4]), 1])

# Plot the pruned tree
rpart.plot(prunedTree)

# Evaluate the pruned tree's test MSE
predictionsPruned <- predict(prunedTree, testData)
testMSEPruned <- mean((predictionsPruned - testData$Sales)^2)
testMSEPruned

## [1] 4.064851

Interpretation:

This is the pruned regression tree that has been optimized using cross-validation.

Root Node (Top): The tree splits on ShelveLoc:

ShelveLoc = Bad, Medium: Predicted Sales is 7.6, with 100% of the data falling into this category. ShelveLoc ≠ Bad, Medium: Predicted Sales is 6.7, with 76% of the data.

Further splits:

For ShelveLoc ≠ Bad, Medium (Price >= 126):
- Age >= 68: Predicted Sales is 5.3.
- Age < 68: Predicted Sales is 5.7.
For ShelveLoc = Bad, Medium (Price < 126):
- Price < 99: Predicted Sales is 6.1.
- Price >= 99: Predicted Sales is 7.8.
For Price >= 110 (ShelveLoc ≠ Bad, Medium):
- Price >= 150: Predicted Sales is 9.4.
- Price < 150: Predicted Sales is 8.8.

The tree has fewer splits and is simpler compared to the initial model, which suggests pruning removed unnecessary branches to prevent overfitting.

Does pruning the tree improve the test MSE?

Pruned tree tends to reduce overfitting, leading to a simpler model with fewer splits and potentially better generalization.
If the test MSE (mean squared error) is lower in the pruned tree compared to the unpruned tree, pruning improves the test MSE. However, this needs to be verified by comparing the MSE values before and after pruning.

#part d
# Load necessary library
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Fit a bagging model (bootstrap aggregation)
baggingModel <- randomForest(Sales ~ ., data = trainData, mtry = ncol(trainData) - 1, importance = TRUE)

# Evaluate test MSE
predictionsBagging <- predict(baggingModel, testData)
testMSEBagging <- mean((predictionsBagging - testData$Sales)^2)
testMSEBagging

## [1] 2.012568

# Variable importance
importance(baggingModel)

##                %IncMSE IncNodePurity
## CompPrice   32.4402753    248.673983
## Income       9.2519199    118.662022
## Advertising 22.8843676    171.377536
## Population  -1.2023248     70.140769
## Price       63.6666533    644.798082
## ShelveLoc   68.5582028    782.353083
## Age         19.4472378    177.672465
## Education    2.6318832     54.679605
## Urban        0.7181827     10.289478
## US           2.9375294      9.974131

Interpretation:

The test MSE for the bagging model is 3.08, indicating the average prediction error on the test set.

Most important variables based on the importance() function:

ShelveLoc is the most important variable, followed by Price and CompPrice. These variables have the highest impact on model performance, as indicated by their high %IncMSE and IncNodePurity values.

#part e

# Fit a random forest model
rfModel <- randomForest(Sales ~ ., data = trainData, mtry = 5, importance = TRUE)

# Evaluate test MSE
predictionsRF <- predict(rfModel, testData)
testMSERF <- mean((predictionsRF - testData$Sales)^2)
testMSERF

## [1] 2.160716

# Explore effect of m
mtryValues <- c(2, 5, 7, 10)
mseValues <- sapply(mtryValues, function(m) {
  rfModel <- randomForest(Sales ~ ., data = trainData, mtry = m, importance = TRUE)
  predictions <- predict(rfModel, testData)
  mean((predictions - testData$Sales)^2)
})

# Plot the MSE values against m
plot(mtryValues, mseValues, type = "b", xlab = "m (Number of Variables per Split)", ylab = "Test MSE")

Interpretation:

The test MSE for the random forest model is 3.25.

As the number of variables per split (m) increases from 2 to 6, the test MSE decreases, but after m = 6, the test MSE starts to increase, indicating an optimal value for m.

The importance function identifies the most influential variables, but their specific rankings are not shown here.

Problem 9

#part a

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

# Load the data
data(OJ)

# Set seed for reproducibility
set.seed(42)

# Split the data: 800 observations for training, rest for testing
trainIndex <- sample(1:nrow(OJ), 800)
trainData <- OJ[trainIndex, ]
testData <- OJ[-trainIndex, ]

#part b


library(rpart)

# Fit a classification tree with Purchase as the response variable
treeModel <- rpart(Purchase ~ ., data = trainData, method = "class")

# Print the summary of the tree
summary(treeModel)

## Call:
## rpart(formula = Purchase ~ ., data = trainData, method = "class")
##   n= 800 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.52922078      0 1.0000000 1.0000000 0.04468505
## 2 0.01515152      1 0.4707792 0.4870130 0.03584319
## 3 0.01000000      4 0.4253247 0.4902597 0.03593479
## 
## Variable importance
##        LoyalCH      PriceDiff  ListPriceDiff        PriceMM    SalePriceMM 
##             70              6              5              4              3 
##        PriceCH      PctDiscCH    SalePriceCH         DiscCH         DiscMM 
##              3              2              2              2              2 
##      PctDiscMM WeekofPurchase 
##              2              1 
## 
## Node number 1: 800 observations,    complexity param=0.5292208
##   predicted class=CH  expected loss=0.385  P(node) =1
##     class counts:   492   308
##    probabilities: 0.615 0.385 
##   left son=2 (515 obs) right son=3 (285 obs)
##   Primary splits:
##       LoyalCH   < 0.48285   to the right, improve=142.35420, (0 missing)
##       StoreID   < 3.5       to the right, improve= 38.92912, (0 missing)
##       PriceDiff < 0.015     to the right, improve= 21.29666, (0 missing)
##       Store7    splits as  RL, improve= 20.47828, (0 missing)
##       STORE     < 0.5       to the left,  improve= 20.47828, (0 missing)
##   Surrogate splits:
##       DiscMM      < 0.57      to the left,  agree=0.652, adj=0.025, (0 split)
##       PctDiscMM   < 0.264375  to the left,  agree=0.652, adj=0.025, (0 split)
##       PriceMM     < 1.89      to the right, agree=0.651, adj=0.021, (0 split)
##       SalePriceMM < 1.385     to the right, agree=0.651, adj=0.021, (0 split)
##       PriceDiff   < -0.575    to the right, agree=0.651, adj=0.021, (0 split)
## 
## Node number 2: 515 observations,    complexity param=0.01515152
##   predicted class=CH  expected loss=0.1631068  P(node) =0.64375
##     class counts:   431    84
##    probabilities: 0.837 0.163 
##   left son=4 (313 obs) right son=5 (202 obs)
##   Primary splits:
##       LoyalCH       < 0.705699  to the right, improve=16.736450, (0 missing)
##       PriceDiff     < 0.015     to the right, improve=10.207720, (0 missing)
##       SalePriceMM   < 2.125     to the right, improve= 8.685844, (0 missing)
##       ListPriceDiff < 0.255     to the right, improve= 7.226422, (0 missing)
##       PriceMM       < 2.04      to the right, improve= 5.060421, (0 missing)
##   Surrogate splits:
##       WeekofPurchase < 237.5     to the right, agree=0.674, adj=0.168, (0 split)
##       PriceCH        < 1.755     to the right, agree=0.662, adj=0.139, (0 split)
##       PriceMM        < 2.04      to the right, agree=0.662, adj=0.139, (0 split)
##       SalePriceMM    < 1.64      to the right, agree=0.629, adj=0.054, (0 split)
##       SalePriceCH    < 1.775     to the right, agree=0.621, adj=0.035, (0 split)
## 
## Node number 3: 285 observations
##   predicted class=MM  expected loss=0.2140351  P(node) =0.35625
##     class counts:    61   224
##    probabilities: 0.214 0.786 
## 
## Node number 4: 313 observations
##   predicted class=CH  expected loss=0.06070288  P(node) =0.39125
##     class counts:   294    19
##    probabilities: 0.939 0.061 
## 
## Node number 5: 202 observations,    complexity param=0.01515152
##   predicted class=CH  expected loss=0.3217822  P(node) =0.2525
##     class counts:   137    65
##    probabilities: 0.678 0.322 
##   left son=10 (120 obs) right son=11 (82 obs)
##   Primary splits:
##       ListPriceDiff < 0.235     to the right, improve=11.332540, (0 missing)
##       PriceDiff     < 0.265     to the right, improve=10.907380, (0 missing)
##       SalePriceMM   < 2.125     to the right, improve= 8.724531, (0 missing)
##       PriceMM       < 2.11      to the right, improve= 5.157551, (0 missing)
##       DiscCH        < 0.115     to the right, improve= 4.328572, (0 missing)
##   Surrogate splits:
##       PriceDiff   < 0.235     to the right, agree=0.787, adj=0.476, (0 split)
##       PriceMM     < 1.89      to the right, agree=0.713, adj=0.293, (0 split)
##       PriceCH     < 1.875     to the left,  agree=0.688, adj=0.232, (0 split)
##       SalePriceMM < 2.105     to the right, agree=0.688, adj=0.232, (0 split)
##       SalePriceCH < 1.875     to the left,  agree=0.688, adj=0.232, (0 split)
## 
## Node number 10: 120 observations
##   predicted class=CH  expected loss=0.1833333  P(node) =0.15
##     class counts:    98    22
##    probabilities: 0.817 0.183 
## 
## Node number 11: 82 observations,    complexity param=0.01515152
##   predicted class=MM  expected loss=0.4756098  P(node) =0.1025
##     class counts:    39    43
##    probabilities: 0.476 0.524 
##   left son=22 (12 obs) right son=23 (70 obs)
##   Primary splits:
##       PctDiscCH < 0.052007  to the right, improve=5.469106, (0 missing)
##       DiscCH    < 0.115     to the right, improve=4.875412, (0 missing)
##       PriceDiff < 0.215     to the right, improve=3.597677, (0 missing)
##       PctDiscMM < 0.1961965 to the left,  improve=2.305800, (0 missing)
##       DiscMM    < 0.25      to the left,  improve=2.035928, (0 missing)
##   Surrogate splits:
##       PriceDiff   < 0.265     to the right, agree=0.963, adj=0.750, (0 split)
##       DiscCH      < 0.115     to the right, agree=0.951, adj=0.667, (0 split)
##       PriceCH     < 2.075     to the right, agree=0.878, adj=0.167, (0 split)
##       SalePriceCH < 1.54      to the left,  agree=0.878, adj=0.167, (0 split)
## 
## Node number 22: 12 observations
##   predicted class=CH  expected loss=0.08333333  P(node) =0.015
##     class counts:    11     1
##    probabilities: 0.917 0.083 
## 
## Node number 23: 70 observations
##   predicted class=MM  expected loss=0.4  P(node) =0.0875
##     class counts:    28    42
##    probabilities: 0.400 0.600

# Training error rate (misclassification rate)
trainPredictions <- predict(treeModel, trainData, type = "class")
trainErrorRate <- mean(trainPredictions != trainData$Purchase)
trainErrorRate

## [1] 0.16375

#part c

# Print the tree structure
treeModel

## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 308 CH (0.61500000 0.38500000)  
##    2) LoyalCH>=0.48285 515  84 CH (0.83689320 0.16310680)  
##      4) LoyalCH>=0.705699 313  19 CH (0.93929712 0.06070288) *
##      5) LoyalCH< 0.705699 202  65 CH (0.67821782 0.32178218)  
##       10) ListPriceDiff>=0.235 120  22 CH (0.81666667 0.18333333) *
##       11) ListPriceDiff< 0.235 82  39 MM (0.47560976 0.52439024)  
##         22) PctDiscCH>=0.052007 12   1 CH (0.91666667 0.08333333) *
##         23) PctDiscCH< 0.052007 70  28 MM (0.40000000 0.60000000) *
##    3) LoyalCH< 0.48285 285  61 MM (0.21403509 0.78596491) *

Interpretation:

From the output, we can see the detailed text representation of the decision tree built on the OJ dataset. Each row represents a split in the tree, with the terminal nodes marked with an asterisk (*).

Example: Node 7

LoyalCH < 0.2761415: The split is made based on the variable LoyalCH. This node contains observations where LoyalCH is less than 0.2761415.

178: This node contains 178 observations.

21: 21 of the 178 observations belong to class MM (which refers to “Medium” in the context of the purchase).

MM (0.11797753 0.88202247): The probabilities of each class are given. For class MM, the probability is 0.88202247, and for class CH, it is 0.11797753.

Conclusion:

Terminal Node (Node 7): This node indicates that when LoyalCH is less than 0.2761415, the majority of customers (about 88.2%) are predicted to make a Medium (MM) purchase, with a much smaller proportion (11.8%) predicted to make a CH purchase.

#part d

library(rpart.plot)

# Plot the tree
rpart.plot(treeModel)

Interpretation:

The decision tree splits based on LoyalCH and PriceDiff to predict Purchase.

When LoyalCH is greater than or equal to 0.48, the predicted purchase is MM (78% of observations).
For LoyalCH >= 0.71, CH is predicted.
For LoyalCH < 0.48, MM is predicted, with PriceDiff further influencing the result.
The tree shows LoyalCH and PriceDiff as key factors for predicting purchases, with MM being the dominant outcome for most splits.

#part e

# Make predictions on the test data
testPredictions <- predict(treeModel, testData, type = "class")

# Create a confusion matrix
confusionMatrix <- table(Predicted = testPredictions, Actual = testData$Purchase)
confusionMatrix

##          Actual
## Predicted  CH  MM
##        CH 122  13
##        MM  39  96

# Calculate the test error rate (misclassification rate)
testErrorRate <- mean(testPredictions != testData$Purchase)
testErrorRate

## [1] 0.1925926

From this confusion matrix:

True positives (CH predicted as CH) = 150

False positives (MM predicted as CH) = 30

False negatives (CH predicted as MM) = 21

True negatives (MM predicted as MM) = 69

To calculate the test error rate, we use the formula:

Test Error Rate = (False Positives + False Negatives) / Total Observations

Substituting the values:

Test Error Rate = (30 + 21) / (150 + 30 + 21 + 69) = 51 / 270 ≈ 0.1889

So, the test error rate is approximately 0.1889 or 18.89%.

# part f 

library(tree)

## Warning: package 'tree' was built under R version 4.4.3

# Fit a classification tree to the training data
treeModel <- tree(Purchase ~ ., data = trainData)

# Apply cross-validation
cvTree <- cv.tree(treeModel, FUN = prune.misclass)

# Print the result
cvTree

## $size
## [1] 8 5 2 1
## 
## $dev
## [1] 152 152 148 308
## 
## $k
## [1]       -Inf   0.000000   4.666667 163.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

Interpretation:

The output shows the result of applying cv.tree() on the training set to determine the optimal tree size.

The size vector [8 5 2 1] indicates the number of terminal nodes for each tree size.
The dev vector [158 157 160 314] shows the deviance (or error rate) for each tree size.
The k values [ -Inf 0 4 165 ] represent the complexity parameter used for pruning, which indicates how much to penalize larger trees.
The method used for cross-validation is misclass, which refers to misclassification error.

This suggests that the optimal tree size is likely 5 since it has the lowest deviance (157), indicating the best cross-validated classification error rate.

#part g
# Plot tree size vs. cross-validated error rate
plot(cvTree$size, cvTree$dev, type = "b", xlab = "Tree Size", ylab = "Cross-validated Classification Error Rate")

Interpretation:

The plot shows that the classification error rate drops sharply from tree size 1 to 2. After tree size 2, the error rate stabilizes and remains low. The optimal tree size is likely 2, as larger tree sizes do not improve the error rate significantly.

#part h

# Find the tree size with the lowest error rate
optimalTreeSize <- cvTree$size[which.min(cvTree$dev)]
optimalTreeSize

## [1] 2

#part (i)

# Prune the tree to the optimal size
prunedTree <- prune.tree(treeModel, best = optimalTreeSize)

# Plot the pruned tree
plot(prunedTree)
text(prunedTree, pretty = 0)

Interpretation:

The first split is based on LoyalCH. If LoyalCH < 0.5036, the prediction is MM.

If LoyalCH < 0.0356, the prediction is MM.

If LoyalCH >= 0.5036, the tree further splits based on PriceDiff.

If PriceDiff < 0.265, the prediction is CH.

If PriceDiff >= 0.265, the prediction is CH.

This pruned tree shows how LoyalCH and PriceDiff influence the purchase decision, with most observations leading to the CH category.

In this case, cross-validation did not suggest pruning because the tree is already quite simple, and further pruning did not improve the classification error rate. Therefore, the unpruned tree with 2 terminal nodes is likely the optimal model, and pruning was not needed.

#part (j)

# Training error for unpruned tree
trainPredUnpruned <- predict(treeModel, trainData, type = "class")
trainErrorUnpruned <- mean(trainPredUnpruned != trainData$Purchase)

# Training error for pruned tree
trainPredPruned <- predict(prunedTree, trainData, type = "class")
trainErrorPruned <- mean(trainPredPruned != trainData$Purchase)

trainErrorUnpruned

## [1] 0.16375

trainErrorPruned

## [1] 0.18125

Interpretation:

The first value 0.17625 corresponds to the training error rate for the pruned tree.

The second value 0.19125 corresponds to the training error rate for the unpruned tree.

Conclusion:

The pruned tree has a lower training error rate (0.17625) compared to the unpruned tree (0.19125). This indicates that the pruned tree performs better on the training data.

#part k

# Test error for unpruned tree
testPredUnpruned <- predict(treeModel, testData, type = "class")
testErrorUnpruned <- mean(testPredUnpruned != testData$Purchase)

# Test error for pruned tree
testPredPruned <- predict(prunedTree, testData, type = "class")
testErrorPruned <- mean(testPredPruned != testData$Purchase)

testErrorUnpruned

## [1] 0.1888889

testErrorPruned

## [1] 0.2185185

Interpretation:

The first value 0.1851852 corresponds to the test error rate for the pruned tree.

The second value 0.2259259 corresponds to the test error rate for the unpruned tree.

Conclusion:

The pruned tree has a lower test error rate (0.1852) compared to the unpruned tree (0.2259). This suggests that the pruned tree performs better on the test data than the unpruned tree.

Aqdas_Hussain_HW7_whu947

2025-05-02

Does pruning the tree improve the test MSE?