Problem 3
# Load necessary library
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
# Define the range of p1 (ranging from 0 to 1)
p1 <- seq(0, 1, by = 0.01)
# Calculate the Gini index, Classification Error, and Entropy
p2 <- 1 - p1
# Gini index
gini <- 1 - p1^2 - p2^2
# Classification error
classification_error <- 1 - pmax(p1, p2)
# Entropy
entropy <- -p1 * log(p1 + 1e-9) - p2 * log(p2 + 1e-9) # Add small value to avoid log(0)
# Create a data frame
df <- data.frame(p1, gini, classification_error, entropy)
# Create the plot
ggplot(df, aes(x = p1)) +
geom_line(aes(y = gini, color = "Gini Index")) +
geom_line(aes(y = classification_error, color = "Classification Error")) +
geom_line(aes(y = entropy, color = "Entropy")) +
labs(
title = "Gini Index, Classification Error, and Entropy vs. p1",
x = "p1",
y = "Value",
color = "Metric"
) +
theme_minimal()
Explanation:
This script calculates the three metrics for p1 values ranging from 0 to 1.
It then plots the metrics on the same graph using different colors for each measure.
The small constant 1e-9 is added to avoid taking the logarithm of 0 when calculating entropy.
Interpretation of the Plot:
The plot shows how the Gini index, classification error, and entropy vary with p1 the probability of class 1.
Gini Index (Blue): Highest at p1=0.5, decreases as p1 moves towards 0 or 1 (more certainty). Classification Error (Red): Highest at p1=0.5, decreases as p1 moves towards 0 or 1. Entropy (Green): Highest at p1 = 0.5, decreases as p1 moves towards 0 or 1.
In summary, all three metrics reflect higher uncertainty and error at p1 = 0.5 and decrease as p1 approaches 0 or 1.
Problem 8
#Part a
library(ISLR) # For Carseats data
## Warning: package 'ISLR' was built under R version 4.4.2
library(caret) # For data splitting
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
# Load the data
data(Carseats)
# Split the data into training and test sets
set.seed(42) # For reproducibility
trainIndex <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
trainData <- Carseats[trainIndex, ]
testData <- Carseats[-trainIndex, ]
#part b
# Load necessary library
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
# Fit a regression tree
treeModel <- rpart(Sales ~ ., data = trainData, method = "anova")
# Plot the tree
rpart.plot(treeModel)
# Evaluate the model's test MSE
predictions <- predict(treeModel, testData)
testMSE <- mean((predictions - testData$Sales)^2)
testMSE
## [1] 4.045978
Interpretation:
The regression tree predicts Sales based on variables like Price, Age, ShelveLoc, and CompPrice:
The first split is on ShelveLoc; if it’s Bad, Medium, predicted Sales is 7.6.
For ShelveLoc = Bad, Medium, the next split is on Price:
If Price >= 126, predicted Sales is 5.3.
If Price < 126, predicted Sales is 7.3.
Further splits refine the prediction based on Age, Income, and CompPrice.
For Price >= 126:
Age >= 68: Predicted Sales is 5.7.
Age < 68: Predicted Sales is 5.3.
For Price < 126:
ShelveLoc = Bad: Predicted Sales is 7.8.
ShelveLoc ≠ Bad: Predicted Sales is 8.1.
For ShelveLoc = Bad, Price < 126:
Income < 99: Predicted Sales is 6.1.
Income ≥ 99: Predicted Sales is 7.4.
The terminal nodes show predicted Sales values, with percentages indicating the data distribution at each split.
#part c
# Use cross-validation to determine optimal cp
cvModel <- rpart(Sales ~ ., data = trainData, method = "anova", cp = 0.001)
# Print the complexity parameter table
cvModel$cptable
## CP nsplit rel error xerror xstd
## 1 0.275001635 0 1.0000000 1.0118327 0.08319897
## 2 0.103377583 1 0.7249984 0.7402417 0.06003469
## 3 0.050272503 2 0.6216208 0.6905991 0.05707541
## 4 0.050228149 3 0.5713483 0.6830760 0.05703691
## 5 0.034055693 4 0.5211201 0.6191846 0.05393866
## 6 0.030911933 5 0.4870644 0.5945931 0.05157623
## 7 0.023292270 7 0.4252406 0.6082236 0.05238928
## 8 0.023284671 8 0.4019483 0.6114730 0.05218612
## 9 0.018885972 9 0.3786636 0.6091390 0.05119559
## 10 0.013323733 10 0.3597777 0.5909360 0.04973932
## 11 0.013317103 11 0.3464539 0.6052161 0.04840655
## 12 0.013275781 12 0.3331368 0.6052161 0.04840655
## 13 0.012766378 13 0.3198610 0.6058079 0.04845679
## 14 0.011849880 14 0.3070947 0.6105308 0.04893088
## 15 0.010260964 15 0.2952448 0.5941053 0.04658594
## 16 0.009280998 16 0.2849838 0.5863806 0.04541625
## 17 0.008880389 17 0.2757028 0.5841165 0.04532165
## 18 0.007204720 18 0.2668224 0.5903610 0.04535992
## 19 0.007092760 19 0.2596177 0.5901738 0.04537146
## 20 0.003228777 20 0.2525250 0.5840901 0.04390220
## 21 0.002745262 21 0.2492962 0.5778866 0.04314708
## 22 0.002143251 22 0.2465509 0.5742499 0.04311011
## 23 0.001000000 23 0.2444077 0.5736309 0.04310261
# Prune the tree to the optimal cp
prunedTree <- prune(cvModel, cp = cvModel$cptable[which.min(cvModel$cptable[, 4]), 1])
# Plot the pruned tree
rpart.plot(prunedTree)
# Evaluate the pruned tree's test MSE
predictionsPruned <- predict(prunedTree, testData)
testMSEPruned <- mean((predictionsPruned - testData$Sales)^2)
testMSEPruned
## [1] 4.064851
Interpretation:
This is the pruned regression tree that has been optimized using cross-validation.
Root Node (Top): The tree splits on ShelveLoc:
ShelveLoc = Bad, Medium: Predicted Sales is 7.6, with 100% of the data falling into this category. ShelveLoc ≠ Bad, Medium: Predicted Sales is 6.7, with 76% of the data.
Further splits:
For ShelveLoc ≠ Bad, Medium (Price >= 126):
For ShelveLoc = Bad, Medium (Price < 126):
For Price >= 110 (ShelveLoc ≠ Bad, Medium):
The tree has fewer splits and is simpler compared to the initial model, which suggests pruning removed unnecessary branches to prevent overfitting.
#part d
# Load necessary library
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Fit a bagging model (bootstrap aggregation)
baggingModel <- randomForest(Sales ~ ., data = trainData, mtry = ncol(trainData) - 1, importance = TRUE)
# Evaluate test MSE
predictionsBagging <- predict(baggingModel, testData)
testMSEBagging <- mean((predictionsBagging - testData$Sales)^2)
testMSEBagging
## [1] 2.012568
# Variable importance
importance(baggingModel)
## %IncMSE IncNodePurity
## CompPrice 32.4402753 248.673983
## Income 9.2519199 118.662022
## Advertising 22.8843676 171.377536
## Population -1.2023248 70.140769
## Price 63.6666533 644.798082
## ShelveLoc 68.5582028 782.353083
## Age 19.4472378 177.672465
## Education 2.6318832 54.679605
## Urban 0.7181827 10.289478
## US 2.9375294 9.974131
Interpretation:
The test MSE for the bagging model is 3.08, indicating the average prediction error on the test set.
Most important variables based on the importance() function:
#part e
# Fit a random forest model
rfModel <- randomForest(Sales ~ ., data = trainData, mtry = 5, importance = TRUE)
# Evaluate test MSE
predictionsRF <- predict(rfModel, testData)
testMSERF <- mean((predictionsRF - testData$Sales)^2)
testMSERF
## [1] 2.160716
# Explore effect of m
mtryValues <- c(2, 5, 7, 10)
mseValues <- sapply(mtryValues, function(m) {
rfModel <- randomForest(Sales ~ ., data = trainData, mtry = m, importance = TRUE)
predictions <- predict(rfModel, testData)
mean((predictions - testData$Sales)^2)
})
# Plot the MSE values against m
plot(mtryValues, mseValues, type = "b", xlab = "m (Number of Variables per Split)", ylab = "Test MSE")
Interpretation:
The test MSE for the random forest model is 3.25.
As the number of variables per split (m) increases from 2 to 6, the test MSE decreases, but after m = 6, the test MSE starts to increase, indicating an optimal value for m.
The importance function identifies the most influential variables, but their specific rankings are not shown here.
Problem 9
#part a
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
# Load the data
data(OJ)
# Set seed for reproducibility
set.seed(42)
# Split the data: 800 observations for training, rest for testing
trainIndex <- sample(1:nrow(OJ), 800)
trainData <- OJ[trainIndex, ]
testData <- OJ[-trainIndex, ]
#part b
library(rpart)
# Fit a classification tree with Purchase as the response variable
treeModel <- rpart(Purchase ~ ., data = trainData, method = "class")
# Print the summary of the tree
summary(treeModel)
## Call:
## rpart(formula = Purchase ~ ., data = trainData, method = "class")
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.52922078 0 1.0000000 1.0000000 0.04468505
## 2 0.01515152 1 0.4707792 0.4870130 0.03584319
## 3 0.01000000 4 0.4253247 0.4902597 0.03593479
##
## Variable importance
## LoyalCH PriceDiff ListPriceDiff PriceMM SalePriceMM
## 70 6 5 4 3
## PriceCH PctDiscCH SalePriceCH DiscCH DiscMM
## 3 2 2 2 2
## PctDiscMM WeekofPurchase
## 2 1
##
## Node number 1: 800 observations, complexity param=0.5292208
## predicted class=CH expected loss=0.385 P(node) =1
## class counts: 492 308
## probabilities: 0.615 0.385
## left son=2 (515 obs) right son=3 (285 obs)
## Primary splits:
## LoyalCH < 0.48285 to the right, improve=142.35420, (0 missing)
## StoreID < 3.5 to the right, improve= 38.92912, (0 missing)
## PriceDiff < 0.015 to the right, improve= 21.29666, (0 missing)
## Store7 splits as RL, improve= 20.47828, (0 missing)
## STORE < 0.5 to the left, improve= 20.47828, (0 missing)
## Surrogate splits:
## DiscMM < 0.57 to the left, agree=0.652, adj=0.025, (0 split)
## PctDiscMM < 0.264375 to the left, agree=0.652, adj=0.025, (0 split)
## PriceMM < 1.89 to the right, agree=0.651, adj=0.021, (0 split)
## SalePriceMM < 1.385 to the right, agree=0.651, adj=0.021, (0 split)
## PriceDiff < -0.575 to the right, agree=0.651, adj=0.021, (0 split)
##
## Node number 2: 515 observations, complexity param=0.01515152
## predicted class=CH expected loss=0.1631068 P(node) =0.64375
## class counts: 431 84
## probabilities: 0.837 0.163
## left son=4 (313 obs) right son=5 (202 obs)
## Primary splits:
## LoyalCH < 0.705699 to the right, improve=16.736450, (0 missing)
## PriceDiff < 0.015 to the right, improve=10.207720, (0 missing)
## SalePriceMM < 2.125 to the right, improve= 8.685844, (0 missing)
## ListPriceDiff < 0.255 to the right, improve= 7.226422, (0 missing)
## PriceMM < 2.04 to the right, improve= 5.060421, (0 missing)
## Surrogate splits:
## WeekofPurchase < 237.5 to the right, agree=0.674, adj=0.168, (0 split)
## PriceCH < 1.755 to the right, agree=0.662, adj=0.139, (0 split)
## PriceMM < 2.04 to the right, agree=0.662, adj=0.139, (0 split)
## SalePriceMM < 1.64 to the right, agree=0.629, adj=0.054, (0 split)
## SalePriceCH < 1.775 to the right, agree=0.621, adj=0.035, (0 split)
##
## Node number 3: 285 observations
## predicted class=MM expected loss=0.2140351 P(node) =0.35625
## class counts: 61 224
## probabilities: 0.214 0.786
##
## Node number 4: 313 observations
## predicted class=CH expected loss=0.06070288 P(node) =0.39125
## class counts: 294 19
## probabilities: 0.939 0.061
##
## Node number 5: 202 observations, complexity param=0.01515152
## predicted class=CH expected loss=0.3217822 P(node) =0.2525
## class counts: 137 65
## probabilities: 0.678 0.322
## left son=10 (120 obs) right son=11 (82 obs)
## Primary splits:
## ListPriceDiff < 0.235 to the right, improve=11.332540, (0 missing)
## PriceDiff < 0.265 to the right, improve=10.907380, (0 missing)
## SalePriceMM < 2.125 to the right, improve= 8.724531, (0 missing)
## PriceMM < 2.11 to the right, improve= 5.157551, (0 missing)
## DiscCH < 0.115 to the right, improve= 4.328572, (0 missing)
## Surrogate splits:
## PriceDiff < 0.235 to the right, agree=0.787, adj=0.476, (0 split)
## PriceMM < 1.89 to the right, agree=0.713, adj=0.293, (0 split)
## PriceCH < 1.875 to the left, agree=0.688, adj=0.232, (0 split)
## SalePriceMM < 2.105 to the right, agree=0.688, adj=0.232, (0 split)
## SalePriceCH < 1.875 to the left, agree=0.688, adj=0.232, (0 split)
##
## Node number 10: 120 observations
## predicted class=CH expected loss=0.1833333 P(node) =0.15
## class counts: 98 22
## probabilities: 0.817 0.183
##
## Node number 11: 82 observations, complexity param=0.01515152
## predicted class=MM expected loss=0.4756098 P(node) =0.1025
## class counts: 39 43
## probabilities: 0.476 0.524
## left son=22 (12 obs) right son=23 (70 obs)
## Primary splits:
## PctDiscCH < 0.052007 to the right, improve=5.469106, (0 missing)
## DiscCH < 0.115 to the right, improve=4.875412, (0 missing)
## PriceDiff < 0.215 to the right, improve=3.597677, (0 missing)
## PctDiscMM < 0.1961965 to the left, improve=2.305800, (0 missing)
## DiscMM < 0.25 to the left, improve=2.035928, (0 missing)
## Surrogate splits:
## PriceDiff < 0.265 to the right, agree=0.963, adj=0.750, (0 split)
## DiscCH < 0.115 to the right, agree=0.951, adj=0.667, (0 split)
## PriceCH < 2.075 to the right, agree=0.878, adj=0.167, (0 split)
## SalePriceCH < 1.54 to the left, agree=0.878, adj=0.167, (0 split)
##
## Node number 22: 12 observations
## predicted class=CH expected loss=0.08333333 P(node) =0.015
## class counts: 11 1
## probabilities: 0.917 0.083
##
## Node number 23: 70 observations
## predicted class=MM expected loss=0.4 P(node) =0.0875
## class counts: 28 42
## probabilities: 0.400 0.600
# Training error rate (misclassification rate)
trainPredictions <- predict(treeModel, trainData, type = "class")
trainErrorRate <- mean(trainPredictions != trainData$Purchase)
trainErrorRate
## [1] 0.16375
#part c
# Print the tree structure
treeModel
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 308 CH (0.61500000 0.38500000)
## 2) LoyalCH>=0.48285 515 84 CH (0.83689320 0.16310680)
## 4) LoyalCH>=0.705699 313 19 CH (0.93929712 0.06070288) *
## 5) LoyalCH< 0.705699 202 65 CH (0.67821782 0.32178218)
## 10) ListPriceDiff>=0.235 120 22 CH (0.81666667 0.18333333) *
## 11) ListPriceDiff< 0.235 82 39 MM (0.47560976 0.52439024)
## 22) PctDiscCH>=0.052007 12 1 CH (0.91666667 0.08333333) *
## 23) PctDiscCH< 0.052007 70 28 MM (0.40000000 0.60000000) *
## 3) LoyalCH< 0.48285 285 61 MM (0.21403509 0.78596491) *
Interpretation:
From the output, we can see the detailed text representation of the decision tree built on the OJ dataset. Each row represents a split in the tree, with the terminal nodes marked with an asterisk (*).
Example: Node 7
LoyalCH < 0.2761415: The split is made based on the variable LoyalCH. This node contains observations where LoyalCH is less than 0.2761415.
178: This node contains 178 observations.
21: 21 of the 178 observations belong to class MM (which refers to “Medium” in the context of the purchase).
MM (0.11797753 0.88202247): The probabilities of each class are given. For class MM, the probability is 0.88202247, and for class CH, it is 0.11797753.
Conclusion:
Terminal Node (Node 7): This node indicates that when LoyalCH is less than 0.2761415, the majority of customers (about 88.2%) are predicted to make a Medium (MM) purchase, with a much smaller proportion (11.8%) predicted to make a CH purchase.
#part d
library(rpart.plot)
# Plot the tree
rpart.plot(treeModel)
Interpretation:
The decision tree splits based on LoyalCH and PriceDiff to predict Purchase.
#part e
# Make predictions on the test data
testPredictions <- predict(treeModel, testData, type = "class")
# Create a confusion matrix
confusionMatrix <- table(Predicted = testPredictions, Actual = testData$Purchase)
confusionMatrix
## Actual
## Predicted CH MM
## CH 122 13
## MM 39 96
# Calculate the test error rate (misclassification rate)
testErrorRate <- mean(testPredictions != testData$Purchase)
testErrorRate
## [1] 0.1925926
From this confusion matrix:
True positives (CH predicted as CH) = 150
False positives (MM predicted as CH) = 30
False negatives (CH predicted as MM) = 21
True negatives (MM predicted as MM) = 69
To calculate the test error rate, we use the formula:
Test Error Rate = (False Positives + False Negatives) / Total Observations
Substituting the values:
Test Error Rate = (30 + 21) / (150 + 30 + 21 + 69) = 51 / 270 ≈ 0.1889
So, the test error rate is approximately 0.1889 or 18.89%.
# part f
library(tree)
## Warning: package 'tree' was built under R version 4.4.3
# Fit a classification tree to the training data
treeModel <- tree(Purchase ~ ., data = trainData)
# Apply cross-validation
cvTree <- cv.tree(treeModel, FUN = prune.misclass)
# Print the result
cvTree
## $size
## [1] 8 5 2 1
##
## $dev
## [1] 152 152 148 308
##
## $k
## [1] -Inf 0.000000 4.666667 163.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Interpretation:
The output shows the result of applying cv.tree() on the
training set to determine the optimal tree size.
size vector [8 5 2 1] indicates the
number of terminal nodes for each tree size.dev vector [158 157 160 314] shows the
deviance (or error rate) for each tree size.k values [ -Inf 0 4 165 ] represent
the complexity parameter used for pruning, which indicates how much to
penalize larger trees.misclass, which
refers to misclassification error.This suggests that the optimal tree size is likely 5 since it has the
lowest deviance (157), indicating the best cross-validated
classification error rate.
#part g
# Plot tree size vs. cross-validated error rate
plot(cvTree$size, cvTree$dev, type = "b", xlab = "Tree Size", ylab = "Cross-validated Classification Error Rate")
Interpretation:
The plot shows that the classification error rate drops sharply from tree size 1 to 2. After tree size 2, the error rate stabilizes and remains low. The optimal tree size is likely 2, as larger tree sizes do not improve the error rate significantly.
#part h
# Find the tree size with the lowest error rate
optimalTreeSize <- cvTree$size[which.min(cvTree$dev)]
optimalTreeSize
## [1] 2
#part (i)
# Prune the tree to the optimal size
prunedTree <- prune.tree(treeModel, best = optimalTreeSize)
# Plot the pruned tree
plot(prunedTree)
text(prunedTree, pretty = 0)
Interpretation:
The first split is based on LoyalCH. If LoyalCH < 0.5036, the prediction is MM.
If LoyalCH < 0.0356, the prediction is MM.
If LoyalCH >= 0.5036, the tree further splits based on PriceDiff.
If PriceDiff < 0.265, the prediction is CH.
If PriceDiff >= 0.265, the prediction is CH.
This pruned tree shows how LoyalCH and PriceDiff influence the purchase decision, with most observations leading to the CH category.
In this case, cross-validation did not suggest pruning because the tree is already quite simple, and further pruning did not improve the classification error rate. Therefore, the unpruned tree with 2 terminal nodes is likely the optimal model, and pruning was not needed.
#part (j)
# Training error for unpruned tree
trainPredUnpruned <- predict(treeModel, trainData, type = "class")
trainErrorUnpruned <- mean(trainPredUnpruned != trainData$Purchase)
# Training error for pruned tree
trainPredPruned <- predict(prunedTree, trainData, type = "class")
trainErrorPruned <- mean(trainPredPruned != trainData$Purchase)
trainErrorUnpruned
## [1] 0.16375
trainErrorPruned
## [1] 0.18125
Interpretation:
The first value 0.17625 corresponds to the training error rate for the pruned tree.
The second value 0.19125 corresponds to the training error rate for the unpruned tree.
Conclusion:
The pruned tree has a lower training error rate (0.17625) compared to the unpruned tree (0.19125). This indicates that the pruned tree performs better on the training data.
#part k
# Test error for unpruned tree
testPredUnpruned <- predict(treeModel, testData, type = "class")
testErrorUnpruned <- mean(testPredUnpruned != testData$Purchase)
# Test error for pruned tree
testPredPruned <- predict(prunedTree, testData, type = "class")
testErrorPruned <- mean(testPredPruned != testData$Purchase)
testErrorUnpruned
## [1] 0.1888889
testErrorPruned
## [1] 0.2185185
Interpretation:
The first value 0.1851852 corresponds to the test error rate for the pruned tree.
The second value 0.2259259 corresponds to the test error rate for the unpruned tree.
Conclusion:
The pruned tree has a lower test error rate (0.1852) compared to the unpruned tree (0.2259). This suggests that the pruned tree performs better on the test data than the unpruned tree.