Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

1. Load the caret package and the GermanCredit dataset. (5pts)

library(lattice)
library(ggplot2)
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
library(rpart.plot)
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable Class) are right. (10pts)

library(rpart)
library(rpart.plot)
# fit the model
fit_German_tree <- rpart(as.factor(Class) ~ ., data=German.train)

Your observation: The code fit_German_tree <- rpart(as.factor(Class) ~ ., data=German.train) fits a decision tree model using the rpart function, where Class is converted into a factor to perform classification, and all other variables in the German.train dataset are used as predictors to learn patterns and predict the target class.

4. Visualized the tree: (5pts)

rpart.plot(fit_German_tree,extra=4, yesno=2)

5. Use the training set to get prediected classes. (5pts)

pred_German_train <- predict(fit_German_tree, German.train, type="class")

6. Obtain confusion matrix and MR on training set. (5pts)

Gmatrix_train = table(true = German.train$Class,
                      pred = pred_German_train)
Gmatrix_train
##     pred
## true   0   1
##    0 145  84
##    1  50 521

Your observation: The confusion matrix, Gmatrix_train reveals that the model performs well overall, achieving an accuracy of 83.25%. It demonstrates strong performance in predicting class 1, with a high recall of 91.25%, meaning it correctly identifies most actual instances of class 1, and a precision of 86.12%, indicating that the majority of its class 1 predictions are correct. This imbalance highlights a potential area for improvement, particularly in reducing false positives, to ensure better performance in identifying both classes accurately.

7. Use the testing set to get prediected classes. (5pts)

pred_German_test <- predict(fit_German_tree, German.test, type="class")

8. Obtain confusion matrix and MR on testing set. (5pts)

Gmatrix_test = table(true = German.test$Class,
                      pred = pred_German_test)
Gmatrix_test
##     pred
## true   0   1
##    0  36  35
##    1  26 103

Your observation: The confusion matrix Gmatrix_test shows that the model has moderate overall performance, with a strong recall of 79.84% for class 1 (correctly identifying 103 out of 129 true class 1 instances), but its precision for class 1 is lower at 74.11% due to 35 false positives. The model does indicate potential issues with false positives and false negatives.

9 Obtain the ROC and AUC for testing data (not training). (5pts)

# obtain predicted probability Testing
pred_prob_German_test = predict(fit_German_tree, German.test, type = "prob")

pred_prob_German_test = pred_prob_German_test[,"1"] 

# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_German_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6742548

10. (optional) use cp or other parameters to prune the tree see if you can get a better testing MR and testing AUC.

# We need to define a cost matrix first, don't change 0 there
German_matrix <- matrix(c(0, 1,  # cost of 1 for FP
                        5, 0),  # cost of 5 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_German_2 <- rpart(as.factor(Class) ~ ., data=German.train, 
                       parms = list(loss = German_matrix))

rpart.plot(fit_tree_German_2,extra=4, yesno=2)

# Fit the tree with a maximum depth of 2
depth_German_control_tree_2 <- rpart(Class ~ ., data = German.train, method = "class",
                            control = rpart.control(maxdepth = 3),
                            parms = list(loss = German_matrix)) 
                            # I am using cost_matrix just for demostration, remove it if needed

rpart.plot(depth_German_control_tree_2,extra=4, yesno=2)

# Fit the tree with a maximum depth of 2
depth_German_control_tree <- rpart(Class ~ ., data = German.train,
                            method = "class",cp = 0.003,
                            parms = list(loss = German_matrix)) 
                            # I am using cost_matrix just for demostration,
                            #remove it if needed

rpart.plot(depth_German_control_tree,extra=4, yesno=2)

# Fit a full tree with very low cp value
full_German_tree <- rpart(Class ~ ., data = German.train,
                   method = "class", cp = 0.001)

# Display CP table to identify the optimal cp
printcp(full_German_tree)
## 
## Classification tree:
## rpart(formula = Class ~ ., data = German.train, method = "class", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] Age                            Amount                        
##  [3] CheckingAccountStatus.0.to.200 CheckingAccountStatus.lt.0    
##  [5] Duration                       InstallmentRatePercentage     
##  [7] Job.SkilledEmployee            NumberExistingCredits         
##  [9] OtherInstallmentPlans.Bank     Property.RealEstate           
## [11] Purpose.Business               Purpose.NewCar                
## [13] Purpose.Radio.Television       Purpose.UsedCar               
## [15] ResidenceDuration              SavingsAccountBonds.100.to.500
## [17] SavingsAccountBonds.lt.100    
## 
## Root node error: 229/800 = 0.28625
## 
## n= 800 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0305677      0   1.00000 1.00000 0.055828
## 2  0.0240175      4   0.87773 1.06987 0.056931
## 3  0.0196507      8   0.77293 1.03930 0.056465
## 4  0.0131004     10   0.73362 1.02620 0.056257
## 5  0.0109170     19   0.60699 1.00437 0.055901
## 6  0.0087336     21   0.58515 1.00437 0.055901
## 7  0.0065502     22   0.57642 0.98253 0.055532
## 8  0.0043668     24   0.56332 0.99127 0.055681
## 9  0.0029112     27   0.55022 1.05240 0.056667
## 10 0.0021834     30   0.54148 1.05240 0.056667
## 11 0.0010000     32   0.53712 1.06114 0.056800
# Identify optimal CP munually or use the following codes
optimal_German_cp <- full_German_tree$cptable[which.min(
  full_German_tree$cptable[,"xerror"]),"CP"] 

# Prune the tree
pruned_German_tree <- prune(full_German_tree, cp = optimal_German_cp)

rpart.plot(pruned_German_tree,extra=4, yesno=2)

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).

0. load the data (5pts)

# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1. Split the dataset into training and test set with 85-15 split. Use set.seed(2024) for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(mtcars),nrow(mtcars)*0.85)
mtcars.train = mtcars[index,]
mtcars.test = mtcars[-index,]

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

library(rpart)
library(rpart.plot)
# fit the Regression tree
mtcars_tree <- rpart(mpg ~ ., data = mtcars.train)

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

rpart.plot(mtcars_tree,extra=1, yesno=2)

summary(mtcars_tree)
## Call:
## rpart(formula = mpg ~ ., data = mtcars.train)
##   n= 27 
## 
##          CP nsplit rel error    xerror      xstd
## 1 0.6121479      0 1.0000000 1.1010927 0.2728919
## 2 0.0100000      1 0.3878521 0.7898291 0.1866011
## 
## Variable importance
##  cyl disp   hp qsec   vs   wt 
##   20   18   18   14   14   14 
## 
## Node number 1: 27 observations,    complexity param=0.6121479
##   mean=20.50741, MSE=37.20809 
##   left son=2 (17 obs) right son=3 (10 obs)
##   Primary splits:
##       cyl  < 5      to the right, improve=0.6121479, (0 missing)
##       hp   < 118    to the right, improve=0.6068166, (0 missing)
##       wt   < 3.325  to the right, improve=0.5916267, (0 missing)
##       disp < 120.55 to the right, improve=0.5838435, (0 missing)
##       vs   < 0.5    to the left,  improve=0.5158466, (0 missing)
##   Surrogate splits:
##       disp < 142.9  to the right, agree=0.963, adj=0.9, (0 split)
##       hp   < 109.5  to the right, agree=0.963, adj=0.9, (0 split)
##       wt   < 2.5425 to the right, agree=0.889, adj=0.7, (0 split)
##       qsec < 18.41  to the left,  agree=0.889, adj=0.7, (0 split)
##       vs   < 0.5    to the left,  agree=0.889, adj=0.7, (0 split)
## 
## Node number 2: 17 observations
##   mean=16.84706, MSE=10.98484 
## 
## Node number 3: 10 observations
##   mean=26.73, MSE=20.2901

Your observation: The model prioritizes cyl, disp, and hp as key predictors of mpg. The primary split based on cyl significantly reduces error, and surrogate splits ensure robust handling of potential missing data. However, the relatively high MSE in both terminal nodes suggests room for improvement, perhaps by adding more splits or exploring alternative models.

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

# Make predictions on the Training sets
mtcars_pred_train <- predict(mtcars_tree, mtcars.train)

# MSE
mse_German_train <- mean((mtcars.train$mpg - mtcars_pred_train)^2)

print(paste0("mse_German_train:",mse_German_train))
## [1] "mse_German_train:14.4312352941176"

Your observation: The mean squared error (MSE) in mse_German_train the value of 14.43 comes from the German.train dataset indicates that, on average, the squared differences between the predicted class probabilities (or predicted values) and the actual class labels are 14.43. This MSE provides a measure of how well the model is performing in terms of prediction accuracy, with lower values indicating better fit and smaller prediction errors. However, since this value alone doesn’t indicate whether the model is performing adequately, it should be interpreted in the context of the scale of the target variable and compared to other models or a baseline to determine its relative effectiveness.

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

# Make predictions on the Testing sets
mtcars_pred_test <- predict(mtcars_tree, mtcars.test)

# MSE
mse_German_test <- mean((mtcars.test$mpg - mtcars_pred_test)^2)

print(paste0("mse_German_test:",mse_German_test))
## [1] "mse_German_test:2.61964574394463"

Your observation: The mean squared error (MSE) in mse_German_test is 2.62 on the German.test dataset indicates that the model’s predictions deviate, on average, by a squared difference of 2.62 from the actual class labels in the test set. This relatively low MSE suggests that the model generalizes well to unseen data, producing accurate predictions. When compared to the MSE on the training set, this test MSE can provide insight into whether the model is overfitting or underfitting; a significantly lower test MSE compared to the training MSE could indicate improved generalization or potentially lower variance in the test data.

Part 3 (Optional): Please recall the results from previous homework, how do you compare them? Just discuss.

The comparison between the SVM model with a linear kernel and logistic regression highlights the SVM’s superior performance, achieving accuracies of 79.2% on the training set and 74.75% on the test set, alongside AUC scores of 0.8505 and 0.7353, respectively, which indicate strong class discrimination despite some susceptibility to false positives, while logistic regression, though simpler and more interpretable, performs adequately in inearly separable data but may struggle with more complex relationships; in contrast, the decision tree model built using the rpart function on the German.train dataset exhibits stronger classification performance, with an accuracy of 83.25% and high recall (91.25%) and precision (86.12%) for class 1, though it also faces challenges in reducing false positives. Notably, the decision tree demonstrates better generalization, as reflected by a significantly lower test MSE (2.62) compared to the training MSE (14.43), suggesting effective handling of unseen data, while SVM models excel in high-dimensional spaces and can be fine-tuned to address class imbalances, making them more suitable for complex datasets, whereas the decision tree offers transparency and interpretable splits driven by key predictors such as cyl, disp, and hp.