Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

1. Load the caret package and the GermanCredit dataset. (5pts)

library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
## Warning: package 'rpart' was built under R version 4.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
german.train = GermanCredit[index,]
german.test = GermanCredit[-index,]

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable Class) are right. (10pts)

german_tree <- rpart(as.factor(Class) ~ ., data = german.train)
german_tree
## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 800 229 1 (0.2862500 0.7137500)  
##     2) CheckingAccountStatus.lt.0>=0.5 211 101 1 (0.4786730 0.5213270)  
##       4) Duration>=11.5 178  82 0 (0.5393258 0.4606742)  
##         8) Duration>=31.5 38  10 0 (0.7368421 0.2631579) *
##         9) Duration< 31.5 140  68 1 (0.4857143 0.5142857)  
##          18) Purpose.UsedCar< 0.5 129  61 0 (0.5271318 0.4728682)  
##            36) Purpose.Business< 0.5 121  54 0 (0.5537190 0.4462810)  
##              72) InstallmentRatePercentage>=2.5 82  31 0 (0.6219512 0.3780488)  
##               144) Amount< 1577.5 40  11 0 (0.7250000 0.2750000) *
##               145) Amount>=1577.5 42  20 0 (0.5238095 0.4761905)  
##                 290) Amount>=2135.5 30  11 0 (0.6333333 0.3666667)  
##                   580) SavingsAccountBonds.lt.100>=0.5 22   5 0 (0.7727273 0.2272727) *
##                   581) SavingsAccountBonds.lt.100< 0.5 8   2 1 (0.2500000 0.7500000) *
##                 291) Amount< 2135.5 12   3 1 (0.2500000 0.7500000) *
##              73) InstallmentRatePercentage< 2.5 39  16 1 (0.4102564 0.5897436)  
##               146) Duration>=15.5 26  12 0 (0.5384615 0.4615385)  
##                 292) Amount< 3506.5 12   3 0 (0.7500000 0.2500000) *
##                 293) Amount>=3506.5 14   5 1 (0.3571429 0.6428571) *
##               147) Duration< 15.5 13   2 1 (0.1538462 0.8461538) *
##            37) Purpose.Business>=0.5 8   1 1 (0.1250000 0.8750000) *
##          19) Purpose.UsedCar>=0.5 11   0 1 (0.0000000 1.0000000) *
##       5) Duration< 11.5 33   5 1 (0.1515152 0.8484848) *
##     3) CheckingAccountStatus.lt.0< 0.5 589 128 1 (0.2173175 0.7826825)  
##       6) CheckingAccountStatus.0.to.200>=0.5 210  83 1 (0.3952381 0.6047619)  
##        12) Amount>=9908.5 16   1 0 (0.9375000 0.0625000) *
##        13) Amount< 9908.5 194  68 1 (0.3505155 0.6494845)  
##          26) Property.RealEstate< 0.5 136  57 1 (0.4191176 0.5808824)  
##            52) Age< 25.5 31  11 0 (0.6451613 0.3548387) *
##            53) Age>=25.5 105  37 1 (0.3523810 0.6476190)  
##             106) SavingsAccountBonds.lt.100>=0.5 52  24 1 (0.4615385 0.5384615)  
##               212) NumberExistingCredits< 1.5 32  13 0 (0.5937500 0.4062500)  
##                 424) Age>=32.5 22   6 0 (0.7272727 0.2727273) *
##                 425) Age< 32.5 10   3 1 (0.3000000 0.7000000) *
##               213) NumberExistingCredits>=1.5 20   5 1 (0.2500000 0.7500000) *
##             107) SavingsAccountBonds.lt.100< 0.5 53  13 1 (0.2452830 0.7547170)  
##               214) SavingsAccountBonds.100.to.500>=0.5 24  10 1 (0.4166667 0.5833333)  
##                 428) Job.SkilledEmployee< 0.5 7   1 0 (0.8571429 0.1428571) *
##                 429) Job.SkilledEmployee>=0.5 17   4 1 (0.2352941 0.7647059) *
##               215) SavingsAccountBonds.100.to.500< 0.5 29   3 1 (0.1034483 0.8965517) *
##          27) Property.RealEstate>=0.5 58  11 1 (0.1896552 0.8103448)  
##            54) Duration>=22 7   2 0 (0.7142857 0.2857143) *
##            55) Duration< 22 51   6 1 (0.1176471 0.8823529) *
##       7) CheckingAccountStatus.0.to.200< 0.5 379  45 1 (0.1187335 0.8812665) *

Your observation: Since this is a classification model, we will need to use as.factor on our predictor, in this case Class, we then use all of our variables in the training data to make the tree model. The output shows multiple values as an output, showing that there are decision nodes, and subsequently leaf nodes.

4. Visualized the tree: (5pts)

rpart.plot(german_tree, extra = 1)

5. Use the training set to get prediected classes. (5pts)

german_pred_train <- predict(german_tree, german.train, type = "class")

6. Obtain confusion matrix and MR on training set. (5pts)

#confusion matrix for training
Cmatrix_train = table(true = german.train$Class,
                      pred = german_pred_train)
Cmatrix_train
##     pred
## true   0   1
##    0 145  84
##    1  50 521
#MR
german_MR_train <- 1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
german_MR_train
## [1] 0.1675

Your observation: On the training data set we see the confusion matrix, with higher values on TP (True Positives), meaning we correctly predict/classify people who have good credit. Next we have higher values in the False Positive section, in this instance, it is worse than the False negative. Since we predicted true, they have good credit so we give them money, but in actuality they have bad credit risk, and may default on the payments. Next we have MR (Misclassification Rate) at 16.75%, which is relatively high, but is still low.

7. Use the testing set to get prediected classes. (5pts)

german_pred_test <- predict(german_tree, german.test, type = "class")

8. Obtain confusion matrix and MR on testing set. (5pts)

#confusion matrix for testing
Cmatrix_test = table(true = german.test$Class,
                      pred = german_pred_test)
Cmatrix_test
##     pred
## true   0   1
##    0  36  35
##    1  26 103
#MR
german_MR_test <- 1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
german_MR_test
## [1] 0.305

Your observation: The confusion matrix here reacts the same as the previous one on training data, but now we used testing. The MR has increased from 16.75% on training to 30.5%, which is expected, but still a drastic increase. It would be better if we were to lower this rate, on both ends.

9 Obtain the ROC and AUC for testing data (not training). (5pts)

# obtain predicted probability
pred_prob_test = predict(german_tree, german.test, type = "prob")
pred_prob_test = pred_prob_test[,"1"] 
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.3
#ROC
pred <- prediction(pred_prob_test, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6742548

your observation: The ROC curve is very smooth and has a poor slope, the AUC while still predicts higher than a random guess, it is still look making it a poor model of predicting.

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).

0. load the data (5pts)

# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1. Split the dataset into training and test set with 85-15 split. Use set.seed(2024) for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(mtcars),nrow(mtcars)*0.85)
mtcars.train = mtcars[index,]
mtcars.test = mtcars[-index,]

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

cars_tree <- rpart(mpg~ ., data=mtcars.train, method = "anova")
cars_tree
## n= 27 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 27 1004.6190 20.50741  
##   2) cyl>=5 17  186.7424 16.84706 *
##   3) cyl< 5 10  202.9010 26.73000 *

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

rpart.plot(cars_tree, extra = 1)

Your observation:There are only two leaf nodes, only one decision node, in this case the root node. With 27 observations overall.

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

cars_pred_train <- predict(cars_tree, mtcars.train)

# MSE
cars_train <- mean((mtcars.train$mpg - cars_pred_train)^2)
cars_train
## [1] 14.43124

your observation: A MSE predicts how well the model predictions are close to the actual results, a lower MSE means the model predicts closer to the truth. A 14.43 MSE is low, not as low as we would like, but still low.

# Calculate SST (Total Sum of Squares)
sst <- sum((mtcars.train$mpg - mean(mtcars.train$mpg))^2)

# Calculate SSR (Residual Sum of Squares)
ssr <- sum((mtcars.train$mpg - cars_pred_train)^2)

# Calculate R-squared
r_squared <- 1 - (ssr / sst)
r_squared
## [1] 0.6121479

Your observation: R squared tells you how much variation can be explained by the model, higher the better in most cases. Here the R^2 is only 0.61 which is both good and ok, it’s still high, but we want it to be higher.

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

cars_pred_test <- predict(cars_tree, mtcars.test)

# MSE
cars_test <- mean((mtcars.test$mpg - cars_pred_test)^2)
cars_test
## [1] 2.619646

your observation: Now on the testing data set, with a significant amount less of observations, the MSE for testing is significantly better than the one for training, meaning our models predicts even better/closer to the actual values.

# Calculate SST (Total Sum of Squares)
sst_test <- sum((mtcars.test$mpg - mean(mtcars.test$mpg))^2)

# Calculate SSR (Residual Sum of Squares)
ssr_test <- sum((mtcars.test$mpg - cars_pred_test)^2)

# Calculate R-squared
r_squared_test <- 1 - (ssr_test / sst_test)
r_squared_test
## [1] 0.8567122

Your observation: The R^2 value is significantly larger than on the training, meaning we are able to explain the variation better.