Part 1, Classification Tree Method (50pts in total)

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

1. Load the caret package and the GermanCredit dataset. (5pts)

library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
library(rpart.plot)
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set with 80-20 split. Please use the random seed as 2024 for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
credit_train = GermanCredit[index,]
credit_test = GermanCredit[-index,]

3. Fit a classification tree model (without extra parameters) using the training set with linear kernel. Please use all variables, but make sure the variable types (especially the response variable Class) are right. (10pts)

fit_tree <- rpart(Class ~ ., data=credit_train, method = "class")

Your observation: Fitting tree model to training data.

4. Visualized the tree: (5pts)

rpart.plot(fit_tree,extra=4, yesno=2)

5. Use the training set to get prediected classes. (5pts)

pred_credit_train <- predict(fit_tree, credit_train, type="class")

6. Obtain confusion matrix and MR on training set. (5pts)

Cmatrix_train = table(true = credit_train$Class,
                      pred = pred_credit_train)
Cmatrix_train
##     pred
## true   0   1
##    0 145  84
##    1  50 521
#MR
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1675

Your observation: Confusion matrix has a MR equal to .1675.

7. Use the testing set to get prediected classes. (5pts)

pred_credit_test <- predict(fit_tree, credit_test, type="class")

8. Obtain confusion matrix and MR on testing set. (5pts)

Cmatrix_test = table(true = credit_test$Class,
                     pred = pred_credit_test)
Cmatrix_test
##     pred
## true   0   1
##    0  36  35
##    1  26 103
#MR
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.305

Your observation: Testing Cmatrix made with a MR of .305

9 Obtain the ROC and AUC for testing data (not training). (5pts)

pred_prob_test = predict(fit_tree, credit_test, type = "prob")

pred_prob_test = pred_prob_test[,"1"] 
library(ROCR)
pred <- prediction(pred_prob_test, credit_test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6742548

10. (optional) use cp or other parameters to prune the tree see if you can get a better testing MR and testing AUC.

Part 2, Regression Tree Method on mtcar data (50pts in total)

Starter code for mtcars dataset

We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).

0. load the data (5pts)

# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

1. Split the dataset into training and test set with 85-15 split. Use set.seed(2024) for reproducibility. (5pts)

set.seed(2024)
index <- sample(1:nrow(mtcars),nrow(mtcars)*0.85)
cars_train = mtcars[index,]
cars_test = mtcars[-index,]

2. Fit a basic regression tree model using the training set with mpg as the response variable. Set method = “anova”. (10pts)

fit_tree <- rpart(mpg ~ ., data=cars_train, method = "anova")

2. Visualize the tree using rpart.plot. Interpret the splits. (10pts)

rpart.plot(fit_tree)

Your observation:The decision tree separates the data based on whether cyl is greater than or equal to 5. This separation shows 63% of the data points meet this condition. It uses this rule to divide the data into two parts.

3. Make predictions and calculate MSE and R-squared on training set. (10pts)

pred_cars_train <- predict(fit_tree, cars_train)
mse <- mean((cars_train$mpg - pred_cars_train)^2)
mse
## [1] 14.43124
r_squared <- 1 - (sum((cars_train$mpg - pred_cars_train)^2) / sum((cars_train$mpg - mean(cars_train$mpg))^2))
r_squared
## [1] 0.6121479

Your observation: MSE= 14.43124 and R squared= .6121479

4. Make predictions and calculate MSE and R-squared on testing set. (10pts)

pred_cars_test <- predict(fit_tree, cars_test)
mse <- mean((cars_test$mpg - pred_cars_test)^2)
mse
## [1] 2.619646
r_squared <- 1 - (sum((cars_test$mpg - pred_cars_test)^2) / sum((cars_test$mpg - mean(cars_test$mpg))^2))
r_squared
## [1] 0.8567122

Your observation:MSE= 2.619646 and R squared= .8567122

Part 3 (Optional): Please recall the results from previous homework, how do you compare them? Just discuss.