Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(lattice)
library(ggplot2)
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
library(rpart.plot)
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2024 for
reproducibility. (5pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]
Class) are right. (10pts)library(rpart)
library(rpart.plot)
# fit the model
fit_German_tree <- rpart(as.factor(Class) ~ ., data=German.train)
Your observation: The code
fit_German_tree <- rpart(as.factor(Class) ~ ., data=German.train)
fits a decision tree model using the rpart function, where
Class is converted into a factor to perform classification,
and all other variables in the German.train dataset are
used as predictors to learn patterns and predict the target class.
rpart.plot(fit_German_tree,extra=4, yesno=2)
pred_German_train <- predict(fit_German_tree, German.train, type="class")
Gmatrix_train = table(true = German.train$Class,
pred = pred_German_train)
Gmatrix_train
## pred
## true 0 1
## 0 145 84
## 1 50 521
Your observation: The confusion matrix, Gmatrix_train
reveals that the model performs well overall, achieving an
accuracy of 83.25%. It demonstrates strong performance
in predicting class 1, with a high recall of 91.25%,
meaning it correctly identifies most actual instances of class 1, and a
precision of 86.12%, indicating that the majority of
its class 1 predictions are correct. This imbalance highlights a
potential area for improvement, particularly in reducing false
positives, to ensure better performance in identifying both classes
accurately.
pred_German_test <- predict(fit_German_tree, German.test, type="class")
Gmatrix_test = table(true = German.test$Class,
pred = pred_German_test)
Gmatrix_test
## pred
## true 0 1
## 0 36 35
## 1 26 103
Your observation: The confusion matrix Gmatrix_test
shows that the model has moderate overall performance, with a strong
recall of 79.84% for class 1 (correctly identifying 103
out of 129 true class 1 instances), but its precision for class 1 is
lower at 74.11% due to 35 false positives. The model
does indicate potential issues with false positives and false
negatives.
# obtain predicted probability Testing
pred_prob_German_test = predict(fit_German_tree, German.test, type = "prob")
pred_prob_German_test = pred_prob_German_test[,"1"]
# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_German_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6742548
# We need to define a cost matrix first, don't change 0 there
German_matrix <- matrix(c(0, 1, # cost of 1 for FP
5, 0), # cost of 5 for FN
byrow = TRUE, nrow = 2)
fit_tree_German_2 <- rpart(as.factor(Class) ~ ., data=German.train,
parms = list(loss = German_matrix))
rpart.plot(fit_tree_German_2,extra=4, yesno=2)
# Fit the tree with a maximum depth of 2
depth_German_control_tree_2 <- rpart(Class ~ ., data = German.train, method = "class",
control = rpart.control(maxdepth = 3),
parms = list(loss = German_matrix))
# I am using cost_matrix just for demostration, remove it if needed
rpart.plot(depth_German_control_tree_2,extra=4, yesno=2)
# Fit the tree with a maximum depth of 2
depth_German_control_tree <- rpart(Class ~ ., data = German.train,
method = "class",cp = 0.003,
parms = list(loss = German_matrix))
# I am using cost_matrix just for demostration,
#remove it if needed
rpart.plot(depth_German_control_tree,extra=4, yesno=2)
# Fit a full tree with very low cp value
full_German_tree <- rpart(Class ~ ., data = German.train,
method = "class", cp = 0.001)
# Display CP table to identify the optimal cp
printcp(full_German_tree)
##
## Classification tree:
## rpart(formula = Class ~ ., data = German.train, method = "class",
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] Age Amount
## [3] CheckingAccountStatus.0.to.200 CheckingAccountStatus.lt.0
## [5] Duration InstallmentRatePercentage
## [7] Job.SkilledEmployee NumberExistingCredits
## [9] OtherInstallmentPlans.Bank Property.RealEstate
## [11] Purpose.Business Purpose.NewCar
## [13] Purpose.Radio.Television Purpose.UsedCar
## [15] ResidenceDuration SavingsAccountBonds.100.to.500
## [17] SavingsAccountBonds.lt.100
##
## Root node error: 229/800 = 0.28625
##
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.0305677 0 1.00000 1.00000 0.055828
## 2 0.0240175 4 0.87773 1.06987 0.056931
## 3 0.0196507 8 0.77293 1.03930 0.056465
## 4 0.0131004 10 0.73362 1.02620 0.056257
## 5 0.0109170 19 0.60699 1.00437 0.055901
## 6 0.0087336 21 0.58515 1.00437 0.055901
## 7 0.0065502 22 0.57642 0.98253 0.055532
## 8 0.0043668 24 0.56332 0.99127 0.055681
## 9 0.0029112 27 0.55022 1.05240 0.056667
## 10 0.0021834 30 0.54148 1.05240 0.056667
## 11 0.0010000 32 0.53712 1.06114 0.056800
# Identify optimal CP munually or use the following codes
optimal_German_cp <- full_German_tree$cptable[which.min(
full_German_tree$cptable[,"xerror"]),"CP"]
# Prune the tree
pruned_German_tree <- prune(full_German_tree, cp = optimal_German_cp)
rpart.plot(pruned_German_tree,extra=4, yesno=2)
We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).
# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
set.seed(2024)
index <- sample(1:nrow(mtcars),nrow(mtcars)*0.85)
mtcars.train = mtcars[index,]
mtcars.test = mtcars[-index,]
library(rpart)
library(rpart.plot)
# fit the Regression tree
mtcars_tree <- rpart(mpg ~ ., data = mtcars.train)
rpart.plot(mtcars_tree,extra=1, yesno=2)
summary(mtcars_tree)
## Call:
## rpart(formula = mpg ~ ., data = mtcars.train)
## n= 27
##
## CP nsplit rel error xerror xstd
## 1 0.6121479 0 1.0000000 1.1010927 0.2728919
## 2 0.0100000 1 0.3878521 0.7898291 0.1866011
##
## Variable importance
## cyl disp hp qsec vs wt
## 20 18 18 14 14 14
##
## Node number 1: 27 observations, complexity param=0.6121479
## mean=20.50741, MSE=37.20809
## left son=2 (17 obs) right son=3 (10 obs)
## Primary splits:
## cyl < 5 to the right, improve=0.6121479, (0 missing)
## hp < 118 to the right, improve=0.6068166, (0 missing)
## wt < 3.325 to the right, improve=0.5916267, (0 missing)
## disp < 120.55 to the right, improve=0.5838435, (0 missing)
## vs < 0.5 to the left, improve=0.5158466, (0 missing)
## Surrogate splits:
## disp < 142.9 to the right, agree=0.963, adj=0.9, (0 split)
## hp < 109.5 to the right, agree=0.963, adj=0.9, (0 split)
## wt < 2.5425 to the right, agree=0.889, adj=0.7, (0 split)
## qsec < 18.41 to the left, agree=0.889, adj=0.7, (0 split)
## vs < 0.5 to the left, agree=0.889, adj=0.7, (0 split)
##
## Node number 2: 17 observations
## mean=16.84706, MSE=10.98484
##
## Node number 3: 10 observations
## mean=26.73, MSE=20.2901
Your observation: The model prioritizes cyl,
disp, and hp as key predictors of
mpg. The primary split based on cyl
significantly reduces error, and surrogate splits ensure robust handling
of potential missing data. However, the relatively high MSE in both
terminal nodes suggests room for improvement, perhaps by adding more
splits or exploring alternative models.
# Make predictions on the Training sets
mtcars_pred_train <- predict(mtcars_tree, mtcars.train)
# MSE
mse_German_train <- mean((mtcars.train$mpg - mtcars_pred_train)^2)
print(paste0("mse_German_train:",mse_German_train))
## [1] "mse_German_train:14.4312352941176"
Your observation: The mean squared error (MSE) in
mse_German_train the value of 14.43 comes
from the German.train dataset indicates that, on average,
the squared differences between the predicted class probabilities (or
predicted values) and the actual class labels are
14.43. This MSE provides a measure of how
well the model is performing in terms of prediction accuracy, with lower
values indicating better fit and smaller prediction errors. However,
since this value alone doesn’t indicate whether the model is performing
adequately, it should be interpreted in the context of the scale of the
target variable and compared to other models or a baseline to determine
its relative effectiveness.
# Make predictions on the Testing sets
mtcars_pred_test <- predict(mtcars_tree, mtcars.test)
# MSE
mse_German_test <- mean((mtcars.test$mpg - mtcars_pred_test)^2)
print(paste0("mse_German_test:",mse_German_test))
## [1] "mse_German_test:2.61964574394463"
Your observation: The mean squared error (MSE) in
mse_German_test is 2.62 on the
German.test dataset indicates that the model’s predictions
deviate, on average, by a squared difference of 2.62
from the actual class labels in the test set. This
relatively low MSE suggests that the model generalizes well
to unseen data, producing accurate predictions. When compared to the MSE
on the training set, this test MSE can provide insight into
whether the model is overfitting or underfitting; a significantly lower
test MSE compared to the training MSE could
indicate improved generalization or potentially lower variance in the
test data.
The comparison between the SVM model with a
linear kernel and logistic regression
highlights the SVM’s superior performance, achieving
accuracies of 79.2% on the training set
and 74.75% on the test set, alongside AUC
scores of 0.8505 and 0.7353, respectively, which
indicate strong class discrimination despite some susceptibility to
false positives, while logistic regression, though simpler
and more interpretable, performs adequately in
inearly separable data but may struggle with more complex
relationships; in contrast, the decision tree model built
using the rpart function on the
German.train dataset exhibits stronger classification
performance, with an accuracy of 83.25% and high recall
(91.25%) and precision (86.12%) for class 1,
though it also faces challenges in reducing false positives. Notably,
the decision tree demonstrates better generalization, as
reflected by a significantly lower test MSE
(2.62) compared to the training MSE
(14.43), suggesting effective handling of unseen data,
while SVM models excel in high-dimensional spaces and can
be fine-tuned to address class imbalances, making them more suitable for
complex datasets, whereas the decision tree offers
transparency and interpretable splits driven by key predictors such as
cyl, disp, and hp.