Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Loading required package: ggplot2
## Loading required package: lattice
library(ROCR)
library(ggplot2)
library(lattice)
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good")
Your observation: Our target variable for this will be ‘Class’ which is why we will use as.numeric to convert it into True or False; 1 or 0.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
str(GermanCredit)
## 'data.frame': 1000 obs. of 49 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : num 1 0 1 1 0 1 1 1 1 0 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
table(GermanCredit$Class)
##
## 0 1
## 300 700
Your observation: “Good” is found 700/1000 times, making up 70% of the Class dataset.
training and test set. Please use the random seed
as 2023 for reproducibility.
set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
german.train = GermanCredit[index,]
german.test = GermanCredit[-index,]
Your observation:
library(rpart)
library(rpart.plot)
# fit the tree
fit_tree <- rpart(as.factor(Class) ~ ., data=german.train)
rpart.plot(fit_tree,extra=4, yesno=2)
Your observation: The rpart paackage was used to ensure all variables were included in regards to the target variable.
library(rpart)
pred_german_train <- predict(fit_tree, german.train, type="class")
Your observation: The predicted classes of the training set were calculated through the fit_tree model
# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = german.train$Class,
pred = pred_german_train)
Cmatrix_train
## pred
## true 0 1
## 0 146 96
## 1 50 508
# MSE
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1825
Your observation: The matrix shows 146 True Negatives,508 True Positives, 96 False Positives, and 50 False Negatives. The Misclassification Rate is 18.25%.
# Make predictions on the training sets
pred_german_test <- predict(fit_tree, german.test, type="class")
Your observation: To get the test set of predicted classes, the fit_tree model was also used.
# Confusion matrix
Cmatrix_test = table(true = german.test$Class,
pred = pred_german_test)
Cmatrix_test
## pred
## true 0 1
## 0 35 23
## 1 27 115
# MSE
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.25
Your observation: There were 35 True Negatives, 115 True Positives, 23 False Positives and 27 False Negatives. This gave a MR rate of 25% which is 6.75% higher than the training predicted set.
# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 2, # cost of 2 for FP
1, 0), # cost of 1 for FN
byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(Class) ~ ., data=german.train,
parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)
Your observation: By having the weights set to 2 for FP and 1 for FN, more emphasis is given to False Positives to prevent them from positive classification unless the confident.
#get predictions for training
pred_german_train21 <- predict(fit_tree_asym, german.train, type="class")
Your observation: Predicted probabilities and classes for the training set were calculated using the weighted fit_tree model, fit_tree_asym
Cmatrix_train_weight = table(true = german.train$Class, pred = pred_german_train21)
Cmatrix_train_weight
## pred
## true 0 1
## 0 172 70
## 1 126 432
1 - sum(diag(Cmatrix_train_weight))/sum(Cmatrix_train_weight)
## [1] 0.245
Your observation: There were 172 True Negatives, 432 True Positives, 70 False Positives, and 126 False Negatives. There are less False Positives because of the weight for FP in the cost matrix. The MR is 24.5% which is 6.5% higher than the unweighted training set MR.
## Unweighted ROC and AUC
# obtain predicted probability
pred_prob_train = predict(fit_tree, german.train, type = "prob")
pred_prob_train = pred_prob_train[,"1"]
library(ROCR)
pred <- prediction(pred_prob_train, german.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7974133
## Weighted ROC & AUC
# obtain predicted probability
pred_prob_train21 = predict(fit_tree_asym, german.train, type = "prob")
pred_prob_train21 = pred_prob_train21[,"1"]
library(ROCR)
pred <- prediction(pred_prob_train21, german.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7557096
Your observation: The unweighted AUC is higher at 79.74% compared to the weighted AUC of 75.57%.The ROC is also better for the unweighted model.
pred_german_test21 <- predict(fit_tree_asym, german.test, type = "class")
Your observation: The weighted fit tree model was used to calculate the predicted probabilities and classes of the testing set.
Cmatrix_test_weight = table( true = german.test$Class, pred = pred_german_test21)
Cmatrix_test_weight
## pred
## true 0 1
## 0 40 18
## 1 43 99
1 - sum(diag(Cmatrix_test_weight))/sum(Cmatrix_test_weight)
## [1] 0.305
Your observation: There are 40 True Negatives, 99 True Positives, 18 False Positives and 43 False Negatives for the confusion matrix. The MR is 30.5%, which is 5.5% higher than the unweighted testing model.
## Unweighted ROC & AUC
# obtain predicted probability
pred_prob_test = predict(fit_tree, german.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7279626
## Weighted ROC & AUC
# obtain predicted probability
pred_prob_test21 = predict(fit_tree_asym, german.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test21 = pred_prob_test21[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test21, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7049539
Your observation: Again, the AUC for the unweighted model is 72.8% while the weighted model is 70.5%. The ROC curve is also better for the unweighted model.
Overall, the unweighted model shows has a better performance.
Unweighted: MR train: 18.25% MR test: 25% AUC train: 79.74% AUC test: 72.8% Better ROC curve
Weighted: MR train: 24.5% MR test: 30.5% AUC train: 75.57% AUC test: 70.5%
Applying more emphasis when classifying as Positive in the weighted model is likely what caused this difference. A case where using the weighted model is preferred could be seen in medical diagnoses or fraud detection wherethe consequences can be more severe.
Pros: - Easily Interpretable - Handles both categorical and numerical features - Minimal Data Preprocessing - Quick to Build
Cons: - Prone to Overfitting - Unstable - Biased to Dominant Class - Limited Expressiveness
Pros: - Effective in high dimensional spaces - Memory efficient for large dataset - Versatile through kernel
Cons: - Poor performance when number of features exceeds number of samples - Sensitive to noise
Pros: - Simplicity _ Efficiency _ Versatility: binary, ordinal and nominal variables - Output Interpretation
Cons: - Assumes linearity - Sensitive to Outliers - High correlation among predictors can affect performance