Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
## Task1: Data Preparation
#### 1. Load the caret package and the GermanCredit dataset.
``` r
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : num 1 0 1 1 0 1 1 1 1 0 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)set.seed(2025)
train_index <- sample(1:nrow(GermanCredit), 0.7 * nrow(GermanCredit))
train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]
Your observation: 70% of observations are for training data abd 30% are for testing
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
tree_model <- rpart(Class ~ ., data = train_data, method = "class")
rpart.plot(tree_model)
Your observation: Tree shows variables most important for predicting credit class.Variables at top are most important
train_probs <- predict(tree_model, train_data, type = "prob")[,2]
train_pred <- predict(tree_model, train_data, type = "class")
head(train_probs)
## 909 460 932 922 961 279
## 0.8593750 0.8593750 0.6904762 0.8593750 0.8593750 0.8593750
head(train_pred)
## 909 460 932 922 961 279
## 1 1 1 1 1 1
## Levels: 0 1
Your observation: Probabilities are predicted using a 0.5 cutoff. Most customers predicted as “Good”
conf_matrix_train <- table(Predicted = train_pred, Actual = train_data$Class)
print(conf_matrix_train)
## Actual
## Predicted 0 1
## 0 112 33
## 1 113 442
MR_train <- 1 - sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
print(MR_train)
## [1] 0.2085714
Your observation: MR rate is 0.1771429. Model classified 1-0.1771429 correctly. TN = 144, FP = 43, FN = 81, TP = 432
test_probs <- predict(tree_model, test_data, type = "prob")[,2]
test_pred <- predict(tree_model, test_data, type = "class")
head(test_probs)
## 3 4 7 9 17 21
## 0.8593750 0.1176471 0.8593750 0.8593750 0.8593750 0.8593750
head(test_pred)
## 3 4 7 9 17 21
## 1 0 1 1 1 1
## Levels: 0 1
Your observation: Similar patterns as training data. Consistent model behavior
conf_matrix_test <- table(Predicted = test_pred, Actual = test_data$Class)
print(conf_matrix_test)
## Actual
## Predicted 0 1
## 0 24 24
## 1 51 201
MR_test <- 1 - sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
print(MR_test)
## [1] 0.25
Your observation: MR rate is 0.2966667. TN = 32, FP = 46, FN = 43, TP = 179
loss_matrix <- matrix(c(0, 2, 1, 0), nrow = 2, byrow = TRUE)
tree_weighted <- rpart(Class ~ ., data = train_data, method = "class")
rpart.plot(tree_weighted)
Your observation:
train_probs_weighted <- predict(tree_weighted, train_data, type = "prob")[,2]
train_pred_weighted <- predict(tree_weighted, train_data, type = "class")
head(train_probs_weighted)
## 909 460 932 922 961 279
## 0.8593750 0.8593750 0.6904762 0.8593750 0.8593750 0.8593750
head(train_pred_weighted)
## 909 460 932 922 961 279
## 1 1 1 1 1 1
## Levels: 0 1
Your observation: Different prob compared to unweighted model. Fewer prediction of “Good”
conf_matrix_train_weighted <- table(Predicted = train_pred_weighted, Actual = train_data$Class)
print(conf_matrix_train_weighted)
## Actual
## Predicted 0 1
## 0 112 33
## 1 113 442
MR_train_weighted <- 1 - sum(diag(conf_matrix_train_weighted)) / sum(conf_matrix_train_weighted)
print(MR_train_weighted)
## [1] 0.2085714
Your observation: MR rate is 0.1771429. TN = 144, FP = 43, FN = 81, TP = 432
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_train <- roc(train_data$Class, train_probs_weighted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_train, main = "ROC curve - Training Set")
auc_train <- auc(roc_train)
print(auc_train)
## Area under the curve: 0.7814
Your observation: AUC = 0.8143. Higher AUC means better model discrimination
test_probs_weighted <- predict(tree_weighted, test_data, type = "prob")[,2]
test_pred_weighted <- predict(tree_weighted, test_data, type = "class")
head(test_probs_weighted)
## 3 4 7 9 17 21
## 0.8593750 0.1176471 0.8593750 0.8593750 0.8593750 0.8593750
head(test_pred_weighted)
## 3 4 7 9 17 21
## 1 0 1 1 1 1
## Levels: 0 1
Your observation: Same pattern seen in training data. Consistent model
conf_matrix_test_weighted <- table(Predicted = test_pred_weighted, Acutal = test_data$Class)
print(conf_matrix_test_weighted)
## Acutal
## Predicted 0 1
## 0 24 24
## 1 51 201
MR_test_weighted <- 1 - sum(diag(conf_matrix_test_weighted)) / sum(conf_matrix_test_weighted)
print(MR_test_weighted)
## [1] 0.25
Your observation: MR is 0.2966667. TN = 32, FP = 46, FN = 43, TP = 179
roc_test <- roc(test_data$Class, test_probs_weighted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_test, main = "ROC Curve - Testing Set")
auc_test <- auc(roc_test)
print(auc_test)
## Area under the curve: 0.7044
Your observation: Test auc is close to train auc. Model doesnt overfit
Findings: Both models had similar AUC. Weighted model reduced false positives but increased false negatives. Observed: Training and test performance were similar for both models. This means no over fitting. Cost weighting approach shifted error distribution to business priorities.