Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
GermanCredit$Class <- as.factor(GermanCredit$Class) #make sure `Class` is a factor as SVM require a factor response,now 1 is good and 0 is bad.
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#load tree model packages
library(rpart)
## Warning: package 'rpart' was built under R version 4.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
#This is the code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2024 for
reproducibility. (5pts)set.seed(2024)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
german.train = GermanCredit[index,]
german.test = GermanCredit[-index,]
Class) are right. (10pts)german_tree <- rpart(as.factor(Class) ~ ., data = german.train)
german_tree
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 229 1 (0.2862500 0.7137500)
## 2) CheckingAccountStatus.lt.0>=0.5 211 101 1 (0.4786730 0.5213270)
## 4) Duration>=11.5 178 82 0 (0.5393258 0.4606742)
## 8) Duration>=31.5 38 10 0 (0.7368421 0.2631579) *
## 9) Duration< 31.5 140 68 1 (0.4857143 0.5142857)
## 18) Purpose.UsedCar< 0.5 129 61 0 (0.5271318 0.4728682)
## 36) Purpose.Business< 0.5 121 54 0 (0.5537190 0.4462810)
## 72) InstallmentRatePercentage>=2.5 82 31 0 (0.6219512 0.3780488)
## 144) Amount< 1577.5 40 11 0 (0.7250000 0.2750000) *
## 145) Amount>=1577.5 42 20 0 (0.5238095 0.4761905)
## 290) Amount>=2135.5 30 11 0 (0.6333333 0.3666667)
## 580) SavingsAccountBonds.lt.100>=0.5 22 5 0 (0.7727273 0.2272727) *
## 581) SavingsAccountBonds.lt.100< 0.5 8 2 1 (0.2500000 0.7500000) *
## 291) Amount< 2135.5 12 3 1 (0.2500000 0.7500000) *
## 73) InstallmentRatePercentage< 2.5 39 16 1 (0.4102564 0.5897436)
## 146) Duration>=15.5 26 12 0 (0.5384615 0.4615385)
## 292) Amount< 3506.5 12 3 0 (0.7500000 0.2500000) *
## 293) Amount>=3506.5 14 5 1 (0.3571429 0.6428571) *
## 147) Duration< 15.5 13 2 1 (0.1538462 0.8461538) *
## 37) Purpose.Business>=0.5 8 1 1 (0.1250000 0.8750000) *
## 19) Purpose.UsedCar>=0.5 11 0 1 (0.0000000 1.0000000) *
## 5) Duration< 11.5 33 5 1 (0.1515152 0.8484848) *
## 3) CheckingAccountStatus.lt.0< 0.5 589 128 1 (0.2173175 0.7826825)
## 6) CheckingAccountStatus.0.to.200>=0.5 210 83 1 (0.3952381 0.6047619)
## 12) Amount>=9908.5 16 1 0 (0.9375000 0.0625000) *
## 13) Amount< 9908.5 194 68 1 (0.3505155 0.6494845)
## 26) Property.RealEstate< 0.5 136 57 1 (0.4191176 0.5808824)
## 52) Age< 25.5 31 11 0 (0.6451613 0.3548387) *
## 53) Age>=25.5 105 37 1 (0.3523810 0.6476190)
## 106) SavingsAccountBonds.lt.100>=0.5 52 24 1 (0.4615385 0.5384615)
## 212) NumberExistingCredits< 1.5 32 13 0 (0.5937500 0.4062500)
## 424) Age>=32.5 22 6 0 (0.7272727 0.2727273) *
## 425) Age< 32.5 10 3 1 (0.3000000 0.7000000) *
## 213) NumberExistingCredits>=1.5 20 5 1 (0.2500000 0.7500000) *
## 107) SavingsAccountBonds.lt.100< 0.5 53 13 1 (0.2452830 0.7547170)
## 214) SavingsAccountBonds.100.to.500>=0.5 24 10 1 (0.4166667 0.5833333)
## 428) Job.SkilledEmployee< 0.5 7 1 0 (0.8571429 0.1428571) *
## 429) Job.SkilledEmployee>=0.5 17 4 1 (0.2352941 0.7647059) *
## 215) SavingsAccountBonds.100.to.500< 0.5 29 3 1 (0.1034483 0.8965517) *
## 27) Property.RealEstate>=0.5 58 11 1 (0.1896552 0.8103448)
## 54) Duration>=22 7 2 0 (0.7142857 0.2857143) *
## 55) Duration< 22 51 6 1 (0.1176471 0.8823529) *
## 7) CheckingAccountStatus.0.to.200< 0.5 379 45 1 (0.1187335 0.8812665) *
Your observation: Since this is a classification model, we will need
to use as.factor on our predictor, in this case Class, we
then use all of our variables in the training data to make the tree
model. The output shows multiple values as an output, showing that there
are decision nodes, and subsequently leaf nodes.
rpart.plot(german_tree, extra = 1)
german_pred_train <- predict(german_tree, german.train, type = "class")
#confusion matrix for training
Cmatrix_train = table(true = german.train$Class,
pred = german_pred_train)
Cmatrix_train
## pred
## true 0 1
## 0 145 84
## 1 50 521
#MR
german_MR_train <- 1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
german_MR_train
## [1] 0.1675
Your observation: On the training data set we see the confusion matrix, with higher values on TP (True Positives), meaning we correctly predict/classify people who have good credit. Next we have higher values in the False Positive section, in this instance, it is worse than the False negative. Since we predicted true, they have good credit so we give them money, but in actuality they have bad credit risk, and may default on the payments. Next we have MR (Misclassification Rate) at 16.75%, which is relatively high, but is still low.
german_pred_test <- predict(german_tree, german.test, type = "class")
#confusion matrix for testing
Cmatrix_test = table(true = german.test$Class,
pred = german_pred_test)
Cmatrix_test
## pred
## true 0 1
## 0 36 35
## 1 26 103
#MR
german_MR_test <- 1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
german_MR_test
## [1] 0.305
Your observation: The confusion matrix here reacts the same as the previous one on training data, but now we used testing. The MR has increased from 16.75% on training to 30.5%, which is expected, but still a drastic increase. It would be better if we were to lower this rate, on both ends.
# obtain predicted probability
pred_prob_test = predict(german_tree, german.test, type = "prob")
pred_prob_test = pred_prob_test[,"1"]
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.3
#ROC
pred <- prediction(pred_prob_test, german.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6742548
your observation: The ROC curve is very smooth and has a poor slope, the AUC while still predicts higher than a random guess, it is still look making it a poor model of predicting.
We will use the built-in mtcars dataset to predict miles per gallon (mpg) using other car characteristics. The dataset includes information about 32 cars from Motor Trend magazine (1973-74).
# Load the mtcars dataset
data(mtcars)
# Display the structure of the dataset
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
set.seed(2024)
index <- sample(1:nrow(mtcars),nrow(mtcars)*0.85)
mtcars.train = mtcars[index,]
mtcars.test = mtcars[-index,]
cars_tree <- rpart(mpg~ ., data=mtcars.train, method = "anova")
cars_tree
## n= 27
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 27 1004.6190 20.50741
## 2) cyl>=5 17 186.7424 16.84706 *
## 3) cyl< 5 10 202.9010 26.73000 *
rpart.plot(cars_tree, extra = 1)
Your observation:There are only two leaf nodes, only one decision node, in this case the root node. With 27 observations overall.
cars_pred_train <- predict(cars_tree, mtcars.train)
# MSE
cars_train <- mean((mtcars.train$mpg - cars_pred_train)^2)
cars_train
## [1] 14.43124
your observation: A MSE predicts how well the model predictions are close to the actual results, a lower MSE means the model predicts closer to the truth. A 14.43 MSE is low, not as low as we would like, but still low.
# Calculate SST (Total Sum of Squares)
sst <- sum((mtcars.train$mpg - mean(mtcars.train$mpg))^2)
# Calculate SSR (Residual Sum of Squares)
ssr <- sum((mtcars.train$mpg - cars_pred_train)^2)
# Calculate R-squared
r_squared <- 1 - (ssr / sst)
r_squared
## [1] 0.6121479
Your observation: R squared tells you how much variation can be explained by the model, higher the better in most cases. Here the R^2 is only 0.61 which is both good and ok, it’s still high, but we want it to be higher.
cars_pred_test <- predict(cars_tree, mtcars.test)
# MSE
cars_test <- mean((mtcars.test$mpg - cars_pred_test)^2)
cars_test
## [1] 2.619646
your observation: Now on the testing data set, with a significant amount less of observations, the MSE for testing is significantly better than the one for training, meaning our models predicts even better/closer to the actual values.
# Calculate SST (Total Sum of Squares)
sst_test <- sum((mtcars.test$mpg - mean(mtcars.test$mpg))^2)
# Calculate SSR (Residual Sum of Squares)
ssr_test <- sum((mtcars.test$mpg - cars_pred_test)^2)
# Calculate R-squared
r_squared_test <- 1 - (ssr_test / sst_test)
r_squared_test
## [1] 0.8567122
Your observation: The R^2 value is significantly larger than on the training, meaning we are able to explain the variation better.