We will be examine the Space Shuttle Launch Data recorded after the 1986 explosion of the space shuttle Challenger. An investigation of the explosion was conducted and data is the report the provided after the investigation. From the below table we can see the there are 4 variables, one dependent variable distress_ct and 2 dependent variables, temperature and field check pressure, the last variable is an identifier code for the 23 shuttles launches observed.
launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")
# examine the launch data
str(launch)
'data.frame': 23 obs. of 4 variables:
$ distress_ct : int 0 1 0 0 0 0 0 0 1 1 ...
$ temperature : int 66 70 69 68 67 72 73 70 57 63 ...
$ field_check_pressure: int 50 50 50 50 50 50 100 100 200 200 ...
$ flight_num : int 1 2 3 4 5 6 7 8 9 10 ...
#View(launch)
## Descriptive Statistics
# First recode the distress_ct variable into 0 and 1, making 1 to represent at least
# one failure during a launch.
summary(launch)
distress_ct temperature field_check_pressure flight_num
Min. :0.0000 Min. :53.00 Min. : 50.0 Min. : 1.0
1st Qu.:0.0000 1st Qu.:67.00 1st Qu.: 75.0 1st Qu.: 6.5
Median :0.0000 Median :70.00 Median :200.0 Median :12.0
Mean :0.3913 Mean :69.57 Mean :152.2 Mean :12.0
3rd Qu.:1.0000 3rd Qu.:75.00 3rd Qu.:200.0 3rd Qu.:17.5
Max. :2.0000 Max. :81.00 Max. :200.0 Max. :23.0
launch$distress_ct <- ifelse(launch$distress_ct<1,0,1)
launch$distress_ctL <- factor(launch$distress_ct, levels = c(0,1), labels = c("No Failures", "At least One Failure"))
# Boxplot of MPG by Car Cylinders
boxplot(temperature~distress_ctL,data=launch, main="Distribution of Tempature by Failures",
xlab="Failures", ylab="Temperature")
From the box plot above, we notice that shuttle launches that happened on warmer days were more likely to be successful than those on colder days. We will now run the logistic regression to confirm this observation with statistically certainty.
We will first step up our train data set and verify that there are no missing values.
# set up trainning and test data sets
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
launch_train = launch[indx,1:4]
launch_test = launch[-indx,1:4]
launch_train_labels = launch[indx,1]
launch_test_labels = launch[-indx,1]
# Check if there are any missing values
#install.packages("Amelia")
library(Amelia)
missmap(launch, main = "Missing values vs observed")
After we setup the test data set, we used the miss map function from the Amelia package to create a graph to show us if we had any missing values. From the graph above it does not seem that we have any missing values but we will take a closer look to verify. It is important to verify that there are no missing values because missing values have a significantly negative effect on the performance of the logistic regression model.
# number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))
distress_ct temperature field_check_pressure flight_num distress_ctL
0 0 0 0 0
# number of unique values in each column
sapply(launch, function(x) length(unique(x)))
distress_ct temperature field_check_pressure flight_num distress_ctL
2 16 3 23 2
Now that the we have verified there are no missing values, we can now fit the model and run the analysis.
# fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
model
Call: glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Coefficients:
(Intercept) temperature field_check_pressure flight_num
22.49654 -0.35787 0.02209 -0.23676
Degrees of Freedom: 19 Total (i.e. Null); 16 Residual
Null Deviance: 22.49
Residual Deviance: 11.86 AIC: 19.86
summary(model)
Call:
glm(formula = distress_ct ~ ., family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.11859 -0.48847 -0.21292 -0.00989 2.02954
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 22.49654 14.58126 1.543 0.1229
temperature -0.35787 0.21420 -1.671 0.0948 .
field_check_pressure 0.02209 0.02355 0.938 0.3482
flight_num -0.23676 0.26362 -0.898 0.3691
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 22.493 on 19 degrees of freedom
Residual deviance: 11.862 on 16 degrees of freedom
AIC: 19.862
Number of Fisher Scoring iterations: 7
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 22.493
temperature 1 9.5950 18 12.898 0.001951 **
field_check_pressure 1 0.0921 17 12.806 0.761494
flight_num 1 0.9447 16 11.862 0.331073
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the above table, we see the output of the model. We want to improve the model so we can use a step wise process to remove insignificant variables.
# drop the insignificant predictors, alpha = 0.10
library(gdata)
drop1(model, test = "Chisq")
Single term deletions
Model:
distress_ct ~ temperature + field_check_pressure + flight_num
Df Deviance AIC LRT Pr(>Chi)
<none> 11.862 19.862
temperature 1 20.352 26.352 8.4908 0.00357 **
field_check_pressure 1 12.859 18.859 0.9975 0.31792
flight_num 1 12.806 18.806 0.9447 0.33107
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the table below we see, field check pressure and flight number are not significant at an alpha of .10. We will drop these variables and rerun the model.
model <- glm(distress_ct ~ temperature,family=binomial(link='logit'),data=launch_train)
summary(model)
Call:
glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"),
data = launch_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.95163 -0.57270 -0.24177 -0.01614 1.97915
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 20.0521 10.4929 1.911 0.056 .
temperature -0.3123 0.1550 -2.014 0.044 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 22.493 on 19 degrees of freedom
Residual deviance: 12.898 on 18 degrees of freedom
AIC: 16.898
Number of Fisher Scoring iterations: 6
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: distress_ct
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 19 22.493
temperature 1 9.595 18 12.898 0.001951 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We see that the model has improve with the deletion of the two other variables. We know what to see check the predictability of the new model.
# check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != launch_test$distress_ct)
#print(paste('Accuracy',round(1-misClasificError,2)*100,"%"))
#Because this data set is so small, the test data set does not contain both 0 and 1 values and the code will not run. And since the test data set is so small the ROC is not useful here but the code is provided.
#install.packages("ROCR")
#library(ROCR)
#p <- predict(model, newdata=launch_test, type="response")
#pr <- prediction(p, launch_test$distress_ct)
#prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#auc <- performance(pr, measure = "auc")
#auc <- auc@y.values[[1]]
#auc
We see that the accuracy of the model is 72 % which is pretty good considering that we were working with a small data set.
We will be conducting a logistic regression analysis using the credit approval data set. The data set is a collection of credit card applications and the credit approval decisions. In this analysis I will also be including a regression tree (CART) model. We will built a model and test is predictability of the outcomes of credit applications.
Below we will load the data set and use the str() function to get a quick understanding of the data types included in the data set.
library(readr)
cred <- read_csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
# examine the launch data
str(cred)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1000 obs. of 17 variables:
$ checking_balance : chr "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
$ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
$ credit_history : chr "critical" "good" "critical" "good" ...
$ purpose : chr "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ savings_balance : chr "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
$ employment_duration : chr "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
$ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
$ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
$ age : int 67 22 49 45 53 35 53 35 61 28 ...
$ other_credit : chr "none" "none" "none" "none" ...
$ housing : chr "own" "own" "own" "other" ...
$ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
$ job : chr "skilled" "skilled" "unskilled" "skilled" ...
$ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
$ phone : chr "yes" "no" "no" "no" ...
$ default : chr "no" "yes" "no" "no" ...
- attr(*, "spec")=List of 2
..$ cols :List of 17
.. ..$ checking_balance : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ months_loan_duration: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ credit_history : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ purpose : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ amount : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ savings_balance : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ employment_duration : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ percent_of_income : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ years_at_residence : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ age : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ other_credit : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ housing : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ existing_loans_count: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ job : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ dependents : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ phone : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ default : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
From the table above we see that dependent variable is character type and we need to transform it to numeric values 0 and 1. Having the data as a character variables is meaningless in terms of the model. We also want to check for missing values and determine how to address them. There are three approaches:remove them, zero them out or estimate a plug value. For this analysis, if we find any missing values we will zero them out.
## Fix the default variable to be 0 or 1
cred$default <- ifelse(cred$default =="yes", 1, 0)
# Check if there are any missing values
library(Amelia)
missmap(cred, main = "Missing values vs observed")
# number of missing values in each column
sapply(cred,function(x) sum(is.na(x)))
checking_balance months_loan_duration credit_history purpose amount
0 0 0 0 0
savings_balance employment_duration percent_of_income years_at_residence age
0 0 0 0 0
other_credit housing existing_loans_count job dependents
0 0 0 0 0
phone default
0 0
# number of unique values in each column
sapply(cred, function(x) length(unique(x)))
checking_balance months_loan_duration credit_history purpose amount
4 33 5 6 921
savings_balance employment_duration percent_of_income years_at_residence age
5 5 4 4 53
other_credit housing existing_loans_count job dependents
3 3 4 4 2
phone default
2 2
From the above graphs and charts, we do not have any missing values thus we can proceed with the analysis.
library(plotrix)
library(sqldf)
Ye <- sum(cred$default)
N <- length(cred$default) - sum(cred$default)
slices <- c(Ye,N)
lbls <- c("Approved", "Not Approved")
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # add % to labels
pie3D(slices, labels = lbls, explode = .01 , main = "Pie Chart of Credit Application Approval", labelpos = c(1.5,5))
We will now create the train and test data sets and run the logistic regression model.
# logisitic regression
# set up trainning and test data sets
indx = sample(1:nrow(cred), as.integer(0.9*nrow(cred)))
#indx
cred_train = cred[indx,]
cred_test = cred[-indx,]
cred_train_labels = cred[indx,17]
cred_test_labels = cred[-indx,17]
# fit the logistic regression model, with all predictor variables
model <- glm(default ~.,family=binomial(link='logit'),data=cred_train)
model
Call: glm(formula = default ~ ., family = binomial(link = "logit"),
data = cred_train)
Coefficients:
(Intercept) checking_balance> 200 DM checking_balance1 - 200 DM
-1.221e+00 -8.600e-01 -4.755e-01
checking_balanceunknown months_loan_duration credit_historygood
-1.722e+00 2.777e-02 8.061e-01
credit_historyperfect credit_historypoor credit_historyvery good
1.376e+00 6.147e-01 1.306e+00
purposecar purposecar0 purposeeducation
2.796e-01 -1.030e+00 6.856e-01
purposefurniture/appliances purposerenovations amount
-1.303e-01 3.688e-01 9.464e-05
savings_balance> 1000 DM savings_balance100 - 500 DM savings_balance500 - 1000 DM
-1.046e+00 -2.049e-01 -5.771e-01
savings_balanceunknown employment_duration> 7 years employment_duration1 - 4 years
-8.369e-01 -2.788e-01 -2.002e-01
employment_duration4 - 7 years employment_durationunemployed percent_of_income
-8.475e-01 -1.874e-01 2.665e-01
years_at_residence age other_creditnone
1.995e-02 -1.232e-02 -6.470e-01
other_creditstore housingown housingrent
-1.359e-01 -1.696e-01 2.317e-01
existing_loans_count jobskilled jobunemployed
2.322e-01 4.105e-02 -2.606e-01
jobunskilled dependents phoneyes
-4.921e-02 -1.316e-01 -2.069e-01
Degrees of Freedom: 899 Total (i.e. Null); 864 Residual
Null Deviance: 1091
Residual Deviance: 857.3 AIC: 929.3
summary(model)
Call:
glm(formula = default ~ ., family = binomial(link = "logit"),
data = cred_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9050 -0.7517 -0.4221 0.8077 2.5871
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.221e+00 9.349e-01 -1.306 0.19156
checking_balance> 200 DM -8.600e-01 3.720e-01 -2.312 0.02079 *
checking_balance1 - 200 DM -4.755e-01 2.180e-01 -2.181 0.02918 *
checking_balanceunknown -1.722e+00 2.346e-01 -7.342 2.1e-13 ***
months_loan_duration 2.777e-02 9.443e-03 2.940 0.00328 **
credit_historygood 8.061e-01 2.614e-01 3.084 0.00204 **
credit_historyperfect 1.376e+00 4.230e-01 3.254 0.00114 **
credit_historypoor 6.147e-01 3.400e-01 1.808 0.07058 .
credit_historyvery good 1.306e+00 4.347e-01 3.004 0.00266 **
purposecar 2.796e-01 3.214e-01 0.870 0.38427
purposecar0 -1.030e+00 8.004e-01 -1.287 0.19815
purposeeducation 6.856e-01 4.329e-01 1.584 0.11320
purposefurniture/appliances -1.303e-01 3.152e-01 -0.414 0.67920
purposerenovations 3.688e-01 5.941e-01 0.621 0.53477
amount 9.464e-05 4.440e-05 2.132 0.03304 *
savings_balance> 1000 DM -1.046e+00 5.001e-01 -2.092 0.03648 *
savings_balance100 - 500 DM -2.049e-01 2.859e-01 -0.717 0.47354
savings_balance500 - 1000 DM -5.771e-01 4.437e-01 -1.301 0.19342
savings_balanceunknown -8.369e-01 2.575e-01 -3.250 0.00116 **
employment_duration> 7 years -2.788e-01 2.955e-01 -0.943 0.34545
employment_duration1 - 4 years -2.002e-01 2.404e-01 -0.833 0.40491
employment_duration4 - 7 years -8.475e-01 3.045e-01 -2.783 0.00539 **
employment_durationunemployed -1.874e-01 4.469e-01 -0.419 0.67490
percent_of_income 2.665e-01 8.862e-02 3.007 0.00264 **
years_at_residence 1.995e-02 8.756e-02 0.228 0.81981
age -1.232e-02 9.037e-03 -1.363 0.17289
other_creditnone -6.470e-01 2.380e-01 -2.718 0.00657 **
other_creditstore -1.359e-01 4.140e-01 -0.328 0.74275
housingown -1.696e-01 2.981e-01 -0.569 0.56937
housingrent 2.317e-01 3.411e-01 0.679 0.49690
existing_loans_count 2.322e-01 1.923e-01 1.207 0.22738
jobskilled 4.105e-02 2.844e-01 0.144 0.88526
jobunemployed -2.606e-01 6.759e-01 -0.386 0.69979
jobunskilled -4.921e-02 3.486e-01 -0.141 0.88773
dependents -1.316e-01 2.465e-01 -0.534 0.59342
phoneyes -2.069e-01 2.005e-01 -1.032 0.30207
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1090.95 on 899 degrees of freedom
Residual deviance: 857.26 on 864 degrees of freedom
AIC: 929.26
Number of Fisher Scoring iterations: 5
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: default
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 899 1090.95
checking_balance 3 112.232 896 978.72 < 2.2e-16 ***
months_loan_duration 1 38.630 895 940.09 5.123e-10 ***
credit_history 4 23.551 891 916.54 9.827e-05 ***
purpose 5 7.853 886 908.68 0.164539
amount 1 0.414 885 908.27 0.519718
savings_balance 4 15.869 881 892.40 0.003200 **
employment_duration 4 10.037 877 882.36 0.039816 *
percent_of_income 1 8.413 876 873.95 0.003725 **
years_at_residence 1 0.188 875 873.76 0.664428
age 1 2.749 874 871.01 0.097306 .
other_credit 2 7.799 872 863.21 0.020247 *
housing 2 3.023 870 860.19 0.220528
existing_loans_count 1 1.189 869 859.00 0.275509
job 3 0.388 866 858.61 0.942683
dependents 1 0.281 865 858.33 0.596134
phone 1 1.071 864 857.26 0.300799
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We now want to drop the insignificant predictors from the above output. This will help to improve the overall predictability of the model.
# drop the insignificant predictors, alpha = 0.10
model <- glm(default ~ checking_balance + months_loan_duration + credit_history + percent_of_income + age,family=binomial(link='logit'),data=cred_train)
model
Call: glm(formula = default ~ checking_balance + months_loan_duration +
credit_history + percent_of_income + age, family = binomial(link = "logit"),
data = cred_train)
Coefficients:
(Intercept) checking_balance> 200 DM checking_balance1 - 200 DM
-1.38751 -1.04269 -0.59932
checking_balanceunknown months_loan_duration credit_historygood
-1.89752 0.03488 0.54507
credit_historyperfect credit_historypoor credit_historyvery good
1.58205 0.55976 1.25426
percent_of_income age
0.16386 -0.01137
Degrees of Freedom: 899 Total (i.e. Null); 889 Residual
Null Deviance: 1091
Residual Deviance: 909.6 AIC: 931.6
summary(model)
Call:
glm(formula = default ~ checking_balance + months_loan_duration +
credit_history + percent_of_income + age, family = binomial(link = "logit"),
data = cred_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7486 -0.7881 -0.4808 0.8796 2.3638
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.387511 0.429112 -3.233 0.001223 **
checking_balance> 200 DM -1.042690 0.353004 -2.954 0.003139 **
checking_balance1 - 200 DM -0.599317 0.198958 -3.012 0.002593 **
checking_balanceunknown -1.897523 0.217381 -8.729 < 2e-16 ***
months_loan_duration 0.034877 0.006637 5.255 1.48e-07 ***
credit_historygood 0.545071 0.206717 2.637 0.008369 **
credit_historyperfect 1.582047 0.398854 3.966 7.29e-05 ***
credit_historypoor 0.559760 0.318190 1.759 0.078544 .
credit_historyvery good 1.254261 0.379193 3.308 0.000941 ***
percent_of_income 0.163864 0.074465 2.201 0.027768 *
age -0.011367 0.007391 -1.538 0.124059
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1091.0 on 899 degrees of freedom
Residual deviance: 909.6 on 889 degrees of freedom
AIC: 931.6
Number of Fisher Scoring iterations: 4
anova(model, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: default
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 899 1090.95
checking_balance 3 112.232 896 978.72 < 2.2e-16 ***
months_loan_duration 1 38.630 895 940.09 5.123e-10 ***
credit_history 4 23.551 891 916.54 9.827e-05 ***
percent_of_income 1 4.526 890 912.01 0.03339 *
age 1 2.410 889 909.60 0.12054
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now that we we have our final model, we want to check the accuracy of the model.
# check Accuracy
fitted.results <- predict(model,newdata=cred_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != cred_test$default)
print(paste('Accuracy',1-misClasificError))
[1] "Accuracy 0.72"
# ROC
# Because this data set is so small, it is possible that the test data set
# does not contain both 0 and 1 values. If this happens the code will not
# run. And since the test data set is so small the ROC is not useful here
# but the code is provided.
#library(ROCR)
#p <- predict(model, newdata=credit_test, type="response")
#pr <- prediction(p, credit_test$default)
#prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#auc <- performance(pr, measure = "auc")
#auc <- auc@y.values[[1]]
#auc
From the output we see that our accuracy of the model is . And we can see that this very good. But lets see if using a tree base model would give a better prediction.
####Classification Trees ------------------
##Training a model on the data ----
# regression tree using rpart
library(rpart)
m.rpart <- rpart(default ~ ., data = cred_train)
# get basic information about the tree
m.rpart
n= 900
node), split, n, deviance, yval
* denotes terminal node
1) root 900 186.972200 0.2944444
2) checking_balance=> 200 DM,unknown 417 48.479620 0.1342926
4) other_credit=none 346 31.459540 0.1011561 *
5) other_credit=bank,store 71 14.788730 0.2957746 *
3) checking_balance=< 0 DM,1 - 200 DM 483 118.563100 0.4327122
6) months_loan_duration< 31.5 385 90.140260 0.3740260
12) credit_history=critical,good,poor 341 75.888560 0.3343109
24) months_loan_duration< 11.5 73 10.027400 0.1643836 *
25) months_loan_duration>=11.5 268 63.179100 0.3805970
50) amount>=1381.5 194 42.185570 0.3195876 *
51) amount< 1381.5 74 18.378380 0.5405405
102) purpose=business,furniture/appliances,renovations 38 8.842105 0.3684211 *
103) purpose=car,education 36 7.222222 0.7222222 *
13) credit_history=perfect,very good 44 9.545455 0.6818182 *
7) months_loan_duration>=31.5 98 21.887760 0.6632653
14) employment_duration=unemployed 8 0.000000 0.0000000 *
15) employment_duration=< 1 year,> 7 years,1 - 4 years,4 - 7 years 90 18.055560 0.7222222 *
# get more detailed information about the tree
summary(m.rpart)
Call:
rpart(formula = default ~ ., data = cred_train)
n= 900
CP nsplit rel error xerror xstd
1 0.10659048 0 1.0000000 1.0037707 0.03018677
2 0.03495242 1 0.8934095 0.8974933 0.03028442
3 0.02517081 2 0.8584571 0.8892272 0.03333667
4 0.02049609 3 0.8332863 0.8670734 0.03419518
5 0.01434470 4 0.8127902 0.8637588 0.03586516
6 0.01398689 5 0.7984455 0.8782908 0.03761752
7 0.01237644 6 0.7844586 0.8906073 0.03877673
8 0.01193411 7 0.7720822 0.8894670 0.03907911
9 0.01000000 8 0.7601481 0.9065333 0.04035903
Variable importance
checking_balance months_loan_duration credit_history amount employment_duration
35 18 13 8 7
savings_balance purpose other_credit age existing_loans_count
6 5 4 2 1
job
1
Node number 1: 900 observations, complexity param=0.1065905
mean=0.2944444, MSE=0.2077469
left son=2 (417 obs) right son=3 (483 obs)
Primary splits:
checking_balance splits as RLRL, improve=0.10659050, (0 missing)
credit_history splits as LLRLR, improve=0.03940550, (0 missing)
months_loan_duration < 34.5 to the left, improve=0.03657627, (0 missing)
savings_balance splits as RLRLL, improve=0.03318005, (0 missing)
amount < 3913.5 to the left, improve=0.03246751, (0 missing)
Surrogate splits:
savings_balance splits as RLRLL, agree=0.607, adj=0.151, (0 split)
credit_history splits as LRRRR, agree=0.590, adj=0.115, (0 split)
existing_loans_count < 1.5 to the right, agree=0.554, adj=0.038, (0 split)
age < 30.5 to the right, agree=0.552, adj=0.034, (0 split)
months_loan_duration < 6.5 to the left, agree=0.550, adj=0.029, (0 split)
Node number 2: 417 observations, complexity param=0.01193411
mean=0.1342926, MSE=0.1162581
left son=4 (346 obs) right son=5 (71 obs)
Primary splits:
other_credit splits as RLR, improve=0.04602649, (0 missing)
employment_duration splits as RLRLR, improve=0.02260084, (0 missing)
amount < 4158 to the left, improve=0.02157993, (0 missing)
purpose splits as RLLLLR, improve=0.01944512, (0 missing)
months_loan_duration < 16.5 to the left, improve=0.01509109, (0 missing)
Surrogate splits:
credit_history splits as LLLLR, agree=0.847, adj=0.099, (0 split)
months_loan_duration < 45 to the left, agree=0.832, adj=0.014, (0 split)
purpose splits as LLRLLL, agree=0.832, adj=0.014, (0 split)
Node number 3: 483 observations, complexity param=0.03495242
mean=0.4327122, MSE=0.2454724
left son=6 (385 obs) right son=7 (98 obs)
Primary splits:
months_loan_duration < 31.5 to the left, improve=0.05511942, (0 missing)
credit_history splits as LLRLR, improve=0.03813898, (0 missing)
savings_balance splits as RLRLL, improve=0.03212835, (0 missing)
amount < 3998 to the left, improve=0.02978766, (0 missing)
housing splits as RLR, improve=0.01513721, (0 missing)
Surrogate splits:
amount < 6648 to the left, agree=0.847, adj=0.245, (0 split)
Node number 4: 346 observations
mean=0.1011561, MSE=0.09092352
Node number 5: 71 observations
mean=0.2957746, MSE=0.208292
Node number 6: 385 observations, complexity param=0.02517081
mean=0.374026, MSE=0.2341305
left son=12 (341 obs) right son=13 (44 obs)
Primary splits:
credit_history splits as LLRLR, improve=0.05221021, (0 missing)
months_loan_duration < 11.5 to the left, improve=0.02781929, (0 missing)
savings_balance splits as RLRLL, improve=0.02194005, (0 missing)
amount < 8195 to the left, improve=0.01666578, (0 missing)
purpose splits as LRRRLL, improve=0.01580102, (0 missing)
Node number 7: 98 observations, complexity param=0.02049609
mean=0.6632653, MSE=0.2233444
left son=14 (8 obs) right son=15 (90 obs)
Primary splits:
employment_duration splits as RRRRL, improve=0.17508420, (0 missing)
savings_balance splits as RRRLL, improve=0.08808193, (0 missing)
age < 26.5 to the right, improve=0.05203600, (0 missing)
job splits as LRLR, improve=0.04773893, (0 missing)
checking_balance splits as R-L-, improve=0.04371296, (0 missing)
Surrogate splits:
purpose splits as RRLRRR, agree=0.929, adj=0.125, (0 split)
job splits as RRLR, agree=0.929, adj=0.125, (0 split)
Node number 12: 341 observations, complexity param=0.0143447
mean=0.3343109, MSE=0.2225471
left son=24 (73 obs) right son=25 (268 obs)
Primary splits:
months_loan_duration < 11.5 to the left, improve=0.03534210, (0 missing)
age < 25.5 to the right, improve=0.01954279, (0 missing)
amount < 8176 to the left, improve=0.01865354, (0 missing)
savings_balance splits as RLRLL, improve=0.01645713, (0 missing)
purpose splits as LRRRRR, improve=0.01610702, (0 missing)
Surrogate splits:
amount < 527.5 to the left, agree=0.806, adj=0.096, (0 split)
age < 66.5 to the right, agree=0.792, adj=0.027, (0 split)
Node number 13: 44 observations
mean=0.6818182, MSE=0.2169421
Node number 14: 8 observations
mean=0, MSE=0
Node number 15: 90 observations
mean=0.7222222, MSE=0.2006173
Node number 24: 73 observations
mean=0.1643836, MSE=0.1373616
Node number 25: 268 observations, complexity param=0.01398689
mean=0.380597, MSE=0.2357429
left son=50 (194 obs) right son=51 (74 obs)
Primary splits:
amount < 1381.5 to the right, improve=0.04139279, (0 missing)
checking_balance splits as R-L-, improve=0.02362391, (0 missing)
purpose splits as LRRRRR, improve=0.02343242, (0 missing)
savings_balance splits as RLRLL, improve=0.02090320, (0 missing)
credit_history splits as LR-L-, improve=0.01425536, (0 missing)
Surrogate splits:
purpose splits as LLLRLL, agree=0.739, adj=0.054, (0 split)
Node number 50: 194 observations
mean=0.3195876, MSE=0.2174514
Node number 51: 74 observations, complexity param=0.01237644
mean=0.5405405, MSE=0.2483565
left son=102 (38 obs) right son=103 (36 obs)
Primary splits:
purpose splits as LR-RLL, improve=0.12591160, (0 missing)
months_loan_duration < 22.5 to the left, improve=0.11025060, (0 missing)
amount < 1181.5 to the left, improve=0.07398612, (0 missing)
existing_loans_count < 1.5 to the right, improve=0.07026175, (0 missing)
other_credit splits as RLR, improve=0.06891326, (0 missing)
Surrogate splits:
months_loan_duration < 15.5 to the left, agree=0.608, adj=0.194, (0 split)
savings_balance splits as LRLLR, agree=0.608, adj=0.194, (0 split)
amount < 1163 to the left, agree=0.595, adj=0.167, (0 split)
employment_duration splits as LRLRL, agree=0.581, adj=0.139, (0 split)
age < 22.5 to the left, agree=0.568, adj=0.111, (0 split)
Node number 102: 38 observations
mean=0.3684211, MSE=0.232687
Node number 103: 36 observations
mean=0.7222222, MSE=0.2006173
# use the rpart.plot package to create a visualization
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)
# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)
Above we can see that output of the tree base model. And we see that the same predictors we found relevant in the logistic model are also deem important in the tree base model as well. But now lets see what accuracy of this model well be.
## Step 4: Evaluate model performance ----
# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, cred_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1012 0.1012 0.3196 0.3140 0.3684 0.7222
summary(cred_test$default)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 0.35 1.00 1.00
# compare the correlation
cor(p.rpart, as.numeric(cred_test$default))
[1] 0.3930203
# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
mean(abs(actual - predicted))
}
# mean absolute error between predicted and actual values
MAE(p.rpart, as.numeric(cred_test$default))
[1] 0.3608024
# mean absolute error between actual values and mean value
mean(as.numeric(cred_train$default))
[1] 0.2944444
MAE(1.30, as.numeric(cred_train$default))
[1] 1.005556