Logistic Regression

Space Shuttle Launch Data

We will be examine the Space Shuttle Launch Data recorded after the 1986 explosion of the space shuttle Challenger. An investigation of the explosion was conducted and data is the report the provided after the investigation. From the below table we can see the there are 4 variables, one dependent variable distress_ct and 2 dependent variables, temperature and field check pressure, the last variable is an identifier code for the 23 shuttles launches observed.

launch <- read.csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml10/challenger.csv")
# examine the launch data
str(launch)

'data.frame':   23 obs. of  4 variables:
 $ distress_ct         : int  0 1 0 0 0 0 0 0 1 1 ...
 $ temperature         : int  66 70 69 68 67 72 73 70 57 63 ...
 $ field_check_pressure: int  50 50 50 50 50 50 100 100 200 200 ...
 $ flight_num          : int  1 2 3 4 5 6 7 8 9 10 ...

#View(launch)
## Descriptive Statistics
# First recode the distress_ct variable into 0 and 1, making 1 to represent at least
# one failure during a launch.
summary(launch)

  distress_ct      temperature    field_check_pressure   flight_num  
 Min.   :0.0000   Min.   :53.00   Min.   : 50.0        Min.   : 1.0  
 1st Qu.:0.0000   1st Qu.:67.00   1st Qu.: 75.0        1st Qu.: 6.5  
 Median :0.0000   Median :70.00   Median :200.0        Median :12.0  
 Mean   :0.3913   Mean   :69.57   Mean   :152.2        Mean   :12.0  
 3rd Qu.:1.0000   3rd Qu.:75.00   3rd Qu.:200.0        3rd Qu.:17.5  
 Max.   :2.0000   Max.   :81.00   Max.   :200.0        Max.   :23.0

launch$distress_ct <- ifelse(launch$distress_ct<1,0,1)
launch$distress_ctL <- factor(launch$distress_ct, levels = c(0,1), labels = c("No Failures", "At least One Failure"))
# Boxplot of MPG by Car Cylinders 
boxplot(temperature~distress_ctL,data=launch, main="Distribution of Tempature by Failures", 
    xlab="Failures", ylab="Temperature")

From the box plot above, we notice that shuttle launches that happened on warmer days were more likely to be successful than those on colder days. We will now run the logistic regression to confirm this observation with statistically certainty.

We will first step up our train data set and verify that there are no missing values.

# set up trainning and test data sets
indx = sample(1:nrow(launch), as.integer(0.9*nrow(launch)))
launch_train = launch[indx,1:4]
launch_test = launch[-indx,1:4]
launch_train_labels = launch[indx,1]
launch_test_labels = launch[-indx,1]   
# Check if there are any missing values
#install.packages("Amelia")
library(Amelia)
missmap(launch, main = "Missing values vs observed")

After we setup the test data set, we used the miss map function from the Amelia package to create a graph to show us if we had any missing values. From the graph above it does not seem that we have any missing values but we will take a closer look to verify. It is important to verify that there are no missing values because missing values have a significantly negative effect on the performance of the logistic regression model.

# number of missing values in each column
sapply(launch,function(x) sum(is.na(x)))

         distress_ct          temperature field_check_pressure           flight_num         distress_ctL 
                   0                    0                    0                    0                    0

# number of unique values in each column
sapply(launch, function(x) length(unique(x)))

         distress_ct          temperature field_check_pressure           flight_num         distress_ctL 
                   2                   16                    3                   23                    2

Now that the we have verified there are no missing values, we can now fit the model and run the analysis.

# fit the logistic regression model, with all predictor variables
model <- glm(distress_ct ~.,family=binomial(link='logit'),data=launch_train)
model


Call:  glm(formula = distress_ct ~ ., family = binomial(link = "logit"), 
    data = launch_train)

Coefficients:
         (Intercept)           temperature  field_check_pressure            flight_num  
            22.49654              -0.35787               0.02209              -0.23676  

Degrees of Freedom: 19 Total (i.e. Null);  16 Residual
Null Deviance:      22.49 
Residual Deviance: 11.86    AIC: 19.86

summary(model)


Call:
glm(formula = distress_ct ~ ., family = binomial(link = "logit"), 
    data = launch_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.11859  -0.48847  -0.21292  -0.00989   2.02954  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)  
(Intercept)          22.49654   14.58126   1.543   0.1229  
temperature          -0.35787    0.21420  -1.671   0.0948 .
field_check_pressure  0.02209    0.02355   0.938   0.3482  
flight_num           -0.23676    0.26362  -0.898   0.3691  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 11.862  on 16  degrees of freedom
AIC: 19.862

Number of Fisher Scoring iterations: 7

anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: distress_ct

Terms added sequentially (first to last)

                     Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                                    19     22.493            
temperature           1   9.5950        18     12.898 0.001951 **
field_check_pressure  1   0.0921        17     12.806 0.761494   
flight_num            1   0.9447        16     11.862 0.331073   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the above table, we see the output of the model. We want to improve the model so we can use a step wise process to remove insignificant variables.

# drop the insignificant predictors, alpha = 0.10
library(gdata)
drop1(model, test = "Chisq")

Single term deletions

Model:
distress_ct ~ temperature + field_check_pressure + flight_num
                     Df Deviance    AIC    LRT Pr(>Chi)   
<none>                    11.862 19.862                   
temperature           1   20.352 26.352 8.4908  0.00357 **
field_check_pressure  1   12.859 18.859 0.9975  0.31792   
flight_num            1   12.806 18.806 0.9447  0.33107   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

From the table below we see, field check pressure and flight number are not significant at an alpha of .10. We will drop these variables and rerun the model.

model <- glm(distress_ct ~ temperature,family=binomial(link='logit'),data=launch_train)
summary(model)


Call:
glm(formula = distress_ct ~ temperature, family = binomial(link = "logit"), 
    data = launch_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.95163  -0.57270  -0.24177  -0.01614   1.97915  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  20.0521    10.4929   1.911    0.056 .
temperature  -0.3123     0.1550  -2.014    0.044 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22.493  on 19  degrees of freedom
Residual deviance: 12.898  on 18  degrees of freedom
AIC: 16.898

Number of Fisher Scoring iterations: 6

anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: distress_ct

Terms added sequentially (first to last)

            Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
NULL                           19     22.493            
temperature  1    9.595        18     12.898 0.001951 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see that the model has improve with the deletion of the two other variables. We know what to see check the predictability of the new model.

# check Accuracy
fitted.results <- predict(model,newdata=launch_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != launch_test$distress_ct)
#print(paste('Accuracy',round(1-misClasificError,2)*100,"%"))
#Because this data set is so small, the test data set does not contain both 0 and 1 values and the code will not run. And since the test data set is so small the ROC is not useful here but the code is provided.
#install.packages("ROCR")
#library(ROCR)
#p <- predict(model, newdata=launch_test, type="response")
#pr <- prediction(p, launch_test$distress_ct)
#prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#auc <- performance(pr, measure = "auc")
#auc <- auc@y.values[[1]]
#auc

We see that the accuracy of the model is 72 % which is pretty good considering that we were working with a small data set.

Logistic Regression and CART with Credit Dataset

We will be conducting a logistic regression analysis using the credit approval data set. The data set is a collection of credit card applications and the credit approval decisions. In this analysis I will also be including a regression tree (CART) model. We will built a model and test is predictability of the outcomes of credit applications.

Below we will load the data set and use the str() function to get a quick understanding of the data types included in the data set.

library(readr)
cred <- read_csv("http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml7/credit.csv")
# examine the launch data
str(cred)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   1000 obs. of  17 variables:
 $ checking_balance    : chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history      : chr  "critical" "good" "critical" "good" ...
 $ purpose             : chr  "furniture/appliances" "furniture/appliances" "education" "furniture/appliances" ...
 $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings_balance     : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ employment_duration : chr  "> 7 years" "1 - 4 years" "4 - 7 years" "4 - 7 years" ...
 $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
 $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
 $ other_credit        : chr  "none" "none" "none" "none" ...
 $ housing             : chr  "own" "own" "own" "other" ...
 $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
 $ job                 : chr  "skilled" "skilled" "unskilled" "skilled" ...
 $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
 $ phone               : chr  "yes" "no" "no" "no" ...
 $ default             : chr  "no" "yes" "no" "no" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 17
  .. ..$ checking_balance    : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ months_loan_duration: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ credit_history      : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ purpose             : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ amount              : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ savings_balance     : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ employment_duration : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ percent_of_income   : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ years_at_residence  : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ age                 : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ other_credit        : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ housing             : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ existing_loans_count: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ job                 : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ dependents          : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ phone               : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ default             : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

From the table above we see that dependent variable is character type and we need to transform it to numeric values 0 and 1. Having the data as a character variables is meaningless in terms of the model. We also want to check for missing values and determine how to address them. There are three approaches:remove them, zero them out or estimate a plug value. For this analysis, if we find any missing values we will zero them out.

## Fix the default variable to be 0 or 1
cred$default <- ifelse(cred$default =="yes", 1, 0)
# Check if there are any missing values
library(Amelia)
missmap(cred, main = "Missing values vs observed")

# number of missing values in each column
sapply(cred,function(x) sum(is.na(x)))

    checking_balance months_loan_duration       credit_history              purpose               amount 
                   0                    0                    0                    0                    0 
     savings_balance  employment_duration    percent_of_income   years_at_residence                  age 
                   0                    0                    0                    0                    0 
        other_credit              housing existing_loans_count                  job           dependents 
                   0                    0                    0                    0                    0 
               phone              default 
                   0                    0

# number of unique values in each column
sapply(cred, function(x) length(unique(x)))

    checking_balance months_loan_duration       credit_history              purpose               amount 
                   4                   33                    5                    6                  921 
     savings_balance  employment_duration    percent_of_income   years_at_residence                  age 
                   5                    5                    4                    4                   53 
        other_credit              housing existing_loans_count                  job           dependents 
                   3                    3                    4                    4                    2 
               phone              default 
                   2                    2

From the above graphs and charts, we do not have any missing values thus we can proceed with the analysis.

library(plotrix)
library(sqldf)
Ye <- sum(cred$default)
N <- length(cred$default) - sum(cred$default)
slices <- c(Ye,N)
lbls <- c("Approved", "Not Approved")
pct <- round(slices/sum(slices)*100)    
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # add % to labels
pie3D(slices, labels = lbls,  explode =  .01 , main = "Pie Chart of Credit Application Approval", labelpos = c(1.5,5))

We will now create the train and test data sets and run the logistic regression model.

# logisitic regression
# set up trainning and test data sets
indx = sample(1:nrow(cred), as.integer(0.9*nrow(cred)))
#indx
cred_train = cred[indx,]
cred_test = cred[-indx,]
cred_train_labels = cred[indx,17]
cred_test_labels = cred[-indx,17]   
# fit the logistic regression model, with all predictor variables
model <- glm(default ~.,family=binomial(link='logit'),data=cred_train)
model


Call:  glm(formula = default ~ ., family = binomial(link = "logit"), 
    data = cred_train)

Coefficients:
                   (Intercept)        checking_balance> 200 DM      checking_balance1 - 200 DM  
                    -1.221e+00                      -8.600e-01                      -4.755e-01  
       checking_balanceunknown            months_loan_duration              credit_historygood  
                    -1.722e+00                       2.777e-02                       8.061e-01  
         credit_historyperfect              credit_historypoor         credit_historyvery good  
                     1.376e+00                       6.147e-01                       1.306e+00  
                    purposecar                     purposecar0                purposeeducation  
                     2.796e-01                      -1.030e+00                       6.856e-01  
   purposefurniture/appliances              purposerenovations                          amount  
                    -1.303e-01                       3.688e-01                       9.464e-05  
      savings_balance> 1000 DM     savings_balance100 - 500 DM    savings_balance500 - 1000 DM  
                    -1.046e+00                      -2.049e-01                      -5.771e-01  
        savings_balanceunknown    employment_duration> 7 years  employment_duration1 - 4 years  
                    -8.369e-01                      -2.788e-01                      -2.002e-01  
employment_duration4 - 7 years   employment_durationunemployed               percent_of_income  
                    -8.475e-01                      -1.874e-01                       2.665e-01  
            years_at_residence                             age                other_creditnone  
                     1.995e-02                      -1.232e-02                      -6.470e-01  
             other_creditstore                      housingown                     housingrent  
                    -1.359e-01                      -1.696e-01                       2.317e-01  
          existing_loans_count                      jobskilled                   jobunemployed  
                     2.322e-01                       4.105e-02                      -2.606e-01  
                  jobunskilled                      dependents                        phoneyes  
                    -4.921e-02                      -1.316e-01                      -2.069e-01  

Degrees of Freedom: 899 Total (i.e. Null);  864 Residual
Null Deviance:      1091 
Residual Deviance: 857.3    AIC: 929.3

summary(model)


Call:
glm(formula = default ~ ., family = binomial(link = "logit"), 
    data = cred_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9050  -0.7517  -0.4221   0.8077   2.5871  

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -1.221e+00  9.349e-01  -1.306  0.19156    
checking_balance> 200 DM       -8.600e-01  3.720e-01  -2.312  0.02079 *  
checking_balance1 - 200 DM     -4.755e-01  2.180e-01  -2.181  0.02918 *  
checking_balanceunknown        -1.722e+00  2.346e-01  -7.342  2.1e-13 ***
months_loan_duration            2.777e-02  9.443e-03   2.940  0.00328 ** 
credit_historygood              8.061e-01  2.614e-01   3.084  0.00204 ** 
credit_historyperfect           1.376e+00  4.230e-01   3.254  0.00114 ** 
credit_historypoor              6.147e-01  3.400e-01   1.808  0.07058 .  
credit_historyvery good         1.306e+00  4.347e-01   3.004  0.00266 ** 
purposecar                      2.796e-01  3.214e-01   0.870  0.38427    
purposecar0                    -1.030e+00  8.004e-01  -1.287  0.19815    
purposeeducation                6.856e-01  4.329e-01   1.584  0.11320    
purposefurniture/appliances    -1.303e-01  3.152e-01  -0.414  0.67920    
purposerenovations              3.688e-01  5.941e-01   0.621  0.53477    
amount                          9.464e-05  4.440e-05   2.132  0.03304 *  
savings_balance> 1000 DM       -1.046e+00  5.001e-01  -2.092  0.03648 *  
savings_balance100 - 500 DM    -2.049e-01  2.859e-01  -0.717  0.47354    
savings_balance500 - 1000 DM   -5.771e-01  4.437e-01  -1.301  0.19342    
savings_balanceunknown         -8.369e-01  2.575e-01  -3.250  0.00116 ** 
employment_duration> 7 years   -2.788e-01  2.955e-01  -0.943  0.34545    
employment_duration1 - 4 years -2.002e-01  2.404e-01  -0.833  0.40491    
employment_duration4 - 7 years -8.475e-01  3.045e-01  -2.783  0.00539 ** 
employment_durationunemployed  -1.874e-01  4.469e-01  -0.419  0.67490    
percent_of_income               2.665e-01  8.862e-02   3.007  0.00264 ** 
years_at_residence              1.995e-02  8.756e-02   0.228  0.81981    
age                            -1.232e-02  9.037e-03  -1.363  0.17289    
other_creditnone               -6.470e-01  2.380e-01  -2.718  0.00657 ** 
other_creditstore              -1.359e-01  4.140e-01  -0.328  0.74275    
housingown                     -1.696e-01  2.981e-01  -0.569  0.56937    
housingrent                     2.317e-01  3.411e-01   0.679  0.49690    
existing_loans_count            2.322e-01  1.923e-01   1.207  0.22738    
jobskilled                      4.105e-02  2.844e-01   0.144  0.88526    
jobunemployed                  -2.606e-01  6.759e-01  -0.386  0.69979    
jobunskilled                   -4.921e-02  3.486e-01  -0.141  0.88773    
dependents                     -1.316e-01  2.465e-01  -0.534  0.59342    
phoneyes                       -2.069e-01  2.005e-01  -1.032  0.30207    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1090.95  on 899  degrees of freedom
Residual deviance:  857.26  on 864  degrees of freedom
AIC: 929.26

Number of Fisher Scoring iterations: 5

anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: default

Terms added sequentially (first to last)

                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                   899    1090.95              
checking_balance      3  112.232       896     978.72 < 2.2e-16 ***
months_loan_duration  1   38.630       895     940.09 5.123e-10 ***
credit_history        4   23.551       891     916.54 9.827e-05 ***
purpose               5    7.853       886     908.68  0.164539    
amount                1    0.414       885     908.27  0.519718    
savings_balance       4   15.869       881     892.40  0.003200 ** 
employment_duration   4   10.037       877     882.36  0.039816 *  
percent_of_income     1    8.413       876     873.95  0.003725 ** 
years_at_residence    1    0.188       875     873.76  0.664428    
age                   1    2.749       874     871.01  0.097306 .  
other_credit          2    7.799       872     863.21  0.020247 *  
housing               2    3.023       870     860.19  0.220528    
existing_loans_count  1    1.189       869     859.00  0.275509    
job                   3    0.388       866     858.61  0.942683    
dependents            1    0.281       865     858.33  0.596134    
phone                 1    1.071       864     857.26  0.300799    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We now want to drop the insignificant predictors from the above output. This will help to improve the overall predictability of the model.

# drop the insignificant predictors, alpha = 0.10
model <- glm(default ~ checking_balance + months_loan_duration + credit_history +  percent_of_income + age,family=binomial(link='logit'),data=cred_train)
model


Call:  glm(formula = default ~ checking_balance + months_loan_duration + 
    credit_history + percent_of_income + age, family = binomial(link = "logit"), 
    data = cred_train)

Coefficients:
               (Intercept)    checking_balance> 200 DM  checking_balance1 - 200 DM  
                  -1.38751                    -1.04269                    -0.59932  
   checking_balanceunknown        months_loan_duration          credit_historygood  
                  -1.89752                     0.03488                     0.54507  
     credit_historyperfect          credit_historypoor     credit_historyvery good  
                   1.58205                     0.55976                     1.25426  
         percent_of_income                         age  
                   0.16386                    -0.01137  

Degrees of Freedom: 899 Total (i.e. Null);  889 Residual
Null Deviance:      1091 
Residual Deviance: 909.6    AIC: 931.6

summary(model)


Call:
glm(formula = default ~ checking_balance + months_loan_duration + 
    credit_history + percent_of_income + age, family = binomial(link = "logit"), 
    data = cred_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7486  -0.7881  -0.4808   0.8796   2.3638  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -1.387511   0.429112  -3.233 0.001223 ** 
checking_balance> 200 DM   -1.042690   0.353004  -2.954 0.003139 ** 
checking_balance1 - 200 DM -0.599317   0.198958  -3.012 0.002593 ** 
checking_balanceunknown    -1.897523   0.217381  -8.729  < 2e-16 ***
months_loan_duration        0.034877   0.006637   5.255 1.48e-07 ***
credit_historygood          0.545071   0.206717   2.637 0.008369 ** 
credit_historyperfect       1.582047   0.398854   3.966 7.29e-05 ***
credit_historypoor          0.559760   0.318190   1.759 0.078544 .  
credit_historyvery good     1.254261   0.379193   3.308 0.000941 ***
percent_of_income           0.163864   0.074465   2.201 0.027768 *  
age                        -0.011367   0.007391  -1.538 0.124059    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1091.0  on 899  degrees of freedom
Residual deviance:  909.6  on 889  degrees of freedom
AIC: 931.6

Number of Fisher Scoring iterations: 4

anova(model, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: default

Terms added sequentially (first to last)

                     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                   899    1090.95              
checking_balance      3  112.232       896     978.72 < 2.2e-16 ***
months_loan_duration  1   38.630       895     940.09 5.123e-10 ***
credit_history        4   23.551       891     916.54 9.827e-05 ***
percent_of_income     1    4.526       890     912.01   0.03339 *  
age                   1    2.410       889     909.60   0.12054    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now that we we have our final model, we want to check the accuracy of the model.

# check Accuracy
fitted.results <- predict(model,newdata=cred_test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != cred_test$default)
print(paste('Accuracy',1-misClasificError))

[1] "Accuracy 0.72"

# ROC
# Because this data set is so small, it is possible that the test data set
# does not contain both 0 and 1 values.  If this happens the code will not
# run.  And since the test data set is so small the ROC is not useful here
# but the code is provided.
#library(ROCR)
#p <- predict(model, newdata=credit_test, type="response")
#pr <- prediction(p, credit_test$default)
#prf <- performance(pr, measure = "tpr", x.measure = "fpr")
#plot(prf)
#auc <- performance(pr, measure = "auc")
#auc <- auc@y.values[[1]]
#auc

From the output we see that our accuracy of the model is . And we can see that this very good. But lets see if using a tree base model would give a better prediction.

####Classification Trees ------------------
##Training a model on the data ----
# regression tree using rpart
library(rpart)
m.rpart <- rpart(default ~ ., data = cred_train)
# get basic information about the tree
m.rpart

n= 900 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 900 186.972200 0.2944444  
    2) checking_balance=> 200 DM,unknown 417  48.479620 0.1342926  
      4) other_credit=none 346  31.459540 0.1011561 *
      5) other_credit=bank,store 71  14.788730 0.2957746 *
    3) checking_balance=< 0 DM,1 - 200 DM 483 118.563100 0.4327122  
      6) months_loan_duration< 31.5 385  90.140260 0.3740260  
       12) credit_history=critical,good,poor 341  75.888560 0.3343109  
         24) months_loan_duration< 11.5 73  10.027400 0.1643836 *
         25) months_loan_duration>=11.5 268  63.179100 0.3805970  
           50) amount>=1381.5 194  42.185570 0.3195876 *
           51) amount< 1381.5 74  18.378380 0.5405405  
            102) purpose=business,furniture/appliances,renovations 38   8.842105 0.3684211 *
            103) purpose=car,education 36   7.222222 0.7222222 *
       13) credit_history=perfect,very good 44   9.545455 0.6818182 *
      7) months_loan_duration>=31.5 98  21.887760 0.6632653  
       14) employment_duration=unemployed 8   0.000000 0.0000000 *
       15) employment_duration=< 1 year,> 7 years,1 - 4 years,4 - 7 years 90  18.055560 0.7222222 *

# get more detailed information about the tree
summary(m.rpart)

Call:
rpart(formula = default ~ ., data = cred_train)
  n= 900 

          CP nsplit rel error    xerror       xstd
1 0.10659048      0 1.0000000 1.0037707 0.03018677
2 0.03495242      1 0.8934095 0.8974933 0.03028442
3 0.02517081      2 0.8584571 0.8892272 0.03333667
4 0.02049609      3 0.8332863 0.8670734 0.03419518
5 0.01434470      4 0.8127902 0.8637588 0.03586516
6 0.01398689      5 0.7984455 0.8782908 0.03761752
7 0.01237644      6 0.7844586 0.8906073 0.03877673
8 0.01193411      7 0.7720822 0.8894670 0.03907911
9 0.01000000      8 0.7601481 0.9065333 0.04035903

Variable importance
    checking_balance months_loan_duration       credit_history               amount  employment_duration 
                  35                   18                   13                    8                    7 
     savings_balance              purpose         other_credit                  age existing_loans_count 
                   6                    5                    4                    2                    1 
                 job 
                   1 

Node number 1: 900 observations,    complexity param=0.1065905
  mean=0.2944444, MSE=0.2077469 
  left son=2 (417 obs) right son=3 (483 obs)
  Primary splits:
      checking_balance     splits as  RLRL,       improve=0.10659050, (0 missing)
      credit_history       splits as  LLRLR,      improve=0.03940550, (0 missing)
      months_loan_duration < 34.5   to the left,  improve=0.03657627, (0 missing)
      savings_balance      splits as  RLRLL,      improve=0.03318005, (0 missing)
      amount               < 3913.5 to the left,  improve=0.03246751, (0 missing)
  Surrogate splits:
      savings_balance      splits as  RLRLL,      agree=0.607, adj=0.151, (0 split)
      credit_history       splits as  LRRRR,      agree=0.590, adj=0.115, (0 split)
      existing_loans_count < 1.5    to the right, agree=0.554, adj=0.038, (0 split)
      age                  < 30.5   to the right, agree=0.552, adj=0.034, (0 split)
      months_loan_duration < 6.5    to the left,  agree=0.550, adj=0.029, (0 split)

Node number 2: 417 observations,    complexity param=0.01193411
  mean=0.1342926, MSE=0.1162581 
  left son=4 (346 obs) right son=5 (71 obs)
  Primary splits:
      other_credit         splits as  RLR,        improve=0.04602649, (0 missing)
      employment_duration  splits as  RLRLR,      improve=0.02260084, (0 missing)
      amount               < 4158   to the left,  improve=0.02157993, (0 missing)
      purpose              splits as  RLLLLR,     improve=0.01944512, (0 missing)
      months_loan_duration < 16.5   to the left,  improve=0.01509109, (0 missing)
  Surrogate splits:
      credit_history       splits as  LLLLR,      agree=0.847, adj=0.099, (0 split)
      months_loan_duration < 45     to the left,  agree=0.832, adj=0.014, (0 split)
      purpose              splits as  LLRLLL,     agree=0.832, adj=0.014, (0 split)

Node number 3: 483 observations,    complexity param=0.03495242
  mean=0.4327122, MSE=0.2454724 
  left son=6 (385 obs) right son=7 (98 obs)
  Primary splits:
      months_loan_duration < 31.5   to the left,  improve=0.05511942, (0 missing)
      credit_history       splits as  LLRLR,      improve=0.03813898, (0 missing)
      savings_balance      splits as  RLRLL,      improve=0.03212835, (0 missing)
      amount               < 3998   to the left,  improve=0.02978766, (0 missing)
      housing              splits as  RLR,        improve=0.01513721, (0 missing)
  Surrogate splits:
      amount < 6648   to the left,  agree=0.847, adj=0.245, (0 split)

Node number 4: 346 observations
  mean=0.1011561, MSE=0.09092352 

Node number 5: 71 observations
  mean=0.2957746, MSE=0.208292 

Node number 6: 385 observations,    complexity param=0.02517081
  mean=0.374026, MSE=0.2341305 
  left son=12 (341 obs) right son=13 (44 obs)
  Primary splits:
      credit_history       splits as  LLRLR,      improve=0.05221021, (0 missing)
      months_loan_duration < 11.5   to the left,  improve=0.02781929, (0 missing)
      savings_balance      splits as  RLRLL,      improve=0.02194005, (0 missing)
      amount               < 8195   to the left,  improve=0.01666578, (0 missing)
      purpose              splits as  LRRRLL,     improve=0.01580102, (0 missing)

Node number 7: 98 observations,    complexity param=0.02049609
  mean=0.6632653, MSE=0.2233444 
  left son=14 (8 obs) right son=15 (90 obs)
  Primary splits:
      employment_duration splits as  RRRRL,      improve=0.17508420, (0 missing)
      savings_balance     splits as  RRRLL,      improve=0.08808193, (0 missing)
      age                 < 26.5   to the right, improve=0.05203600, (0 missing)
      job                 splits as  LRLR,       improve=0.04773893, (0 missing)
      checking_balance    splits as  R-L-,       improve=0.04371296, (0 missing)
  Surrogate splits:
      purpose splits as  RRLRRR, agree=0.929, adj=0.125, (0 split)
      job     splits as  RRLR,   agree=0.929, adj=0.125, (0 split)

Node number 12: 341 observations,    complexity param=0.0143447
  mean=0.3343109, MSE=0.2225471 
  left son=24 (73 obs) right son=25 (268 obs)
  Primary splits:
      months_loan_duration < 11.5   to the left,  improve=0.03534210, (0 missing)
      age                  < 25.5   to the right, improve=0.01954279, (0 missing)
      amount               < 8176   to the left,  improve=0.01865354, (0 missing)
      savings_balance      splits as  RLRLL,      improve=0.01645713, (0 missing)
      purpose              splits as  LRRRRR,     improve=0.01610702, (0 missing)
  Surrogate splits:
      amount < 527.5  to the left,  agree=0.806, adj=0.096, (0 split)
      age    < 66.5   to the right, agree=0.792, adj=0.027, (0 split)

Node number 13: 44 observations
  mean=0.6818182, MSE=0.2169421 

Node number 14: 8 observations
  mean=0, MSE=0 

Node number 15: 90 observations
  mean=0.7222222, MSE=0.2006173 

Node number 24: 73 observations
  mean=0.1643836, MSE=0.1373616 

Node number 25: 268 observations,    complexity param=0.01398689
  mean=0.380597, MSE=0.2357429 
  left son=50 (194 obs) right son=51 (74 obs)
  Primary splits:
      amount           < 1381.5 to the right, improve=0.04139279, (0 missing)
      checking_balance splits as  R-L-,       improve=0.02362391, (0 missing)
      purpose          splits as  LRRRRR,     improve=0.02343242, (0 missing)
      savings_balance  splits as  RLRLL,      improve=0.02090320, (0 missing)
      credit_history   splits as  LR-L-,      improve=0.01425536, (0 missing)
  Surrogate splits:
      purpose splits as  LLLRLL, agree=0.739, adj=0.054, (0 split)

Node number 50: 194 observations
  mean=0.3195876, MSE=0.2174514 

Node number 51: 74 observations,    complexity param=0.01237644
  mean=0.5405405, MSE=0.2483565 
  left son=102 (38 obs) right son=103 (36 obs)
  Primary splits:
      purpose              splits as  LR-RLL,     improve=0.12591160, (0 missing)
      months_loan_duration < 22.5   to the left,  improve=0.11025060, (0 missing)
      amount               < 1181.5 to the left,  improve=0.07398612, (0 missing)
      existing_loans_count < 1.5    to the right, improve=0.07026175, (0 missing)
      other_credit         splits as  RLR,        improve=0.06891326, (0 missing)
  Surrogate splits:
      months_loan_duration < 15.5   to the left,  agree=0.608, adj=0.194, (0 split)
      savings_balance      splits as  LRLLR,      agree=0.608, adj=0.194, (0 split)
      amount               < 1163   to the left,  agree=0.595, adj=0.167, (0 split)
      employment_duration  splits as  LRLRL,      agree=0.581, adj=0.139, (0 split)
      age                  < 22.5   to the left,  agree=0.568, adj=0.111, (0 split)

Node number 102: 38 observations
  mean=0.3684211, MSE=0.232687 

Node number 103: 36 observations
  mean=0.7222222, MSE=0.2006173

# use the rpart.plot package to create a visualization
library(rpart.plot)
# a basic decision tree diagram
rpart.plot(m.rpart, digits = 3)

# a few adjustments to the diagram
rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE, type = 3, extra = 101)

Above we can see that output of the tree base model. And we see that the same predictors we found relevant in the logistic model are also deem important in the tree base model as well. But now lets see what accuracy of this model well be.

## Step 4: Evaluate model performance ----
# generate predictions for the testing dataset
p.rpart <- predict(m.rpart, cred_test)
# compare the distribution of predicted values vs. actual values
summary(p.rpart)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1012  0.1012  0.3196  0.3140  0.3684  0.7222

summary(cred_test$default)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00    0.35    1.00    1.00

# compare the correlation
cor(p.rpart, as.numeric(cred_test$default))

[1] 0.3930203

# function to calculate the mean absolute error
MAE <- function(actual, predicted) {
  mean(abs(actual - predicted))  
}
# mean absolute error between predicted and actual values
MAE(p.rpart, as.numeric(cred_test$default))

[1] 0.3608024

# mean absolute error between actual values and mean value
mean(as.numeric(cred_train$default))

[1] 0.2944444

MAE(1.30, as.numeric(cred_train$default))

[1] 1.005556

LS0tCnRpdGxlOiAiU1RBVCA2NjIwIEhXIDYgUGFydCAxIgphdXRob3I6ICJHZW9yZ2UgTWFzb24iCm91dHB1dDoKICBodG1sX25vdGVib29rOiBkZWZhdWx0CiAgd29yZF9kb2N1bWVudDogZGVmYXVsdAotLS0KCiNMb2dpc3RpYyBSZWdyZXNzaW9uCgojIyNTcGFjZSBTaHV0dGxlIExhdW5jaCBEYXRhCgpXZSB3aWxsIGJlIGV4YW1pbmUgdGhlIFNwYWNlIFNodXR0bGUgTGF1bmNoIERhdGEgcmVjb3JkZWQgYWZ0ZXIgdGhlIDE5ODYgZXhwbG9zaW9uIG9mIHRoZSBzcGFjZSBzaHV0dGxlIENoYWxsZW5nZXIuIEFuIGludmVzdGlnYXRpb24gb2YgdGhlIGV4cGxvc2lvbiB3YXMgY29uZHVjdGVkIGFuZCBkYXRhIGlzIHRoZSByZXBvcnQgdGhlIHByb3ZpZGVkIGFmdGVyIHRoZSBpbnZlc3RpZ2F0aW9uLiBGcm9tIHRoZSBiZWxvdyB0YWJsZSB3ZSBjYW4gc2VlIHRoZSB0aGVyZSBhcmUgNCB2YXJpYWJsZXMsIG9uZSBkZXBlbmRlbnQgdmFyaWFibGUgZGlzdHJlc3NfY3QgYW5kIDIgZGVwZW5kZW50IHZhcmlhYmxlcywgdGVtcGVyYXR1cmUgYW5kIGZpZWxkIGNoZWNrIHByZXNzdXJlLCB0aGUgbGFzdCB2YXJpYWJsZSBpcyBhbiBpZGVudGlmaWVyIGNvZGUgZm9yIHRoZSAyMyBzaHV0dGxlcyBsYXVuY2hlcyBvYnNlcnZlZC4KCmBgYHtyIHNwYWNlfQpsYXVuY2ggPC0gcmVhZC5jc3YoImh0dHA6Ly93d3cuc2NpLmNzdWVhc3RiYXkuZWR1L35lc3Vlc3MvY2xhc3Nlcy9TdGF0aXN0aWNzXzY2MjAvUHJlc2VudGF0aW9ucy9tbDEwL2NoYWxsZW5nZXIuY3N2IikKCiMgZXhhbWluZSB0aGUgbGF1bmNoIGRhdGEKc3RyKGxhdW5jaCkKI1ZpZXcobGF1bmNoKQojIyBEZXNjcmlwdGl2ZSBTdGF0aXN0aWNzCiMgRmlyc3QgcmVjb2RlIHRoZSBkaXN0cmVzc19jdCB2YXJpYWJsZSBpbnRvIDAgYW5kIDEsIG1ha2luZyAxIHRvIHJlcHJlc2VudCBhdCBsZWFzdAojIG9uZSBmYWlsdXJlIGR1cmluZyBhIGxhdW5jaC4KCnN1bW1hcnkobGF1bmNoKQoKbGF1bmNoJGRpc3RyZXNzX2N0IDwtIGlmZWxzZShsYXVuY2gkZGlzdHJlc3NfY3Q8MSwwLDEpCmxhdW5jaCRkaXN0cmVzc19jdEwgPC0gZmFjdG9yKGxhdW5jaCRkaXN0cmVzc19jdCwgbGV2ZWxzID0gYygwLDEpLCBsYWJlbHMgPSBjKCJObyBGYWlsdXJlcyIsICJBdCBsZWFzdCBPbmUgRmFpbHVyZSIpKQoKIyBCb3hwbG90IG9mIE1QRyBieSBDYXIgQ3lsaW5kZXJzIApib3hwbG90KHRlbXBlcmF0dXJlfmRpc3RyZXNzX2N0TCxkYXRhPWxhdW5jaCwgbWFpbj0iRGlzdHJpYnV0aW9uIG9mIFRlbXBhdHVyZSBieSBGYWlsdXJlcyIsIAogIAl4bGFiPSJGYWlsdXJlcyIsIHlsYWI9IlRlbXBlcmF0dXJlIikKYGBgCgpGcm9tIHRoZSBib3ggcGxvdCBhYm92ZSwgd2Ugbm90aWNlIHRoYXQgc2h1dHRsZSBsYXVuY2hlcyB0aGF0IGhhcHBlbmVkIG9uIHdhcm1lciBkYXlzIHdlcmUgbW9yZSBsaWtlbHkgdG8gYmUgc3VjY2Vzc2Z1bCB0aGFuIHRob3NlIG9uIGNvbGRlciBkYXlzLiBXZSB3aWxsIG5vdyBydW4gdGhlIGxvZ2lzdGljIHJlZ3Jlc3Npb24gdG8gY29uZmlybSB0aGlzIG9ic2VydmF0aW9uIHdpdGggc3RhdGlzdGljYWxseSBjZXJ0YWludHkuIAoKV2Ugd2lsbCBmaXJzdCBzdGVwIHVwIG91ciB0cmFpbiBkYXRhIHNldCBhbmQgdmVyaWZ5IHRoYXQgdGhlcmUgYXJlIG5vIG1pc3NpbmcgdmFsdWVzLgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KIyBzZXQgdXAgdHJhaW5uaW5nIGFuZCB0ZXN0IGRhdGEgc2V0cwoKaW5keCA9IHNhbXBsZSgxOm5yb3cobGF1bmNoKSwgYXMuaW50ZWdlcigwLjkqbnJvdyhsYXVuY2gpKSkKCmxhdW5jaF90cmFpbiA9IGxhdW5jaFtpbmR4LDE6NF0KbGF1bmNoX3Rlc3QgPSBsYXVuY2hbLWluZHgsMTo0XQoKbGF1bmNoX3RyYWluX2xhYmVscyA9IGxhdW5jaFtpbmR4LDFdCmxhdW5jaF90ZXN0X2xhYmVscyA9IGxhdW5jaFstaW5keCwxXSAgIAoKIyBDaGVjayBpZiB0aGVyZSBhcmUgYW55IG1pc3NpbmcgdmFsdWVzCiNpbnN0YWxsLnBhY2thZ2VzKCJBbWVsaWEiKQpsaWJyYXJ5KEFtZWxpYSkKbWlzc21hcChsYXVuY2gsIG1haW4gPSAiTWlzc2luZyB2YWx1ZXMgdnMgb2JzZXJ2ZWQiKQpgYGAKCkFmdGVyIHdlIHNldHVwIHRoZSB0ZXN0IGRhdGEgc2V0LCB3ZSB1c2VkIHRoZSBtaXNzIG1hcCBmdW5jdGlvbiBmcm9tIHRoZSBBbWVsaWEgcGFja2FnZSB0byBjcmVhdGUgYSBncmFwaCB0byBzaG93IHVzIGlmIHdlIGhhZCBhbnkgbWlzc2luZyB2YWx1ZXMuIEZyb20gdGhlIGdyYXBoIGFib3ZlIGl0IGRvZXMgbm90IHNlZW0gdGhhdCB3ZSBoYXZlIGFueSBtaXNzaW5nIHZhbHVlcyBidXQgd2Ugd2lsbCB0YWtlIGEgY2xvc2VyIGxvb2sgdG8gdmVyaWZ5LiBJdCBpcyBpbXBvcnRhbnQgdG8gdmVyaWZ5IHRoYXQgdGhlcmUgYXJlIG5vIG1pc3NpbmcgdmFsdWVzIGJlY2F1c2UgbWlzc2luZyB2YWx1ZXMgaGF2ZSBhIHNpZ25pZmljYW50bHkgbmVnYXRpdmUgZWZmZWN0IG9uIHRoZSBwZXJmb3JtYW5jZSBvZiB0aGUgbG9naXN0aWMgcmVncmVzc2lvbiBtb2RlbC4gCgpgYGB7ciB9CiMgbnVtYmVyIG9mIG1pc3NpbmcgdmFsdWVzIGluIGVhY2ggY29sdW1uCgpzYXBwbHkobGF1bmNoLGZ1bmN0aW9uKHgpIHN1bShpcy5uYSh4KSkpCgojIG51bWJlciBvZiB1bmlxdWUgdmFsdWVzIGluIGVhY2ggY29sdW1uCgpzYXBwbHkobGF1bmNoLCBmdW5jdGlvbih4KSBsZW5ndGgodW5pcXVlKHgpKSkKYGBgCgpOb3cgdGhhdCB0aGUgd2UgaGF2ZSB2ZXJpZmllZCB0aGVyZSBhcmUgbm8gbWlzc2luZyB2YWx1ZXMsIHdlIGNhbiBub3cgZml0IHRoZSBtb2RlbCBhbmQgcnVuIHRoZSBhbmFseXNpcy4KCmBgYHtyIH0KIyBmaXQgdGhlIGxvZ2lzdGljIHJlZ3Jlc3Npb24gbW9kZWwsIHdpdGggYWxsIHByZWRpY3RvciB2YXJpYWJsZXMKCm1vZGVsIDwtIGdsbShkaXN0cmVzc19jdCB+LixmYW1pbHk9Ymlub21pYWwobGluaz0nbG9naXQnKSxkYXRhPWxhdW5jaF90cmFpbikKbW9kZWwKCnN1bW1hcnkobW9kZWwpCmFub3ZhKG1vZGVsLCB0ZXN0PSJDaGlzcSIpCmBgYAoKRnJvbSB0aGUgYWJvdmUgdGFibGUsIHdlIHNlZSB0aGUgb3V0cHV0IG9mIHRoZSBtb2RlbC4gV2Ugd2FudCB0byBpbXByb3ZlIHRoZSBtb2RlbCBzbyB3ZSBjYW4gdXNlIGEgc3RlcCB3aXNlIHByb2Nlc3MgdG8gcmVtb3ZlIGluc2lnbmlmaWNhbnQgdmFyaWFibGVzLiAKCmBgYHtyLCBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQojIGRyb3AgdGhlIGluc2lnbmlmaWNhbnQgcHJlZGljdG9ycywgYWxwaGEgPSAwLjEwCmxpYnJhcnkoZ2RhdGEpCmRyb3AxKG1vZGVsLCB0ZXN0ID0gIkNoaXNxIikKCmBgYAoKRnJvbSB0aGUgdGFibGUgYmVsb3cgd2Ugc2VlLCBmaWVsZCBjaGVjayBwcmVzc3VyZSBhbmQgZmxpZ2h0IG51bWJlciBhcmUgbm90IHNpZ25pZmljYW50IGF0IGFuIGFscGhhIG9mIC4xMC4gV2Ugd2lsbCBkcm9wIHRoZXNlIHZhcmlhYmxlcyBhbmQgcmVydW4gdGhlIG1vZGVsLgoKYGBge3IgfQptb2RlbCA8LSBnbG0oZGlzdHJlc3NfY3QgfiB0ZW1wZXJhdHVyZSxmYW1pbHk9Ymlub21pYWwobGluaz0nbG9naXQnKSxkYXRhPWxhdW5jaF90cmFpbikKc3VtbWFyeShtb2RlbCkKYW5vdmEobW9kZWwsIHRlc3Q9IkNoaXNxIikKYGBgCgpXZSBzZWUgdGhhdCB0aGUgbW9kZWwgaGFzIGltcHJvdmUgd2l0aCB0aGUgZGVsZXRpb24gb2YgdGhlIHR3byBvdGhlciB2YXJpYWJsZXMuIFdlIGtub3cgd2hhdCB0byBzZWUgY2hlY2sgdGhlIHByZWRpY3RhYmlsaXR5IG9mIHRoZSBuZXcgbW9kZWwuCgpgYGB7ciB9CiMgY2hlY2sgQWNjdXJhY3kKCmZpdHRlZC5yZXN1bHRzIDwtIHByZWRpY3QobW9kZWwsbmV3ZGF0YT1sYXVuY2hfdGVzdCx0eXBlPSdyZXNwb25zZScpCmZpdHRlZC5yZXN1bHRzIDwtIGlmZWxzZShmaXR0ZWQucmVzdWx0cyA+IDAuNSwxLDApCgptaXNDbGFzaWZpY0Vycm9yIDwtIG1lYW4oZml0dGVkLnJlc3VsdHMgIT0gbGF1bmNoX3Rlc3QkZGlzdHJlc3NfY3QpCiNwcmludChwYXN0ZSgnQWNjdXJhY3knLHJvdW5kKDEtbWlzQ2xhc2lmaWNFcnJvciwyKSoxMDAsIiUiKSkKCgojQmVjYXVzZSB0aGlzIGRhdGEgc2V0IGlzIHNvIHNtYWxsLCB0aGUgdGVzdCBkYXRhIHNldCBkb2VzIG5vdCBjb250YWluIGJvdGggMCBhbmQgMSB2YWx1ZXMgYW5kIHRoZSBjb2RlIHdpbGwgbm90IHJ1bi4gQW5kIHNpbmNlIHRoZSB0ZXN0IGRhdGEgc2V0IGlzIHNvIHNtYWxsIHRoZSBST0MgaXMgbm90IHVzZWZ1bCBoZXJlIGJ1dCB0aGUgY29kZSBpcyBwcm92aWRlZC4KCiNpbnN0YWxsLnBhY2thZ2VzKCJST0NSIikKI2xpYnJhcnkoUk9DUikKI3AgPC0gcHJlZGljdChtb2RlbCwgbmV3ZGF0YT1sYXVuY2hfdGVzdCwgdHlwZT0icmVzcG9uc2UiKQojcHIgPC0gcHJlZGljdGlvbihwLCBsYXVuY2hfdGVzdCRkaXN0cmVzc19jdCkKI3ByZiA8LSBwZXJmb3JtYW5jZShwciwgbWVhc3VyZSA9ICJ0cHIiLCB4Lm1lYXN1cmUgPSAiZnByIikKI3Bsb3QocHJmKQoKI2F1YyA8LSBwZXJmb3JtYW5jZShwciwgbWVhc3VyZSA9ICJhdWMiKQojYXVjIDwtIGF1Y0B5LnZhbHVlc1tbMV1dCiNhdWMKYGBgCgpXZSBzZWUgdGhhdCB0aGUgYWNjdXJhY3kgb2YgdGhlIG1vZGVsIGlzIGByIHBhc3RlKHJvdW5kKDEtbWlzQ2xhc2lmaWNFcnJvciwyKSoxMDAsIiUiKWAgd2hpY2ggaXMgcHJldHR5IGdvb2QgY29uc2lkZXJpbmcgdGhhdCB3ZSB3ZXJlIHdvcmtpbmcgd2l0aCBhIHNtYWxsIGRhdGEgc2V0LiAKCgoKIyMgTG9naXN0aWMgUmVncmVzc2lvbiBhbmQgQ0FSVCB3aXRoIENyZWRpdCBEYXRhc2V0CgpXZSB3aWxsIGJlIGNvbmR1Y3RpbmcgYSBsb2dpc3RpYyByZWdyZXNzaW9uIGFuYWx5c2lzIHVzaW5nIHRoZSBjcmVkaXQgYXBwcm92YWwgZGF0YSBzZXQuIFRoZSBkYXRhIHNldCBpcyBhIGNvbGxlY3Rpb24gb2YgY3JlZGl0IGNhcmQgYXBwbGljYXRpb25zIGFuZCB0aGUgY3JlZGl0IGFwcHJvdmFsIGRlY2lzaW9ucy4gSW4gdGhpcyBhbmFseXNpcyBJIHdpbGwgYWxzbyBiZSBpbmNsdWRpbmcgYSByZWdyZXNzaW9uIHRyZWUgKENBUlQpIG1vZGVsLiBXZSB3aWxsIGJ1aWx0IGEgbW9kZWwgYW5kIHRlc3QgaXMgcHJlZGljdGFiaWxpdHkgb2YgdGhlIG91dGNvbWVzIG9mIGNyZWRpdCBhcHBsaWNhdGlvbnMuIAoKQmVsb3cgd2Ugd2lsbCBsb2FkIHRoZSBkYXRhIHNldCBhbmQgdXNlIHRoZSBzdHIoKSBmdW5jdGlvbiB0byBnZXQgYSBxdWljayB1bmRlcnN0YW5kaW5nIG9mIHRoZSBkYXRhIHR5cGVzIGluY2x1ZGVkIGluIHRoZSBkYXRhIHNldC4gCgpgYGB7ciBjcmVkLCBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQpsaWJyYXJ5KHJlYWRyKQpjcmVkIDwtIHJlYWRfY3N2KCJodHRwOi8vd3d3LnNjaS5jc3VlYXN0YmF5LmVkdS9+ZXN1ZXNzL2NsYXNzZXMvU3RhdGlzdGljc182NjIwL1ByZXNlbnRhdGlvbnMvbWw3L2NyZWRpdC5jc3YiKQoKIyBleGFtaW5lIHRoZSBsYXVuY2ggZGF0YQpzdHIoY3JlZCkKYGBgCgpGcm9tIHRoZSB0YWJsZSBhYm92ZSB3ZSBzZWUgdGhhdCBkZXBlbmRlbnQgdmFyaWFibGUgaXMgY2hhcmFjdGVyIHR5cGUgYW5kIHdlIG5lZWQgdG8gdHJhbnNmb3JtIGl0IHRvIG51bWVyaWMgdmFsdWVzIDAgYW5kIDEuIEhhdmluZyB0aGUgZGF0YSBhcyBhIGNoYXJhY3RlciB2YXJpYWJsZXMgaXMgbWVhbmluZ2xlc3MgaW4gdGVybXMgb2YgdGhlIG1vZGVsLiBXZSBhbHNvIHdhbnQgdG8gY2hlY2sgZm9yIG1pc3NpbmcgdmFsdWVzIGFuZCBkZXRlcm1pbmUgaG93IHRvIGFkZHJlc3MgdGhlbS4gVGhlcmUgYXJlIHRocmVlIGFwcHJvYWNoZXM6cmVtb3ZlIHRoZW0sIHplcm8gdGhlbSBvdXQgb3IgZXN0aW1hdGUgYSBwbHVnIHZhbHVlLiBGb3IgdGhpcyBhbmFseXNpcywgaWYgd2UgZmluZCBhbnkgbWlzc2luZyB2YWx1ZXMgd2Ugd2lsbCB6ZXJvIHRoZW0gb3V0LgoKYGBge3IsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9CiMjIEZpeCB0aGUgZGVmYXVsdCB2YXJpYWJsZSB0byBiZSAwIG9yIDEKCmNyZWQkZGVmYXVsdCA8LSBpZmVsc2UoY3JlZCRkZWZhdWx0ID09InllcyIsIDEsIDApCgojIENoZWNrIGlmIHRoZXJlIGFyZSBhbnkgbWlzc2luZyB2YWx1ZXMKCmxpYnJhcnkoQW1lbGlhKQptaXNzbWFwKGNyZWQsIG1haW4gPSAiTWlzc2luZyB2YWx1ZXMgdnMgb2JzZXJ2ZWQiKQoKIyBudW1iZXIgb2YgbWlzc2luZyB2YWx1ZXMgaW4gZWFjaCBjb2x1bW4KCnNhcHBseShjcmVkLGZ1bmN0aW9uKHgpIHN1bShpcy5uYSh4KSkpCgojIG51bWJlciBvZiB1bmlxdWUgdmFsdWVzIGluIGVhY2ggY29sdW1uCgpzYXBwbHkoY3JlZCwgZnVuY3Rpb24oeCkgbGVuZ3RoKHVuaXF1ZSh4KSkpCgoKYGBgCgpGcm9tIHRoZSBhYm92ZSBncmFwaHMgYW5kIGNoYXJ0cywgd2UgZG8gbm90IGhhdmUgYW55IG1pc3NpbmcgdmFsdWVzIHRodXMgd2UgY2FuIHByb2NlZWQgd2l0aCB0aGUgYW5hbHlzaXMuCgpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0KbGlicmFyeShwbG90cml4KQpsaWJyYXJ5KHNxbGRmKQoKWWUgPC0gc3VtKGNyZWQkZGVmYXVsdCkKTiA8LSBsZW5ndGgoY3JlZCRkZWZhdWx0KSAtIHN1bShjcmVkJGRlZmF1bHQpCnNsaWNlcyA8LSBjKFllLE4pCmxibHMgPC0gYygiQXBwcm92ZWQiLCAiTm90IEFwcHJvdmVkIikKcGN0IDwtIHJvdW5kKHNsaWNlcy9zdW0oc2xpY2VzKSoxMDApICAgIApsYmxzIDwtIHBhc3RlKGxibHMsIHBjdCkgIyBhZGQgcGVyY2VudHMgdG8gbGFiZWxzCmxibHMgPC0gcGFzdGUobGJscywiJSIsc2VwPSIiKSAjIGFkZCAlIHRvIGxhYmVscwpwaWUzRChzbGljZXMsIGxhYmVscyA9IGxibHMsICBleHBsb2RlID0gIC4wMSAsIG1haW4gPSAiUGllIENoYXJ0IG9mIENyZWRpdCBBcHBsaWNhdGlvbiBBcHByb3ZhbCIsIGxhYmVscG9zID0gYygxLjUsNSkpCgpgYGAKCldlIHdpbGwgbm93IGNyZWF0ZSB0aGUgdHJhaW4gYW5kIHRlc3QgZGF0YSBzZXRzIGFuZCBydW4gdGhlIGxvZ2lzdGljIHJlZ3Jlc3Npb24gbW9kZWwuCgpgYGB7ciB9CiMgbG9naXNpdGljIHJlZ3Jlc3Npb24KCiMgc2V0IHVwIHRyYWlubmluZyBhbmQgdGVzdCBkYXRhIHNldHMKCmluZHggPSBzYW1wbGUoMTpucm93KGNyZWQpLCBhcy5pbnRlZ2VyKDAuOSpucm93KGNyZWQpKSkKI2luZHgKCmNyZWRfdHJhaW4gPSBjcmVkW2luZHgsXQpjcmVkX3Rlc3QgPSBjcmVkWy1pbmR4LF0KCmNyZWRfdHJhaW5fbGFiZWxzID0gY3JlZFtpbmR4LDE3XQpjcmVkX3Rlc3RfbGFiZWxzID0gY3JlZFstaW5keCwxN10gICAKCgoKIyBmaXQgdGhlIGxvZ2lzdGljIHJlZ3Jlc3Npb24gbW9kZWwsIHdpdGggYWxsIHByZWRpY3RvciB2YXJpYWJsZXMKCm1vZGVsIDwtIGdsbShkZWZhdWx0IH4uLGZhbWlseT1iaW5vbWlhbChsaW5rPSdsb2dpdCcpLGRhdGE9Y3JlZF90cmFpbikKbW9kZWwKCnN1bW1hcnkobW9kZWwpCgphbm92YShtb2RlbCwgdGVzdD0iQ2hpc3EiKQpgYGAKCldlIG5vdyB3YW50IHRvIGRyb3AgdGhlIGluc2lnbmlmaWNhbnQgcHJlZGljdG9ycyBmcm9tIHRoZSBhYm92ZSBvdXRwdXQuIFRoaXMgd2lsbCBoZWxwIHRvIGltcHJvdmUgdGhlIG92ZXJhbGwgcHJlZGljdGFiaWxpdHkgb2YgdGhlIG1vZGVsLgoKYGBge3IgfQojIGRyb3AgdGhlIGluc2lnbmlmaWNhbnQgcHJlZGljdG9ycywgYWxwaGEgPSAwLjEwCgptb2RlbCA8LSBnbG0oZGVmYXVsdCB+IGNoZWNraW5nX2JhbGFuY2UgKyBtb250aHNfbG9hbl9kdXJhdGlvbiArIGNyZWRpdF9oaXN0b3J5ICsgIHBlcmNlbnRfb2ZfaW5jb21lICsgYWdlLGZhbWlseT1iaW5vbWlhbChsaW5rPSdsb2dpdCcpLGRhdGE9Y3JlZF90cmFpbikKbW9kZWwKCnN1bW1hcnkobW9kZWwpCgphbm92YShtb2RlbCwgdGVzdD0iQ2hpc3EiKQpgYGAKCk5vdyB0aGF0IHdlIHdlIGhhdmUgb3VyIGZpbmFsIG1vZGVsLCB3ZSB3YW50IHRvIGNoZWNrIHRoZSBhY2N1cmFjeSBvZiB0aGUgbW9kZWwuCgpgYGB7ciB9CiMgY2hlY2sgQWNjdXJhY3kKCmZpdHRlZC5yZXN1bHRzIDwtIHByZWRpY3QobW9kZWwsbmV3ZGF0YT1jcmVkX3Rlc3QsdHlwZT0ncmVzcG9uc2UnKQpmaXR0ZWQucmVzdWx0cyA8LSBpZmVsc2UoZml0dGVkLnJlc3VsdHMgPiAwLjUsMSwwKQoKbWlzQ2xhc2lmaWNFcnJvciA8LSBtZWFuKGZpdHRlZC5yZXN1bHRzICE9IGNyZWRfdGVzdCRkZWZhdWx0KQpwcmludChwYXN0ZSgnQWNjdXJhY3knLDEtbWlzQ2xhc2lmaWNFcnJvcikpCgojIFJPQwojIEJlY2F1c2UgdGhpcyBkYXRhIHNldCBpcyBzbyBzbWFsbCwgaXQgaXMgcG9zc2libGUgdGhhdCB0aGUgdGVzdCBkYXRhIHNldAojIGRvZXMgbm90IGNvbnRhaW4gYm90aCAwIGFuZCAxIHZhbHVlcy4gIElmIHRoaXMgaGFwcGVucyB0aGUgY29kZSB3aWxsIG5vdAojIHJ1bi4gIEFuZCBzaW5jZSB0aGUgdGVzdCBkYXRhIHNldCBpcyBzbyBzbWFsbCB0aGUgUk9DIGlzIG5vdCB1c2VmdWwgaGVyZQojIGJ1dCB0aGUgY29kZSBpcyBwcm92aWRlZC4KCiNsaWJyYXJ5KFJPQ1IpCiNwIDwtIHByZWRpY3QobW9kZWwsIG5ld2RhdGE9Y3JlZGl0X3Rlc3QsIHR5cGU9InJlc3BvbnNlIikKI3ByIDwtIHByZWRpY3Rpb24ocCwgY3JlZGl0X3Rlc3QkZGVmYXVsdCkKI3ByZiA8LSBwZXJmb3JtYW5jZShwciwgbWVhc3VyZSA9ICJ0cHIiLCB4Lm1lYXN1cmUgPSAiZnByIikKI3Bsb3QocHJmKQoKI2F1YyA8LSBwZXJmb3JtYW5jZShwciwgbWVhc3VyZSA9ICJhdWMiKQojYXVjIDwtIGF1Y0B5LnZhbHVlc1tbMV1dCiNhdWMKYGBgCgpGcm9tIHRoZSBvdXRwdXQgd2Ugc2VlIHRoYXQgb3VyIGFjY3VyYWN5IG9mIHRoZSBtb2RlbCBpcyBgciBwcmludChwYXN0ZSgnQWNjdXJhY3knLDEtbWlzQ2xhc2lmaWNFcnJvcikpYC4gQW5kIHdlIGNhbiBzZWUgdGhhdCB0aGlzIHZlcnkgZ29vZC4gQnV0IGxldHMgc2VlIGlmIHVzaW5nIGEgdHJlZSBiYXNlIG1vZGVsIHdvdWxkIGdpdmUgYSBiZXR0ZXIgcHJlZGljdGlvbi4KCmBgYHtyIH0KIyMjI0NsYXNzaWZpY2F0aW9uIFRyZWVzIC0tLS0tLS0tLS0tLS0tLS0tLQoKIyNUcmFpbmluZyBhIG1vZGVsIG9uIHRoZSBkYXRhIC0tLS0KIyByZWdyZXNzaW9uIHRyZWUgdXNpbmcgcnBhcnQKbGlicmFyeShycGFydCkKbS5ycGFydCA8LSBycGFydChkZWZhdWx0IH4gLiwgZGF0YSA9IGNyZWRfdHJhaW4pCgojIGdldCBiYXNpYyBpbmZvcm1hdGlvbiBhYm91dCB0aGUgdHJlZQptLnJwYXJ0CgojIGdldCBtb3JlIGRldGFpbGVkIGluZm9ybWF0aW9uIGFib3V0IHRoZSB0cmVlCnN1bW1hcnkobS5ycGFydCkKCiMgdXNlIHRoZSBycGFydC5wbG90IHBhY2thZ2UgdG8gY3JlYXRlIGEgdmlzdWFsaXphdGlvbgpsaWJyYXJ5KHJwYXJ0LnBsb3QpCgojIGEgYmFzaWMgZGVjaXNpb24gdHJlZSBkaWFncmFtCnJwYXJ0LnBsb3QobS5ycGFydCwgZGlnaXRzID0gMykKCiMgYSBmZXcgYWRqdXN0bWVudHMgdG8gdGhlIGRpYWdyYW0KcnBhcnQucGxvdChtLnJwYXJ0LCBkaWdpdHMgPSA0LCBmYWxsZW4ubGVhdmVzID0gVFJVRSwgdHlwZSA9IDMsIGV4dHJhID0gMTAxKQpgYGAKCkFib3ZlIHdlIGNhbiBzZWUgdGhhdCBvdXRwdXQgb2YgdGhlIHRyZWUgYmFzZSBtb2RlbC4gQW5kIHdlIHNlZSB0aGF0IHRoZSBzYW1lIHByZWRpY3RvcnMgd2UgZm91bmQgcmVsZXZhbnQgaW4gdGhlIGxvZ2lzdGljIG1vZGVsIGFyZSBhbHNvIGRlZW0gaW1wb3J0YW50IGluIHRoZSB0cmVlIGJhc2UgbW9kZWwgYXMgd2VsbC4gQnV0IG5vdyBsZXRzIHNlZSB3aGF0IGFjY3VyYWN5IG9mIHRoaXMgbW9kZWwgd2VsbCBiZS4KCmBgYHtyIH0KIyMgU3RlcCA0OiBFdmFsdWF0ZSBtb2RlbCBwZXJmb3JtYW5jZSAtLS0tCgojIGdlbmVyYXRlIHByZWRpY3Rpb25zIGZvciB0aGUgdGVzdGluZyBkYXRhc2V0CnAucnBhcnQgPC0gcHJlZGljdChtLnJwYXJ0LCBjcmVkX3Rlc3QpCgojIGNvbXBhcmUgdGhlIGRpc3RyaWJ1dGlvbiBvZiBwcmVkaWN0ZWQgdmFsdWVzIHZzLiBhY3R1YWwgdmFsdWVzCnN1bW1hcnkocC5ycGFydCkKc3VtbWFyeShjcmVkX3Rlc3QkZGVmYXVsdCkKCiMgY29tcGFyZSB0aGUgY29ycmVsYXRpb24KY29yKHAucnBhcnQsIGFzLm51bWVyaWMoY3JlZF90ZXN0JGRlZmF1bHQpKQoKIyBmdW5jdGlvbiB0byBjYWxjdWxhdGUgdGhlIG1lYW4gYWJzb2x1dGUgZXJyb3IKTUFFIDwtIGZ1bmN0aW9uKGFjdHVhbCwgcHJlZGljdGVkKSB7CiAgbWVhbihhYnMoYWN0dWFsIC0gcHJlZGljdGVkKSkgIAp9CgojIG1lYW4gYWJzb2x1dGUgZXJyb3IgYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXMKTUFFKHAucnBhcnQsIGFzLm51bWVyaWMoY3JlZF90ZXN0JGRlZmF1bHQpKQoKCiMgbWVhbiBhYnNvbHV0ZSBlcnJvciBiZXR3ZWVuIGFjdHVhbCB2YWx1ZXMgYW5kIG1lYW4gdmFsdWUKbWVhbihhcy5udW1lcmljKGNyZWRfdHJhaW4kZGVmYXVsdCkpIApNQUUoMS4zMCwgYXMubnVtZXJpYyhjcmVkX3RyYWluJGRlZmF1bHQpKQoKYGBgCgoKCg==

STAT 6620 HW 6 Part 1

George Mason

Logistic Regression

Space Shuttle Launch Data

Logistic Regression and CART with Credit Dataset