Case 2

Executive Summary

Background

Boston Housing- (Problem #1) We have worked many times with the Boston housing dataset, both in class and in labs. This dataset can be found in the MASS package. This dataset is an analysis of median value of homes and the factors which affect the price.

German Credit Data- (Problem #2) This dataset contains information about 1000 people, whether they have been classified as risky or not risky. The variable response is the risk label: 1 as good and 2 as bad.

Approach

Boston Housing - (Problem #1) Random sampling a training dataset that contains 70% of original datapoints. With this sample, fitting a GLM model, tree models, GAM models, and neural network. Once the in-sample performance is examined, the out-of-sample performance will be as well. Results will be compared to find the better model.

German Credit Score - (Problem #2) Random sampling a training dataset that contains 70% of original points. With this sample, fitting a generalized logistic regression, classification tree, generalized additive models, and neural network.

Major Findings

Boston Housing - (Problem #1) After building the first training dataset; indus, age, and crim are insignificant. The first tree model which was built needed to be pruned around node 7. There was a root node error of 91.17529. The larger tree was pruned at a cp value of 0.009. I then built a GAM model which I did not use s() on chas and rad in order to leave them as integers. I examined which variables had an edf closest to 1 and those were zn, age, and ptratrio. Only ptratrio had a significant p-value. A model was then built without these. Neural networks models were then examined and the MSE of the testing set was 32.47491.

German Credit Score - (Problem #2) The german credit data set was first fitted into a generalized linear model with logit and contained a BIC value of 552.0723. The misclassification rate was found and the value was 0.2627119. The table contains the true and predicted values. A classification tree is built to further understand the model. In the GAM model, the approximate significance of smooth terms were: duration, amount, and age. These variables were then plotted before the neural network was looked at.

Splitting Boston data into 70:30 training and testing

library(MASS)
data(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

set.seed(08331816)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.70)

Boston_train <- Boston[sample_index,]

Boston_test <- Boston[-sample_index,]

Fitting a generalized linear regression

glm1 <- lm(medv~., data = Boston_train)
glm1_summary <- summary(glm1)
glm1_summary

## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.7576  -2.8794  -0.7294   2.0064  24.2091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.950719   6.284196   6.198 1.65e-09 ***
## crim         -0.065527   0.057485  -1.140 0.255135    
## zn            0.043956   0.017712   2.482 0.013559 *  
## indus         0.025602   0.077966   0.328 0.742831    
## chas          3.263081   1.038597   3.142 0.001827 ** 
## nox         -17.451522   4.729293  -3.690 0.000261 ***
## rm            3.523302   0.502619   7.010 1.29e-11 ***
## age           0.001765   0.016702   0.106 0.915904    
## dis          -1.586468   0.253926  -6.248 1.24e-09 ***
## rad           0.284664   0.082330   3.458 0.000614 ***
## tax          -0.011169   0.004526  -2.468 0.014078 *  
## ptratio      -0.928464   0.170104  -5.458 9.29e-08 ***
## black         0.009381   0.003285   2.855 0.004564 ** 
## lstat        -0.620984   0.065504  -9.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.075 on 340 degrees of freedom
## Multiple R-squared:  0.7287, Adjusted R-squared:  0.7183 
## F-statistic: 70.25 on 13 and 340 DF,  p-value: < 2.2e-16

After reviewing the model, you can see that indus, age and crim are insignificant. Now I will build a model without these variables.

glm2 <- lm(medv~. -crim -age -indus,data = Boston_train)
glm2_summary <- summary(glm2)
glm2_summary

## 
## Call:
## lm(formula = medv ~ . - crim - age - indus, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6647  -2.7519  -0.6885   2.1961  24.4741 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.301148   6.230022   6.148 2.18e-09 ***
## zn            0.041031   0.017394   2.359 0.018890 *  
## chas          3.376261   1.025091   3.294 0.001092 ** 
## nox         -16.541834   4.369230  -3.786 0.000181 ***
## rm            3.542833   0.489793   7.233 3.10e-12 ***
## dis          -1.585594   0.230455  -6.880 2.85e-11 ***
## rad           0.242366   0.073052   3.318 0.001005 ** 
## tax          -0.010245   0.004037  -2.537 0.011609 *  
## ptratio      -0.909898   0.167416  -5.435 1.04e-07 ***
## black         0.009642   0.003263   2.955 0.003340 ** 
## lstat        -0.634020   0.059720 -10.617  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.064 on 343 degrees of freedom
## Multiple R-squared:  0.7275, Adjusted R-squared:  0.7196 
## F-statistic: 91.58 on 10 and 343 DF,  p-value: < 2.2e-16

Tree models

library(rpart)
library(rpart.plot)
Boston_tree <- rpart(formula = medv~., data=Boston_train)
Boston_tree

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 32276.05000 22.85932  
##    2) rm< 6.945 299 12906.51000 20.04214  
##      4) lstat>=14.395 127  2570.16900 15.19764  
##        8) nox>=0.607 81  1102.95800 13.20494  
##         16) lstat>=19.72 39   378.14310 10.73077 *
##         17) lstat< 19.72 42   264.38980 15.50238 *
##        9) nox< 0.607 46   579.20800 18.70652 *
##      5) lstat< 14.395 172  5154.96700 23.61919  
##       10) dis>=1.6024 165  2269.27200 22.98121  
##         20) rm< 6.5445 127  1033.49600 21.75354 *
##         21) rm>=6.5445 38   404.65050 27.08421 *
##       11) dis< 1.6024 7  1235.55700 38.65714 *
##    3) rm>=6.945 55  4095.94400 38.17455  
##      6) rm< 7.445 30  1119.17400 32.74333 *
##      7) rm>=7.445 25  1029.89800 44.69200  
##       14) ptratio>=16.15 12   609.48920 40.94167 *
##       15) ptratio< 16.15 13    95.83231 48.15385 *

prp(Boston_tree,digits = 4, extra = 1)

plotcp(Boston_tree)

When looking at the plot, the optimal cutoff node is 7.

Prediction using regression tree and MSE

boston_train_prediction <- predict(Boston_tree)
boston_test_prediction <- predict(Boston_tree,Boston_test)
sum((Boston_train$medv - mean(Boston_train$medv))^2)/nrow(Boston_train)

## [1] 91.17529

Contains a root node error of 91.17529

Pruning

boston.largetree <- rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
prp(boston.largetree)

plotcp(boston.largetree)

We can further prune the tree.

pruned_tree <- prune(boston.largetree, cp=0.009)
pruned_tree

## n= 354 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 354 32276.05000 22.85932  
##    2) rm< 6.945 299 12906.51000 20.04214  
##      4) lstat>=14.395 127  2570.16900 15.19764  
##        8) nox>=0.607 81  1102.95800 13.20494  
##         16) lstat>=19.72 39   378.14310 10.73077 *
##         17) lstat< 19.72 42   264.38980 15.50238 *
##        9) nox< 0.607 46   579.20800 18.70652 *
##      5) lstat< 14.395 172  5154.96700 23.61919  
##       10) dis>=1.6024 165  2269.27200 22.98121  
##         20) rm< 6.5445 127  1033.49600 21.75354 *
##         21) rm>=6.5445 38   404.65050 27.08421 *
##       11) dis< 1.6024 7  1235.55700 38.65714 *
##    3) rm>=6.945 55  4095.94400 38.17455  
##      6) rm< 7.445 30  1119.17400 32.74333 *
##      7) rm>=7.445 25  1029.89800 44.69200  
##       14) ptratio>=16.15 12   609.48920 40.94167 *
##       15) ptratio< 16.15 13    95.83231 48.15385 *

Generalized Additive Models

For my model I did not use s() on chas and rad to leave them as integers

library(mgcv)
Boston_gam <- gam(medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + s(dis) +  s(tax) + s(ptratio) + s(black) + s(lstat) + chas + rad, data = Boston_train)
summary(Boston_gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + s(nox) + s(rm) + s(age) + 
##     s(dis) + s(tax) + s(ptratio) + s(black) + s(lstat) + chas + 
##     rad
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.9869     1.2448  15.253  < 2e-16 ***
## chas          1.6589     0.7513   2.208  0.02799 *  
## rad           0.3835     0.1260   3.045  0.00254 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    2.579  3.193  5.205  0.00149 ** 
## s(zn)      1.000  1.000  0.496  0.48159    
## s(indus)   6.677  7.716  3.022  0.00353 ** 
## s(nox)     9.000  9.000 11.340 8.79e-16 ***
## s(rm)      8.038  8.754 15.524  < 2e-16 ***
## s(age)     1.000  1.000  0.993  0.31987    
## s(dis)     8.806  8.986  7.948 1.45e-10 ***
## s(tax)     2.711  3.299  8.628 8.81e-06 ***
## s(ptratio) 1.000  1.000 21.920 4.25e-06 ***
## s(black)   1.558  1.918  1.596  0.26657    
## s(lstat)   7.540  8.454 18.248  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.881   Deviance explained = 89.8%
## GCV = 12.818  Scale est. = 10.903    n = 354

To see which variables have a linear relationship with medv, we want to look at the edf being closest to 1. We can see that zn, age, and ptratrio all have an edf of 1. Only ptratrio has a significant p-value.

# Building a model without linear variables 
Boston_gam <- gam(medv ~ s(crim) + (zn) + s(indus) + s(nox) + s(rm) + (age) + s(dis) +  s(tax) + s(ptratio) + (black) + s(lstat) + chas + rad, data = Boston_train)
summary(Boston_gam)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + (zn) + s(indus) + s(nox) + s(rm) + (age) + s(dis) + 
##     s(tax) + s(ptratio) + (black) + s(lstat) + chas + rad
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.676194   1.817860  10.274  < 2e-16 ***
## zn           0.012729   0.018979   0.671  0.50292    
## age         -0.014156   0.013864  -1.021  0.30805    
## black        0.003274   0.002441   1.341  0.18089    
## chas         1.654118   0.752288   2.199  0.02865 *  
## rad          0.383287   0.126186   3.037  0.00259 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    2.731  3.374  5.082  0.00144 ** 
## s(indus)   6.670  7.712  2.972  0.00408 ** 
## s(nox)     9.000  9.000 11.385 7.47e-16 ***
## s(rm)      7.946  8.709 15.536  < 2e-16 ***
## s(dis)     8.821  8.988  8.054 9.69e-11 ***
## s(tax)     2.713  3.302  8.618 8.92e-06 ***
## s(ptratio) 1.000  1.000 22.365 3.42e-06 ***
## s(lstat)   7.455  8.398 18.110  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.88   Deviance explained = 89.8%
## GCV = 12.832  Scale est. = 10.935    n = 354

Plotting GAM

plot(Boston_gam, shade = TRUE, seWithMean = TRUE, scale = 0)

Nueral Networks

library(MASS)
maxs <- apply(Boston, 2, max) 
mins <- apply(Boston, 2, min)

scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
index <- sample(1:nrow(Boston),round(0.75*nrow(Boston)))

train_ <- scaled[index,]
test_ <- scaled[-index,]

library(neuralnet)
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)

pr.nn <- compute(nn,test_[,1:13])

pr.nn_ <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)

# MSE of testing set
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
MSE.nn

## [1] 13.7965

# In-sample
(Boston_gam_mse <- mean((predict(Boston_gam) - Boston_train$medv) ^ 2))

## [1] 9.318212

# Out of sample
(Boston_gam_mspe <- mean((predict(Boston_gam, newdata = Boston_test) - Boston_test$medv) ^ 2))

## [1] 9.362403

German Credit Score data

Loading in dataset and reformatting response coding

#Read German Credit Scoring Data
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

#column names
colnames(german_credit)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")

#orginal response coding 1= good, 2 = bad
#we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1
#change the datatype of 'response' variable
german_credit$response <- as.factor(german_credit$response)
str(german_credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 2 ...

Splitting German Credit data into 70:30 training and testing set

sample_index_2 <- sample(nrow(german_credit),nrow(german_credit)*0.70)
german_credit_train <- german_credit[sample_index,]
german_credit_test <- german_credit[-sample_index,]

#Generalized linear model with logit
german_credit_logit <- glm(response~., family = binomial(link = "logit"), data = german_credit_train)
summary(german_credit_logit)

## 
## Call:
## glm(formula = response ~ ., family = binomial(link = "logit"), 
##     data = german_credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5869  -0.5667  -0.2260   0.5103   2.5113  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.385e-01  2.018e+00  -0.366  0.71436    
## chk_acctA12        6.460e-01  4.146e-01   1.558  0.11916    
## chk_acctA13       -1.112e+00  8.138e-01  -1.367  0.17163    
## chk_acctA14       -1.967e+00  4.844e-01  -4.061 4.88e-05 ***
## duration           3.398e-02  1.737e-02   1.956  0.05047 .  
## credit_hisA31      1.111e+00  1.356e+00   0.819  0.41250    
## credit_hisA32     -2.712e-01  9.330e-01  -0.291  0.77133    
## credit_hisA33     -6.125e-01  9.736e-01  -0.629  0.52926    
## credit_hisA34     -1.504e+00  9.713e-01  -1.548  0.12162    
## purposeA41        -1.783e+00  7.405e-01  -2.408  0.01603 *  
## purposeA410       -2.490e+00  1.558e+00  -1.598  0.10997    
## purposeA42        -6.424e-01  5.014e-01  -1.281  0.20015    
## purposeA43         1.792e-01  4.768e-01   0.376  0.70706    
## purposeA44        -1.492e+01  1.111e+03  -0.013  0.98928    
## purposeA45        -7.812e-01  9.375e-01  -0.833  0.40466    
## purposeA46         1.148e+00  7.319e-01   1.568  0.11687    
## purposeA48        -1.423e+01  1.676e+03  -0.008  0.99323    
## purposeA49        -4.435e-01  6.643e-01  -0.668  0.50441    
## amount             1.573e-04  8.295e-05   1.896  0.05793 .  
## saving_acctA62    -1.113e+00  6.159e-01  -1.808  0.07061 .  
## saving_acctA63     4.168e-01  7.581e-01   0.550  0.58243    
## saving_acctA64    -1.029e+00  7.756e-01  -1.327  0.18465    
## saving_acctA65    -1.291e+00  5.445e-01  -2.371  0.01774 *  
## present_empA72    -1.473e+00  8.222e-01  -1.792  0.07312 .  
## present_empA73    -6.420e-01  7.844e-01  -0.819  0.41306    
## present_empA74    -1.797e+00  8.732e-01  -2.057  0.03966 *  
## present_empA75    -8.024e-01  7.875e-01  -1.019  0.30820    
## installment_rate   5.678e-01  1.816e-01   3.126  0.00177 ** 
## sexA92            -4.520e-01  7.540e-01  -0.599  0.54885    
## sexA93            -1.858e+00  7.496e-01  -2.478  0.01320 *  
## sexA94            -8.945e-01  8.226e-01  -1.087  0.27683    
## other_debtorA102   3.898e-01  8.341e-01   0.467  0.64027    
## other_debtorA103  -2.052e+00  9.067e-01  -2.263  0.02365 *  
## present_resid      1.297e-01  1.756e-01   0.739  0.46018    
## propertyA122       8.917e-01  5.196e-01   1.716  0.08610 .  
## propertyA123       4.951e-01  4.575e-01   1.082  0.27919    
## propertyA124       1.684e+00  8.329e-01   2.022  0.04315 *  
## age               -2.826e-02  1.906e-02  -1.483  0.13807    
## other_installA142  5.382e-01  8.498e-01   0.633  0.52653    
## other_installA143 -5.808e-01  5.228e-01  -1.111  0.26657    
## housingA152        8.955e-02  4.682e-01   0.191  0.84832    
## housingA153       -1.294e+00  9.401e-01  -1.376  0.16872    
## n_credits          6.883e-01  3.656e-01   1.883  0.05975 .  
## jobA172           -4.720e-01  1.154e+00  -0.409  0.68244    
## jobA173           -6.501e-01  1.092e+00  -0.595  0.55179    
## jobA174           -3.559e-01  1.053e+00  -0.338  0.73534    
## n_people           8.536e-01  5.370e-01   1.590  0.11191    
## telephoneA192     -4.894e-01  4.031e-01  -1.214  0.22468    
## foreignA202        1.585e-01  1.045e+00   0.152  0.87947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 428.70  on 353  degrees of freedom
## Residual deviance: 264.48  on 305  degrees of freedom
## AIC: 362.48
## 
## Number of Fisher Scoring iterations: 15

BIC(german_credit_logit)

## [1] 552.0723

Misclassification

pred_glm_gtrain_glm0 <- predict(german_credit_logit, type = "response")
pcut <- 1/6
class_pred_glm0 <- (pred_glm_gtrain_glm0 > pcut)*1
mis_rate <- mean(german_credit_train$response!=class_pred_glm0)
mis_rate

## [1] 0.2627119

table(german_credit_train$response,class_pred_glm0, dnn = c("True","Predicted"))

##     Predicted
## True   0   1
##    0 166  84
##    1   9  95

Classification Tree

german_credit_rpart <- rpart(formula = response ~ ., data = german_credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0))))
pred0<- predict(german_credit_rpart, type="class")
table(german_credit_train$response, pred0, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 157  93
##    1   1 103

prp(german_credit_rpart,extra = 1)

#out of sample performance - CART
german_credit_rpart_test <- rpart(formula = response ~ ., data = german_credit_test, method = "class", parms = list(loss=matrix(c(0,5,1,0))))
pred1<- predict(german_credit_rpart_test, type="class")
table(german_credit_test$response, pred1, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 265 185
##    1   5 191

german_credit_rpart

## n= 354 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 354 250 1 (0.70621469 0.29378531)  
##     2) chk_acct=A13,A14 160  80 0 (0.90000000 0.10000000)  
##       4) other_install=A143 137  40 0 (0.94160584 0.05839416)  
##         8) age>=30.5 85   5 0 (0.98823529 0.01176471) *
##         9) age< 30.5 52  35 0 (0.86538462 0.13461538)  
##          18) purpose=A41,A42,A45 17   0 0 (1.00000000 0.00000000) *
##          19) purpose=A40,A43,A46,A49 35  28 1 (0.80000000 0.20000000)  
##            38) credit_his=A32 22  10 0 (0.90909091 0.09090909)  
##              76) present_emp=A72,A73,A75 15   0 0 (1.00000000 0.00000000) *
##              77) present_emp=A71,A74 7   5 1 (0.71428571 0.28571429) *
##            39) credit_his=A31,A33,A34 13   8 1 (0.61538462 0.38461538) *
##       5) other_install=A141,A142 23  15 1 (0.65217391 0.34782609) *
##     3) chk_acct=A11,A12 194 106 1 (0.54639175 0.45360825)  
##       6) purpose=A41,A410,A44,A48 28  23 1 (0.82142857 0.17857143)  
##        12) amount< 3155 13   0 0 (1.00000000 0.00000000) *
##        13) amount>=3155 15  10 1 (0.66666667 0.33333333) *
##       7) purpose=A40,A42,A43,A45,A46,A49 166  83 1 (0.50000000 0.50000000)  
##        14) property=A121 44  32 1 (0.72727273 0.27272727)  
##          28) duration< 22.5 35  29 1 (0.82857143 0.17142857)  
##            56) sex=A93,A94 26  10 0 (0.92307692 0.07692308)  
##             112) saving_acct=A61,A62,A64 19   0 0 (1.00000000 0.00000000) *
##             113) saving_acct=A63,A65 7   5 1 (0.71428571 0.28571429) *
##            57) sex=A91,A92 9   5 1 (0.55555556 0.44444444) *
##          29) duration>=22.5 9   3 1 (0.33333333 0.66666667) *
##        15) property=A122,A123,A124 122  51 1 (0.41803279 0.58196721)  
##          30) present_emp=A74 21  16 1 (0.76190476 0.23809524)  
##            60) credit_his=A30,A33,A34 9   0 0 (1.00000000 0.00000000) *
##            61) credit_his=A31,A32 12   7 1 (0.58333333 0.41666667) *
##          31) present_emp=A71,A72,A73,A75 101  35 1 (0.34653465 0.65346535) *

prp(german_credit_rpart,extra=1)

Generalized Additive Models

str(german_credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : Factor w/ 2 levels "0","1": 1 2 1 1 2 1 1 1 1 2 ...

gam_formula <- as.formula(paste("response~s(duration)+s(amount)+s(age)"))

fitted_gam <- gam(formula = gam_formula, family = binomial, 
              data = german_credit_train)
summary(fitted_gam)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## response ~ s(duration) + s(amount) + s(age)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9556     0.1262  -7.575 3.59e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df Chi.sq  p-value    
## s(duration) 1.550  1.928 13.382 0.000858 ***
## s(amount)   3.977  4.939 12.832 0.027785 *  
## s(age)      1.000  1.000  3.281 0.070117 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0954   Deviance explained = 8.95%
## UBRE = 0.14511  Scale est. = 1         n = 354

plot(fitted_gam, shade = TRUE,seWithMean = TRUE, scale = 0)

Neural Network

First need to build the neural network model

library(nnet)
library(caret)
library(NeuralNetTools)

Plotting the actual graph

german_nnet <- train(response ~., data=german_credit_train, method = "nnet")

plotnet(german_nnet)

In sample and out of sample

prob_nnet= predict(german_nnet,type='prob')
pred_nnet = (prob_nnet[,2] >=pcut)*1
table(german_credit_train$response,pred_nnet, dnn=c("Observed","Predicted"))

##         Predicted
## Observed   1
##        0 250
##        1 104

prob_nnet_test= predict(german_nnet,german_credit_test,type='prob')
pred_nnet_test = as.numeric(prob_nnet_test[,2] > pcut)
table(german_credit_test$response,pred_nnet_test, dnn=c("Observed","Predicted"))

##         Predicted
## Observed   1
##        0 450
##        1 196

mean(ifelse(german_credit_test$response != pred_nnet_test, 1, 0))

## [1] 0.6965944

Case 2

Kevin Gilmore

4/7/2020

Executive Summary

Splitting Boston data into 70:30 training and testing

Fitting a generalized linear regression

Tree models

Prediction using regression tree and MSE

Pruning

Generalized Additive Models

Plotting GAM

Nueral Networks

German Credit Score data

Loading in dataset and reformatting response coding

Splitting German Credit data into 70:30 training and testing set

Misclassification

Classification Tree

Generalized Additive Models

Neural Network