Homework 7

library(tidyverse)
library(ggplot2)
library(randomForest)
library(tree)
library(caret)

Question 10.1

Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you can using

(a) a regression tree model, and

(b) a random forest model.

In R, you can use the tree package or the rpart package, and the randomForest package. For each model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t just stop when you have a good model, but interpret it too).

setwd("C:/Users/tarra/Documents/Intro to Analytics Modeling/Fall2020hw7/data 10.1")

uscrime_data <- read.table("uscrime.txt", header = TRUE)

# Build the regression tree model
uscrime_treemod <- tree(Crime~., data = uscrime_data)
summary(uscrime_treemod)

## 
## Regression tree:
## tree(formula = Crime ~ ., data = uscrime_data)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47390 = 1896000 / 40 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -573.900  -98.300   -1.545    0.000  110.600  490.100

# Let's see how exactly the tree was split
uscrime_treemod$frame

##       var  n        dev      yval splits.cutleft splits.cutright
## 1     Po1 47 6880927.66  905.0851          <7.65           >7.65
## 2     Pop 23  779243.48  669.6087          <22.5           >22.5
## 4      LF 12  243811.00  550.5000        <0.5675         >0.5675
## 8  <leaf>  7   48518.86  466.8571                               
## 9  <leaf>  5   77757.20  667.6000                               
## 5  <leaf> 11  179470.73  799.5455                               
## 3      NW 24 3604162.50 1130.7500          <7.65           >7.65
## 6     Pop 10  557574.90  886.9000          <21.5           >21.5
## 12 <leaf>  5  146390.80 1049.2000                               
## 13 <leaf>  5  147771.20  724.6000                               
## 7     Po1 14 2027224.93 1304.9286          <9.65           >9.65
## 14 <leaf>  6  170828.00 1041.0000                               
## 15 <leaf>  8 1124984.88 1502.8750

plot(uscrime_treemod)
text(uscrime_treemod)

# Now let's prune the tree
termnodes <- 5
prune_uscrime_treemod <- prune.tree(uscrime_treemod, best = termnodes)
plot(prune_uscrime_treemod)
text(prune_uscrime_treemod)

summary(prune_uscrime_treemod)

## 
## Regression tree:
## snip.tree(tree = uscrime_treemod, nodes = c(4L, 6L))
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "NW" 
## Number of terminal nodes:  5 
## Residual mean deviance:  54210 = 2277000 / 42 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -573.9  -107.5    15.5     0.0   122.8   490.1

# pruning the tree is showing that it's probably a worse fit, and there might be some overfitting of the model. We should just go ahead and use the 7 nodes as it has less error. 

termnodes <- 7
prune_uscrime_treemod2 <- prune.tree(uscrime_treemod, best = termnodes)
plot(prune_uscrime_treemod2)
text(prune_uscrime_treemod2)

summary(prune_uscrime_treemod2)

## 
## Regression tree:
## tree(formula = Crime ~ ., data = uscrime_data)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47390 = 1896000 / 40 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -573.900  -98.300   -1.545    0.000  110.600  490.100

# Calculate quaility of fit
uscrime_treepred <- predict(prune_uscrime_treemod2, data = uscrime_data[,1:15])
RSS <- sum((uscrime_treepred - uscrime_data[,16])^2)
TSS <- sum((uscrime_data[,16] - mean(uscrime_data[,16]))^2)
R2 <- 1 - RSS/TSS
R2

## [1] 0.7244962

Our R^2, at 0.7245 is high. We can also see that over fitting may be a problem with this model.

# Let's move onto the random forest model
# creating the baseline model. 
uscrime_rf <- randomForest(Crime ~ ., data=uscrime_data, importance = TRUE, nodesize = 5)
uscrime_rfpred <- predict(uscrime_rf, data=uscrime_data[,-16])
RSS <- sum((uscrime_rfpred - uscrime_data[,16])^2)
R2 <- 1 - RSS/TSS
R2

## [1] 0.4122103

With the random forest model, we get a worse R^2. With this information we can see that the random forest model is not as accurate as the tree.

Question 10.2

Describe a situation or problem from your job, everyday life, current events, etc., for which a logistic regression model would be appropriate. List some (up to 5) predictors that you might use.

My close friend works in IT security at The University of Alabama at Birmingham. One of the biggest problems they deal with is spam emails. Logistic regression can be used to tell whether or not an email is a spam. Some identifies could be the email sender, occurrences of certain key words, and occurrences of special links in the email.

Question 10.3

1.Using the GermanCredit data set germancredit.txt from http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german / (description at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 ), use logistic regression to find a good predictive model for whether credit applicants are good credit risks or not. Show your model (factors used and their coefficients), the software output, and the quality of fit. You can use the glm function in R. To get a logistic regression (logit) model on data where the response is either zero or one, use family=binomial(link=“logit”) in your glm function call.

2. Because the model gives a result between 0 and 1, it requires setting a threshold probability to separate between “good” and “bad” answers. In this data set, they estimate that incorrectly identifying a bad customer as good, is 5 times worse than incorrectly classifying a good customer as bad. Determine a good threshold probability based on your model.

setwd("C:/Users/tarra/Documents/Intro to Analytics Modeling/Fall2020hw7/data 10.3")

germ_cred <- read.table("germancredit.txt", header = FALSE)

#with this we can see what holds what. V21 holds the responses. 
str(germ_cred)

## 'data.frame':    1000 obs. of  21 variables:
##  $ V1 : chr  "A11" "A12" "A14" "A11" ...
##  $ V2 : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ V3 : chr  "A34" "A32" "A34" "A32" ...
##  $ V4 : chr  "A43" "A43" "A46" "A42" ...
##  $ V5 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ V6 : chr  "A65" "A61" "A61" "A61" ...
##  $ V7 : chr  "A75" "A73" "A74" "A74" ...
##  $ V8 : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ V9 : chr  "A93" "A92" "A93" "A93" ...
##  $ V10: chr  "A101" "A101" "A101" "A103" ...
##  $ V11: int  4 2 3 4 4 4 4 2 4 2 ...
##  $ V12: chr  "A121" "A121" "A121" "A122" ...
##  $ V13: int  67 22 49 45 53 35 53 35 61 28 ...
##  $ V14: chr  "A143" "A143" "A143" "A143" ...
##  $ V15: chr  "A152" "A152" "A152" "A153" ...
##  $ V16: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ V17: chr  "A173" "A173" "A172" "A173" ...
##  $ V18: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ V19: chr  "A192" "A191" "A191" "A191" ...
##  $ V20: chr  "A201" "A201" "A201" "A201" ...
##  $ V21: int  1 2 1 1 2 1 1 1 1 2 ...

table(germ_cred$V21)

## 
##   1   2 
## 700 300

#we want to replace 1 and 2 with 0 and 1 
germ_cred$V21[germ_cred$V21==1] <- 0
germ_cred$V21[germ_cred$V21==2] <- 1

#what we want to do is measure the model using truePositive + trueNegative /(total predictions + actual values)
#so lets split the data into a training set and a validation set

germcredit_split <- createDataPartition(germ_cred$V21, times = 1, p = 0.7, list=FALSE)
head(germcredit_split)

##      Resample1
## [1,]         2
## [2,]         3
## [3,]         4
## [4,]         5
## [5,]         6
## [6,]         7

germcredit_train <- germ_cred[germcredit_split,] 
germcredit_val <- germ_cred[-germcredit_split,]
table(germcredit_train$V21)

## 
##   0   1 
## 486 214

table(germcredit_val$V21)

## 
##   0   1 
## 214  86

germcred_logmod <- glm(V21 ~ ., data = germcredit_train, family=binomial(link="logit"))
summary(germcred_logmod)

## 
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = germcredit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2960  -0.7145  -0.3640   0.7161   2.5515  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.669e-01  1.307e+00   0.204 0.838152    
## V1A12       -3.171e-01  2.614e-01  -1.213 0.225104    
## V1A13       -1.434e+00  5.016e-01  -2.859 0.004252 ** 
## V1A14       -1.606e+00  2.757e-01  -5.824 5.73e-09 ***
## V2           2.346e-02  1.178e-02   1.991 0.046454 *  
## V3A31        4.163e-01  6.637e-01   0.627 0.530515    
## V3A32       -2.644e-01  5.204e-01  -0.508 0.611423    
## V3A33       -7.901e-01  5.659e-01  -1.396 0.162703    
## V3A34       -1.316e+00  5.350e-01  -2.461 0.013871 *  
## V4A41       -2.129e+00  4.675e-01  -4.553 5.28e-06 ***
## V4A410      -1.003e+00  1.003e+00  -1.000 0.317545    
## V4A42       -7.033e-01  3.125e-01  -2.251 0.024401 *  
## V4A43       -1.110e+00  2.945e-01  -3.769 0.000164 ***
## V4A44       -2.341e-01  8.866e-01  -0.264 0.791764    
## V4A45        1.704e-01  6.909e-01   0.247 0.805221    
## V4A46       -7.129e-02  4.914e-01  -0.145 0.884651    
## V4A48       -1.575e+01  4.735e+02  -0.033 0.973461    
## V4A49       -7.666e-01  4.094e-01  -1.872 0.061157 .  
## V5           1.138e-04  5.580e-05   2.039 0.041430 *  
## V6A62       -2.642e-01  3.512e-01  -0.752 0.451860    
## V6A63       -6.222e-02  4.288e-01  -0.145 0.884646    
## V6A64       -1.288e+00  6.086e-01  -2.117 0.034297 *  
## V6A65       -8.564e-01  3.122e-01  -2.743 0.006090 ** 
## V7A72        6.433e-01  5.099e-01   1.262 0.207118    
## V7A73        2.517e-01  4.872e-01   0.517 0.605438    
## V7A74       -4.155e-01  5.315e-01  -0.782 0.434433    
## V7A75        4.195e-01  4.897e-01   0.857 0.391656    
## V8           2.634e-01  1.045e-01   2.520 0.011743 *  
## V9A92       -1.342e-01  4.829e-01  -0.278 0.781112    
## V9A93       -9.991e-01  4.775e-01  -2.092 0.036401 *  
## V9A94       -3.581e-01  5.760e-01  -0.622 0.534142    
## V10A102      6.158e-01  4.900e-01   1.257 0.208866    
## V10A103     -6.144e-01  4.952e-01  -1.241 0.214713    
## V11         -7.023e-02  1.056e-01  -0.665 0.505908    
## V12A122      2.408e-01  3.087e-01   0.780 0.435344    
## V12A123      2.955e-01  2.784e-01   1.062 0.288398    
## V12A124      7.883e-01  4.804e-01   1.641 0.100782    
## V13         -1.209e-02  1.122e-02  -1.077 0.281299    
## V14A142      2.694e-01  5.231e-01   0.515 0.606486    
## V14A143     -5.978e-01  2.919e-01  -2.048 0.040532 *  
## V15A152     -3.062e-01  2.833e-01  -1.081 0.279852    
## V15A153     -2.719e-01  5.561e-01  -0.489 0.624875    
## V16          4.637e-01  2.316e-01   2.002 0.045305 *  
## V17A172      1.974e-01  8.268e-01   0.239 0.811313    
## V17A173      1.781e-01  7.986e-01   0.223 0.823497    
## V17A174      5.796e-01  8.157e-01   0.711 0.477367    
## V18          6.567e-02  3.195e-01   0.206 0.837132    
## V19A192     -4.408e-01  2.529e-01  -1.743 0.081291 .  
## V20A202     -1.313e+00  7.811e-01  -1.681 0.092783 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 861.88  on 699  degrees of freedom
## Residual deviance: 623.89  on 651  degrees of freedom
## AIC: 721.89
## 
## Number of Fisher Scoring iterations: 14

#now lets do the prediction

germcred_predict <- predict(germcred_logmod, newdata=germcredit_val[,-21], type="response")
table(germcredit_val$V21, round(germcred_predict))

##    
##       0   1
##   0 179  35
##   1  43  43

So our baseline prediction model correctly classifies a good bit of the good borrowers, but heavily misclassifies the bad borrowers. To deal with the fact that the cost of bad borrowers is 5x that of misclassifying good borrowers, lets look for ways to minimize the misclassifying of bad borrowers. We can do this by only including variables that had a signficance of at least p <0.1

# Because there are multiple cat variables in a single column lets manually remove the non-significant variables. 
germcredit_train$V1A14[germcredit_train$V1 == "A14"] <- 1
germcredit_train$V1A14[germcredit_train$V1 != "A14"] <- 0

germcredit_train$V3A34[germcredit_train$V3 == "A34"] <- 1
germcredit_train$V3A34[germcredit_train$V3 != "A34"] <- 0

germcredit_train$V4A41[germcredit_train$V4 == "A41"] <- 1
germcredit_train$V4A41[germcredit_train$V4 != "A41"] <- 0

germcredit_train$V4A43[germcredit_train$V4 == "A43"] <- 1
germcredit_train$V4A43[germcredit_train$V4 != "A43"] <- 0

germcred_logmod2 <- glm(V21 ~ V1A14+V2+V3A34+V4A41+V4A43, data = germcredit_train, family=binomial(link="logit"))
summary(germcred_logmod2)

## 
## Call:
## glm(formula = V21 ~ V1A14 + V2 + V3A34 + V4A41 + V4A43, family = binomial(link = "logit"), 
##     data = germcredit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6221  -0.8019  -0.4871   1.0727   2.5475  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.50046    0.20328  -2.462 0.013821 *  
## V1A14       -1.43483    0.21168  -6.778 1.22e-11 ***
## V2           0.03132    0.00770   4.068 4.74e-05 ***
## V3A34       -0.78889    0.22460  -3.512 0.000444 ***
## V4A41       -1.35802    0.36507  -3.720 0.000199 ***
## V4A43       -0.85684    0.21638  -3.960 7.50e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 861.88  on 699  degrees of freedom
## Residual deviance: 733.45  on 694  degrees of freedom
## AIC: 745.45
## 
## Number of Fisher Scoring iterations: 4

# Now lets do that with the validation data
germcredit_val$V1A14[germcredit_val$V1 == "A14"] <- 1
germcredit_val$V1A14[germcredit_val$V1 != "A14"] <- 0

germcredit_val$V3A34[germcredit_val$V3 == "A34"] <- 1
germcredit_val$V3A34[germcredit_val$V3 != "A34"] <- 0

germcredit_val$V4A41[germcredit_val$V4 == "A41"] <- 1
germcredit_val$V4A41[germcredit_val$V4 != "A41"] <- 0

germcredit_val$V4A43[germcredit_val$V4 == "A43"] <- 1
germcredit_val$V4A43[germcredit_val$V4 != "A43"] <- 0

# Now lets create a confusion matrix of predicted vs observed on the validation data set
# This will help to calculate the accuracy and specificity 
germcred_predict2 <- predict(germcred_logmod2, newdata=germcredit_val[,-21], type="response")
m <- as.matrix(table(round(germcred_predict2), germcredit_val$V21))
names(dimnames(m)) <- c("Predicted", "Observed")
m

##          Observed
## Predicted   0   1
##         0 186  51
##         1  28  35

threshold <- 0.7
m2 <- as.matrix(table(round(germcred_predict2 > threshold), germcredit_val$V21))
names(dimnames(m2)) <- c("Predicted", "Observed")
m2

##          Observed
## Predicted   0   1
##         0 210  74
##         1   4  12

accuracy <- (m2[1,1]+m2[2,2])/(m2[1,1]+m2[1,2]+m2[2,1]+m2[2,2])
accuracy

## [1] 0.74

specificity <- (m2[1,1])/(m2[1,1]+m2[2,1])
specificity

## [1] 0.9813084

After messing around with the threshold, I concluded that 0.7 was a good fit, providing a specificity of 98.54% and an accuracy of 68.67%, which overall is much better.