Question 9.1

In order to apply principal component analysis (PCA) to the US crime data, I first loaded the data and then scaled it using prcomp as instructed in the homework. I then summarized the principal components in order to get a feel for the proportion of variance for each.

# Clearing environment
rm(list = ls())

# Loading necessary packages
#install.packages('DAAG')
library(DAAG)

## Loading required package: lattice

# Setting a seed in order to make results reproducible 
set.seed(113)

# Reading in the US crime dataset and scaling
crime <- read.table('uscrime.txt', stringsAsFactors = FALSE, header = TRUE)
prc.crime <- prcomp(crime[,-16], center = T, scale = T)

# Summarizing the principal components
summary(prc.crime)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion  0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
##                            PC8     PC9    PC10    PC11    PC12    PC13   PC14
## Standard deviation     0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion  0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
##                           PC15
## Standard deviation     0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion  1.00000

The summary above ranked the proportion of variace in each principal component; principal components 1-3 have a significant level of variation relative to principal components 4-15 (also confirmed via the plot of proportional variance below).

I then estimated the eigenvalues and assessed the proportional variance of each principal component by dividing each eigenvalue by the sum of eigenvalues.

# Estimating eigenvalues
eigen <- prc.crime$sdev^2

# Estimating proportional variance
propvar <- eigen/sum(eigen)
plot(propvar,
     xlab = "Principal Component",
     ylab = "Proportional Variance",
     ylim = c(0,1), type = "b")

Next, I wanted to determine which principal component variables are important. The Kaiser method suggests that any SD > 1 is important, so I developed a Scree Plot in order to vizualize which principal components had variances (SDs) > 1.

# Building the Scree Plot
screeplot(prc.crime, main = "Scree Plot", type = "line")
abline(h = 1, col = "pink")

Looking at the plot above, the first 5 principal components in our model have variances ≥ 1, which would indicate that these are the most important. However, I decided to go one step further and calculate the R-squared and cross-validated R-squared values of a PCA model using all principal components (i.e., 1-15). This will either confirm our findings from above, or give us a better sense of which principal components we should include for the most robust model.

# Looping through each of the principal components to calculate an R-squared and a cross-validated R-squared
r2 <- numeric(15)
r2cross <- numeric(15)

library(DAAG)

for (i in 1:15){
  pc.list <- prc.crime$x[,1:i]
  pcc <- cbind(crime[,16],pc.list)
  #calculate R-squared, store in list
  model <- lm(V1~., data = as.data.frame(pcc))
  r2[i] <- 1 -sum(model$residuals^2)/sum((crime$Crime - mean(crime$Crime))^2)
   #calculate cross validated R-squared, store in list
  par(mfrow = c(3,5))
  invisible(c <- cv.lm(as.data.frame(pcc), model, m = 5, plotit = FALSE))
  r2cross[i] <- 1 - attr(c,"ms")*nrow(crime) / sum((crime$Crime - mean(crime$Crime))^2)
}

# Plotting the R-squared results
plot(r2, 
     xlab = "Principal Component",
     ylab = "R^2",
     ylim = c(0,1), type = "b", col = "dark green")

# Plotting the cross-validated R-squared results
plot(r2cross, 
     xlab = "Principal Component",
     ylab = "Cross-Validation R^2",
     ylim = c(0,1), type = "b", col = "dark green")

# Storing R-squared data 
table <- data.frame(r2, r2cross, c(1:length(r2)))
table

##       r2 r2cross c.1.length.r2..
## 1  0.171  0.0711               1
## 2  0.263  0.1228               2
## 3  0.272  0.0963               3
## 4  0.309  0.0392               4
## 5  0.645  0.5068               5
## 6  0.659  0.5218               6
## 7  0.688  0.5306               7
## 8  0.690  0.4706               8
## 9  0.692  0.4299               9
## 10 0.696  0.4085              10
## 11 0.697  0.2768              11
## 12 0.769  0.3808              12
## 13 0.772  0.3461              13
## 14 0.791  0.4172              14
## 15 0.803  0.4198              15

The plots and table indicated that using either 5 or 14 principal components would provide the most accurate and robut model. However, because the goal of this exercise is to decrease the complexity of the model via the use of less variables; thus, I used 5 principal components in my final model.

k = 5

#Combining prinical components 1:k with the crime data from the original dataset
combine.crime <- cbind(prc.crime$x[,1:k], crime[,16])

# Using the combined data to create a linear regression
model <- lm(V6~., data = as.data.frame(combine.crime))
summary(model)

## 
## Call:
## lm(formula = V6 ~ ., data = as.data.frame(combine.crime))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -420.8 -185.0   12.2  146.2  447.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    905.1       35.6   25.43  < 2e-16 ***
## PC1             65.2       14.7    4.45  6.5e-05 ***
## PC2            -70.1       21.5   -3.26   0.0022 ** 
## PC3             25.2       25.4    0.99   0.3272    
## PC4             69.4       33.4    2.08   0.0437 *  
## PC5           -229.0       36.8   -6.23  2.0e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared:  0.645,  Adjusted R-squared:  0.602 
## F-statistic: 14.9 on 5 and 41 DF,  p-value: 2.45e-08

In order to determine how accurate my model was, I descaled the data via the following:

b0 <- model$coefficients[1]
b <- model$coefficients[2:(k+1)]
a <- prc.crime$rotation[,1:k] %*% b
 
# Recovering original alpha values by dividing alpha vector by sigma
mu <- sapply(crime[,1:15], mean)
sigma <- sapply(crime[,1:15], sd)
orig.a <- a/sigma

# Recovering original beta values
orig.b0 <- b0 - sum(a*mu / sigma)

estimates <- as.matrix(crime[,1:15]) %*% orig.a + orig.b0

# Using estimates to calculate R-squared values to determine accuracy of the model
sse = sum((estimates - crime[,16])^2)
sse.tot = sum((crime[,16] - mean(crime[,16]))^2)
R2 <- 1 - sse/sse.tot
R2

## [1] 0.645

R2_adjusted <- R2 - (1 - R2) * k / (nrow(crime) - k - 1)
R2_adjusted

## [1] 0.602

Then I used the new_city data from last week to estimate the crime rate given the “improved” model.

new_city <- data.frame(M= 14.0, So = 0, Ed = 10.0, 
                       Po1 = 12.0, Po2 = 15.5,LF = 0.640, 
                       M.F = 94.0, Pop = 150, NW = 1.1, 
                       U1 = 0.120, U2 = 3.6, Wealth = 3200, 
                       Ineq = 20.1, Prob = 0.040,Time = 39.0)

# Applying the PCA data onto the new city data and predicting the crime rate
p.crime <- data.frame(predict(prc.crime, new_city))
p <- predict(model, p.crime)
p

##    1 
## 1389

My model predicted a crime rate of 1389. While this value makes sense considering the other crime values in the data, when considering the prediction from last week (i.e., crime rate of 1304 and an R-squared of 0.671) the PCA model seems slightly less accurate at predicting values.

Question 10.1

In order to use the crime data and find the best regression tree and random forest models possible, I first cleared my workspace and reimported the crime data.

# Clearing environment
rm(list = ls())

# Loading necessary packages
#install.packages('randomForest')
#install.packages('tree')
#install.packages('caret')
library(tree)
library(caret)
library(randomForest)

# Setting a seed in order to make results reproducible 
set.seed(113)

# Reading in the US crime dataset and scaling
crime <- read.table('uscrime.txt', header = TRUE)

I then built a regression tree model using the tree package and took an initial look at how the tree was split via the below plot.

# Building the regression tree model
crimeTree <- tree(Crime ~ ., data = crime)
summary(crimeTree)

## 
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47400 = 1900000 / 40 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -574     -98      -2       0     111     490

# Viewing the split
crimeTree$frame

##       var  n     dev yval splits.cutleft splits.cutright
## 1     Po1 47 6880928  905          <7.65           >7.65
## 2     Pop 23  779243  670          <22.5           >22.5
## 4      LF 12  243811  550        <0.5675         >0.5675
## 8  <leaf>  7   48519  467                               
## 9  <leaf>  5   77757  668                               
## 5  <leaf> 11  179471  800                               
## 3      NW 24 3604162 1131          <7.65           >7.65
## 6     Pop 10  557575  887          <21.5           >21.5
## 12 <leaf>  5  146391 1049                               
## 13 <leaf>  5  147771  725                               
## 7     Po1 14 2027225 1305          <9.65           >9.65
## 14 <leaf>  6  170828 1041                               
## 15 <leaf>  8 1124985 1503

# Generating a visual of the tree
plot(crimeTree)
text(crimeTree)
title("Crime Training Set Classification Tree")

Looking at the residual mean deviance above, the regression tree as is was not a good fit. Based on this, I tried pruning the tree in order to see if I could lower the residual mean deviance.

# Pruning the tree to see if we can get a better fit
nodes <- 5
pru.crimeTree <- prune.tree(crimeTree, best = nodes)
plot(pru.crimeTree)
text(pru.crimeTree)
title("Pruned Tree")

# Generating a summary of the pruned tree in order to check the residual mean deviance
summary(pru.crimeTree)

## 
## Regression tree:
## snip.tree(tree = crimeTree, nodes = c(4L, 6L))
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "NW" 
## Number of terminal nodes:  5 
## Residual mean deviance:  54200 = 2280000 / 42 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -574    -108      16       0     123     490

Based on the output above, the residual mean deviance increased from the pruning (i.e., the pruned model is a worse fit). I decided to perform a cross-validation and plot, in order to try and visually determine the best number of nodes to include in the model. The plot of the cross-validation, shown below, indicated that 7 nodes had the least amount of error.

crimeCV <- cv.tree(crimeTree)
prune.tree(crimeTree)$size

## [1] 7 6 5 4 3 2 1

prune.tree(crimeTree)$dev

## [1] 1895722 2013257 2276670 2632631 3364043 4383406 6880928

crimeCV$dev

## [1] 8133096 8308707 8430820 8451299 7891202 6947370 8399729

plot(crimeCV$size, crimeCV$dev, type = "b")

more.nodes <- 7
pru.crimeTree2 <- prune.tree(crimeTree, best = more.nodes)
plot(pru.crimeTree2)
text(pru.crimeTree2)
title("Pruned Tree 2")

summary(pru.crimeTree2)

## 
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF"  "NW" 
## Number of terminal nodes:  7 
## Residual mean deviance:  47400 = 1900000 / 40 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -574     -98      -2       0     111     490

I then looked at the quality of the fit for my regression tree model.

# Estimating the quality of the fit for the regression tree model
p.crimeTree <- predict(pru.crimeTree2, data = crime[,1:15])
rss <- sum((p.crimeTree - crime[,16])^2)
tss <- sum((crime[,16] - mean(crime[,16]))^2)
r2<- 1 - rss/tss
r2

## [1] 0.724

Key Takeaways: R2 is high (0.724) with the regression tree model, but I’m not convinced overfitting is not an issue. The dataset seems almost eventy split based on the value of Po1 (>7 and ≤7). Additionally, based on Pop being dropped during pruning, it appears as though NW (i.e., region) is a stronger predictor of crime than Pop.

Random forest model time…

I first created a baseline random forest model in order assess the baseline R2 and determine what I might change in order to get a better fit.

# Creating baseline random forest model
crimeRF <- randomForest(Crime ~., data = crime, importance = TRUE, nodesize = 5)
p.crimeRF <- predict(crimeRF, data = crime[,16])
rssRF <- sum((p.crimeRF - crime[,16])^2)
r2RF <- 1 - rssRF/tss
r2RF

## [1] 0.41

The R2 for the baseline random forest model was worse than the R2 for the final regression tree model. I decided to play around with the mtry and nodesize in order to see if I could influence the predictor variables and get a better fit. I used a loop to run through various mtry and nodesize estimates and see which combination produced the best R2.

# Looping through different mtry values
testRF <- data.frame(matrix(nrow=5, ncol=3))
colnames(testRF) <- c("NodeSize", "mtry", "R2")
i = 1
suppressWarnings(for (nodesize in 2:15) {
  for (m in 1:20) {
    testModel <- randomForest(Crime ~ ., data = crime, importance = TRUE, nodesize = nodesize, mtry = m)
    p.test <- predict(testModel, data = crime[,-16])
    rssTest <- sum((p.test - crime[,16])^2)
    tssTest <- sum((crime[,16] - mean(crime[,16]))^2)
    r2Test <- 1 - rssTest/tssTest
    testRF[i,1] <- nodesize
    testRF[i,2] <- m
    testRF[i,3] <- r2Test
    i = i + 1
  }
})
head(testRF)

##   NodeSize mtry    R2
## 1        2    1 0.392
## 2        2    2 0.430
## 3        2    3 0.438
## 4        2    4 0.429
## 5        2    5 0.395
## 6        2    6 0.436

# Reporting which combination gives us the highest R2 value
testRF[which.max(testRF[,3]),]

##    NodeSize mtry    R2
## 63        5    3 0.455

Based on the output from looping through combinations of mtry and nodesize values, an mtry value of 2 and a nodesize value of 4 was projected to give the best fit random forest model.

crimeRF.final <- randomForest(Crime ~ ., data = crime, importance = TRUE, nodesize = 4, mtry = 2)

# Checking the importance of variables in the final random forest model
importance(crimeRF.final)

##        %IncMSE IncNodePurity
## M        0.955        224726
## So       2.016         47896
## Ed       3.260        357977
## Po1      9.833       1060007
## Po2      8.531        733099
## LF       4.269        377348
## M.F      0.961        412577
## Pop      0.152        409669
## NW       8.425        521683
## U1       0.422        222610
## U2       3.603        254964
## Wealth   4.239        632035
## Ineq     0.331        279822
## Prob     8.175        684659
## Time     2.215        329479

varImpPlot(crimeRF.final)

The maximum R2 based on the parameters listed above is 0.453; this was accomplished using a nodesize of 4 and an mtry value of 2. In addition, looking at the plot above, it appears that Po1, NW, and Po2 are the most influential variables on the model; this aligns with my findings from the regression tree model, where Po1 and NW werw the most influential variables. In implementing both the regression tree and random forest modeling approaches, it appears as though the random forest model is slightly less accurate than a standard regression tree model; increasing the number of variables used in each split of the random forest model seemed to negatively impact the accuracy of my model.

Question 10.2

I work for a healthcare analytics company; pharmaceutical companies subscribe to our software platform in order to conduct data analytics using a point and click user interface. These subscriptions make up 90-95% of our annual revenue as a company and our business development and client support staff work hard to ensure clients are happy with the software platform. One area where logistic regression could be useful is in predicting which clients are high-risk for subscription non-renewal; super-users are users that are always in the platform and their companies pose a very low risk for non-renewal; there are also clients who hardly ever have users logged in and using the platform (i.e., at-risk clients). Our company could use (1) number of total logins per month, (2) number of unique users per month, (3) number of customer support requests per month, (4) number of 1:1 client trainings per month, and (5) number of new data projects created per month in order to predict those users that are at-risk for non-renewal.

Question 10.3

First, I read in the credit data and looked at how responses were coded.

# Clearing environment
rm(list = ls())

# Loading necessary packages
#install.packages('dplyr')
library(dplyr)

# Setting a seed in order to make results reproducible 
set.seed(113)

# Reading in the German credit dataset
credit <- read.table('germancredit.txt', header = FALSE)

# Getting a sense for the data structure
str(credit)

# What do the responses look like
table(credit$V21)

I then recoded the responses with 0 and 1; not required for this analysis, but more out of personal preference for recoding and cleaning data.

# Recoding responses from 1 and 2 to 0 and 1
credit$V21[credit$V21==1] <- 0
credit$V21[credit$V21==2] <- 1

# Making sure the recode worked
table(credit$V21)

## 
##   0   1 
## 700 300

I then measured model quality via accuracy; to do this, I split the data into training and validation sets and then assessed the importance of each predictor in my model.

# Splitting the data into the training and validation subsets
credit.partition <- createDataPartition(credit$V21, times = 1, p = 0.7, list = FALSE)
head(credit.partition)

##      Resample1
## [1,]         1
## [2,]         2
## [3,]         3
## [4,]         4
## [5,]         6
## [6,]         7

train <- credit[credit.partition,]
validation <- credit[-credit.partition,]

table(train$V21)

## 
##   0   1 
## 500 200

table(validation$V21)

## 
##   0   1 
## 200 100

# Creating the GLM
credit.glm <- glm(V21 ~ ., data = train, family = binomial(link = "logit"))

# Looking at the importance of each predictor in the GLM
summary(credit.glm)

## 
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.554  -0.649  -0.339   0.594   2.713  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.35e+00   1.36e+00   -0.99  0.31974    
## V1A12       -3.50e-01   2.77e-01   -1.26  0.20729    
## V1A13       -1.31e+00   5.06e-01   -2.59  0.00970 ** 
## V1A14       -1.68e+00   2.95e-01   -5.70  1.2e-08 ***
## V2           2.37e-02   1.16e-02    2.03  0.04219 *  
## V3A31        2.17e-01   6.58e-01    0.33  0.74154    
## V3A32       -7.10e-01   5.21e-01   -1.36  0.17309    
## V3A33       -1.33e+00   5.86e-01   -2.27  0.02338 *  
## V3A34       -2.03e+00   5.40e-01   -3.77  0.00017 ***
## V4A41       -1.65e+00   4.63e-01   -3.56  0.00037 ***
## V4A410      -1.07e+00   9.64e-01   -1.11  0.26761    
## V4A42       -9.94e-01   3.34e-01   -2.98  0.00289 ** 
## V4A43       -9.79e-01   3.13e-01   -3.13  0.00175 ** 
## V4A44       -4.41e-01   1.01e+00   -0.44  0.66236    
## V4A45        3.15e-01   6.97e-01    0.45  0.65183    
## V4A46       -6.78e-01   4.95e-01   -1.37  0.17049    
## V4A48       -1.53e+01   5.11e+02   -0.03  0.97608    
## V4A49       -6.73e-01   4.39e-01   -1.53  0.12480    
## V5           1.66e-04   5.70e-05    2.90  0.00368 ** 
## V6A62       -2.32e-01   3.55e-01   -0.65  0.51270    
## V6A63       -3.33e-01   5.34e-01   -0.62  0.53270    
## V6A64       -1.19e+00   6.31e-01   -1.88  0.06040 .  
## V6A65       -8.59e-01   3.18e-01   -2.70  0.00692 ** 
## V7A72        5.40e-02   5.27e-01    0.10  0.91854    
## V7A73       -1.70e-01   5.03e-01   -0.34  0.73474    
## V7A74       -1.00e+00   5.48e-01   -1.83  0.06761 .  
## V7A75       -4.66e-01   5.16e-01   -0.90  0.36675    
## V8           3.21e-01   1.15e-01    2.80  0.00508 ** 
## V9A92        2.49e-02   4.86e-01    0.05  0.95922    
## V9A93       -6.62e-01   4.76e-01   -1.39  0.16451    
## V9A94       -2.98e-01   5.73e-01   -0.52  0.60345    
## V10A102      5.28e-01   4.97e-01    1.06  0.28782    
## V10A103     -1.13e+00   5.40e-01   -2.09  0.03688 *  
## V11          8.44e-02   1.11e-01    0.76  0.44639    
## V12A122      1.62e-01   3.32e-01    0.49  0.62489    
## V12A123      4.99e-01   2.94e-01    1.70  0.08990 .  
## V12A124      1.28e+00   5.17e-01    2.47  0.01335 *  
## V13         -1.10e-02   1.12e-02   -0.98  0.32510    
## V14A142      1.12e-01   5.38e-01    0.21  0.83546    
## V14A143     -1.20e-01   3.02e-01   -0.40  0.69020    
## V15A152     -3.76e-01   2.93e-01   -1.29  0.19831    
## V15A153     -8.25e-01   5.84e-01   -1.41  0.15785    
## V16          4.37e-01   2.36e-01    1.85  0.06467 .  
## V17A172      1.30e+00   8.69e-01    1.50  0.13345    
## V17A173      1.30e+00   8.35e-01    1.56  0.11974    
## V17A174      1.32e+00   8.44e-01    1.57  0.11688    
## V18          1.67e-01   3.17e-01    0.53  0.59877    
## V19A192     -4.00e-01   2.60e-01   -1.54  0.12379    
## V20A202     -1.74e+00   9.37e-01   -1.86  0.06322 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 837.58  on 699  degrees of freedom
## Residual deviance: 583.60  on 651  degrees of freedom
## AIC: 681.6
## 
## Number of Fisher Scoring iterations: 14

I then estimated a baseline prediction using my GLM and all predictors in order to determine how the model was performing. I assumed a threshold of 0.5.

# Generating the baseline prediction using the GLM and all predictors
p.credit <- predict(credit.glm, newdata = validation[,-21], type = "response")
table(validation$V21, round(p.credit))

##    
##       0   1
##   0 175  25
##   1  59  41

Based on the results above, my initial GLM does a decent job classifying the good borrowers (1), however, a lot of bad borrowers are misclassified. Given that the cost of misclassifying a bad borrower is 5x that of misclassifying a good borrower, I decided to investigate minimizing the misclassification of bad borrowers.

# Manually removing the non-significant variables from the datasets
train$V1A14[train$V1 == "A14"] <- 1
train$V1A14[train$V1 != "A14"] <- 0

train$V3A34[train$V3 == "A34"] <- 1
train$V3A34[train$V3 != "A34"] <- 0

train$V4A41[train$V4 == "A41"] <- 1
train$V4A41[train$V4 != "A41"] <- 0

train$V4A43[train$V4 == "A43"] <- 1
train$V4A43[train$V4 != "A43"] <- 0

# Rerunning the GLM
credit.glm2 <- glm(V21 ~ V1A14+V2+V3A34+V4A41+V4A43, data = train, family = binomial(link = "logit"))
summary(credit.glm2)

## 
## Call:
## glm(formula = V21 ~ V1A14 + V2 + V3A34 + V4A41 + V4A43, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.620  -0.797  -0.476   0.960   2.409  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.85642    0.20982   -4.08  4.5e-05 ***
## V1A14       -1.48802    0.22587   -6.59  4.5e-11 ***
## V2           0.03865    0.00751    5.15  2.7e-07 ***
## V3A34       -0.96499    0.23596   -4.09  4.3e-05 ***
## V4A41       -0.70293    0.35575   -1.98   0.0482 *  
## V4A43       -0.68179    0.22008   -3.10   0.0019 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 837.58  on 699  degrees of freedom
## Residual deviance: 703.03  on 694  degrees of freedom
## AIC: 715
## 
## Number of Fisher Scoring iterations: 5

# Processing the validation data set similarly
validation$V1A14[validation$V1 == "A14"] <- 1
validation$V1A14[validation$V1 != "A14"] <- 0

validation$V3A34[validation$V3 == "A34"] <- 1
validation$V3A34[validation$V3 != "A34"] <- 0

validation$V4A41[validation$V4 == "A41"] <- 1
validation$V4A41[validation$V4 != "A41"] <- 0

validation$V4A43[validation$V4 == "A43"] <- 1
validation$V4A43[validation$V4 != "A43"] <- 0

# Using GLM to predict on validation data
p.credit2 <- predict(credit.glm2, newdata = validation[,-21], type = "response")

x <- as.matrix(table(round(p.credit2), validation$V21))
names(dimnames(x)) <- c("Predicted", "Observed")
x

##          Observed
## Predicted   0   1
##         0 184  70
##         1  16  30

# Calculating accuracy and specificity
threshold <- 0.7
x2 <- as.matrix(table(round(p.credit2 > threshold), validation$V21))
names(dimnames(x)) <- c("Predicted", "Observed")
x2

##    
##       0   1
##   0 197  93
##   1   3   7

# Accuracy
accuracy <- (x2[1,1]+x2[2,2])/(x2[1,1]+x2[1,2]+x2[2,1]+x2[2,2])
accuracy

## [1] 0.68

# Specificity
specificity <- (x2[1,1])/(x2[1,1]+x2[2,1])
specificity

## [1] 0.985

Assuming a threshold of 0.7, I observed a specificity of 98.6% and accuracy of 72%. This translates to a 1.4% chance that a bad borrower will be misclassified as a good borrower.

ISyE 6501: Homework 4

6/10/2020

Question 9.1

Question 10.1

Question 10.2

Question 10.3