In order to apply principal component analysis (PCA) to the US crime data, I first loaded the data and then scaled it using prcomp
as instructed in the homework. I then summarized the principal components in order to get a feel for the proportion of variance for each.
# Clearing environment
rm(list = ls())
# Loading necessary packages
#install.packages('DAAG')
library(DAAG)
## Loading required package: lattice
# Setting a seed in order to make results reproducible
set.seed(113)
# Reading in the US crime dataset and scaling
crime <- read.table('uscrime.txt', stringsAsFactors = FALSE, header = TRUE)
prc.crime <- prcomp(crime[,-16], center = T, scale = T)
# Summarizing the principal components
summary(prc.crime)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4534 1.6739 1.4160 1.07806 0.97893 0.74377 0.56729
## Proportion of Variance 0.4013 0.1868 0.1337 0.07748 0.06389 0.03688 0.02145
## Cumulative Proportion 0.4013 0.5880 0.7217 0.79920 0.86308 0.89996 0.92142
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.55444 0.48493 0.44708 0.41915 0.35804 0.26333 0.2418
## Proportion of Variance 0.02049 0.01568 0.01333 0.01171 0.00855 0.00462 0.0039
## Cumulative Proportion 0.94191 0.95759 0.97091 0.98263 0.99117 0.99579 0.9997
## PC15
## Standard deviation 0.06793
## Proportion of Variance 0.00031
## Cumulative Proportion 1.00000
The summary above ranked the proportion of variace in each principal component; principal components 1-3 have a significant level of variation relative to principal components 4-15 (also confirmed via the plot of proportional variance below).
I then estimated the eigenvalues and assessed the proportional variance of each principal component by dividing each eigenvalue by the sum of eigenvalues.
# Estimating eigenvalues
eigen <- prc.crime$sdev^2
# Estimating proportional variance
propvar <- eigen/sum(eigen)
plot(propvar,
xlab = "Principal Component",
ylab = "Proportional Variance",
ylim = c(0,1), type = "b")
Next, I wanted to determine which principal component variables are important. The Kaiser method suggests that any SD > 1 is important, so I developed a Scree Plot in order to vizualize which principal components had variances (SDs) > 1.
# Building the Scree Plot
screeplot(prc.crime, main = "Scree Plot", type = "line")
abline(h = 1, col = "pink")
Looking at the plot above, the first 5 principal components in our model have variances ≥ 1, which would indicate that these are the most important. However, I decided to go one step further and calculate the R-squared and cross-validated R-squared values of a PCA model using all principal components (i.e., 1-15). This will either confirm our findings from above, or give us a better sense of which principal components we should include for the most robust model.
# Looping through each of the principal components to calculate an R-squared and a cross-validated R-squared
r2 <- numeric(15)
r2cross <- numeric(15)
library(DAAG)
for (i in 1:15){
pc.list <- prc.crime$x[,1:i]
pcc <- cbind(crime[,16],pc.list)
#calculate R-squared, store in list
model <- lm(V1~., data = as.data.frame(pcc))
r2[i] <- 1 -sum(model$residuals^2)/sum((crime$Crime - mean(crime$Crime))^2)
#calculate cross validated R-squared, store in list
par(mfrow = c(3,5))
invisible(c <- cv.lm(as.data.frame(pcc), model, m = 5, plotit = FALSE))
r2cross[i] <- 1 - attr(c,"ms")*nrow(crime) / sum((crime$Crime - mean(crime$Crime))^2)
}
# Plotting the R-squared results
plot(r2,
xlab = "Principal Component",
ylab = "R^2",
ylim = c(0,1), type = "b", col = "dark green")
# Plotting the cross-validated R-squared results
plot(r2cross,
xlab = "Principal Component",
ylab = "Cross-Validation R^2",
ylim = c(0,1), type = "b", col = "dark green")
# Storing R-squared data
table <- data.frame(r2, r2cross, c(1:length(r2)))
table
## r2 r2cross c.1.length.r2..
## 1 0.171 0.0711 1
## 2 0.263 0.1228 2
## 3 0.272 0.0963 3
## 4 0.309 0.0392 4
## 5 0.645 0.5068 5
## 6 0.659 0.5218 6
## 7 0.688 0.5306 7
## 8 0.690 0.4706 8
## 9 0.692 0.4299 9
## 10 0.696 0.4085 10
## 11 0.697 0.2768 11
## 12 0.769 0.3808 12
## 13 0.772 0.3461 13
## 14 0.791 0.4172 14
## 15 0.803 0.4198 15
The plots and table indicated that using either 5 or 14 principal components would provide the most accurate and robut model. However, because the goal of this exercise is to decrease the complexity of the model via the use of less variables; thus, I used 5 principal components in my final model.
k = 5
#Combining prinical components 1:k with the crime data from the original dataset
combine.crime <- cbind(prc.crime$x[,1:k], crime[,16])
# Using the combined data to create a linear regression
model <- lm(V6~., data = as.data.frame(combine.crime))
summary(model)
##
## Call:
## lm(formula = V6 ~ ., data = as.data.frame(combine.crime))
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.8 -185.0 12.2 146.2 447.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 905.1 35.6 25.43 < 2e-16 ***
## PC1 65.2 14.7 4.45 6.5e-05 ***
## PC2 -70.1 21.5 -3.26 0.0022 **
## PC3 25.2 25.4 0.99 0.3272
## PC4 69.4 33.4 2.08 0.0437 *
## PC5 -229.0 36.8 -6.23 2.0e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 244 on 41 degrees of freedom
## Multiple R-squared: 0.645, Adjusted R-squared: 0.602
## F-statistic: 14.9 on 5 and 41 DF, p-value: 2.45e-08
In order to determine how accurate my model was, I descaled the data via the following:
b0 <- model$coefficients[1]
b <- model$coefficients[2:(k+1)]
a <- prc.crime$rotation[,1:k] %*% b
# Recovering original alpha values by dividing alpha vector by sigma
mu <- sapply(crime[,1:15], mean)
sigma <- sapply(crime[,1:15], sd)
orig.a <- a/sigma
# Recovering original beta values
orig.b0 <- b0 - sum(a*mu / sigma)
estimates <- as.matrix(crime[,1:15]) %*% orig.a + orig.b0
# Using estimates to calculate R-squared values to determine accuracy of the model
sse = sum((estimates - crime[,16])^2)
sse.tot = sum((crime[,16] - mean(crime[,16]))^2)
R2 <- 1 - sse/sse.tot
R2
## [1] 0.645
R2_adjusted <- R2 - (1 - R2) * k / (nrow(crime) - k - 1)
R2_adjusted
## [1] 0.602
Then I used the new_city data from last week to estimate the crime rate given the “improved” model.
new_city <- data.frame(M= 14.0, So = 0, Ed = 10.0,
Po1 = 12.0, Po2 = 15.5,LF = 0.640,
M.F = 94.0, Pop = 150, NW = 1.1,
U1 = 0.120, U2 = 3.6, Wealth = 3200,
Ineq = 20.1, Prob = 0.040,Time = 39.0)
# Applying the PCA data onto the new city data and predicting the crime rate
p.crime <- data.frame(predict(prc.crime, new_city))
p <- predict(model, p.crime)
p
## 1
## 1389
My model predicted a crime rate of 1389. While this value makes sense considering the other crime values in the data, when considering the prediction from last week (i.e., crime rate of 1304 and an R-squared of 0.671) the PCA model seems slightly less accurate at predicting values.
In order to use the crime data and find the best regression tree and random forest models possible, I first cleared my workspace and reimported the crime data.
# Clearing environment
rm(list = ls())
# Loading necessary packages
#install.packages('randomForest')
#install.packages('tree')
#install.packages('caret')
library(tree)
library(caret)
library(randomForest)
# Setting a seed in order to make results reproducible
set.seed(113)
# Reading in the US crime dataset and scaling
crime <- read.table('uscrime.txt', header = TRUE)
I then built a regression tree model using the tree
package and took an initial look at how the tree was split via the below plot.
# Building the regression tree model
crimeTree <- tree(Crime ~ ., data = crime)
summary(crimeTree)
##
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF" "NW"
## Number of terminal nodes: 7
## Residual mean deviance: 47400 = 1900000 / 40
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -574 -98 -2 0 111 490
# Viewing the split
crimeTree$frame
## var n dev yval splits.cutleft splits.cutright
## 1 Po1 47 6880928 905 <7.65 >7.65
## 2 Pop 23 779243 670 <22.5 >22.5
## 4 LF 12 243811 550 <0.5675 >0.5675
## 8 <leaf> 7 48519 467
## 9 <leaf> 5 77757 668
## 5 <leaf> 11 179471 800
## 3 NW 24 3604162 1131 <7.65 >7.65
## 6 Pop 10 557575 887 <21.5 >21.5
## 12 <leaf> 5 146391 1049
## 13 <leaf> 5 147771 725
## 7 Po1 14 2027225 1305 <9.65 >9.65
## 14 <leaf> 6 170828 1041
## 15 <leaf> 8 1124985 1503
# Generating a visual of the tree
plot(crimeTree)
text(crimeTree)
title("Crime Training Set Classification Tree")
Looking at the residual mean deviance above, the regression tree as is was not a good fit. Based on this, I tried pruning the tree in order to see if I could lower the residual mean deviance.
# Pruning the tree to see if we can get a better fit
nodes <- 5
pru.crimeTree <- prune.tree(crimeTree, best = nodes)
plot(pru.crimeTree)
text(pru.crimeTree)
title("Pruned Tree")
# Generating a summary of the pruned tree in order to check the residual mean deviance
summary(pru.crimeTree)
##
## Regression tree:
## snip.tree(tree = crimeTree, nodes = c(4L, 6L))
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "NW"
## Number of terminal nodes: 5
## Residual mean deviance: 54200 = 2280000 / 42
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -574 -108 16 0 123 490
Based on the output above, the residual mean deviance increased from the pruning (i.e., the pruned model is a worse fit). I decided to perform a cross-validation and plot, in order to try and visually determine the best number of nodes to include in the model. The plot of the cross-validation, shown below, indicated that 7 nodes had the least amount of error.
crimeCV <- cv.tree(crimeTree)
prune.tree(crimeTree)$size
## [1] 7 6 5 4 3 2 1
prune.tree(crimeTree)$dev
## [1] 1895722 2013257 2276670 2632631 3364043 4383406 6880928
crimeCV$dev
## [1] 8133096 8308707 8430820 8451299 7891202 6947370 8399729
plot(crimeCV$size, crimeCV$dev, type = "b")
more.nodes <- 7
pru.crimeTree2 <- prune.tree(crimeTree, best = more.nodes)
plot(pru.crimeTree2)
text(pru.crimeTree2)
title("Pruned Tree 2")
summary(pru.crimeTree2)
##
## Regression tree:
## tree(formula = Crime ~ ., data = crime)
## Variables actually used in tree construction:
## [1] "Po1" "Pop" "LF" "NW"
## Number of terminal nodes: 7
## Residual mean deviance: 47400 = 1900000 / 40
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -574 -98 -2 0 111 490
I then looked at the quality of the fit for my regression tree model.
# Estimating the quality of the fit for the regression tree model
p.crimeTree <- predict(pru.crimeTree2, data = crime[,1:15])
rss <- sum((p.crimeTree - crime[,16])^2)
tss <- sum((crime[,16] - mean(crime[,16]))^2)
r2<- 1 - rss/tss
r2
## [1] 0.724
Key Takeaways: R2 is high (0.724) with the regression tree model, but I’m not convinced overfitting is not an issue. The dataset seems almost eventy split based on the value of Po1 (>7 and ≤7). Additionally, based on Pop being dropped during pruning, it appears as though NW (i.e., region) is a stronger predictor of crime than Pop.
Random forest model time…
I first created a baseline random forest model in order assess the baseline R2 and determine what I might change in order to get a better fit.
# Creating baseline random forest model
crimeRF <- randomForest(Crime ~., data = crime, importance = TRUE, nodesize = 5)
p.crimeRF <- predict(crimeRF, data = crime[,16])
rssRF <- sum((p.crimeRF - crime[,16])^2)
r2RF <- 1 - rssRF/tss
r2RF
## [1] 0.41
The R2 for the baseline random forest model was worse than the R2 for the final regression tree model. I decided to play around with the mtry and nodesize in order to see if I could influence the predictor variables and get a better fit. I used a loop to run through various mtry and nodesize estimates and see which combination produced the best R2.
# Looping through different mtry values
testRF <- data.frame(matrix(nrow=5, ncol=3))
colnames(testRF) <- c("NodeSize", "mtry", "R2")
i = 1
suppressWarnings(for (nodesize in 2:15) {
for (m in 1:20) {
testModel <- randomForest(Crime ~ ., data = crime, importance = TRUE, nodesize = nodesize, mtry = m)
p.test <- predict(testModel, data = crime[,-16])
rssTest <- sum((p.test - crime[,16])^2)
tssTest <- sum((crime[,16] - mean(crime[,16]))^2)
r2Test <- 1 - rssTest/tssTest
testRF[i,1] <- nodesize
testRF[i,2] <- m
testRF[i,3] <- r2Test
i = i + 1
}
})
head(testRF)
## NodeSize mtry R2
## 1 2 1 0.392
## 2 2 2 0.430
## 3 2 3 0.438
## 4 2 4 0.429
## 5 2 5 0.395
## 6 2 6 0.436
# Reporting which combination gives us the highest R2 value
testRF[which.max(testRF[,3]),]
## NodeSize mtry R2
## 63 5 3 0.455
Based on the output from looping through combinations of mtry and nodesize values, an mtry value of 2 and a nodesize value of 4 was projected to give the best fit random forest model.
crimeRF.final <- randomForest(Crime ~ ., data = crime, importance = TRUE, nodesize = 4, mtry = 2)
# Checking the importance of variables in the final random forest model
importance(crimeRF.final)
## %IncMSE IncNodePurity
## M 0.955 224726
## So 2.016 47896
## Ed 3.260 357977
## Po1 9.833 1060007
## Po2 8.531 733099
## LF 4.269 377348
## M.F 0.961 412577
## Pop 0.152 409669
## NW 8.425 521683
## U1 0.422 222610
## U2 3.603 254964
## Wealth 4.239 632035
## Ineq 0.331 279822
## Prob 8.175 684659
## Time 2.215 329479
varImpPlot(crimeRF.final)
The maximum R2 based on the parameters listed above is 0.453; this was accomplished using a nodesize of 4 and an mtry value of 2. In addition, looking at the plot above, it appears that Po1, NW, and Po2 are the most influential variables on the model; this aligns with my findings from the regression tree model, where Po1 and NW werw the most influential variables. In implementing both the regression tree and random forest modeling approaches, it appears as though the random forest model is slightly less accurate than a standard regression tree model; increasing the number of variables used in each split of the random forest model seemed to negatively impact the accuracy of my model.
I work for a healthcare analytics company; pharmaceutical companies subscribe to our software platform in order to conduct data analytics using a point and click user interface. These subscriptions make up 90-95% of our annual revenue as a company and our business development and client support staff work hard to ensure clients are happy with the software platform. One area where logistic regression could be useful is in predicting which clients are high-risk for subscription non-renewal; super-users are users that are always in the platform and their companies pose a very low risk for non-renewal; there are also clients who hardly ever have users logged in and using the platform (i.e., at-risk clients). Our company could use (1) number of total logins per month, (2) number of unique users per month, (3) number of customer support requests per month, (4) number of 1:1 client trainings per month, and (5) number of new data projects created per month in order to predict those users that are at-risk for non-renewal.
First, I read in the credit data and looked at how responses were coded.
# Clearing environment
rm(list = ls())
# Loading necessary packages
#install.packages('dplyr')
library(dplyr)
# Setting a seed in order to make results reproducible
set.seed(113)
# Reading in the German credit dataset
credit <- read.table('germancredit.txt', header = FALSE)
# Getting a sense for the data structure
str(credit)
# What do the responses look like
table(credit$V21)
I then recoded the responses with 0 and 1; not required for this analysis, but more out of personal preference for recoding and cleaning data.
# Recoding responses from 1 and 2 to 0 and 1
credit$V21[credit$V21==1] <- 0
credit$V21[credit$V21==2] <- 1
# Making sure the recode worked
table(credit$V21)
##
## 0 1
## 700 300
I then measured model quality via accuracy; to do this, I split the data into training and validation sets and then assessed the importance of each predictor in my model.
# Splitting the data into the training and validation subsets
credit.partition <- createDataPartition(credit$V21, times = 1, p = 0.7, list = FALSE)
head(credit.partition)
## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 7
train <- credit[credit.partition,]
validation <- credit[-credit.partition,]
table(train$V21)
##
## 0 1
## 500 200
table(validation$V21)
##
## 0 1
## 200 100
# Creating the GLM
credit.glm <- glm(V21 ~ ., data = train, family = binomial(link = "logit"))
# Looking at the importance of each predictor in the GLM
summary(credit.glm)
##
## Call:
## glm(formula = V21 ~ ., family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.554 -0.649 -0.339 0.594 2.713
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.35e+00 1.36e+00 -0.99 0.31974
## V1A12 -3.50e-01 2.77e-01 -1.26 0.20729
## V1A13 -1.31e+00 5.06e-01 -2.59 0.00970 **
## V1A14 -1.68e+00 2.95e-01 -5.70 1.2e-08 ***
## V2 2.37e-02 1.16e-02 2.03 0.04219 *
## V3A31 2.17e-01 6.58e-01 0.33 0.74154
## V3A32 -7.10e-01 5.21e-01 -1.36 0.17309
## V3A33 -1.33e+00 5.86e-01 -2.27 0.02338 *
## V3A34 -2.03e+00 5.40e-01 -3.77 0.00017 ***
## V4A41 -1.65e+00 4.63e-01 -3.56 0.00037 ***
## V4A410 -1.07e+00 9.64e-01 -1.11 0.26761
## V4A42 -9.94e-01 3.34e-01 -2.98 0.00289 **
## V4A43 -9.79e-01 3.13e-01 -3.13 0.00175 **
## V4A44 -4.41e-01 1.01e+00 -0.44 0.66236
## V4A45 3.15e-01 6.97e-01 0.45 0.65183
## V4A46 -6.78e-01 4.95e-01 -1.37 0.17049
## V4A48 -1.53e+01 5.11e+02 -0.03 0.97608
## V4A49 -6.73e-01 4.39e-01 -1.53 0.12480
## V5 1.66e-04 5.70e-05 2.90 0.00368 **
## V6A62 -2.32e-01 3.55e-01 -0.65 0.51270
## V6A63 -3.33e-01 5.34e-01 -0.62 0.53270
## V6A64 -1.19e+00 6.31e-01 -1.88 0.06040 .
## V6A65 -8.59e-01 3.18e-01 -2.70 0.00692 **
## V7A72 5.40e-02 5.27e-01 0.10 0.91854
## V7A73 -1.70e-01 5.03e-01 -0.34 0.73474
## V7A74 -1.00e+00 5.48e-01 -1.83 0.06761 .
## V7A75 -4.66e-01 5.16e-01 -0.90 0.36675
## V8 3.21e-01 1.15e-01 2.80 0.00508 **
## V9A92 2.49e-02 4.86e-01 0.05 0.95922
## V9A93 -6.62e-01 4.76e-01 -1.39 0.16451
## V9A94 -2.98e-01 5.73e-01 -0.52 0.60345
## V10A102 5.28e-01 4.97e-01 1.06 0.28782
## V10A103 -1.13e+00 5.40e-01 -2.09 0.03688 *
## V11 8.44e-02 1.11e-01 0.76 0.44639
## V12A122 1.62e-01 3.32e-01 0.49 0.62489
## V12A123 4.99e-01 2.94e-01 1.70 0.08990 .
## V12A124 1.28e+00 5.17e-01 2.47 0.01335 *
## V13 -1.10e-02 1.12e-02 -0.98 0.32510
## V14A142 1.12e-01 5.38e-01 0.21 0.83546
## V14A143 -1.20e-01 3.02e-01 -0.40 0.69020
## V15A152 -3.76e-01 2.93e-01 -1.29 0.19831
## V15A153 -8.25e-01 5.84e-01 -1.41 0.15785
## V16 4.37e-01 2.36e-01 1.85 0.06467 .
## V17A172 1.30e+00 8.69e-01 1.50 0.13345
## V17A173 1.30e+00 8.35e-01 1.56 0.11974
## V17A174 1.32e+00 8.44e-01 1.57 0.11688
## V18 1.67e-01 3.17e-01 0.53 0.59877
## V19A192 -4.00e-01 2.60e-01 -1.54 0.12379
## V20A202 -1.74e+00 9.37e-01 -1.86 0.06322 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 837.58 on 699 degrees of freedom
## Residual deviance: 583.60 on 651 degrees of freedom
## AIC: 681.6
##
## Number of Fisher Scoring iterations: 14
I then estimated a baseline prediction using my GLM and all predictors in order to determine how the model was performing. I assumed a threshold of 0.5.
# Generating the baseline prediction using the GLM and all predictors
p.credit <- predict(credit.glm, newdata = validation[,-21], type = "response")
table(validation$V21, round(p.credit))
##
## 0 1
## 0 175 25
## 1 59 41
Based on the results above, my initial GLM does a decent job classifying the good borrowers (1), however, a lot of bad borrowers are misclassified. Given that the cost of misclassifying a bad borrower is 5x that of misclassifying a good borrower, I decided to investigate minimizing the misclassification of bad borrowers.
# Manually removing the non-significant variables from the datasets
train$V1A14[train$V1 == "A14"] <- 1
train$V1A14[train$V1 != "A14"] <- 0
train$V3A34[train$V3 == "A34"] <- 1
train$V3A34[train$V3 != "A34"] <- 0
train$V4A41[train$V4 == "A41"] <- 1
train$V4A41[train$V4 != "A41"] <- 0
train$V4A43[train$V4 == "A43"] <- 1
train$V4A43[train$V4 != "A43"] <- 0
# Rerunning the GLM
credit.glm2 <- glm(V21 ~ V1A14+V2+V3A34+V4A41+V4A43, data = train, family = binomial(link = "logit"))
summary(credit.glm2)
##
## Call:
## glm(formula = V21 ~ V1A14 + V2 + V3A34 + V4A41 + V4A43, family = binomial(link = "logit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.620 -0.797 -0.476 0.960 2.409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.85642 0.20982 -4.08 4.5e-05 ***
## V1A14 -1.48802 0.22587 -6.59 4.5e-11 ***
## V2 0.03865 0.00751 5.15 2.7e-07 ***
## V3A34 -0.96499 0.23596 -4.09 4.3e-05 ***
## V4A41 -0.70293 0.35575 -1.98 0.0482 *
## V4A43 -0.68179 0.22008 -3.10 0.0019 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 837.58 on 699 degrees of freedom
## Residual deviance: 703.03 on 694 degrees of freedom
## AIC: 715
##
## Number of Fisher Scoring iterations: 5
# Processing the validation data set similarly
validation$V1A14[validation$V1 == "A14"] <- 1
validation$V1A14[validation$V1 != "A14"] <- 0
validation$V3A34[validation$V3 == "A34"] <- 1
validation$V3A34[validation$V3 != "A34"] <- 0
validation$V4A41[validation$V4 == "A41"] <- 1
validation$V4A41[validation$V4 != "A41"] <- 0
validation$V4A43[validation$V4 == "A43"] <- 1
validation$V4A43[validation$V4 != "A43"] <- 0
# Using GLM to predict on validation data
p.credit2 <- predict(credit.glm2, newdata = validation[,-21], type = "response")
x <- as.matrix(table(round(p.credit2), validation$V21))
names(dimnames(x)) <- c("Predicted", "Observed")
x
## Observed
## Predicted 0 1
## 0 184 70
## 1 16 30
# Calculating accuracy and specificity
threshold <- 0.7
x2 <- as.matrix(table(round(p.credit2 > threshold), validation$V21))
names(dimnames(x)) <- c("Predicted", "Observed")
x2
##
## 0 1
## 0 197 93
## 1 3 7
# Accuracy
accuracy <- (x2[1,1]+x2[2,2])/(x2[1,1]+x2[1,2]+x2[2,1]+x2[2,2])
accuracy
## [1] 0.68
# Specificity
specificity <- (x2[1,1])/(x2[1,1]+x2[2,1])
specificity
## [1] 0.985
Assuming a threshold of 0.7, I observed a specificity of 98.6% and accuracy of 72%. This translates to a 1.4% chance that a bad borrower will be misclassified as a good borrower.