Qu 1. Does a persons race affect there chances of gtting a loan? What percent of Black, Hispanic, and White applicants got their loan applications approved. Qu 2. Does a loan applicants mortgage history like late payments affect their chances of getting a loan Qu 3. What percentage of an applicants income spent on housing. Does monthly housing expenses exceed 30% of household income. Qu 4. All other things being equal, does level of education improve ones changes of getting a mortgage loan. Qu 5. Does appraised values of homes correlate with purchase price
library(magrittr)
library(ggplot2)
library(lattice)
library(RColorBrewer)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Read the data
loandata <- read.csv('loanapp.csv', header = TRUE)
# Display 20 rows of the loan data
head(loandata, n = 20)
## id occ loanamt action msa suffolk appinc typur unit married dep emp yjob
## 1 300 1 25 1 1120 1 48 0 1 1 0 0 9
## 2 1893 1 36 1 1120 0 36 0 1 1 0 0 1
## 3 767 1 42 1 1120 0 45 0 1 1 0 0 0
## 4 82 1 40 3 1120 0 50 0 1 0 4 0 0
## 5 258 1 110 3 1120 0 60 0 1 1 1 0 1
## 6 1121 1 45 1 1120 0 21 0 1 0 0 0 0
## 7 1150 1 67 3 1120 1 72 0 1 1 1 0 0
## 8 223 1 44 1 1120 0 22 0 1 0 0 0 0
## 9 631 1 51 1 1120 0 34 0 1 0 0 0 0
## 10 119 1 47 1 1120 0 36 0 1 0 0 0 1
## 11 716 1 50 1 1120 0 29 0 1 0 0 0 0
## 12 1115 1 46 3 1120 0 28 0 1 0 0 0 0
## 13 1839 1 9 3 1120 0 28 0 1 0 0 0 0
## 14 120 1 48 3 1120 0 42 0 1 1 1 0 0
## 15 29 1 17 1 1120 0 31 0 1 1 2 0 0
## 16 698 1 59 1 1120 0 29 0 1 0 0 0 1
## 17 534 1 50 1 1120 0 39 1 1 0 0 0 0
## 18 652 1 48 1 1120 0 59 5 1 0 0 0 0
## 19 180 1 26 1 1120 0 60 0 1 0 0 0 0
## 20 1022 1 28 1 1120 0 35 0 1 1 0 0 1
## self atotinc cototinc hexp price other liq rep gdlin lines mortg cons
## 1 0 4030 0 720.0 25.0 0 6.500 1 1 4 1 1
## 2 0 3166 0 577.0 45.0 0 19.000 1 1 20 1 1
## 3 0 3716 0 672.0 47.0 0 14.000 4 1 9 2 1
## 4 0 4166 0 429.0 50.0 0 17.000 3 0 18 3 3
## 5 1 1000 3022 1130.0 50.0 0 26.000 2 1 21 2 1
## 6 0 1776 0 454.0 51.0 0 7.000 1 1 1 2 4
## 7 0 3875 0 746.0 96.0 0 38.000 1 1 10 2 1
## 8 0 1833 0 458.0 55.0 0 18.000 0 1 2 2 1
## 9 0 2873 0 629.0 57.0 0 30.000 2 1 14 2 1
## 10 0 1800 1213 486.0 67.5 0 5.152 1 1 15 2 2
## 11 0 2463 0 618.0 56.0 0 9.000 2 1 19 2 1
## 12 0 2378 0 636.0 61.0 410 9.000 1 1 11 1 2
## 13 0 2378 0 636.0 61.0 82 9.000 1 1 11 1 1
## 14 0 1888 1620 571.1 62.9 0 34.000 1 1 15 2 2
## 15 0 1299 1086 342.0 57.0 0 50.000 1 1 1 2 1
## 16 0 2479 0 699.0 65.0 0 10.000 2 1 9 2 1
## 17 0 3254 0 574.0 63.0 0 14.000 1 1 5 1 3
## 18 0 4885 0 564.0 60.0 0 99.000 1 1 10 2 1
## 19 0 2600 2388 464.4 29.0 0 7.000 2 1 19 2 2
## 20 0 3016 0 450.9 31.0 0 141.000 1 1 19 2 1
## pubrec hrat obrat fixadj term apr prop inss inson gift cosign unver review
## 1 0 18.00 47.0 0 120 25.0 2 0 0 0 0 0 1
## 2 0 16.00 28.0 1 360 45.0 1 0 0 0 0 0 2
## 3 0 18.00 23.0 0 180 47.0 1 0 0 0 0 0 2
## 4 1 13.00 47.0 0 180 50.0 2 0 0 0 0 0 2
## 5 0 29.00 45.0 1 360 50.0 2 0 0 1 0 0 1
## 6 0 25.38 25.4 1 360 50.0 1 0 0 0 0 0 2
## 7 0 19.00 37.0 0 360 50.0 1 0 0 0 0 0 1
## 8 0 25.00 33.0 1 360 55.0 1 0 0 0 0 0 1
## 9 0 22.00 26.0 0 360 57.0 1 0 0 0 0 0 2
## 10 0 19.00 32.0 1 360 58.5 1 1 0 0 0 0 4
## 11 0 27.00 39.0 0 360 60.0 1 0 0 0 0 0 4
## 12 0 28.00 35.0 0 360 62.0 1 0 0 1 0 0 1
## 13 0 28.00 35.0 0 360 62.0 1 0 0 1 0 0 1
## 14 0 16.00 19.0 0 360 62.9 2 0 0 0 0 1 2
## 15 0 14.00 14.0 0 120 65.0 2 0 0 0 0 0 1
## 16 0 28.89 32.9 1 360 65.0 1 0 0 0 0 0 1
## 17 0 18.00 33.0 0 360 66.0 2 0 0 0 0 0 1
## 18 0 11.00 16.0 0 360 66.0 1 0 0 0 0 0 1
## 19 0 10.00 19.0 1 300 67.0 1 0 0 0 0 0 2
## 20 0 15.00 33.0 1 300 67.0 1 0 0 0 0 0 2
## netw unem min30 bd mi old vr sch black hispan male reject approve mortno
## 1 108.250 10.6 1 1 0 1 1 0 0 0 1 0 1 1
## 2 25.000 3.2 0 1 1 1 1 1 0 0 0 0 1 1
## 3 39.000 3.9 0 1 1 0 1 1 0 0 1 0 1 0
## 4 102.000 3.1 0 0 1 1 1 1 0 0 1 1 0 0
## 5 33.000 10.6 0 0 0 1 1 1 0 0 1 1 0 0
## 6 8.000 2.0 0 1 1 1 1 0 0 0 1 0 1 0
## 7 37.000 3.2 0 0 0 1 1 1 0 1 1 1 0 0
## 8 17.000 5.3 0 0 1 0 0 0 0 1 0 0 1 0
## 9 74.000 2.0 0 1 1 1 1 1 0 0 0 0 1 0
## 10 -11.310 4.3 0 0 1 0 0 1 0 0 1 0 1 0
## 11 37.000 3.2 0 1 1 0 1 1 0 0 1 0 1 0
## 12 43.000 3.2 0 0 1 1 1 1 0 0 0 1 0 1
## 13 43.000 3.2 0 0 1 1 1 1 0 0 0 1 0 1
## 14 42.883 3.1 0 0 1 1 0 0 0 1 1 1 0 0
## 15 53.000 3.2 0 0 1 1 1 1 0 0 1 0 1 0
## 16 37.000 3.6 0 1 1 0 1 1 0 0 1 0 1 0
## 17 -95.000 3.1 0 0 1 0 1 1 0 0 1 0 1 1
## 18 189.000 10.6 0 1 1 1 1 0 0 0 1 0 1 0
## 19 67.000 3.2 0 0 1 0 0 1 0 0 0 0 1 0
## 20 147.000 1.8 0 0 1 1 0 0 0 0 0 0 1 0
## mortperf mortlat1 mortlat2 chist multi loanprc thick white
## 1 0 0 0 1 0 1.0000000 0 1
## 2 0 0 0 1 0 0.8000000 0 1
## 3 1 0 0 1 0 0.8936170 1 1
## 4 0 1 0 1 0 0.8000000 1 1
## 5 1 0 0 1 0 2.2000000 0 1
## 6 1 0 0 1 0 0.8823529 0 1
## 7 1 0 0 1 0 0.6979167 0 0
## 8 1 0 0 1 0 0.8000000 0 0
## 9 1 0 0 1 0 0.8947368 0 1
## 10 1 0 0 1 0 0.6962963 0 1
## 11 1 0 0 1 0 0.8928571 0 1
## 12 0 0 0 1 0 0.7540984 0 1
## 13 0 0 0 1 0 0.1475410 0 1
## 14 1 0 0 1 0 0.7631161 0 0
## 15 1 0 0 1 0 0.2982456 0 1
## 16 1 0 0 1 0 0.9076923 0 1
## 17 0 0 0 1 0 0.7936508 0 1
## 18 1 0 0 1 0 0.8000000 0 1
## 19 1 0 0 1 0 0.8965517 0 1
## 20 1 0 0 1 0 0.9032258 0 1
Used the summary function in the base R package to gain an overview of the data
# Generate an overview of the data
summary(loandata)
## id occ loanamt action msa
## Min. : 1 Min. :1.000 Min. : 2.0 Min. :1.000 Min. :1120
## 1st Qu.: 498 1st Qu.:1.000 1st Qu.:100.0 1st Qu.:1.000 1st Qu.:1120
## Median : 995 Median :1.000 Median :126.0 Median :1.000 Median :1120
## Mean : 995 Mean :1.032 Mean :143.2 Mean :1.276 Mean :1120
## 3rd Qu.:1492 3rd Qu.:1.000 3rd Qu.:165.0 3rd Qu.:1.000 3rd Qu.:1120
## Max. :1989 Max. :3.000 Max. :980.0 Max. :3.000 Max. :1120
##
## suffolk appinc typur unit
## Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 48.00 1st Qu.:0.000 1st Qu.:1.000
## Median :0.0000 Median : 64.00 Median :0.000 Median :1.000
## Mean :0.1543 Mean : 84.68 Mean :1.531 Mean :1.122
## 3rd Qu.:0.0000 3rd Qu.: 88.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :1.0000 Max. :972.00 Max. :9.000 Max. :4.000
## NA's :4
## married dep emp yjob
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :1.0000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.6586 Mean :0.7709 Mean :0.2097 Mean :0.449
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :8.0000 Max. :9.0000 Max. :9.000
## NA's :3 NA's :3
## self atotinc cototinc hexp
## Min. :0.0000 Min. : 0 Min. : 0 Min. : 154
## 1st Qu.:0.0000 1st Qu.: 2876 1st Qu.: 0 1st Qu.: 1054
## Median :0.0000 Median : 3813 Median : 1145 Median : 1317
## Mean :0.1292 Mean : 5196 Mean : 1547 Mean : 1505
## 3rd Qu.:0.0000 3rd Qu.: 5596 3rd Qu.: 2417 3rd Qu.: 1715
## Max. :1.0000 Max. :81000 Max. :41667 Max. :10798
##
## price other liq rep
## Min. : 25.0 Min. : 0.00 Min. : 0 Min. :0.000
## 1st Qu.: 129.0 1st Qu.: 0.00 1st Qu.: 20 1st Qu.:1.000
## Median : 163.0 Median : 0.00 Median : 38 Median :1.000
## Mean : 196.3 Mean : 2.37 Mean : 4618 Mean :1.503
## 3rd Qu.: 225.0 3rd Qu.: 0.00 3rd Qu.: 83 3rd Qu.:2.000
## Max. :1535.0 Max. :1020.00 Max. :1000000 Max. :9.000
## NA's :9
## gdlin lines mortg cons
## Min. : 0.000 Min. : 0.0 Min. :1.000 Min. :1.00
## 1st Qu.: 1.000 1st Qu.: 7.0 1st Qu.:1.000 1st Qu.:1.00
## Median : 1.000 Median : 12.0 Median :2.000 Median :1.00
## Mean : 1.583 Mean : 516.4 Mean :1.708 Mean :2.11
## 3rd Qu.: 1.000 3rd Qu.: 19.0 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :666.000 Max. :999999.4 Max. :4.000 Max. :6.00
##
## pubrec hrat obrat fixadj
## Min. :0.00000 Min. : 1.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:21.00 1st Qu.:28.00 1st Qu.:0.0000
## Median :0.00000 Median :25.77 Median :33.00 Median :0.0000
## Mean :0.06888 Mean :24.79 Mean :32.39 Mean :0.3082
## 3rd Qu.:0.00000 3rd Qu.:29.00 3rd Qu.:37.00 3rd Qu.:1.0000
## Max. :1.00000 Max. :72.00 Max. :95.00 Max. :1.0000
##
## term apr prop inss
## Min. : 6 Min. : 25.0 Min. :1.000 Min. :0.0000
## 1st Qu.: 360 1st Qu.: 135.0 1st Qu.:2.000 1st Qu.:0.0000
## Median : 360 Median : 169.0 Median :2.000 Median :0.0000
## Mean : 2352 Mean : 205.1 Mean :1.861 Mean :0.2001
## 3rd Qu.: 360 3rd Qu.: 230.0 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :999999 Max. :4316.0 Max. :3.000 Max. :1.0000
##
## inson gift cosign unver
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.01508 Mean :0.1599 Mean :0.02866 Mean :0.04274
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000
##
## review netw unem min30
## Min. : 0.0 Min. :-7919.0 Min. : 1.800 Min. :0.00000
## 1st Qu.: 1.0 1st Qu.: 43.0 1st Qu.: 3.100 1st Qu.:0.00000
## Median : 2.0 Median : 95.0 Median : 3.200 Median :0.00000
## Mean :113.7 Mean : 266.6 Mean : 3.882 Mean :0.05703
## 3rd Qu.: 3.0 3rd Qu.: 229.6 3rd Qu.: 3.900 3rd Qu.:0.00000
## Max. :999.0 Max. :28023.0 Max. :10.600 Max. :1.00000
## NA's :183
## bd mi old vr
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.4208 Mean :0.8728 Mean :0.4676 Mean :0.4098
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## sch black hispan male
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :1.0000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :0.7717 Mean :0.09904 Mean :0.05581 Mean :0.8131
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## NA's :15
## reject approve mortno mortperf
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :1.0000 Median :0.0000 Median :1.0000
## Mean :0.1227 Mean :0.8773 Mean :0.3318 Mean :0.6385
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## mortlat1 mortlat2 chist multi
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :1.0000 Median :0.00000
## Mean :0.01911 Mean :0.01056 Mean :0.8376 Mean :0.08615
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## NA's :4
## loanprc thick white
## Min. :0.02105 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.70000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.80000 Median :0.0000 Median :1.0000
## Mean :0.77064 Mean :0.1051 Mean :0.8451
## 3rd Qu.:0.89894 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :2.57143 Max. :1.0000 Max. :1.0000
## NA's :9
Let find out if there are outlier in the purchased and appraised values
purchased.appraised.prices <- data.frame(cbind(loandata$price, loandata$apr))
names(purchased.appraised.prices) <- c('Purchased Price', 'Appraised Prices')
boxplot(purchased.appraised.prices)
There are outliers in the data. Let us identify them by calculating the
quantile of values.
Q.appraised.value <- quantile(loandata$apr, na.rm = FALSE)
Q.purchased.price <- quantile(loandata$price, na.rm = FALSE)
Quantiles <- data.frame(cbind(Q.purchased.price, Q.appraised.value))
names(Quantiles) <- c('Purchased Price', 'Appraised Value')
Quantiles
## Purchased Price Appraised Value
## 0% 25 25
## 25% 129 135
## 50% 163 169
## 75% 225 230
## 100% 1535 4316
Let us calculate the interquantile range of the values
iqr.appraised.value <- IQR(loandata$apr)
iqr.purchased.prices <- IQR(loandata$price)
iqr <- data.frame(cbind(iqr.purchased.prices,iqr.appraised.value))
names(iqr) <- c('Purchased Price', 'Appraised Value')
iqr
## Purchased Price Appraised Value
## 1 96 95
We will find the cut-off ranges beyond which all data points are outliers.
up.purchased.price <- Q.purchased.price[2]+1.5*iqr.purchased.prices # Upper Range
low.purchased.price<- Q.purchased.price[1]-1.5*iqr.purchased.prices # Lower Range
up.appraised.value <- Q.appraised.value[2]+1.5*iqr.appraised.value # Upper Range
low.appraised.value<- Q.appraised.value[1]-1.5*iqr.appraised.value # Lower Range
purchased.price.cut.off.values <- c(low = low.purchased.price, up = up.purchased.price)
appraise.value.cut.off.values <- c(low = low.appraised.value, up = up.appraised.value)
cut.off.values <- data.frame(cbind(purchased.price.cut.off.values,appraise.value.cut.off.values))
names(cut.off.values) <- c('Purchased Price', 'Appraised Value')
cut.off.values
## Purchased Price Appraised Value
## low.0% -119 -117.5
## up.25% 273 277.5
Eliminate the extreme values
loandata.cleaned <- subset(loandata, loandata$apr > (Q.appraised.value[2] - 1.5*iqr.appraised.value) & loandata$apr < (Q.appraised.value[3]+1.5*iqr.appraised.value))
loandata.cleaned <- subset(loandata.cleaned, loandata.cleaned$price > (Q.purchased.price[2] - 1.5*iqr.purchased.prices) & loandata.cleaned$price < (Q.purchased.price[3]+1.5*iqr.purchased.prices))
head(loandata.cleaned)
## id occ loanamt action msa suffolk appinc typur unit married dep emp yjob
## 1 300 1 25 1 1120 1 48 0 1 1 0 0 9
## 2 1893 1 36 1 1120 0 36 0 1 1 0 0 1
## 3 767 1 42 1 1120 0 45 0 1 1 0 0 0
## 4 82 1 40 3 1120 0 50 0 1 0 4 0 0
## 5 258 1 110 3 1120 0 60 0 1 1 1 0 1
## 6 1121 1 45 1 1120 0 21 0 1 0 0 0 0
## self atotinc cototinc hexp price other liq rep gdlin lines mortg cons pubrec
## 1 0 4030 0 720 25 0 6.5 1 1 4 1 1 0
## 2 0 3166 0 577 45 0 19.0 1 1 20 1 1 0
## 3 0 3716 0 672 47 0 14.0 4 1 9 2 1 0
## 4 0 4166 0 429 50 0 17.0 3 0 18 3 3 1
## 5 1 1000 3022 1130 50 0 26.0 2 1 21 2 1 0
## 6 0 1776 0 454 51 0 7.0 1 1 1 2 4 0
## hrat obrat fixadj term apr prop inss inson gift cosign unver review netw
## 1 18.00 47.0 0 120 25 2 0 0 0 0 0 1 108.25
## 2 16.00 28.0 1 360 45 1 0 0 0 0 0 2 25.00
## 3 18.00 23.0 0 180 47 1 0 0 0 0 0 2 39.00
## 4 13.00 47.0 0 180 50 2 0 0 0 0 0 2 102.00
## 5 29.00 45.0 1 360 50 2 0 0 1 0 0 1 33.00
## 6 25.38 25.4 1 360 50 1 0 0 0 0 0 2 8.00
## unem min30 bd mi old vr sch black hispan male reject approve mortno mortperf
## 1 10.6 1 1 0 1 1 0 0 0 1 0 1 1 0
## 2 3.2 0 1 1 1 1 1 0 0 0 0 1 1 0
## 3 3.9 0 1 1 0 1 1 0 0 1 0 1 0 1
## 4 3.1 0 0 1 1 1 1 0 0 1 1 0 0 0
## 5 10.6 0 0 0 1 1 1 0 0 1 1 0 0 1
## 6 2.0 0 1 1 1 1 0 0 0 1 0 1 0 1
## mortlat1 mortlat2 chist multi loanprc thick white
## 1 0 0 1 0 1.0000000 0 1
## 2 0 0 1 0 0.8000000 0 1
## 3 0 0 1 0 0.8936170 1 1
## 4 1 0 1 0 0.8000000 1 1
## 5 0 0 1 0 2.2000000 0 1
## 6 0 0 1 0 0.8823529 0 1
Now we will display a boxplot of our data without the outliers.
purchased.appraised.prices.cleaned <- data.frame(cbind(loandata.cleaned$price, loandata.cleaned$apr))
names(purchased.appraised.prices.cleaned) <- c('Purchased Price', 'Appraised Prices')
boxplot(purchased.appraised.prices.cleaned)
Below are the summary statistics of applicant’s annual income, monthly
income, co applicants monthly income, loan amount, appraised value, and
purchased price, housing expense to income
#Calculate the mean
Mean <- c('Appraised Value' = mean(loandata.cleaned$apr, na.rm = TRUE)*1000, 'Loan Amount' = mean(loandata.cleaned$loanamt, na.rm = TRUE)*1000, 'Purchased Price' = mean(loandata.cleaned$price, na.rm = TRUE)*1000, 'Total Income' = mean(loandata.cleaned$appinc, na.rm = TRUE)*1000, 'Monthly Income' = mean(loandata.cleaned$atotinc, na.rm = TRUE), 'Proposed Housing Expenses' = mean(loandata.cleaned$hexp, na.rm = TRUE), 'Housing Expense-Income Ratio' = mean(loandata.cleaned$hrat, na.rm = TRUE))
#Calculate the median
Median <- c('Appraised Value' = median(loandata.cleaned$apr, na.rm = TRUE)*1000, 'Loan Amount' = median(loandata.cleaned$loanamt, na.rm = TRUE)*1000, 'Purchased Price' = median(loandata.cleaned$price, na.rm = TRUE)*1000, 'Total Income' = median(loandata.cleaned$appinc, na.rm = TRUE)*1000, 'Monthly Income' = median(loandata.cleaned$atotinc, na.rm = TRUE), 'Proposed Housing Expenses' = median(loandata.cleaned$hexp, na.rm = TRUE), 'Housing Expense-Income Ratio' = mean(loandata.cleaned$hrat, na.rm = TRUE))
#Calculate the quartiles.
quartiles.appraised <- quantile(loandata.cleaned$apr, na.rm = TRUE)*1000
quartiles.loan.ammount <- quantile(loandata.cleaned$loanamt, na.rm = TRUE)*1000
quartiles.purchased.price <- quantile(loandata.cleaned$price, na.rm = TRUE)*1000
quartiles.total.income <- quantile(loandata.cleaned$appinc, na.rm = TRUE)*1000
quartiles.monthly.income <- quantile(loandata.cleaned$atotinc, na.rm = TRUE)
quartiles.proposed.housing.expenses <- quantile(loandata.cleaned$hexp, na.rm = TRUE)
quartiles.mortgage.to.income.ratio <- quantile(loandata.cleaned$hrat, na.rm = TRUE)
quartiles <- rbind('Appraised Value' = quartiles.appraised, 'Loan Amount' = quartiles.loan.ammount, 'Purchased Price' = quartiles.purchased.price, 'Total Income' = quartiles.total.income, 'Monthly Income' = quartiles.monthly.income, 'Proposed Housing Expenses' =quartiles.proposed.housing.expenses, 'Housing Expense-Income Ratio' = quartiles.mortgage.to.income.ratio)
# X0 = '1st Quartile', X25 = '2nd Quartile', X75 = '3rd Quartile', X100 = '4th Quartile'))
summary.data <- data.frame(cbind(Mean, Median), quartiles)
summary.data
## Mean Median X0. X25. X50.
## Appraised Value 168210.13164 160000.00000 25000 130850 160000.0
## Loan Amount 124153.40584 122000.00000 2000 98000 122000.0
## Purchased Price 161993.89808 154000.00000 25000 125000 154000.0
## Total Income 74176.87464 60000.00000 4000 48000 60000.0
## Monthly Income 4228.00515 3579.00000 0 2750 3579.0
## Proposed Housing Expenses 1302.19284 1256.00000 156 1011 1256.0
## Housing Expense-Income Ratio 24.85077 24.85077 1 21 25.7
## X75. X100.
## Appraised Value 200000 310000
## Loan Amount 150000 267000
## Purchased Price 194500 305000
## Total Income 77000 666000
## Monthly Income 4798 39147
## Proposed Housing Expenses 1541 4080
## Housing Expense-Income Ratio 29 72
Based on the above summary statistics, on average, housing expense to monthly income ratio is below 30% which is recommended. The data also shows that, on average, the purchased price is higher than the loan amount. Home buyers are using other sources of financing.
First, we will make a selection of columns of interest
loandataDF <- loandata.cleaned %>% data.frame() %>% select(loanamt, action, appinc, atotinc, cototinc, hexp, price, other, liq, apr )
head(loandataDF, 10)
## loanamt action appinc atotinc cototinc hexp price other liq apr
## 1 25 1 48 4030 0 720 25.0 0 6.500 25.0
## 2 36 1 36 3166 0 577 45.0 0 19.000 45.0
## 3 42 1 45 3716 0 672 47.0 0 14.000 47.0
## 4 40 3 50 4166 0 429 50.0 0 17.000 50.0
## 5 110 3 60 1000 3022 1130 50.0 0 26.000 50.0
## 6 45 1 21 1776 0 454 51.0 0 7.000 50.0
## 7 67 3 72 3875 0 746 96.0 0 38.000 50.0
## 8 44 1 22 1833 0 458 55.0 0 18.000 55.0
## 9 51 1 34 2873 0 629 57.0 0 30.000 57.0
## 10 47 1 36 1800 1213 486 67.5 0 5.152 58.5
Now, we will add a new columns Race. We will also replace the values in action column as follows: 1 = approved, 2 = approved, 3 = rejected
loandataDF$Race[loandata.cleaned$black == 1] <- 'Black'
loandataDF$Race[loandata.cleaned$hispan == 1] <- 'Hispanic'
loandataDF$Race[loandata.cleaned$white == 1] <- 'White'
loandataDF$action[loandataDF$action == 1] <- 'Approved'
loandataDF$action[loandataDF$action == 2] <- 'Approved'
loandataDF$action[loandataDF$action == 3] <- 'Rejected'
head(loandataDF, n = 5)
## loanamt action appinc atotinc cototinc hexp price other liq apr Race
## 1 25 Approved 48 4030 0 720 25 0 6.5 25 White
## 2 36 Approved 36 3166 0 577 45 0 19.0 45 White
## 3 42 Approved 45 3716 0 672 47 0 14.0 47 White
## 4 40 Rejected 50 4166 0 429 50 0 17.0 50 White
## 5 110 Rejected 60 1000 3022 1130 50 0 26.0 50 White
Next, we will rename the columns. But first divide the Total Monthly Income and Co-applicants Monthly Income by 1000 to bring the finances to the same denominator.
loandataDF$atotinc <- loandataDF$atotinc/1000
loandataDF$cototinc <- loandataDF$cototinc/1000
names(loandataDF) <- c('Loan Amount', 'Action', 'Applicant Income', 'Total Monthly Income', 'Co-applicants Monthly Income', 'Propose Housing Expense', 'Purchase Price', 'Other Financing', 'Liquid Asset', 'Appraised Value', 'Race')
head(loandataDF, n = 20)
## Loan Amount Action Applicant Income Total Monthly Income
## 1 25 Approved 48 4.030
## 2 36 Approved 36 3.166
## 3 42 Approved 45 3.716
## 4 40 Rejected 50 4.166
## 5 110 Rejected 60 1.000
## 6 45 Approved 21 1.776
## 7 67 Rejected 72 3.875
## 8 44 Approved 22 1.833
## 9 51 Approved 34 2.873
## 10 47 Approved 36 1.800
## 11 50 Approved 29 2.463
## 12 46 Rejected 28 2.378
## 13 9 Rejected 28 2.378
## 14 48 Rejected 42 1.888
## 15 17 Approved 31 1.299
## 16 59 Approved 29 2.479
## 17 50 Approved 39 3.254
## 18 48 Approved 59 4.885
## 19 26 Approved 60 2.600
## 20 28 Approved 35 3.016
## Co-applicants Monthly Income Propose Housing Expense Purchase Price
## 1 0.000 720.0 25.0
## 2 0.000 577.0 45.0
## 3 0.000 672.0 47.0
## 4 0.000 429.0 50.0
## 5 3.022 1130.0 50.0
## 6 0.000 454.0 51.0
## 7 0.000 746.0 96.0
## 8 0.000 458.0 55.0
## 9 0.000 629.0 57.0
## 10 1.213 486.0 67.5
## 11 0.000 618.0 56.0
## 12 0.000 636.0 61.0
## 13 0.000 636.0 61.0
## 14 1.620 571.1 62.9
## 15 1.086 342.0 57.0
## 16 0.000 699.0 65.0
## 17 0.000 574.0 63.0
## 18 0.000 564.0 60.0
## 19 2.388 464.4 29.0
## 20 0.000 450.9 31.0
## Other Financing Liquid Asset Appraised Value Race
## 1 0 6.500 25.0 White
## 2 0 19.000 45.0 White
## 3 0 14.000 47.0 White
## 4 0 17.000 50.0 White
## 5 0 26.000 50.0 White
## 6 0 7.000 50.0 White
## 7 0 38.000 50.0 Hispanic
## 8 0 18.000 55.0 Hispanic
## 9 0 30.000 57.0 White
## 10 0 5.152 58.5 White
## 11 0 9.000 60.0 White
## 12 410 9.000 62.0 White
## 13 82 9.000 62.0 White
## 14 0 34.000 62.9 Hispanic
## 15 0 50.000 65.0 White
## 16 0 10.000 65.0 White
## 17 0 14.000 66.0 White
## 18 0 99.000 66.0 White
## 19 0 7.000 67.0 White
## 20 0 141.000 67.0 White
Below, we will analysed the loan approval rate by race
Black <- c(Applications = sum(loandataDF$Race == 'Black'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'Black'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'Black'), 'Percent Denied' = sum(loandataDF$Race == 'Black' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'Black')*100)
Hispanic <- c(Applications = sum(loandataDF$Race == 'Hispanic'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'Hispanic'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'Hispanic'), 'Percent' = sum(loandataDF$Race == 'Hispanic' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'Hispanic')*100)
White <- c(Applications = sum(loandataDF$Race == 'White'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'White'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'White'), 'Percent' = sum(loandataDF$Race == 'White' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'White')*100)
loan.action.by.race <- data.frame(rbind(White, Black, Hispanic))
loan.action.by.race
## Applications Approved Denied Percent
## White 1457 1321 136 9.334248
## Black 184 122 62 33.695652
## Hispanic 106 81 25 23.584906
The data above indicates that whites have a lower denial rate while blacks have the highest denial rate.
We can achieve the grouping result above by using the the aggregate function documents.
# Rename our data fram the we cleaned from outliers
loandata.new = loandata.cleaned
# Substitute values for the action column
loandata.new$action[loandata.new$action == 1] <- 'Approved'
loandata.new$action[loandata.new$action == 2] <- 'Approved'
loandata.new$action[loandata.new$action == 3] <- 'Rejected'
# Add a 'race' column to loandata.new
loandata.new$race[loandata.new$black == 1] <- 'Black'
loandata.new$race[loandata.new$hispan == 1] <- 'Hispanic'
loandata.new$race[loandata.new$white == 1] <- 'White'
# Aggregate the 'action' and 'race' column column by the black, hispan, and white columns
gg <- aggregate(cbind(Black = black, Hispanic = hispan, White = white) ~ action + race, loandata.new, length)
gg
## action race Black Hispanic White
## 1 Approved Black 122 122 122
## 2 Rejected Black 62 62 62
## 3 Approved Hispanic 81 81 81
## 4 Rejected Hispanic 25 25 25
## 5 Approved White 1321 1321 1321
## 6 Rejected White 136 136 136
Let us group our by mortgage history
# Add history column
loandata.new$history[loandata.new$mortno == 1] <- 'No history'
loandata.new$history[loandata.new$mortperf == 1] <- 'No late payment'
loandata.new$history[loandata.new$mortlat1 == 1] <- '1 to 2 Months late'
loandata.new$history[loandata.new$mortlat2 == 1] <- 'Two late payments'
tt <- aggregate(cbind(action) ~ action + history, loandata.new, length)
tt
## action history action
## 1 Approved 1 to 2 Months late 25
## 2 Rejected 1 to 2 Months late 10
## 3 Approved No history 468
## 4 Rejected No history 34
## 5 Approved No late payment 1021
## 6 Rejected No late payment 175
## 7 Approved Two late payments 10
## 8 Rejected Two late payments 4
The data above shows that applicants with prior mortgage history form the majority. Among those, the majority were applicants with no late payment. These applicants mostly got their loans approved. Similarly, most applicants with no mortgage history had their loans approved. There were fewer applicants with history of late mortgage payment and a significant amount of their loans were rejected.
Next let us analyse the data by level of education
# Add education column
loandata.new$education[loandata.new$sch == 1] <- 'More than 12 years schooling'
loandata.new$education[loandata.new$sch == 0] <- 'Less than 12 years schooling'
ss <- aggregate(cbind(action) ~ action + education, loandata.new, length)
ss
## action education action
## 1 Approved Less than 12 years schooling 375
## 2 Rejected Less than 12 years schooling 68
## 3 Approved More than 12 years schooling 1149
## 4 Rejected More than 12 years schooling 155
From the above summary, most applicants have above high school education and mostly had the loans approved. however, a significant number of applicants have below high school education but also mostly had their loans approved. This analysis implies that while level of education impact the number of loan applications, it does not impact loan decision.
Next, we will look at income and loan decision
# Add education column
loandata.new$education[loandata.new$sch == 1] <- 'More than 12 years schooling'
loandata.new$education[loandata.new$sch == 0] <- 'Less than 12 years schooling'
inc <- aggregate(cbind(appinc, liq) ~ action + education, loandata.new, mean)
names(inc) <- c('Decision', 'Education', 'Liquid Asset', 'Income x 1000')
inc
## Decision Education Liquid Asset Income x 1000
## 1 Approved Less than 12 years schooling 65.48267 5390.91363
## 2 Rejected Less than 12 years schooling 83.58824 14731.27168
## 3 Approved More than 12 years schooling 75.73803 4421.09285
## 4 Rejected More than 12 years schooling 79.50968 98.25071
The above summary shows that, on average, applicants with higher monthly income are less likely to be rejected than those with lower monthly income. However, liquid asset does not appear to impact loan decision.
Income <- loandataDF %>% group_by(Race = loandataDF$Race) %>% summarize('Average Total Income' = mean(`Applicant Income`), 'Average Monthly Income' = mean(`Total Monthly Income`), 'Average Liquid Asset' = mean(`Liquid Asset`))
Income
## # A tibble: 3 × 4
## Race `Average Total Income` Average Monthly Incom…¹ `Average Liquid Asset`
## <chr> <dbl> <dbl> <dbl>
## 1 Black 65.4 3.63 38.9
## 2 Hispanic 66.2 3.89 45.6
## 3 White 75.9 4.33 5564.
## # ℹ abbreviated name: ¹​`Average Monthly Income`
From the income analysis above, one could infer that the loan approval rate among whites is higher because they have significantly higher liquid asset.
Below is a scatter plot between the appraised value of homes and purchase price. The plot shows that their is a positive correlation between the appraised value and the purchased price.
library(ggplot2)
p <- ggplot(loandataDF, aes(x = loandataDF$`Purchase Price`, y=loandataDF$`Appraised Value`, )) + geom_point() + ggtitle('Appraised Value versus Purchase Price') +
xlab('Purchase Price') + ylab('Appraised Value')
p
p + theme(
plot.title = element_text(color="black", size=14, hjust = 0.5),
axis.title.x = element_text(color="black", size=14),
axis.title.y = element_text(color="black", size=14),
panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid"),
panel.grid.major = element_line(size = 0.5, linetype = 'solid',
colour = "white"),
panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
colour = "white")
)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The scatter plot shows that there is a positive correlation between purchase price and appraised value.
# Rename the data.frame
loandata.new = loandata.cleaned
# Substitute values for the action column
loandata.new$action[loandata.new$action == 1] <- 'Approved'
loandata.new$action[loandata.new$action == 2] <- 'Approved'
loandata.new$action[loandata.new$action == 3] <- 'Rejected'
# Add a 'race' column to loandata.new
loandata.new$race[loandata.new$black == 1] <- 'Black'
loandata.new$race[loandata.new$hispan == 1] <- 'Hispanic'
loandata.new$race[loandata.new$white == 1] <- 'White'
# Aggregate the 'action' and 'race' column column by the black, hispan, and white columns
gg <- aggregate(cbind(Black = black, Hispanic = hispan, White = white) ~ race, loandata.new, length)
gg
## race Black Hispanic White
## 1 Black 184 184 184
## 2 Hispanic 106 106 106
## 3 White 1457 1457 1457
black <- gg$Black
hispanic <- gg$Hispanic
white <- gg$White
gg1 <- gg[, 2:4]
gg1
## Black Hispanic White
## 1 184 184 184
## 2 106 106 106
## 3 1457 1457 1457
loan.decision.matrix <- data.matrix(gg1)
loan.decision.matrix
## Black Hispanic White
## [1,] 184 184 184
## [2,] 106 106 106
## [3,] 1457 1457 1457
barplot(loan.decision.matrix, beside = TRUE, legend = gg$race, col = heat.colors(3), border = 'white')
Read from github
gitURL = 'https://raw.githubusercontent.com/hawa1983/R_Bridge/main/loanapp.csv'
loandataGIT <- read.csv(file = gitURL, header = TRUE, sep = ",")
# Display 10 rows
head(loandataGIT, n = 10)
## X occ loanamt action msa suffolk appinc typur unit married dep emp yjob
## 1 1 1 89 1 1120 0 72 0 1 0 0 0 0
## 2 2 1 128 3 1120 0 74 0 1 1 1 0 0
## 3 3 1 128 1 1120 0 84 3 1 0 0 1 1
## 4 4 1 66 1 1120 0 36 0 1 1 0 0 0
## 5 5 1 120 1 1120 0 59 8 1 1 0 0 0
## 6 6 1 111 1 1120 0 63 9 1 0 0 0 0
## 7 7 1 141 1 1120 1 72 1 1 0 0 0 1
## 8 8 1 276 1 1120 0 90 6 1 0 0 0 1
## 9 9 1 100 1 1120 0 72 1 1 1 1 1 1
## 10 10 1 267 1 1120 0 144 0 1 1 2 0 0
## self atotinc cototinc hexp price other liq rep gdlin lines mortg cons
## 1 0 5849 0 1031 118.0 0 34.5 1 1 15 2 1
## 2 0 4583 1508 1391 160.0 0 52.0 3 1 19 2 2
## 3 0 2666 4416 1371 143.0 0 37.0 6 1 18 2 2
## 4 1 3000 0 839 110.0 0 19.0 1 1 25 2 6
## 5 0 2583 2358 1341 134.0 0 31.0 1 1 15 2 1
## 6 0 2208 2959 1122 138.0 0 169.0 2 1 10 2 6
## 7 0 6000 0 1870 157.0 0 38.0 1 1 9 2 1
## 8 0 7500 0 2706 346.0 0 142.0 1 1 5 2 1
## 9 0 3234 2524 1112 99.7 0 24.5 1 1 2 2 1
## 10 1 5417 4196 2959 422.0 0 111.0 3 1 15 1 1
## pubrec hrat obrat fixadj term apr prop inss inson gift cosign unver review
## 1 0 17.63 34.5 0 360 118 1 0 0 0 0 0 1
## 2 0 22.54 34.1 1 360 175 2 0 0 0 0 0 999
## 3 0 19.00 26.0 0 180 145 2 0 0 0 0 0 3
## 4 1 24.00 37.0 0 360 110 2 0 0 1 0 0 2
## 5 0 25.10 32.1 0 360 135 1 1 0 0 0 0 2
## 6 0 21.00 33.0 0 360 144 2 0 0 0 0 0 1
## 7 0 32.00 36.0 0 360 162 1 1 0 0 0 0 1
## 8 0 36.00 37.0 1 360 365 2 0 0 0 0 0 1
## 9 0 19.31 30.7 0 360 105 2 1 0 0 0 0 3
## 10 0 26.51 49.0 0 360 422 2 0 0 0 0 0 2
## netw unem min30 bd mi old vr sch black hispan male reject approve mortno
## 1 99.6 3.2 0 0 1 0 1 1 0 0 NA 0 1 0
## 2 847.0 3.2 0 0 1 0 1 1 0 0 1 1 0 0
## 3 40.0 3.9 0 1 1 0 0 1 0 0 1 0 1 0
## 4 158.0 3.1 0 0 1 1 1 1 0 0 1 0 1 0
## 5 69.0 4.3 0 1 1 0 0 0 0 0 1 0 1 0
## 6 262.0 3.2 0 1 1 0 0 0 0 0 1 0 1 0
## 7 46.0 3.2 0 0 0 1 1 1 0 0 1 0 1 0
## 8 214.0 3.2 0 0 1 1 1 1 0 0 1 0 1 0
## 9 38.8 1.8 0 1 1 0 0 1 0 0 1 0 1 0
## 10 374.0 3.1 0 0 1 1 1 1 0 0 1 0 1 1
## mortperf mortlat1 mortlat2 chist multi loanprc thick white
## 1 1 0 0 1 0 0.7542373 0 1
## 2 1 0 0 1 0 0.8000000 1 1
## 3 1 0 0 1 0 0.8951049 1 1
## 4 1 0 0 0 0 0.6000000 0 1
## 5 1 0 0 1 0 0.8955224 0 1
## 6 1 0 0 0 0 0.8043478 0 1
## 7 1 0 0 1 0 0.8980892 0 1
## 8 1 0 0 1 0 0.7976879 0 1
## 9 1 0 0 1 0 1.0030091 0 1
## 10 0 0 0 1 0 0.6327014 1 1