Qu 1. Does a persons race affect there chances of gtting a loan? What percent of Black, Hispanic, and White applicants got their loan applications approved. Qu 2. Does a loan applicants mortgage history like late payments affect their chances of getting a loan Qu 3. What percentage of an applicants income spent on housing. Does monthly housing expenses exceed 30% of household income. Qu 4. All other things being equal, does level of education improve ones changes of getting a mortgage loan. Qu 5. Does appraised values of homes correlate with purchase price

  1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
library(magrittr)
library(ggplot2)
library(lattice)
library(RColorBrewer)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Read the data
loandata <- read.csv('loanapp.csv', header = TRUE)

# Display 20 rows of the loan data
head(loandata, n = 20)
##      id occ loanamt action  msa suffolk appinc typur unit married dep emp yjob
## 1   300   1      25      1 1120       1     48     0    1       1   0   0    9
## 2  1893   1      36      1 1120       0     36     0    1       1   0   0    1
## 3   767   1      42      1 1120       0     45     0    1       1   0   0    0
## 4    82   1      40      3 1120       0     50     0    1       0   4   0    0
## 5   258   1     110      3 1120       0     60     0    1       1   1   0    1
## 6  1121   1      45      1 1120       0     21     0    1       0   0   0    0
## 7  1150   1      67      3 1120       1     72     0    1       1   1   0    0
## 8   223   1      44      1 1120       0     22     0    1       0   0   0    0
## 9   631   1      51      1 1120       0     34     0    1       0   0   0    0
## 10  119   1      47      1 1120       0     36     0    1       0   0   0    1
## 11  716   1      50      1 1120       0     29     0    1       0   0   0    0
## 12 1115   1      46      3 1120       0     28     0    1       0   0   0    0
## 13 1839   1       9      3 1120       0     28     0    1       0   0   0    0
## 14  120   1      48      3 1120       0     42     0    1       1   1   0    0
## 15   29   1      17      1 1120       0     31     0    1       1   2   0    0
## 16  698   1      59      1 1120       0     29     0    1       0   0   0    1
## 17  534   1      50      1 1120       0     39     1    1       0   0   0    0
## 18  652   1      48      1 1120       0     59     5    1       0   0   0    0
## 19  180   1      26      1 1120       0     60     0    1       0   0   0    0
## 20 1022   1      28      1 1120       0     35     0    1       1   0   0    1
##    self atotinc cototinc   hexp price other     liq rep gdlin lines mortg cons
## 1     0    4030        0  720.0  25.0     0   6.500   1     1     4     1    1
## 2     0    3166        0  577.0  45.0     0  19.000   1     1    20     1    1
## 3     0    3716        0  672.0  47.0     0  14.000   4     1     9     2    1
## 4     0    4166        0  429.0  50.0     0  17.000   3     0    18     3    3
## 5     1    1000     3022 1130.0  50.0     0  26.000   2     1    21     2    1
## 6     0    1776        0  454.0  51.0     0   7.000   1     1     1     2    4
## 7     0    3875        0  746.0  96.0     0  38.000   1     1    10     2    1
## 8     0    1833        0  458.0  55.0     0  18.000   0     1     2     2    1
## 9     0    2873        0  629.0  57.0     0  30.000   2     1    14     2    1
## 10    0    1800     1213  486.0  67.5     0   5.152   1     1    15     2    2
## 11    0    2463        0  618.0  56.0     0   9.000   2     1    19     2    1
## 12    0    2378        0  636.0  61.0   410   9.000   1     1    11     1    2
## 13    0    2378        0  636.0  61.0    82   9.000   1     1    11     1    1
## 14    0    1888     1620  571.1  62.9     0  34.000   1     1    15     2    2
## 15    0    1299     1086  342.0  57.0     0  50.000   1     1     1     2    1
## 16    0    2479        0  699.0  65.0     0  10.000   2     1     9     2    1
## 17    0    3254        0  574.0  63.0     0  14.000   1     1     5     1    3
## 18    0    4885        0  564.0  60.0     0  99.000   1     1    10     2    1
## 19    0    2600     2388  464.4  29.0     0   7.000   2     1    19     2    2
## 20    0    3016        0  450.9  31.0     0 141.000   1     1    19     2    1
##    pubrec  hrat obrat fixadj term  apr prop inss inson gift cosign unver review
## 1       0 18.00  47.0      0  120 25.0    2    0     0    0      0     0      1
## 2       0 16.00  28.0      1  360 45.0    1    0     0    0      0     0      2
## 3       0 18.00  23.0      0  180 47.0    1    0     0    0      0     0      2
## 4       1 13.00  47.0      0  180 50.0    2    0     0    0      0     0      2
## 5       0 29.00  45.0      1  360 50.0    2    0     0    1      0     0      1
## 6       0 25.38  25.4      1  360 50.0    1    0     0    0      0     0      2
## 7       0 19.00  37.0      0  360 50.0    1    0     0    0      0     0      1
## 8       0 25.00  33.0      1  360 55.0    1    0     0    0      0     0      1
## 9       0 22.00  26.0      0  360 57.0    1    0     0    0      0     0      2
## 10      0 19.00  32.0      1  360 58.5    1    1     0    0      0     0      4
## 11      0 27.00  39.0      0  360 60.0    1    0     0    0      0     0      4
## 12      0 28.00  35.0      0  360 62.0    1    0     0    1      0     0      1
## 13      0 28.00  35.0      0  360 62.0    1    0     0    1      0     0      1
## 14      0 16.00  19.0      0  360 62.9    2    0     0    0      0     1      2
## 15      0 14.00  14.0      0  120 65.0    2    0     0    0      0     0      1
## 16      0 28.89  32.9      1  360 65.0    1    0     0    0      0     0      1
## 17      0 18.00  33.0      0  360 66.0    2    0     0    0      0     0      1
## 18      0 11.00  16.0      0  360 66.0    1    0     0    0      0     0      1
## 19      0 10.00  19.0      1  300 67.0    1    0     0    0      0     0      2
## 20      0 15.00  33.0      1  300 67.0    1    0     0    0      0     0      2
##       netw unem min30 bd mi old vr sch black hispan male reject approve mortno
## 1  108.250 10.6     1  1  0   1  1   0     0      0    1      0       1      1
## 2   25.000  3.2     0  1  1   1  1   1     0      0    0      0       1      1
## 3   39.000  3.9     0  1  1   0  1   1     0      0    1      0       1      0
## 4  102.000  3.1     0  0  1   1  1   1     0      0    1      1       0      0
## 5   33.000 10.6     0  0  0   1  1   1     0      0    1      1       0      0
## 6    8.000  2.0     0  1  1   1  1   0     0      0    1      0       1      0
## 7   37.000  3.2     0  0  0   1  1   1     0      1    1      1       0      0
## 8   17.000  5.3     0  0  1   0  0   0     0      1    0      0       1      0
## 9   74.000  2.0     0  1  1   1  1   1     0      0    0      0       1      0
## 10 -11.310  4.3     0  0  1   0  0   1     0      0    1      0       1      0
## 11  37.000  3.2     0  1  1   0  1   1     0      0    1      0       1      0
## 12  43.000  3.2     0  0  1   1  1   1     0      0    0      1       0      1
## 13  43.000  3.2     0  0  1   1  1   1     0      0    0      1       0      1
## 14  42.883  3.1     0  0  1   1  0   0     0      1    1      1       0      0
## 15  53.000  3.2     0  0  1   1  1   1     0      0    1      0       1      0
## 16  37.000  3.6     0  1  1   0  1   1     0      0    1      0       1      0
## 17 -95.000  3.1     0  0  1   0  1   1     0      0    1      0       1      1
## 18 189.000 10.6     0  1  1   1  1   0     0      0    1      0       1      0
## 19  67.000  3.2     0  0  1   0  0   1     0      0    0      0       1      0
## 20 147.000  1.8     0  0  1   1  0   0     0      0    0      0       1      0
##    mortperf mortlat1 mortlat2 chist multi   loanprc thick white
## 1         0        0        0     1     0 1.0000000     0     1
## 2         0        0        0     1     0 0.8000000     0     1
## 3         1        0        0     1     0 0.8936170     1     1
## 4         0        1        0     1     0 0.8000000     1     1
## 5         1        0        0     1     0 2.2000000     0     1
## 6         1        0        0     1     0 0.8823529     0     1
## 7         1        0        0     1     0 0.6979167     0     0
## 8         1        0        0     1     0 0.8000000     0     0
## 9         1        0        0     1     0 0.8947368     0     1
## 10        1        0        0     1     0 0.6962963     0     1
## 11        1        0        0     1     0 0.8928571     0     1
## 12        0        0        0     1     0 0.7540984     0     1
## 13        0        0        0     1     0 0.1475410     0     1
## 14        1        0        0     1     0 0.7631161     0     0
## 15        1        0        0     1     0 0.2982456     0     1
## 16        1        0        0     1     0 0.9076923     0     1
## 17        0        0        0     1     0 0.7936508     0     1
## 18        1        0        0     1     0 0.8000000     0     1
## 19        1        0        0     1     0 0.8965517     0     1
## 20        1        0        0     1     0 0.9032258     0     1

Used the summary function in the base R package to gain an overview of the data

# Generate an overview of the data
summary(loandata)
##        id            occ           loanamt          action           msa      
##  Min.   :   1   Min.   :1.000   Min.   :  2.0   Min.   :1.000   Min.   :1120  
##  1st Qu.: 498   1st Qu.:1.000   1st Qu.:100.0   1st Qu.:1.000   1st Qu.:1120  
##  Median : 995   Median :1.000   Median :126.0   Median :1.000   Median :1120  
##  Mean   : 995   Mean   :1.032   Mean   :143.2   Mean   :1.276   Mean   :1120  
##  3rd Qu.:1492   3rd Qu.:1.000   3rd Qu.:165.0   3rd Qu.:1.000   3rd Qu.:1120  
##  Max.   :1989   Max.   :3.000   Max.   :980.0   Max.   :3.000   Max.   :1120  
##                                                                               
##     suffolk           appinc           typur            unit      
##  Min.   :0.0000   Min.   :  0.00   Min.   :0.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.: 48.00   1st Qu.:0.000   1st Qu.:1.000  
##  Median :0.0000   Median : 64.00   Median :0.000   Median :1.000  
##  Mean   :0.1543   Mean   : 84.68   Mean   :1.531   Mean   :1.122  
##  3rd Qu.:0.0000   3rd Qu.: 88.00   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :972.00   Max.   :9.000   Max.   :4.000  
##                                                    NA's   :4      
##     married            dep              emp              yjob      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.6586   Mean   :0.7709   Mean   :0.2097   Mean   :0.449  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :8.0000   Max.   :9.0000   Max.   :9.000  
##  NA's   :3        NA's   :3                                        
##       self           atotinc         cototinc          hexp      
##  Min.   :0.0000   Min.   :    0   Min.   :    0   Min.   :  154  
##  1st Qu.:0.0000   1st Qu.: 2876   1st Qu.:    0   1st Qu.: 1054  
##  Median :0.0000   Median : 3813   Median : 1145   Median : 1317  
##  Mean   :0.1292   Mean   : 5196   Mean   : 1547   Mean   : 1505  
##  3rd Qu.:0.0000   3rd Qu.: 5596   3rd Qu.: 2417   3rd Qu.: 1715  
##  Max.   :1.0000   Max.   :81000   Max.   :41667   Max.   :10798  
##                                                                  
##      price            other              liq               rep       
##  Min.   :  25.0   Min.   :   0.00   Min.   :      0   Min.   :0.000  
##  1st Qu.: 129.0   1st Qu.:   0.00   1st Qu.:     20   1st Qu.:1.000  
##  Median : 163.0   Median :   0.00   Median :     38   Median :1.000  
##  Mean   : 196.3   Mean   :   2.37   Mean   :   4618   Mean   :1.503  
##  3rd Qu.: 225.0   3rd Qu.:   0.00   3rd Qu.:     83   3rd Qu.:2.000  
##  Max.   :1535.0   Max.   :1020.00   Max.   :1000000   Max.   :9.000  
##                                                       NA's   :9      
##      gdlin             lines              mortg            cons     
##  Min.   :  0.000   Min.   :     0.0   Min.   :1.000   Min.   :1.00  
##  1st Qu.:  1.000   1st Qu.:     7.0   1st Qu.:1.000   1st Qu.:1.00  
##  Median :  1.000   Median :    12.0   Median :2.000   Median :1.00  
##  Mean   :  1.583   Mean   :   516.4   Mean   :1.708   Mean   :2.11  
##  3rd Qu.:  1.000   3rd Qu.:    19.0   3rd Qu.:2.000   3rd Qu.:2.00  
##  Max.   :666.000   Max.   :999999.4   Max.   :4.000   Max.   :6.00  
##                                                                     
##      pubrec             hrat           obrat           fixadj      
##  Min.   :0.00000   Min.   : 1.00   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:21.00   1st Qu.:28.00   1st Qu.:0.0000  
##  Median :0.00000   Median :25.77   Median :33.00   Median :0.0000  
##  Mean   :0.06888   Mean   :24.79   Mean   :32.39   Mean   :0.3082  
##  3rd Qu.:0.00000   3rd Qu.:29.00   3rd Qu.:37.00   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :72.00   Max.   :95.00   Max.   :1.0000  
##                                                                    
##       term             apr              prop            inss       
##  Min.   :     6   Min.   :  25.0   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:   360   1st Qu.: 135.0   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :   360   Median : 169.0   Median :2.000   Median :0.0000  
##  Mean   :  2352   Mean   : 205.1   Mean   :1.861   Mean   :0.2001  
##  3rd Qu.:   360   3rd Qu.: 230.0   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :999999   Max.   :4316.0   Max.   :3.000   Max.   :1.0000  
##                                                                    
##      inson              gift            cosign            unver        
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
##  Mean   :0.01508   Mean   :0.1599   Mean   :0.02866   Mean   :0.04274  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##                                                                        
##      review           netw              unem            min30        
##  Min.   :  0.0   Min.   :-7919.0   Min.   : 1.800   Min.   :0.00000  
##  1st Qu.:  1.0   1st Qu.:   43.0   1st Qu.: 3.100   1st Qu.:0.00000  
##  Median :  2.0   Median :   95.0   Median : 3.200   Median :0.00000  
##  Mean   :113.7   Mean   :  266.6   Mean   : 3.882   Mean   :0.05703  
##  3rd Qu.:  3.0   3rd Qu.:  229.6   3rd Qu.: 3.900   3rd Qu.:0.00000  
##  Max.   :999.0   Max.   :28023.0   Max.   :10.600   Max.   :1.00000  
##                                                     NA's   :183      
##        bd               mi              old               vr        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4208   Mean   :0.8728   Mean   :0.4676   Mean   :0.4098  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##       sch             black             hispan             male       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :1.0000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.7717   Mean   :0.09904   Mean   :0.05581   Mean   :0.8131  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##                                                       NA's   :15      
##      reject          approve           mortno          mortperf     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.1227   Mean   :0.8773   Mean   :0.3318   Mean   :0.6385  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##     mortlat1          mortlat2           chist            multi        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :1.0000   Median :0.00000  
##  Mean   :0.01911   Mean   :0.01056   Mean   :0.8376   Mean   :0.08615  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##                                                       NA's   :4        
##     loanprc            thick            white       
##  Min.   :0.02105   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.70000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.80000   Median :0.0000   Median :1.0000  
##  Mean   :0.77064   Mean   :0.1051   Mean   :0.8451  
##  3rd Qu.:0.89894   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :2.57143   Max.   :1.0000   Max.   :1.0000  
##                    NA's   :9

Let find out if there are outlier in the purchased and appraised values

purchased.appraised.prices <- data.frame(cbind(loandata$price, loandata$apr))
names(purchased.appraised.prices) <- c('Purchased Price', 'Appraised Prices')
boxplot(purchased.appraised.prices)

There are outliers in the data. Let us identify them by calculating the quantile of values.

Q.appraised.value <- quantile(loandata$apr, na.rm = FALSE)
Q.purchased.price <- quantile(loandata$price, na.rm = FALSE)
Quantiles <- data.frame(cbind(Q.purchased.price, Q.appraised.value))
names(Quantiles) <- c('Purchased Price', 'Appraised Value')
Quantiles
##      Purchased Price Appraised Value
## 0%                25              25
## 25%              129             135
## 50%              163             169
## 75%              225             230
## 100%            1535            4316

Let us calculate the interquantile range of the values

iqr.appraised.value <- IQR(loandata$apr)
iqr.purchased.prices <- IQR(loandata$price)
iqr <- data.frame(cbind(iqr.purchased.prices,iqr.appraised.value))
names(iqr) <- c('Purchased Price', 'Appraised Value')
iqr
##   Purchased Price Appraised Value
## 1              96              95

We will find the cut-off ranges beyond which all data points are outliers.

up.purchased.price <-  Q.purchased.price[2]+1.5*iqr.purchased.prices # Upper Range  
low.purchased.price<- Q.purchased.price[1]-1.5*iqr.purchased.prices # Lower Range
up.appraised.value <-  Q.appraised.value[2]+1.5*iqr.appraised.value # Upper Range  
low.appraised.value<- Q.appraised.value[1]-1.5*iqr.appraised.value # Lower Range

purchased.price.cut.off.values <- c(low = low.purchased.price, up = up.purchased.price)
appraise.value.cut.off.values <- c(low = low.appraised.value, up = up.appraised.value)
cut.off.values <- data.frame(cbind(purchased.price.cut.off.values,appraise.value.cut.off.values))
names(cut.off.values) <- c('Purchased Price', 'Appraised Value')

cut.off.values
##        Purchased Price Appraised Value
## low.0%            -119          -117.5
## up.25%             273           277.5

Eliminate the extreme values

loandata.cleaned <- subset(loandata, loandata$apr > (Q.appraised.value[2] - 1.5*iqr.appraised.value) & loandata$apr < (Q.appraised.value[3]+1.5*iqr.appraised.value))

loandata.cleaned <- subset(loandata.cleaned, loandata.cleaned$price > (Q.purchased.price[2] - 1.5*iqr.purchased.prices) & loandata.cleaned$price < (Q.purchased.price[3]+1.5*iqr.purchased.prices))

head(loandata.cleaned)
##     id occ loanamt action  msa suffolk appinc typur unit married dep emp yjob
## 1  300   1      25      1 1120       1     48     0    1       1   0   0    9
## 2 1893   1      36      1 1120       0     36     0    1       1   0   0    1
## 3  767   1      42      1 1120       0     45     0    1       1   0   0    0
## 4   82   1      40      3 1120       0     50     0    1       0   4   0    0
## 5  258   1     110      3 1120       0     60     0    1       1   1   0    1
## 6 1121   1      45      1 1120       0     21     0    1       0   0   0    0
##   self atotinc cototinc hexp price other  liq rep gdlin lines mortg cons pubrec
## 1    0    4030        0  720    25     0  6.5   1     1     4     1    1      0
## 2    0    3166        0  577    45     0 19.0   1     1    20     1    1      0
## 3    0    3716        0  672    47     0 14.0   4     1     9     2    1      0
## 4    0    4166        0  429    50     0 17.0   3     0    18     3    3      1
## 5    1    1000     3022 1130    50     0 26.0   2     1    21     2    1      0
## 6    0    1776        0  454    51     0  7.0   1     1     1     2    4      0
##    hrat obrat fixadj term apr prop inss inson gift cosign unver review   netw
## 1 18.00  47.0      0  120  25    2    0     0    0      0     0      1 108.25
## 2 16.00  28.0      1  360  45    1    0     0    0      0     0      2  25.00
## 3 18.00  23.0      0  180  47    1    0     0    0      0     0      2  39.00
## 4 13.00  47.0      0  180  50    2    0     0    0      0     0      2 102.00
## 5 29.00  45.0      1  360  50    2    0     0    1      0     0      1  33.00
## 6 25.38  25.4      1  360  50    1    0     0    0      0     0      2   8.00
##   unem min30 bd mi old vr sch black hispan male reject approve mortno mortperf
## 1 10.6     1  1  0   1  1   0     0      0    1      0       1      1        0
## 2  3.2     0  1  1   1  1   1     0      0    0      0       1      1        0
## 3  3.9     0  1  1   0  1   1     0      0    1      0       1      0        1
## 4  3.1     0  0  1   1  1   1     0      0    1      1       0      0        0
## 5 10.6     0  0  0   1  1   1     0      0    1      1       0      0        1
## 6  2.0     0  1  1   1  1   0     0      0    1      0       1      0        1
##   mortlat1 mortlat2 chist multi   loanprc thick white
## 1        0        0     1     0 1.0000000     0     1
## 2        0        0     1     0 0.8000000     0     1
## 3        0        0     1     0 0.8936170     1     1
## 4        1        0     1     0 0.8000000     1     1
## 5        0        0     1     0 2.2000000     0     1
## 6        0        0     1     0 0.8823529     0     1

Now we will display a boxplot of our data without the outliers.

purchased.appraised.prices.cleaned <- data.frame(cbind(loandata.cleaned$price, loandata.cleaned$apr))
names(purchased.appraised.prices.cleaned) <- c('Purchased Price', 'Appraised Prices')
boxplot(purchased.appraised.prices.cleaned)

Below are the summary statistics of applicant’s annual income, monthly income, co applicants monthly income, loan amount, appraised value, and purchased price, housing expense to income

#Calculate the mean
Mean <- c('Appraised Value' = mean(loandata.cleaned$apr, na.rm = TRUE)*1000, 'Loan Amount' = mean(loandata.cleaned$loanamt, na.rm = TRUE)*1000, 'Purchased Price' = mean(loandata.cleaned$price, na.rm = TRUE)*1000, 'Total Income' = mean(loandata.cleaned$appinc, na.rm = TRUE)*1000, 'Monthly Income' = mean(loandata.cleaned$atotinc, na.rm = TRUE), 'Proposed Housing Expenses' = mean(loandata.cleaned$hexp, na.rm = TRUE), 'Housing Expense-Income Ratio' = mean(loandata.cleaned$hrat, na.rm = TRUE))

#Calculate the median
Median <- c('Appraised Value' = median(loandata.cleaned$apr, na.rm = TRUE)*1000, 'Loan Amount' = median(loandata.cleaned$loanamt, na.rm = TRUE)*1000, 'Purchased Price' = median(loandata.cleaned$price, na.rm = TRUE)*1000, 'Total Income' = median(loandata.cleaned$appinc, na.rm = TRUE)*1000, 'Monthly Income' = median(loandata.cleaned$atotinc, na.rm = TRUE), 'Proposed Housing Expenses' = median(loandata.cleaned$hexp, na.rm = TRUE),  'Housing Expense-Income Ratio' = mean(loandata.cleaned$hrat, na.rm = TRUE))

#Calculate the quartiles.
quartiles.appraised <- quantile(loandata.cleaned$apr, na.rm = TRUE)*1000
quartiles.loan.ammount <- quantile(loandata.cleaned$loanamt, na.rm = TRUE)*1000
quartiles.purchased.price <- quantile(loandata.cleaned$price, na.rm = TRUE)*1000
quartiles.total.income <- quantile(loandata.cleaned$appinc, na.rm = TRUE)*1000
quartiles.monthly.income <- quantile(loandata.cleaned$atotinc, na.rm = TRUE)
quartiles.proposed.housing.expenses <- quantile(loandata.cleaned$hexp, na.rm = TRUE)
quartiles.mortgage.to.income.ratio <- quantile(loandata.cleaned$hrat, na.rm = TRUE)


quartiles <- rbind('Appraised Value' = quartiles.appraised, 'Loan Amount' = quartiles.loan.ammount, 'Purchased Price' = quartiles.purchased.price, 'Total Income' = quartiles.total.income, 'Monthly Income' = quartiles.monthly.income, 'Proposed Housing Expenses' =quartiles.proposed.housing.expenses, 'Housing Expense-Income Ratio' = quartiles.mortgage.to.income.ratio)

# X0 = '1st Quartile', X25 = '2nd Quartile', X75 = '3rd Quartile', X100 = '4th Quartile'))

summary.data <- data.frame(cbind(Mean, Median), quartiles)
summary.data
##                                      Mean       Median   X0.   X25.     X50.
## Appraised Value              168210.13164 160000.00000 25000 130850 160000.0
## Loan Amount                  124153.40584 122000.00000  2000  98000 122000.0
## Purchased Price              161993.89808 154000.00000 25000 125000 154000.0
## Total Income                  74176.87464  60000.00000  4000  48000  60000.0
## Monthly Income                 4228.00515   3579.00000     0   2750   3579.0
## Proposed Housing Expenses      1302.19284   1256.00000   156   1011   1256.0
## Housing Expense-Income Ratio     24.85077     24.85077     1     21     25.7
##                                X75.  X100.
## Appraised Value              200000 310000
## Loan Amount                  150000 267000
## Purchased Price              194500 305000
## Total Income                  77000 666000
## Monthly Income                 4798  39147
## Proposed Housing Expenses      1541   4080
## Housing Expense-Income Ratio     29     72

Based on the above summary statistics, on average, housing expense to monthly income ratio is below 30% which is recommended. The data also shows that, on average, the purchased price is higher than the loan amount. Home buyers are using other sources of financing.

  1. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

First, we will make a selection of columns of interest

loandataDF <- loandata.cleaned %>% data.frame() %>% select(loanamt, action, appinc, atotinc, cototinc, hexp, price, other, liq, apr )
head(loandataDF, 10)
##    loanamt action appinc atotinc cototinc hexp price other    liq  apr
## 1       25      1     48    4030        0  720  25.0     0  6.500 25.0
## 2       36      1     36    3166        0  577  45.0     0 19.000 45.0
## 3       42      1     45    3716        0  672  47.0     0 14.000 47.0
## 4       40      3     50    4166        0  429  50.0     0 17.000 50.0
## 5      110      3     60    1000     3022 1130  50.0     0 26.000 50.0
## 6       45      1     21    1776        0  454  51.0     0  7.000 50.0
## 7       67      3     72    3875        0  746  96.0     0 38.000 50.0
## 8       44      1     22    1833        0  458  55.0     0 18.000 55.0
## 9       51      1     34    2873        0  629  57.0     0 30.000 57.0
## 10      47      1     36    1800     1213  486  67.5     0  5.152 58.5

Now, we will add a new columns Race. We will also replace the values in action column as follows: 1 = approved, 2 = approved, 3 = rejected

loandataDF$Race[loandata.cleaned$black == 1] <- 'Black'
loandataDF$Race[loandata.cleaned$hispan == 1] <- 'Hispanic'
loandataDF$Race[loandata.cleaned$white == 1] <- 'White'

loandataDF$action[loandataDF$action == 1] <- 'Approved'
loandataDF$action[loandataDF$action == 2] <- 'Approved'
loandataDF$action[loandataDF$action == 3] <- 'Rejected'
head(loandataDF, n = 5)
##   loanamt   action appinc atotinc cototinc hexp price other  liq apr  Race
## 1      25 Approved     48    4030        0  720    25     0  6.5  25 White
## 2      36 Approved     36    3166        0  577    45     0 19.0  45 White
## 3      42 Approved     45    3716        0  672    47     0 14.0  47 White
## 4      40 Rejected     50    4166        0  429    50     0 17.0  50 White
## 5     110 Rejected     60    1000     3022 1130    50     0 26.0  50 White

Next, we will rename the columns. But first divide the Total Monthly Income and Co-applicants Monthly Income by 1000 to bring the finances to the same denominator.

loandataDF$atotinc <- loandataDF$atotinc/1000 
loandataDF$cototinc <- loandataDF$cototinc/1000 

names(loandataDF) <- c('Loan Amount', 'Action', 'Applicant Income', 'Total Monthly Income', 'Co-applicants Monthly Income', 'Propose Housing Expense', 'Purchase Price', 'Other Financing', 'Liquid Asset', 'Appraised Value', 'Race')

head(loandataDF, n = 20)
##    Loan Amount   Action Applicant Income Total Monthly Income
## 1           25 Approved               48                4.030
## 2           36 Approved               36                3.166
## 3           42 Approved               45                3.716
## 4           40 Rejected               50                4.166
## 5          110 Rejected               60                1.000
## 6           45 Approved               21                1.776
## 7           67 Rejected               72                3.875
## 8           44 Approved               22                1.833
## 9           51 Approved               34                2.873
## 10          47 Approved               36                1.800
## 11          50 Approved               29                2.463
## 12          46 Rejected               28                2.378
## 13           9 Rejected               28                2.378
## 14          48 Rejected               42                1.888
## 15          17 Approved               31                1.299
## 16          59 Approved               29                2.479
## 17          50 Approved               39                3.254
## 18          48 Approved               59                4.885
## 19          26 Approved               60                2.600
## 20          28 Approved               35                3.016
##    Co-applicants Monthly Income Propose Housing Expense Purchase Price
## 1                         0.000                   720.0           25.0
## 2                         0.000                   577.0           45.0
## 3                         0.000                   672.0           47.0
## 4                         0.000                   429.0           50.0
## 5                         3.022                  1130.0           50.0
## 6                         0.000                   454.0           51.0
## 7                         0.000                   746.0           96.0
## 8                         0.000                   458.0           55.0
## 9                         0.000                   629.0           57.0
## 10                        1.213                   486.0           67.5
## 11                        0.000                   618.0           56.0
## 12                        0.000                   636.0           61.0
## 13                        0.000                   636.0           61.0
## 14                        1.620                   571.1           62.9
## 15                        1.086                   342.0           57.0
## 16                        0.000                   699.0           65.0
## 17                        0.000                   574.0           63.0
## 18                        0.000                   564.0           60.0
## 19                        2.388                   464.4           29.0
## 20                        0.000                   450.9           31.0
##    Other Financing Liquid Asset Appraised Value     Race
## 1                0        6.500            25.0    White
## 2                0       19.000            45.0    White
## 3                0       14.000            47.0    White
## 4                0       17.000            50.0    White
## 5                0       26.000            50.0    White
## 6                0        7.000            50.0    White
## 7                0       38.000            50.0 Hispanic
## 8                0       18.000            55.0 Hispanic
## 9                0       30.000            57.0    White
## 10               0        5.152            58.5    White
## 11               0        9.000            60.0    White
## 12             410        9.000            62.0    White
## 13              82        9.000            62.0    White
## 14               0       34.000            62.9 Hispanic
## 15               0       50.000            65.0    White
## 16               0       10.000            65.0    White
## 17               0       14.000            66.0    White
## 18               0       99.000            66.0    White
## 19               0        7.000            67.0    White
## 20               0      141.000            67.0    White

Below, we will analysed the loan approval rate by race

Black <- c(Applications = sum(loandataDF$Race == 'Black'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'Black'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'Black'), 'Percent Denied' = sum(loandataDF$Race == 'Black' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'Black')*100)

Hispanic <- c(Applications = sum(loandataDF$Race == 'Hispanic'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'Hispanic'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'Hispanic'), 'Percent' = sum(loandataDF$Race == 'Hispanic' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'Hispanic')*100)

White <- c(Applications = sum(loandataDF$Race == 'White'), Approved = sum(loandataDF$Action == 'Approved' & loandataDF$Race == 'White'), Denied = sum(loandataDF$Action == 'Rejected' & loandataDF$Race == 'White'), 'Percent' = sum(loandataDF$Race == 'White' & loandataDF$Action == 'Rejected')/sum(loandataDF$Race == 'White')*100)

loan.action.by.race <- data.frame(rbind(White, Black, Hispanic))

loan.action.by.race
##          Applications Approved Denied   Percent
## White            1457     1321    136  9.334248
## Black             184      122     62 33.695652
## Hispanic          106       81     25 23.584906

The data above indicates that whites have a lower denial rate while blacks have the highest denial rate.

We can achieve the grouping result above by using the the aggregate function documents.

# Rename our data fram the we cleaned from outliers
loandata.new = loandata.cleaned

# Substitute values for the action column
loandata.new$action[loandata.new$action == 1] <- 'Approved'
loandata.new$action[loandata.new$action == 2] <- 'Approved'
loandata.new$action[loandata.new$action == 3] <- 'Rejected'

# Add a 'race' column to loandata.new
loandata.new$race[loandata.new$black == 1] <- 'Black'
loandata.new$race[loandata.new$hispan == 1] <- 'Hispanic'
loandata.new$race[loandata.new$white == 1] <- 'White'

# Aggregate the 'action' and 'race' column column by the black, hispan, and white columns
gg <- aggregate(cbind(Black = black, Hispanic = hispan, White = white) ~ action + race, loandata.new, length)
gg
##     action     race Black Hispanic White
## 1 Approved    Black   122      122   122
## 2 Rejected    Black    62       62    62
## 3 Approved Hispanic    81       81    81
## 4 Rejected Hispanic    25       25    25
## 5 Approved    White  1321     1321  1321
## 6 Rejected    White   136      136   136

Let us group our by mortgage history

# Add history column
loandata.new$history[loandata.new$mortno == 1] <- 'No history'
loandata.new$history[loandata.new$mortperf == 1] <- 'No late payment'
loandata.new$history[loandata.new$mortlat1 == 1] <- '1 to 2 Months late'
loandata.new$history[loandata.new$mortlat2 == 1] <- 'Two late payments'


tt <- aggregate(cbind(action) ~ action + history, loandata.new, length)
tt
##     action            history action
## 1 Approved 1 to 2 Months late     25
## 2 Rejected 1 to 2 Months late     10
## 3 Approved         No history    468
## 4 Rejected         No history     34
## 5 Approved    No late payment   1021
## 6 Rejected    No late payment    175
## 7 Approved  Two late payments     10
## 8 Rejected  Two late payments      4

The data above shows that applicants with prior mortgage history form the majority. Among those, the majority were applicants with no late payment. These applicants mostly got their loans approved. Similarly, most applicants with no mortgage history had their loans approved. There were fewer applicants with history of late mortgage payment and a significant amount of their loans were rejected.

Next let us analyse the data by level of education

# Add education column
loandata.new$education[loandata.new$sch == 1] <- 'More than 12 years schooling'
loandata.new$education[loandata.new$sch == 0] <- 'Less than 12 years schooling'
ss <- aggregate(cbind(action) ~ action + education, loandata.new, length)
ss
##     action                    education action
## 1 Approved Less than 12 years schooling    375
## 2 Rejected Less than 12 years schooling     68
## 3 Approved More than 12 years schooling   1149
## 4 Rejected More than 12 years schooling    155

From the above summary, most applicants have above high school education and mostly had the loans approved. however, a significant number of applicants have below high school education but also mostly had their loans approved. This analysis implies that while level of education impact the number of loan applications, it does not impact loan decision.

Next, we will look at income and loan decision

# Add education column
loandata.new$education[loandata.new$sch == 1] <- 'More than 12 years schooling'
loandata.new$education[loandata.new$sch == 0] <- 'Less than 12 years schooling'

inc <- aggregate(cbind(appinc, liq) ~ action + education, loandata.new, mean)

names(inc) <- c('Decision', 'Education', 'Liquid Asset', 'Income x 1000')
inc
##   Decision                    Education Liquid Asset Income x 1000
## 1 Approved Less than 12 years schooling     65.48267    5390.91363
## 2 Rejected Less than 12 years schooling     83.58824   14731.27168
## 3 Approved More than 12 years schooling     75.73803    4421.09285
## 4 Rejected More than 12 years schooling     79.50968      98.25071

The above summary shows that, on average, applicants with higher monthly income are less likely to be rejected than those with lower monthly income. However, liquid asset does not appear to impact loan decision.

Income <- loandataDF %>% group_by(Race = loandataDF$Race) %>% summarize('Average Total Income' = mean(`Applicant Income`), 'Average Monthly Income' = mean(`Total Monthly Income`), 'Average Liquid Asset' = mean(`Liquid Asset`))

Income
## # A tibble: 3 × 4
##   Race     `Average Total Income` Average Monthly Incom…¹ `Average Liquid Asset`
##   <chr>                     <dbl>                   <dbl>                  <dbl>
## 1 Black                      65.4                    3.63                   38.9
## 2 Hispanic                   66.2                    3.89                   45.6
## 3 White                      75.9                    4.33                 5564. 
## # ℹ abbreviated name: ¹​`Average Monthly Income`

From the income analysis above, one could infer that the loan approval rate among whites is higher because they have significantly higher liquid asset.

Below is a scatter plot between the appraised value of homes and purchase price. The plot shows that their is a positive correlation between the appraised value and the purchased price.

library(ggplot2)
p <- ggplot(loandataDF, aes(x = loandataDF$`Purchase Price`, y=loandataDF$`Appraised Value`, )) + geom_point() + ggtitle('Appraised Value versus Purchase Price') +
  xlab('Purchase Price') + ylab('Appraised Value')

p

p + theme(
plot.title = element_text(color="black", size=14, hjust = 0.5),
axis.title.x = element_text(color="black", size=14),
axis.title.y = element_text(color="black", size=14),
panel.background = element_rect(fill = "lightblue",
                                colour = "lightblue",
                                size = 0.5, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")
)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The scatter plot shows that there is a positive correlation between purchase price and appraised value.

# Rename the data.frame
loandata.new = loandata.cleaned

# Substitute values for the action column
loandata.new$action[loandata.new$action == 1] <- 'Approved'
loandata.new$action[loandata.new$action == 2] <- 'Approved'
loandata.new$action[loandata.new$action == 3] <- 'Rejected'

# Add a 'race' column to loandata.new
loandata.new$race[loandata.new$black == 1] <- 'Black'
loandata.new$race[loandata.new$hispan == 1] <- 'Hispanic'
loandata.new$race[loandata.new$white == 1] <- 'White'

# Aggregate the 'action' and 'race' column column by the black, hispan, and white columns
gg <- aggregate(cbind(Black = black, Hispanic = hispan, White = white) ~ race, loandata.new, length)
gg
##       race Black Hispanic White
## 1    Black   184      184   184
## 2 Hispanic   106      106   106
## 3    White  1457     1457  1457
black <- gg$Black
hispanic <- gg$Hispanic
white <- gg$White






gg1 <- gg[, 2:4]
gg1
##   Black Hispanic White
## 1   184      184   184
## 2   106      106   106
## 3  1457     1457  1457
loan.decision.matrix <- data.matrix(gg1)
loan.decision.matrix
##      Black Hispanic White
## [1,]   184      184   184
## [2,]   106      106   106
## [3,]  1457     1457  1457
barplot(loan.decision.matrix, beside = TRUE, legend = gg$race, col = heat.colors(3), border = 'white')

Read from github

gitURL = 'https://raw.githubusercontent.com/hawa1983/R_Bridge/main/loanapp.csv'

loandataGIT <- read.csv(file = gitURL, header = TRUE, sep = ",")

# Display 10 rows
head(loandataGIT, n = 10)
##     X occ loanamt action  msa suffolk appinc typur unit married dep emp yjob
## 1   1   1      89      1 1120       0     72     0    1       0   0   0    0
## 2   2   1     128      3 1120       0     74     0    1       1   1   0    0
## 3   3   1     128      1 1120       0     84     3    1       0   0   1    1
## 4   4   1      66      1 1120       0     36     0    1       1   0   0    0
## 5   5   1     120      1 1120       0     59     8    1       1   0   0    0
## 6   6   1     111      1 1120       0     63     9    1       0   0   0    0
## 7   7   1     141      1 1120       1     72     1    1       0   0   0    1
## 8   8   1     276      1 1120       0     90     6    1       0   0   0    1
## 9   9   1     100      1 1120       0     72     1    1       1   1   1    1
## 10 10   1     267      1 1120       0    144     0    1       1   2   0    0
##    self atotinc cototinc hexp price other   liq rep gdlin lines mortg cons
## 1     0    5849        0 1031 118.0     0  34.5   1     1    15     2    1
## 2     0    4583     1508 1391 160.0     0  52.0   3     1    19     2    2
## 3     0    2666     4416 1371 143.0     0  37.0   6     1    18     2    2
## 4     1    3000        0  839 110.0     0  19.0   1     1    25     2    6
## 5     0    2583     2358 1341 134.0     0  31.0   1     1    15     2    1
## 6     0    2208     2959 1122 138.0     0 169.0   2     1    10     2    6
## 7     0    6000        0 1870 157.0     0  38.0   1     1     9     2    1
## 8     0    7500        0 2706 346.0     0 142.0   1     1     5     2    1
## 9     0    3234     2524 1112  99.7     0  24.5   1     1     2     2    1
## 10    1    5417     4196 2959 422.0     0 111.0   3     1    15     1    1
##    pubrec  hrat obrat fixadj term apr prop inss inson gift cosign unver review
## 1       0 17.63  34.5      0  360 118    1    0     0    0      0     0      1
## 2       0 22.54  34.1      1  360 175    2    0     0    0      0     0    999
## 3       0 19.00  26.0      0  180 145    2    0     0    0      0     0      3
## 4       1 24.00  37.0      0  360 110    2    0     0    1      0     0      2
## 5       0 25.10  32.1      0  360 135    1    1     0    0      0     0      2
## 6       0 21.00  33.0      0  360 144    2    0     0    0      0     0      1
## 7       0 32.00  36.0      0  360 162    1    1     0    0      0     0      1
## 8       0 36.00  37.0      1  360 365    2    0     0    0      0     0      1
## 9       0 19.31  30.7      0  360 105    2    1     0    0      0     0      3
## 10      0 26.51  49.0      0  360 422    2    0     0    0      0     0      2
##     netw unem min30 bd mi old vr sch black hispan male reject approve mortno
## 1   99.6  3.2     0  0  1   0  1   1     0      0   NA      0       1      0
## 2  847.0  3.2     0  0  1   0  1   1     0      0    1      1       0      0
## 3   40.0  3.9     0  1  1   0  0   1     0      0    1      0       1      0
## 4  158.0  3.1     0  0  1   1  1   1     0      0    1      0       1      0
## 5   69.0  4.3     0  1  1   0  0   0     0      0    1      0       1      0
## 6  262.0  3.2     0  1  1   0  0   0     0      0    1      0       1      0
## 7   46.0  3.2     0  0  0   1  1   1     0      0    1      0       1      0
## 8  214.0  3.2     0  0  1   1  1   1     0      0    1      0       1      0
## 9   38.8  1.8     0  1  1   0  0   1     0      0    1      0       1      0
## 10 374.0  3.1     0  0  1   1  1   1     0      0    1      0       1      1
##    mortperf mortlat1 mortlat2 chist multi   loanprc thick white
## 1         1        0        0     1     0 0.7542373     0     1
## 2         1        0        0     1     0 0.8000000     1     1
## 3         1        0        0     1     0 0.8951049     1     1
## 4         1        0        0     0     0 0.6000000     0     1
## 5         1        0        0     1     0 0.8955224     0     1
## 6         1        0        0     0     0 0.8043478     0     1
## 7         1        0        0     1     0 0.8980892     0     1
## 8         1        0        0     1     0 0.7976879     0     1
## 9         1        0        0     1     0 1.0030091     0     1
## 10        0        0        0     1     0 0.6327014     1     1