This document presents an analysis of HMDA mortgage data concerning how mortgage applications are handled. The analysis is based on data obtained from 500000 mortgage customer applications, each containing specific characteristics of a customer and whether they were accepted or denied. The goal of the analysis is to check whether the final status of an application can be predicted with the information obtained. After exploring the data with descriptive statistics, several potential relationships (associations) between customer characteristics and their application outcome status were identified. A predictive model to predict the outcome of the application was created, and later tuned to see how accurate it could predict on new data
This R Markdown document however just provides a glimpse into some of the steps i went through in writing my final report. The actual report obtained showing the entire analysis carried out can be found here http://rmarkdown.rstudio.com.
The following R packages were installed and loaded to help with the analysis. The HDMA data set was loaded as well
hmdadata <- read.csv("hmda_train.csv")
library(ggplot2)
library(Hmisc)
library(corrplot)
library(caret)
library(dplyr)
library(magrittr)
library(pwr)
library(dummies)
library(janitor)
library(ROSE)
We begin by looking thorugh the data to understand what the data set looks like and establish the nature of the features in the data set. Then split the data set, one containing numeric features and the other containing categorical features for exploratory data analysis
str(hmdadata)
## 'data.frame': 500000 obs. of 24 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ row_id : num 0 1 2 3 4 5 6 7 8 9 ...
## $ loan_type : int 3 1 2 1 1 1 3 2 1 1 ...
## $ property_type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ loan_purpose : int 1 3 3 1 1 3 1 1 3 3 ...
## $ occupancy : int 1 1 1 1 1 1 1 1 2 1 ...
## $ loan_amount : int 70 178 163 155 305 133 240 210 209 197 ...
## $ preapproval : int 3 3 3 1 3 3 3 3 3 3 ...
## $ msa_md : int 18 369 16 305 24 221 374 322 24 194 ...
## $ state_code : int 37 52 10 47 37 13 28 37 37 9 ...
## $ county_code : int 246 299 306 180 20 55 131 35 20 20 ...
## $ applicant_ethnicity : int 2 1 2 2 2 2 1 1 2 2 ...
## $ applicant_race : int 5 5 5 5 3 5 5 5 5 5 ...
## $ applicant_sex : int 1 1 1 1 2 2 2 1 1 1 ...
## $ applicant_income : int 24 57 67 105 71 51 104 55 244 86 ...
## $ population : int 6203 5774 6094 6667 6732 6078 6068 6030 5151 7916 ...
## $ minority_population_pct : num 44.23 15.9 61.27 6.25 100 ...
## $ ffiecmedian_family_income : num 60588 54821 67719 78439 63075 ...
## $ tract_to_msa_md_income_pct : num 50.9 100 100 100 82.2 ...
## $ number_of_owner.occupied_units: int 716 1622 760 2025 1464 1827 1863 969 411 1861 ...
## $ number_of_1_to_4_family_units : int 2642 2108 1048 2299 1847 2340 2560 1601 481 2123 ...
## $ lender : int 4536 2458 5710 5888 289 964 5488 2442 2118 3507 ...
## $ co_applicant : Factor w/ 2 levels "False","True": 1 1 1 2 1 1 1 2 2 1 ...
## $ accepted : int 1 0 1 1 1 1 1 1 1 0 ...
dim(hmdadata)
## [1] 500000 24
nums2 <- c("loan_amount", "applicant_income","population",
"minority_population_pct","ffiecmedian_family_income",
"tract_to_msa_md_income_pct","number_of_owner.occupied_units","number_of_1_to_4_family_units","accepted")
ct <- c("loan_type","property_type","loan_purpose","occupancy","preapproval",
"msa_md","state_code","county_code","applicant_ethnicity","applicant_race",
"applicant_sex","co_applicant","lender","accepted")
hmdanums <- hmdadata[nums2]
hmdact <- hmdadata[ct]
summary(hmdanums,na.rm=TRUE)
## loan_amount applicant_income population
## Min. : 1.0 Min. : 1.0 Min. : 14
## 1st Qu.: 93.0 1st Qu.: 47.0 1st Qu.: 3744
## Median : 162.0 Median : 74.0 Median : 4975
## Mean : 221.8 Mean : 102.4 Mean : 5417
## 3rd Qu.: 266.0 3rd Qu.: 117.0 3rd Qu.: 6467
## Max. :100878.0 Max. :10139.0 Max. :37097
## NA's :39948 NA's :22465
## minority_population_pct ffiecmedian_family_income
## Min. : 0.534 Min. : 17858
## 1st Qu.: 10.700 1st Qu.: 59731
## Median : 22.901 Median : 67526
## Mean : 31.617 Mean : 69236
## 3rd Qu.: 46.020 3rd Qu.: 75351
## Max. :100.000 Max. :125248
## NA's :22466 NA's :22440
## tract_to_msa_md_income_pct number_of_owner.occupied_units
## Min. : 3.981 Min. : 4
## 1st Qu.: 88.067 1st Qu.: 944
## Median :100.000 Median :1327
## Mean : 91.833 Mean :1428
## 3rd Qu.:100.000 3rd Qu.:1780
## Max. :100.000 Max. :8771
## NA's :22514 NA's :22565
## number_of_1_to_4_family_units accepted
## Min. : 1 Min. :0.0000
## 1st Qu.: 1301 1st Qu.:0.0000
## Median : 1753 Median :1.0000
## Mean : 1886 Mean :0.5002
## 3rd Qu.: 2309 3rd Qu.:1.0000
## Max. :13623 Max. :1.0000
## NA's :22530
summary(hmdact, na.rm=TRUE)
## loan_type property_type loan_purpose occupancy
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00
## Median :1.000 Median :1.000 Median :2.000 Median :1.00
## Mean :1.366 Mean :1.048 Mean :2.067 Mean :1.11
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.00
## Max. :4.000 Max. :3.000 Max. :3.000 Max. :3.00
## preapproval msa_md state_code county_code
## Min. :1.000 Min. : -1.0 Min. :-1.00 Min. : -1.0
## 1st Qu.:3.000 1st Qu.: 25.0 1st Qu.: 6.00 1st Qu.: 57.0
## Median :3.000 Median :192.0 Median :26.00 Median :131.0
## Mean :2.765 Mean :181.6 Mean :23.73 Mean :144.5
## 3rd Qu.:3.000 3rd Qu.:314.0 3rd Qu.:37.00 3rd Qu.:246.0
## Max. :3.000 Max. :408.0 Max. :52.00 Max. :324.0
## applicant_ethnicity applicant_race applicant_sex co_applicant
## Min. :1.000 Min. :1.000 Min. :1.000 False:299974
## 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:1.000 True :200026
## Median :2.000 Median :5.000 Median :1.000
## Mean :2.036 Mean :4.787 Mean :1.462
## 3rd Qu.:2.000 3rd Qu.:5.000 3rd Qu.:2.000
## Max. :4.000 Max. :7.000 Max. :4.000
## lender accepted
## Min. : 0 Min. :0.0000
## 1st Qu.:2442 1st Qu.:0.0000
## Median :3731 Median :1.0000
## Mean :3720 Mean :0.5002
## 3rd Qu.:5436 3rd Qu.:1.0000
## Max. :6508 Max. :1.0000
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.