For predicting the home price after foreclosure, the data from Fannie Mae is used. Fannie Mae is a government sponsored enterprise in the US that buys mortgage loans from other lenders. It then bundles these loans up into mortgage-backed securities and resells them. This enables lenders to make more mortgage loans, and creates more liquidity in the market.This theoretically leads to more homeownership, and better loan terms. From a borrowers perspective, things stay largely the same, though.
The Loan Performance Data site provides access to loan-level performance data on a portion of Fannie Mae’s single-family mortgages. Fannie Mae releases two types of data. First is the data on loan it aquires and the second is the data on how those loans perform over time when the borrowers miss multiple payments results in foreclosure. Fannie Mae tracks which loans have missed payment on them and which loans need to be foreclosed on.
Aquisition data contains information on the borrower which is published when the loan is acquired by Fannie Mae. Performance data contains information on the payments being made by the borrower and foreclosure status if any which is published every quarter after the loan is aquired.
In this project, the net sale proceeds of a loan after foreclosure is predicted. Net sale proceed is the Total cash received from the sale of the property net of any applicable selling expenses, such as fees and commissions, allowable for inclusion under the terms of the property sale.
## Loading required package: foreach
## Loading required package: data.table
## LOAN_ID ORIG_CHN Seller.Name ORIG_RT
## Length:367 Length:367 Length:367 Min. :3.125
## Class :character Class :character Class :character 1st Qu.:3.990
## Mode :character Mode :character Mode :character Median :4.375
## Mean :4.328
## 3rd Qu.:4.750
## Max. :6.000
##
## ORIG_AMT ORIG_TRM ORIG_DTE FRST_DTE
## Min. : 23000 Min. :360 Length:367 Length:367
## 1st Qu.: 86500 1st Qu.:360 Class :character Class :character
## Median :125000 Median :360 Mode :character Mode :character
## Mean :152275 Mean :360
## 3rd Qu.:186000 3rd Qu.:360
## Max. :472000 Max. :360
##
## OLTV OCLTV NUM_BO DTI
## Min. :28.00 Min. : 28.0 Length:367 Min. :12.00
## 1st Qu.:75.00 1st Qu.: 75.5 Class :character 1st Qu.:32.00
## Median :80.00 Median : 80.0 Mode :character Median :38.00
## Mean :81.11 Mean : 81.9 Mean :36.05
## 3rd Qu.:90.00 3rd Qu.: 92.0 3rd Qu.:42.00
## Max. :97.00 Max. :100.0 Max. :50.00
##
## CSCORE_B FTHB_FLG PURPOSE PROP_TYP
## Min. :621.0 Length:367 Length:367 Length:367
## 1st Qu.:686.5 Class :character Class :character Class :character
## Median :722.0 Mode :character Mode :character Mode :character
## Mean :724.2
## 3rd Qu.:763.0
## Max. :819.0
##
## NUM_UNIT OCC_STAT STATE
## Length:367 Length:367 Length:367
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ZIP_3 MI_PCT Product.Type CSCORE_C
## Length:367 Min. :12.0 Length:367 Min. :620.0
## Class :character 1st Qu.:25.0 Class :character 1st Qu.:691.0
## Mode :character Median :30.0 Mode :character Median :737.0
## Mean :25.6 Mean :732.1
## 3rd Qu.:30.0 3rd Qu.:765.0
## Max. :35.0 Max. :820.0
## NA's :220 NA's :270
## FCC_DTE NS_PROCS
## Min. :2012-06-01 Min. : 11954
## 1st Qu.:2013-11-01 1st Qu.: 65961
## Median :2014-06-01 Median :101665
## Mean :2014-05-19 Mean :129196
## 3rd Qu.:2014-12-01 3rd Qu.:160277
## Max. :2015-12-01 Max. :653503
##
By looking at the summary of the data that we have after all the cleaning, we can observe that the foreclosure date is between 2012 to 2015, net proceedings are between $11,954 and $653,000 and credit score of borrowers are between 620 and 820.
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
As we can see in the box plot above, single family homes are the the most foreclosed upon type of residence, while co-ops are the least. This makes intuitive sense as single family homes are, on average, more expensive than other types of residence and thus have higher foreclosure counts. The difference between condos and manufactured housing is small, making them almost the same.
In the following graph we are going to show total number of foreclosures by state. This is a choropleth map of foreclosures where light green corresponds to fewer foreclosures and darker green corresponds to greater foreclosures. Hovering over each state provides the exact number of foreclosures in that state. For example, Tennessee (TN) has 20 foreclosures while Maryland (MD) has 4 foreclosures.
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The bar chart below shows a breakdown of foreclosed property type by state, providing another dimension of granularity for the data.
A mutli-varaible regression model is used to predict net proceedings of foreclosed homes.
## Loading required package: lattice
## [1] 259 25
## [1] 108 25
Next, we will fit a multivariate model using FCC_DTE , PROP_TYP, and ORIG_AMT to predict NS_PROCS.
## Linear Regression
##
## 259 samples
## 24 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 259, 259, 259, 259, 259, 259, ...
## Resampling results:
##
## RMSE Rsquared
## 46325.2 0.7801634
##
##
This statistic measures how successful the fit is in explaining the variation of the data. Put another way, R-square is the square of the correlation between the response values and the predicted response values. It is also called the square of the multiple correlation coefficient and multiple determination coefficent. R-square can take on any value between 0 and 1, with a value closer to 1 indicating that a greater proportion of variance is accounted for by the model. For example, an R-square value of 0.811 means that the fit explains 81% of the total variation in the data about the average.
Below is the Diagnostic plot which is very typical when we are building regression model. Fitted Values are the predictions of the model on the training set and residuals are the amount of leftovers after the model is fit.
Below we have plotted fitted versus residual data by variable.
We attempted to take a close look at the Fannie Mae foreclosure data and see what insights we could reveal. We first performed some exploratory analysis by cleaning and plotting the data. We then built a predictive algorithm using a multivariate regression model that was fairly reliable in predicting net proceeds of foreclosed homes. However, it should be noted that the entire Fannie Mae dataset is massive and far exceeded the computation ability of a single machine. If we are able to process the full data set, perhaps by leveraging multiple processors, perhaps a more reliable model can be created.