Introduction

For predicting the home price after foreclosure, the data from Fannie Mae is used. Fannie Mae is a government sponsored enterprise in the US that buys mortgage loans from other lenders. It then bundles these loans up into mortgage-backed securities and resells them. This enables lenders to make more mortgage loans, and creates more liquidity in the market.This theoretically leads to more homeownership, and better loan terms. From a borrowers perspective, things stay largely the same, though.

The Loan Performance Data site provides access to loan-level performance data on a portion of Fannie Mae’s single-family mortgages. Fannie Mae releases two types of data. First is the data on loan it aquires and the second is the data on how those loans perform over time when the borrowers miss multiple payments results in foreclosure. Fannie Mae tracks which loans have missed payment on them and which loans need to be foreclosed on.

Aquisition data contains information on the borrower which is published when the loan is acquired by Fannie Mae. Performance data contains information on the payments being made by the borrower and foreclosure status if any which is published every quarter after the loan is aquired.

In this project, the net sale proceeds of a loan after foreclosure is predicted. Net sale proceed is the Total cash received from the sale of the property net of any applicable selling expenses, such as fees and commissions, allowable for inclusion under the terms of the property sale.

## Loading required package: foreach
## Loading required package: data.table
##    LOAN_ID            ORIG_CHN         Seller.Name           ORIG_RT     
##  Length:367         Length:367         Length:367         Min.   :3.125  
##  Class :character   Class :character   Class :character   1st Qu.:3.990  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.375  
##                                                           Mean   :4.328  
##                                                           3rd Qu.:4.750  
##                                                           Max.   :6.000  
##                                                                          
##     ORIG_AMT         ORIG_TRM     ORIG_DTE           FRST_DTE        
##  Min.   : 23000   Min.   :360   Length:367         Length:367        
##  1st Qu.: 86500   1st Qu.:360   Class :character   Class :character  
##  Median :125000   Median :360   Mode  :character   Mode  :character  
##  Mean   :152275   Mean   :360                                        
##  3rd Qu.:186000   3rd Qu.:360                                        
##  Max.   :472000   Max.   :360                                        
##                                                                      
##       OLTV           OCLTV          NUM_BO               DTI       
##  Min.   :28.00   Min.   : 28.0   Length:367         Min.   :12.00  
##  1st Qu.:75.00   1st Qu.: 75.5   Class :character   1st Qu.:32.00  
##  Median :80.00   Median : 80.0   Mode  :character   Median :38.00  
##  Mean   :81.11   Mean   : 81.9                      Mean   :36.05  
##  3rd Qu.:90.00   3rd Qu.: 92.0                      3rd Qu.:42.00  
##  Max.   :97.00   Max.   :100.0                      Max.   :50.00  
##                                                                    
##     CSCORE_B       FTHB_FLG           PURPOSE            PROP_TYP        
##  Min.   :621.0   Length:367         Length:367         Length:367        
##  1st Qu.:686.5   Class :character   Class :character   Class :character  
##  Median :722.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :724.2                                                           
##  3rd Qu.:763.0                                                           
##  Max.   :819.0                                                           
##                                                                          
##    NUM_UNIT           OCC_STAT            STATE          
##  Length:367         Length:367         Length:367        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     ZIP_3               MI_PCT     Product.Type          CSCORE_C    
##  Length:367         Min.   :12.0   Length:367         Min.   :620.0  
##  Class :character   1st Qu.:25.0   Class :character   1st Qu.:691.0  
##  Mode  :character   Median :30.0   Mode  :character   Median :737.0  
##                     Mean   :25.6                      Mean   :732.1  
##                     3rd Qu.:30.0                      3rd Qu.:765.0  
##                     Max.   :35.0                      Max.   :820.0  
##                     NA's   :220                       NA's   :270    
##     FCC_DTE              NS_PROCS     
##  Min.   :2012-06-01   Min.   : 11954  
##  1st Qu.:2013-11-01   1st Qu.: 65961  
##  Median :2014-06-01   Median :101665  
##  Mean   :2014-05-19   Mean   :129196  
##  3rd Qu.:2014-12-01   3rd Qu.:160277  
##  Max.   :2015-12-01   Max.   :653503  
## 

By looking at the summary of the data that we have after all the cleaning, we can observe that the foreclosure date is between 2012 to 2015, net proceedings are between $11,954 and $653,000 and credit score of borrowers are between 620 and 820.

FORECLOSED PROPERTY TYPE VS NET SALE PROCEEDS

## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

As we can see in the box plot above, single family homes are the the most foreclosed upon type of residence, while co-ops are the least. This makes intuitive sense as single family homes are, on average, more expensive than other types of residence and thus have higher foreclosure counts. The difference between condos and manufactured housing is small, making them almost the same.

FORECLOSED PROPERTY VS STATE

In the following graph we are going to show total number of foreclosures by state. This is a choropleth map of foreclosures where light green corresponds to fewer foreclosures and darker green corresponds to greater foreclosures. Hovering over each state provides the exact number of foreclosures in that state. For example, Tennessee (TN) has 20 foreclosures while Maryland (MD) has 4 foreclosures.

## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

FORECLOSED PROPERTY TYPE VS STATE

The bar chart below shows a breakdown of foreclosed property type by state, providing another dimension of granularity for the data.






###Prediction Algorithm

A mutli-varaible regression model is used to predict net proceedings of foreclosed homes.

## Loading required package: lattice
## [1] 259  25
## [1] 108  25

FCC_DTE Vs NS_PROCS using the training data

Next, we will fit a multivariate model using FCC_DTE , PROP_TYP, and ORIG_AMT to predict NS_PROCS.

## Linear Regression 
## 
## 259 samples
##  24 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 259, 259, 259, 259, 259, 259, ... 
## Resampling results:
## 
##   RMSE     Rsquared 
##   46325.2  0.7801634
## 
## 

This statistic measures how successful the fit is in explaining the variation of the data. Put another way, R-square is the square of the correlation between the response values and the predicted response values. It is also called the square of the multiple correlation coefficient and multiple determination coefficent. R-square can take on any value between 0 and 1, with a value closer to 1 indicating that a greater proportion of variance is accounted for by the model. For example, an R-square value of 0.811 means that the fit explains 81% of the total variation in the data about the average.

Diagnostic plot

Below is the Diagnostic plot which is very typical when we are building regression model. Fitted Values are the predictions of the model on the training set and residuals are the amount of leftovers after the model is fit.

Below we have plotted fitted versus residual data by variable.

Conclusion

We attempted to take a close look at the Fannie Mae foreclosure data and see what insights we could reveal. We first performed some exploratory analysis by cleaning and plotting the data. We then built a predictive algorithm using a multivariate regression model that was fairly reliable in predicting net proceeds of foreclosed homes. However, it should be noted that the entire Fannie Mae dataset is massive and far exceeded the computation ability of a single machine. If we are able to process the full data set, perhaps by leveraging multiple processors, perhaps a more reliable model can be created.