Find out Sydney mortgage stress areas with logistic regression analysis

Introduction

In our previous research of “Micro factors affecting monthly household mortgage payments in Sydney Metro,” we successfully built a model that could predict average monthly mortgage repayments with a high level of confidence (error rate of 8%). It was also in the findings of our previous analyses, we noticed an increasing trend of mortgage stress that is disproportionate to the growth of household income. In this report, I am trying to quantify this stress and build a model to predict if the mortgage stress of a chosen Statistical Areas Level 2 (SA2) fell to a level that could be considered as ‘SAFE’. The model would benefit government and commercial organizations such banks and mortgage insurance providers for setting area specific policies.

Prepare the data and environment

As mentioned in our previous research, we collected a dataset from the ABS website which contains 902 variables for each of SA2s in Sydney metropolitan area. In addition to the raw data collected from the ABS, we were also able to find latitude and longitude of each SA2s and calculate a distance of each SA2s to the Sydney CBD. The combined dataset was then cleaned, transformed and normalised into a CSV file named “model_df_new.csv” for later analysis. During the process of data transformation, we adopted a number of data cleansing techniques such as removing NA values, multicollinearity analyses, and intuitive selection to reduce the number of variables to 45 in a new dataset named model_df_load.

# Prepare the environment
rm( list = ls() ) #Gets rid of variables in the global environment
#dev.off() #Remove plots
gc() # Runs the garbage collector to clear memory.
cat("\014") #Clears the console
setwd("~/RWorkingDirectory")
getwd()


library(datasets)
library(ggplot2)
library(dplyr)
library(ISLR)
library(glmnet)
library(caret)
library(dplyr)
library(tidyverse) #Loads dplyr, ggplot2, readr, tidyr, rvest, lubridate etc

#Load the data 
model_df_load <- read_csv("model_df_new.csv")

Exam the dataset

Upon the new dataset is loaded, we further exam its integrity to ensure the new dataset is fit for the purpose of the statistical analyses. In addition, we added two calculated variables to present average and SA2 specific mortgage to income ration of the year 2013. By summarizing the statistics of the new dataset, we noticed average household mortgage to income ratio of Sydney metropolitan area is at 0.23 which falls into a safe zone of 0.25 as nominated by many industry experts (“What Percentage Of Income Can You Afford For A Mortgage?,” 2016). That is if household mortgage repayment is no more than 25 percent of household income, then it could be considered as safe. Vice, the situation is not safe in term of mortgage stress.

To ensure these statistics were not mispresented, we plotted all SA2s household mortgage to income ratio observations of the year 2013 into a number of different forms. By examining these plots, we noticed the distribution of SA2 specific household mortgage to income ratio of the year 2013 has a tendency of staying around the theoretical safety threshold of 0.25 and slightly left weighted. This indicates in the year of 2013 although the high mortgage stress existed in some areas, the average mortgage stress of Sydney metropolitan area was still within in the safe zone.

I am sure the result surprised a lot of us, but it could be explained as the perception of general high mortgage stress in Sydney was created by a specific group of people that cry loudest.

#Remove lon and lat
model_df <- model_df_load[, -(39:40)]

# Check for NA
na_columns <- model_df %>%
  summarise_each(funs(sum(is.na(.)))) %>%
  gather() %>%
  arrange(desc(value)) %>%
  rename( measure_year = key, na_count = value)
na_columns

## # A tibble: 40 × 2
##    measure_year na_count
##           <chr>    <int>
## 1      Distance        1
## 2    sa2_number        0
## 3    ERP_6_2013        0
## 4    ERP_7_2013        0
## 5    ERP_8_2013        0
## 6    ERP_9_2013        0
## 7   ERP_10_2013        0
## 8   ERP_11_2013        0
## 9   ERP_12_2013        0
## 10  ERP_13_2013        0
## # ... with 30 more rows

#Add varibles to represent average household income and monthly household mortage payment of the whole Sydney metropolitan area of year 2013 

model_df <- model_df %>% 
  mutate(INCOME_17_AVG_2013 = mean((INCOME_17_2013 * HHTYPE_6_2011)))%>%
  mutate(RENT_AVG_2013 = mean(RENT_3_2013*12)) 

#Add a variable to represent average household annual income to household annual mortage payment ratio of of the whole Sydney metropolitan area of year 2013   

model_df <- model_df %>% 
  mutate(MORT__INCOME_AVG_2013 = RENT_AVG_2013/INCOME_17_AVG_2013) %>%
#Add a variable to represent average household annual income to household annual mortage payment ratio of of the specific SA2 of year 2013   
  mutate(MORT__INCOME_2013 = (RENT_3_2013*12)/(INCOME_17_2013 * HHTYPE_6_2011))    

 
#print the statical summary of model_df$MORT__INCOME_2013
summary(model_df$MORT__INCOME_2013)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1536  0.1964  0.2365  0.2388  0.2686  0.4010

#visualise mortgage ration to income with boxplot 
ggplot(model_df,aes(x='',y=MORT__INCOME_2013)) +
  geom_boxplot()

#visualise mortgage ration to income with density plot 
ggplot(model_df, aes(MORT__INCOME_2013)) +
  geom_density()

#visualise mortgage ration to income with histogram plot 
ggplot(model_df, aes(MORT__INCOME_2013)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Run logistic regression to predict mortgage stress with training dataset

My next question is for those areas where the mortgage to income ratio is greater than 0.25, can we find out common characteristics and build a model to predict if a specific SA2 area falls into a safe zone in term of mortgage stress. The model could help us to predict mortgage stress for a new SA2 area where either household mortgage repayment or income is unknown.

From our previous research, we had successfully identified some very good leading indicators to predict average household mortgage repayment. Let us first to see if these indicators could also help us model if a specific area is safe from mortgage stress. For the ease of logistic regression analyses, we first added a binomial factor of SAFE_2013 to the dataset to present if the mortgage stress in a specific SA2 area falls into the safe zone. That is if the household mortgage repayment to income ratio is less or equal to 0.25, then the factor of SAFE_2013 is “Yes”, vice the SAFE_2013 equals to “No”.

To simulate missing data from household mortgage repayment or income, we created two logistic regression models, one with household income excluded and the other with household mortgage repayment excluded. We then run a logistic regression with each of these two models on a training data set which represents 70% of the population of the dataset loaded. The intent of data segregation here is to best simulate real-world environment and to minimize possible model overfit since the training error often too optimistic. In this case, we reserved the remaining 30% of the loaded dataset as test dataset to test the model later on.

Surprisingly, both models gave us the similar error rate of about 10%. That is about 90% of observations in training dataset could fit into one of the chosen models. The low error rate of training models gave us a high confidence that there is a strong correlation between the chosen indicators to the response binomial factor if mortgage stress is in a safe zone.

#Add a variable to represent is the SA2 specific mortgage ration to income less than 0.25   
model_df <- model_df %>% 
  mutate(SAFE_2013 = ifelse(MORT__INCOME_2013 <=0.25,"Yes","No"))
  model_df$SAFE_2013 <- as.factor(model_df$SAFE_2013)
  is.factor(model_df$SAFE_2013)

## [1] TRUE

# List interested vraibles to front
model_df <-model_df %>%
  select(MORT__INCOME_AVG_2013,MORT__INCOME_2013,SAFE_2013,everything())


# create data partition row list
set.seed(42)
train = createDataPartition(y = model_df$HOUSES_3_2013, p = 0.7, list = F)

# partition default data - remove the variable
training_model_df = model_df[train, ]
testing_model_df = model_df[-train, ]



#Run logistic regression to predict mortgage stress with training dataset while average of household income is unknown.    

glm.fit1 = glm(SAFE_2013 ~ ERP_7_2013 + 
                          ERP_8_2013 +
                          RENT_3_2013+
                          HOUSES_3_2013 +
                          LF_4_2011 + 
                          SCHOOL_2_2011 +
                          MIG_4_2013_ADJ_2013 + 
                          #INCOME_17_2013 + 
                          Distance,
                        data=training_model_df,family = binomial(link = "logit"))

#Summaries the chosen model 
summary(glm.fit1)

## 
## Call:
## glm(formula = SAFE_2013 ~ ERP_7_2013 + ERP_8_2013 + RENT_3_2013 + 
##     HOUSES_3_2013 + LF_4_2011 + SCHOOL_2_2011 + MIG_4_2013_ADJ_2013 + 
##     Distance, family = binomial(link = "logit"), data = training_model_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2248  -0.2012   0.0705   0.2433   2.4793  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          7.299e+00  6.620e+00   1.103  0.27020   
## ERP_7_2013           2.418e-01  1.477e-01   1.637  0.10162   
## ERP_8_2013          -1.482e-01  7.266e-02  -2.039  0.04142 * 
## RENT_3_2013         -1.412e-03  2.163e-03  -0.653  0.51396   
## HOUSES_3_2013       -1.138e-05  3.683e-06  -3.089  0.00201 **
## LF_4_2011            1.509e-02  3.539e-01   0.043  0.96599   
## SCHOOL_2_2011        7.517e-02  6.661e-02   1.128  0.25914   
## MIG_4_2013_ADJ_2013 -4.923e+00  2.729e+01  -0.180  0.85681   
## Distance             1.521e-05  4.404e-05   0.345  0.72980   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.639  on 158  degrees of freedom
## Residual deviance:  70.375  on 150  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 88.375
## 
## Number of Fisher Scoring iterations: 7

#Exame the model 
glm.probs=predict(glm.fit1,type="response")

glm.pred=rep("No",160)
glm.pred[glm.probs>0.5]="Yes"


table(glm.pred,training_model_df$SAFE_2013)

##         
## glm.pred No Yes
##      No  63   9
##      Yes  6  82

mean(glm.pred==training_model_df$SAFE_2013)

## [1] 0.90625

#Run logistic regression to predict mortgage stress with training dataset while average of mortgage repayment is unknown.    

glm.fit2 = glm(SAFE_2013 ~ ERP_7_2013 + 
                          ERP_8_2013 +
                          #RENT_3_2013+
                          HOUSES_3_2013 +
                          LF_4_2011 + 
                          SCHOOL_2_2011 +
                          MIG_4_2013_ADJ_2013 + 
                          INCOME_17_2013 + 
                          Distance,
                        data=training_model_df,family = binomial(link = "logit"))

#Summaries the chosen model 
summary(glm.fit2)

## 
## Call:
## glm(formula = SAFE_2013 ~ ERP_7_2013 + ERP_8_2013 + HOUSES_3_2013 + 
##     LF_4_2011 + SCHOOL_2_2011 + MIG_4_2013_ADJ_2013 + INCOME_17_2013 + 
##     Distance, family = binomial(link = "logit"), data = training_model_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.63612  -0.18167   0.02452   0.14127   2.01977  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -2.515e+00  5.456e+00  -0.461   0.6449    
## ERP_7_2013           2.784e-01  1.470e-01   1.894   0.0582 .  
## ERP_8_2013          -1.493e-01  7.532e-02  -1.982   0.0475 *  
## HOUSES_3_2013       -1.318e-05  3.372e-06  -3.908  9.3e-05 ***
## LF_4_2011            4.446e-01  3.627e-01   1.226   0.2202    
## SCHOOL_2_2011       -2.837e-02  7.899e-02  -0.359   0.7195    
## MIG_4_2013_ADJ_2013 -3.206e+01  3.013e+01  -1.064   0.2874    
## INCOME_17_2013       2.232e-04  9.368e-05   2.383   0.0172 *  
## Distance             4.144e-05  4.661e-05   0.889   0.3740    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 217.639  on 158  degrees of freedom
## Residual deviance:  64.534  on 150  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 82.534
## 
## Number of Fisher Scoring iterations: 8

#Exame the model 
glm.probs2=predict(glm.fit2,type="response")

glm.pred2=rep("No",160)
glm.pred2[glm.probs2>0.5]="Yes"


table(glm.pred2,training_model_df$SAFE_2013)

##          
## glm.pred2 No Yes
##       No  62  10
##       Yes  7  81

mean(glm.pred2==training_model_df$SAFE_2013)

## [1] 0.89375

Prove the model with test dataset

With the model found, we could now test the model with the test dataset to see if our model is able to confidently predict unsafe mortgage stress areas by using R function of predict().The test error rate found with model glm.fit1 and glm.fit2 were at 0.12 and 0.17 respectively. Both results are very promising, as it suggests we could confidently predict if mortgage stress is within the safe zone using the chosen indicators in the model. The model would be very useful for setting area specific policies to tackle mortgage stress issues. By taking a closer look at the results, it was noticed most of the areas that fall out of the mortgage stress safe zone are also Sydney’s wealthiest. Areas like “Bondi Beach - North Bondi” and “Paddington - Moore Park” which are considered as the country’s most affluent are also on the list.

#Test the model glm.fit1 with test dataset 
glm.probs3=predict(glm.fit1,testing_model_df,type="response")

glm.pred3=rep("No",67)
glm.pred3[glm.probs3>0.5]="Yes"

table(glm.pred3,testing_model_df$SAFE_2013)

##          
## glm.pred3 No Yes
##       No  22   4
##       Yes  4  37

mean(glm.pred3==testing_model_df$SAFE_2013)

## [1] 0.880597

#Test the model glm.fit2 with test dataset 
glm.probs4=predict(glm.fit2,testing_model_df,type="response")

glm.pred4=rep("No",67)
glm.pred4[glm.probs4>0.5]="Yes"

table(glm.pred4,testing_model_df$SAFE_2013)

##          
## glm.pred4 No Yes
##       No  21   6
##       Yes  5  35

mean(glm.pred4==testing_model_df$SAFE_2013)

## [1] 0.8358209

# Display SA2 where the mortgage stress is high (No safe)  
NO_SAFE_2013 <- subset(testing_model_df,SAFE_2013=="No",select=sa2_number)

model_df_description <- read_csv("model_df_description.csv")

## Parsed with column specification:
## cols(
##   sa4 = col_character(),
##   sa3 = col_character(),
##   sa2 = col_character(),
##   sa2_number = col_integer(),
##   X5 = col_character(),
##   X6 = col_character()
## )

DATA_2013 = merge(NO_SAFE_2013,model_df_description, by ="sa2_number")
DATA_2013$sa2

##  [1] "Dural - Kenthurst - Wisemans Ferry"
##  [2] "Waterloo - Beaconsfield"           
##  [3] "Bondi Beach - North Bondi"         
##  [4] "Bondi Junction - Waverly"          
##  [5] "Paddington - Moore Park"           
##  [6] "Coogee - Clovelly"                 
##  [7] "Kensington - Kingsford"            
##  [8] "Belmore - Belfield"                
##  [9] "South Hurstville - Blakehurst"     
## [10] "Kingsgrove (South) - Bardwell Park"
## [11] "Five Dock - Abbotsford"            
## [12] "Ashfield"                          
## [13] "St Leonards - Naremburn"           
## [14] "Gordon - Killara"                  
## [15] "Wahroonga - Warrawee"              
## [16] "Avalon - Palm Beach"               
## [17] "Newport - Bilgola"                 
## [18] "Cromer"                            
## [19] "Freshwater - Brookvale"            
## [20] "Terrey Hills - Duffys Forest"      
## [21] "Katoomba - Leura"                  
## [22] "Eastwood - Denistone"              
## [23] "Caringbah - Lilli Pilli"           
## [24] "Miranda - Yowie Bay"               
## [25] "Sylvania - Taren Point"            
## [26] "Sutherland - Kirrawee"

Conclusion

It was found from our statistical analyses that in the year of 2013, average mortgage stress level of Sydney metropolitan areas was within a safe zone. However, wealthy areas are among the most vulnerable SA2s from mortgage stress perspective. This finding is also corresponding to the findings from our previous research that areas with extremely high monthly mortgage repayments have a tendency to take even bigger mortgage un-proportionally than the all areas average with similar characteristics. Interestingly, a similar conclusion was made by David Taylor (2017) in the ‘THE WORLD TODAY’ that ‘Australia’s wealthiest suburbs among most vulnerable to mortgage stress’.(“Wealthiest suburbs among most vulnerable to mortgage stress - ABC News (Australian Broadcasting Corporation),” 2017)

References

Wealthiest suburbs among most vulnerable to mortgage stress - ABC News (Australian Broadcasting Corporation). (n.d.). Retrieved June 3, 2017, from http://mobile.abc.net.au/news/2017-01-09/wealthiest-suburbs-among-most-vulnerable-postcodes-for-mortgage/8170338

What Percentage Of Income Can You Afford For A Mortgage? (2016, October 4). Retrieved June 3, 2017, from https://www.moneyunder30.com/percentage-income-mortgage-payments