Describe the general area or issue you want to investigate in your data analysis project. = The general area I want to investigate is the housing price index and mortgage rates. I want to show the relation between housing price index and the mortgage rate. My analysis will also consider the recessions. I will show the ways combine these data.
Definition: Mortgage Rate: A mortgage rate is the rate of interest charged on a mortgage.
Housing Price Index: - The House Price Index (HPI) is a broad measure of the movement of single-family property prices in the United States. Aside from serving as an indicator of house price trends, it also functions as an analytical tool for estimating changes in the rates of mortgage defaults, prepayments, and housing affordability - It is published by the Federal Housing Finance Agency (FHFA), using monthly and quarterly data supplied by Fannie Mae and Freddie Mac. - The HPI is one of many economic indicators that investors use to keep a pulse on broader economic trends and potential shifts in the stock market.(source: Investopedia)
Define at least two specific questions you would like to attempt to answer. = The questions are as follows: 1) What are the impacts of mortgage interest rate on house prices Index?
Which state and county in the US has highest growth in housing price Index year over year?
What is the impact of recession on housing price Index and mortgage rates?
Identify the data source(s) you used for your analysis. Provide a URL if possibl. = The data sources i’ll be using is as follows:
Provide a link to the documentation for the data or the documentation itself. Is there a data dictionary? =The technical documentation can be found here :
Freddie Mac publishes the monthly index values of the Freddie Mac House Price Index (FMHPISM) each quarter. Index values are available for the nation, the 50 states and the District of Columbia, and the more than 380 metropolitan statistical areas (MSAs) in the U.S.
Use the tools in R such as str() and summary() to describe the original dataset you imported.
#install the tidyTuesdayR package.
library(tidytuesdayR)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#load tidy tuesday datas
HPandMR<-tt_load('2019-02-05')
## --- Compiling #TidyTuesday Information for 2019-02-05 ----
## --- There are 3 files available ---
## --- Starting Download ---
##
## Downloading file 1 of 3: `mortgage.csv`
## Downloading file 2 of 3: `recessions.csv`
## Downloading file 3 of 3: `state_hpi.csv`
## --- Download complete ---
#separate individual tables of the housing index, recession and mortgage rate.
#mortgage.csv data
mortgage_rate <- HPandMR$mortgage
#recessions.csv data
recessions <- HPandMR$recessions
#state_hpi.csv data
state_hpi <- HPandMR$state_hpi
#summary() and str()
summary(mortgage_rate)
## date fixed_rate_30_yr fees_and_pts_30_yr fixed_rate_15_yr
## Min. :1971-04-02 Min. : 3.310 Min. :0.300 Min. :2.560
## 1st Qu.:1983-03-09 1st Qu.: 5.850 1st Qu.:0.700 1st Qu.:3.765
## Median :1995-02-13 Median : 7.645 Median :1.100 Median :5.680
## Mean :1995-02-13 Mean : 8.084 Mean :1.271 Mean :5.479
## 3rd Qu.:2007-01-19 3rd Qu.: 9.780 3rd Qu.:1.800 3rd Qu.:6.840
## Max. :2018-12-27 Max. :18.630 Max. :2.700 Max. :8.890
## NA's :32 NA's :1065
## fees_and_pts_15_yr adjustable_rate_5_1_hybrid fees_and_pts_5_1_hybrid
## Min. :0.3000 Min. :2.560 Min. :0.20
## 1st Qu.:0.6000 1st Qu.:2.970 1st Qu.:0.50
## Median :0.7000 Median :3.540 Median :0.50
## Mean :0.9025 Mean :4.045 Mean :0.53
## 3rd Qu.:1.0000 3rd Qu.:5.260 3rd Qu.:0.60
## Max. :2.0000 Max. :6.390 Max. :0.90
## NA's :1065 NA's :1762 NA's :1762
## adjustable_margin_5_1_hybrid spread_30_yr_fixed_and_5_1_adjustable
## Min. :2.70 Min. :0.5000
## 1st Qu.:2.74 1st Qu.:0.6800
## Median :2.74 Median :0.7600
## Mean :2.75 Mean :0.7633
## 3rd Qu.:2.76 3rd Qu.:0.8225
## Max. :2.80 Max. :1.0200
## NA's :1762 NA's :2336
str(mortgage_rate)
## spec_tbl_df [2,492 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ date : Date[1:2492], format: "1971-04-02" "1971-04-09" ...
## $ fixed_rate_30_yr : num [1:2492] 7.33 7.31 7.31 7.31 7.29 7.38 7.42 7.44 7.46 7.52 ...
## $ fees_and_pts_30_yr : num [1:2492] NA NA NA NA NA NA NA NA 1 NA ...
## $ fixed_rate_15_yr : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## $ fees_and_pts_15_yr : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## $ adjustable_rate_5_1_hybrid : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## $ fees_and_pts_5_1_hybrid : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## $ adjustable_margin_5_1_hybrid : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## $ spread_30_yr_fixed_and_5_1_adjustable: num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. date = col_date(format = ""),
## .. fixed_rate_30_yr = col_double(),
## .. fees_and_pts_30_yr = col_double(),
## .. fixed_rate_15_yr = col_double(),
## .. fees_and_pts_15_yr = col_double(),
## .. adjustable_rate_5_1_hybrid = col_double(),
## .. fees_and_pts_5_1_hybrid = col_double(),
## .. adjustable_margin_5_1_hybrid = col_double(),
## .. spread_30_yr_fixed_and_5_1_adjustable = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(recessions)
## name period_range duration_months
## Length:14 Length:14 Length:14
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## time_since_previous_recession_months peak_unemploy_ment
## Length:14 Length:14
## Class :character Class :character
## Mode :character Mode :character
## gdp_decline_peak_to_trough characteristics
## Length:14 Length:14
## Class :character Class :character
## Mode :character Mode :character
summary(state_hpi)
## year month state price_index
## Min. :1975 Min. : 1.00 Length:26877 Min. : 15.36
## 1st Qu.:1985 1st Qu.: 3.00 Class :character 1st Qu.: 57.52
## Median :1996 Median : 6.00 Mode :character Median : 84.49
## Mean :1996 Mean : 6.49 Mean : 94.09
## 3rd Qu.:2007 3rd Qu.: 9.00 3rd Qu.:126.17
## Max. :2018 Max. :12.00 Max. :381.41
## us_avg
## Min. : 23.46
## 1st Qu.: 52.94
## Median : 77.51
## Mean : 93.13
## 3rd Qu.:131.91
## Max. :189.54
str(state_hpi)
## spec_tbl_df [26,877 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ year : num [1:26877] 1975 1975 1975 1975 1975 ...
## $ month : num [1:26877] 1 2 3 4 5 6 7 8 9 10 ...
## $ state : chr [1:26877] "AK" "AK" "AK" "AK" ...
## $ price_index: num [1:26877] 34.6 35.1 35.5 35.9 36.3 ...
## $ us_avg : num [1:26877] 23.5 23.6 23.8 24.1 24.2 ...
## - attr(*, "spec")=
## .. cols(
## .. year = col_double(),
## .. month = col_double(),
## .. state = col_character(),
## .. price_index = col_double(),
## .. us_avg = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Describe the steps you took to get from your original dataset to the final dataset you used for your analysis. Include the R code in chunks.
#check if any missing data in any of the dataframe's required cols by boolean table.
table(is.na(mortgage_rate$date))
##
## FALSE
## 2492
table(mortgage_rate$date=="")
## < table of extent 0 >
#do the same for fixed_rate_30_yr colum; double check for missing data
table(is.na(mortgage_rate$fixed_rate_30_yr))
##
## FALSE
## 2492
table(mortgage_rate$fixed_rate_30_yr=="")
##
## FALSE
## 2492
#no missing data in these two cols.
#check for fees_and_pts_30_yr column
table(is.na(mortgage_rate$fees_and_pts_30_yr))
##
## FALSE TRUE
## 2460 32
table(mortgage_rate$fees_and_pts_30_yr=="")
##
## FALSE
## 2460
#we see 32 rows missing with missing values. we will fix this by replacing with mean of this column.
mean_fees_and_pts_30_yr = mean(mortgage_rate$fees_and_pts_30_yr, na.rm = TRUE)
mean_fees_and_pts_30_yr
## [1] 1.270776
#replace the missing value for fees_and_pts_30_yr column with the mean
mortgage_rate[is.na(mortgage_rate$fees_and_pts_30_yr), "fees_and_pts_30_yr"] = mean_fees_and_pts_30_yr
#chekc again to see if any missing values in this colume. use either summary or table
table(is.na(mortgage_rate$fees_and_pts_30_yr))
##
## FALSE
## 2492
#summary for this column
summary(mortgage_rate$fees_and_pts_30_yr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.300 0.700 1.100 1.271 1.800 2.700
no missing values for other tables as we can see in the summary above.
#cleaning the recession datafram to readable output for R.
#converting the recession date from character vector to start and end date.
library(stringr)
recession_dates <- recessions %>%
mutate(duration_months = substring(duration_months, 3),
period_range = substring(period_range, 5),
time_since_previous_recession_months = substring(time_since_previous_recession_months, 4),
peak_unemploy_ment = substring(peak_unemploy_ment, 5),
gdp_decline_peak_to_trough = substring(gdp_decline_peak_to_trough, 5),
period_range = case_when(name == "Great Depression" ~ "Aug 1929-Mar 1933",
name == "Great Recession" ~ "Dec 2007-June 2009",
TRUE ~ period_range))
#extract(period_range, c("started_from","lasted_to"), ("([[^-]]+)-([[^-]]+)"))
#sperate the period_range into start and end month and year using lubridate and stringr packages
recession_dates$from <- my(stringr::str_extract(recession_dates$period_range,"^.*?(?=-)"))
recession_dates$to <- my(stringr::str_extract(recession_dates$period_range,"(?=-).*?$"))
Show how you approached the questions you posed at the beginning. Describe how much you were able to accomplish. There should be both graphical and numerical results produced by R code included in chunks. Explain what you did and what it means.
#ggplot for the state_hpi for quick overview
state_hpi %>%
ggplot(aes(year, price_index, color = state, group = state)) +
geom_point(size = 0.3, alpha = 0.7) +
ggtitle("Year and House Price Index chart by state")
#ggplotly(ggplot_state_hpi)
This chart shows that something happened in 2000 with the HPI. so i did a little research on what happened that year for HPI i found that the HPI is scaled so that each state has a value of 100 in Dec 2000. So each data point is the percent of the Dec 2000 price in that state.
#reindexing the price_index and us_average so that the HPI starts from Jan 1975 instead of 2000.
state_hpi %>%
mutate(date = ymd(year*1e4 + month*1e2 + 1)) %>%
group_by(state) %>%
mutate(price_reindex = price_index /price_index[1] * 100,
us_reAvg_hpi = us_avg /us_avg[1] *100) %>%
ungroup() -> state_hpi
#replotting the above chart with the reindex and reaverage
#ggplot for the state_hpifor quick overview
state_hpi %>%
ggplot(aes(year, price_reindex, color = state, group = state))+
geom_point(size = 0.3, alpha = 0.7) +
ylab("HPI_reindexed to Jan 1975 = 100") +
ggtitle("date vs HPI reindexed to jan 1975 by state")
#ggplotly(ggplot_state_hpi2)
recession_dates %>%
extract(period_range, c("from_c","to_c"), "(^.*?(?=-))-((?=-).*?$)") %>%
mutate(from = as.Date(parse_date_time(from_c, "%b %Y")),
to = as.Date(parse_date_time(to_c, "%b %Y")))
##This chart answers the question of: 3) What is the impact of recession on housing price Index and mortgage rates?
#putting all the charts together: house price index, recession, and us_avg index. we get
state_hpi %>%
ggplot() +
geom_line(aes(date,price_index, group = state, color = state)) +
geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf),
alpha = 0.25,
fill = "black",
data = recession_dates) +
xlim(ymd(19750101), NA) +
ylab("HPI / US avg HPI\nreindexed to Jan 1975 = 100")
## Warning: Removed 9 rows containing missing values (geom_rect).
= We see that in every recession, the housing price index has significantlty come down before before it continues its uptrend.
##Chart of us_average house price index.
state_hpi %>%
ggplot()+
geom_line(aes(x = date, y = us_avg))+
geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf),
alpha = 0.25,
fill = "black",
data = recession_dates) +
xlim(ymd(19750101), NA) +
ylab("US avg HPI Jan 1975 = 100") +
ggtitle("US Average house price Index vs date, including recession periods")
## Warning: Removed 9 rows containing missing values (geom_rect).
#first only the mortgage rate chart with time
mortgage_rate %>%
ggplot(aes(date, fixed_rate_30_yr)) +
geom_line(color = "blue") +
geom_line(aes(date, fees_and_pts_30_yr),
alpha = 0.3) +
ggtitle("Mortgage rate chart, fees and points vs time")
#ggplotly(mortgage_rate_chart1)
This chart answers the questio of:
state_hpi %>%
ggplot()+
geom_line(aes(x = date, y = us_avg))+
geom_line(aes(x = date, y = fixed_rate_30_yr),
alpha = 0.25,
color = "blue",
data = mortgage_rate) +
xlim(ymd(19750101), NA) +
ylab("house price index + mortgage rate") +
ggtitle("US Average house price Index with relation to 30 years fixed mortgage rate")
## Warning: Removed 196 row(s) containing missing values (geom_path).
= The housing price index and mortgage seems to have negetive corelation. We see the mortgage rate is in the down trend whereas the housing price index is uptrending with time
group_by_year <- state_hpi %>%
group_by(year) %>%
top_n(1, price_index) %>%
ungroup() %>%
arrange(year)
group_by_year
##plot the graph of the group by year for HPI
This chart answers the question of: 2) Which state and county in the US has highest growth in housing price Index year over year?
group_by_year %>%
ggplot(aes(x= year, y = price_index, color = state))+
geom_point(size = 0.5) +
ggtitle("top state with Highest Housing Price Index by each year since 1975")
#ggplotly(p1)
= WE can see the top state with highest housing price index each year.
Hence, These answers all of the questions i was trying to answer with this project about houseing price index, mortgage rate and recession.