Data Analysis Final

Rabin Gora

4/29/2022

Issue Description

Describe the general area or issue you want to investigate in your data analysis project. = The general area I want to investigate is the housing price index and mortgage rates. I want to show the relation between housing price index and the mortgage rate. My analysis will also consider the recessions. I will show the ways combine these data.

Definition: Mortgage Rate: A mortgage rate is the rate of interest charged on a mortgage.

Housing Price Index: - The House Price Index (HPI) is a broad measure of the movement of single-family property prices in the United States. Aside from serving as an indicator of house price trends, it also functions as an analytical tool for estimating changes in the rates of mortgage defaults, prepayments, and housing affordability - It is published by the Federal Housing Finance Agency (FHFA), using monthly and quarterly data supplied by Fannie Mae and Freddie Mac. - The HPI is one of many economic indicators that investors use to keep a pulse on broader economic trends and potential shifts in the stock market.(source: Investopedia)

Questions

Define at least two specific questions you would like to attempt to answer. = The questions are as follows: 1) What are the impacts of mortgage interest rate on house prices Index?

  1. Which state and county in the US has highest growth in housing price Index year over year?

  2. What is the impact of recession on housing price Index and mortgage rates?

Data Source

Identify the data source(s) you used for your analysis. Provide a URL if possibl. = The data sources i’ll be using is as follows:

  1. https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-02-05

Documentation

Provide a link to the documentation for the data or the documentation itself. Is there a data dictionary? =The technical documentation can be found here :

https://github.com/rfordatascience/tidytuesday/blob/master/data/2019/2019-02-05/FM_HPI_Technical_Description.docx.

Freddie Mac publishes the monthly index values of the Freddie Mac House Price Index (FMHPISM) each quarter. Index values are available for the nation, the 50 states and the District of Columbia, and the more than 380 metropolitan statistical areas (MSAs) in the U.S.

Description of the Data

Use the tools in R such as str() and summary() to describe the original dataset you imported.

#install the tidyTuesdayR package.
library(tidytuesdayR)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#load tidy tuesday datas
HPandMR<-tt_load('2019-02-05')
## --- Compiling #TidyTuesday Information for 2019-02-05 ----
## --- There are 3 files available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 3: `mortgage.csv`
##  Downloading file 2 of 3: `recessions.csv`
##  Downloading file 3 of 3: `state_hpi.csv`
## --- Download complete ---
#separate individual tables of the housing index, recession and mortgage rate.

#mortgage.csv data
mortgage_rate <- HPandMR$mortgage

#recessions.csv data
recessions <- HPandMR$recessions

#state_hpi.csv data
state_hpi <- HPandMR$state_hpi
#summary() and str()
summary(mortgage_rate)
##       date            fixed_rate_30_yr fees_and_pts_30_yr fixed_rate_15_yr
##  Min.   :1971-04-02   Min.   : 3.310   Min.   :0.300      Min.   :2.560   
##  1st Qu.:1983-03-09   1st Qu.: 5.850   1st Qu.:0.700      1st Qu.:3.765   
##  Median :1995-02-13   Median : 7.645   Median :1.100      Median :5.680   
##  Mean   :1995-02-13   Mean   : 8.084   Mean   :1.271      Mean   :5.479   
##  3rd Qu.:2007-01-19   3rd Qu.: 9.780   3rd Qu.:1.800      3rd Qu.:6.840   
##  Max.   :2018-12-27   Max.   :18.630   Max.   :2.700      Max.   :8.890   
##                                        NA's   :32         NA's   :1065    
##  fees_and_pts_15_yr adjustable_rate_5_1_hybrid fees_and_pts_5_1_hybrid
##  Min.   :0.3000     Min.   :2.560              Min.   :0.20           
##  1st Qu.:0.6000     1st Qu.:2.970              1st Qu.:0.50           
##  Median :0.7000     Median :3.540              Median :0.50           
##  Mean   :0.9025     Mean   :4.045              Mean   :0.53           
##  3rd Qu.:1.0000     3rd Qu.:5.260              3rd Qu.:0.60           
##  Max.   :2.0000     Max.   :6.390              Max.   :0.90           
##  NA's   :1065       NA's   :1762               NA's   :1762           
##  adjustable_margin_5_1_hybrid spread_30_yr_fixed_and_5_1_adjustable
##  Min.   :2.70                 Min.   :0.5000                       
##  1st Qu.:2.74                 1st Qu.:0.6800                       
##  Median :2.74                 Median :0.7600                       
##  Mean   :2.75                 Mean   :0.7633                       
##  3rd Qu.:2.76                 3rd Qu.:0.8225                       
##  Max.   :2.80                 Max.   :1.0200                       
##  NA's   :1762                 NA's   :2336
str(mortgage_rate)
## spec_tbl_df [2,492 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date                                 : Date[1:2492], format: "1971-04-02" "1971-04-09" ...
##  $ fixed_rate_30_yr                     : num [1:2492] 7.33 7.31 7.31 7.31 7.29 7.38 7.42 7.44 7.46 7.52 ...
##  $ fees_and_pts_30_yr                   : num [1:2492] NA NA NA NA NA NA NA NA 1 NA ...
##  $ fixed_rate_15_yr                     : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  $ fees_and_pts_15_yr                   : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  $ adjustable_rate_5_1_hybrid           : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  $ fees_and_pts_5_1_hybrid              : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  $ adjustable_margin_5_1_hybrid         : num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  $ spread_30_yr_fixed_and_5_1_adjustable: num [1:2492] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   fixed_rate_30_yr = col_double(),
##   ..   fees_and_pts_30_yr = col_double(),
##   ..   fixed_rate_15_yr = col_double(),
##   ..   fees_and_pts_15_yr = col_double(),
##   ..   adjustable_rate_5_1_hybrid = col_double(),
##   ..   fees_and_pts_5_1_hybrid = col_double(),
##   ..   adjustable_margin_5_1_hybrid = col_double(),
##   ..   spread_30_yr_fixed_and_5_1_adjustable = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(recessions)
##      name           period_range       duration_months   
##  Length:14          Length:14          Length:14         
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##  time_since_previous_recession_months peak_unemploy_ment
##  Length:14                            Length:14         
##  Class :character                     Class :character  
##  Mode  :character                     Mode  :character  
##  gdp_decline_peak_to_trough characteristics   
##  Length:14                  Length:14         
##  Class :character           Class :character  
##  Mode  :character           Mode  :character
summary(state_hpi)
##       year          month          state            price_index    
##  Min.   :1975   Min.   : 1.00   Length:26877       Min.   : 15.36  
##  1st Qu.:1985   1st Qu.: 3.00   Class :character   1st Qu.: 57.52  
##  Median :1996   Median : 6.00   Mode  :character   Median : 84.49  
##  Mean   :1996   Mean   : 6.49                      Mean   : 94.09  
##  3rd Qu.:2007   3rd Qu.: 9.00                      3rd Qu.:126.17  
##  Max.   :2018   Max.   :12.00                      Max.   :381.41  
##      us_avg      
##  Min.   : 23.46  
##  1st Qu.: 52.94  
##  Median : 77.51  
##  Mean   : 93.13  
##  3rd Qu.:131.91  
##  Max.   :189.54
str(state_hpi)
## spec_tbl_df [26,877 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ year       : num [1:26877] 1975 1975 1975 1975 1975 ...
##  $ month      : num [1:26877] 1 2 3 4 5 6 7 8 9 10 ...
##  $ state      : chr [1:26877] "AK" "AK" "AK" "AK" ...
##  $ price_index: num [1:26877] 34.6 35.1 35.5 35.9 36.3 ...
##  $ us_avg     : num [1:26877] 23.5 23.6 23.8 24.1 24.2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   year = col_double(),
##   ..   month = col_double(),
##   ..   state = col_character(),
##   ..   price_index = col_double(),
##   ..   us_avg = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Cleaning and Preparation

Describe the steps you took to get from your original dataset to the final dataset you used for your analysis. Include the R code in chunks.

#check if any missing data in any of the dataframe's required cols by boolean table.
table(is.na(mortgage_rate$date)) 
## 
## FALSE 
##  2492
table(mortgage_rate$date=="")       
## < table of extent 0 >
#do the same for fixed_rate_30_yr colum; double check for missing data
table(is.na(mortgage_rate$fixed_rate_30_yr))
## 
## FALSE 
##  2492
table(mortgage_rate$fixed_rate_30_yr=="")
## 
## FALSE 
##  2492
#no missing data in these two cols.
#check for fees_and_pts_30_yr column
table(is.na(mortgage_rate$fees_and_pts_30_yr))
## 
## FALSE  TRUE 
##  2460    32
table(mortgage_rate$fees_and_pts_30_yr=="")
## 
## FALSE 
##  2460
#we see 32 rows missing with missing values. we will fix this by replacing with mean of this column.
mean_fees_and_pts_30_yr = mean(mortgage_rate$fees_and_pts_30_yr, na.rm = TRUE)
mean_fees_and_pts_30_yr
## [1] 1.270776
#replace the missing value for fees_and_pts_30_yr column with the mean
mortgage_rate[is.na(mortgage_rate$fees_and_pts_30_yr), "fees_and_pts_30_yr"] = mean_fees_and_pts_30_yr

#chekc again to see if any missing values in this colume. use either summary or table
table(is.na(mortgage_rate$fees_and_pts_30_yr))
## 
## FALSE 
##  2492
#summary for this column
summary(mortgage_rate$fees_and_pts_30_yr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.300   0.700   1.100   1.271   1.800   2.700

no missing values for other tables as we can see in the summary above.

#cleaning the recession datafram to readable output for R.
#converting the recession date from character vector to start and end date.
library(stringr)

recession_dates <- recessions %>% 
  mutate(duration_months = substring(duration_months, 3),
         period_range = substring(period_range, 5),
         time_since_previous_recession_months = substring(time_since_previous_recession_months, 4),
         peak_unemploy_ment = substring(peak_unemploy_ment, 5),
         gdp_decline_peak_to_trough = substring(gdp_decline_peak_to_trough, 5),
         period_range = case_when(name == "Great Depression" ~ "Aug 1929-Mar 1933",
                                  name == "Great Recession" ~ "Dec 2007-June 2009",
                                  TRUE ~ period_range))
  #extract(period_range, c("started_from","lasted_to"), ("([[^-]]+)-([[^-]]+)")) 

#sperate the period_range into start and end month and year using lubridate and stringr packages
recession_dates$from <- my(stringr::str_extract(recession_dates$period_range,"^.*?(?=-)"))  
recession_dates$to <- my(stringr::str_extract(recession_dates$period_range,"(?=-).*?$"))

Final Results

Show how you approached the questions you posed at the beginning. Describe how much you were able to accomplish. There should be both graphical and numerical results produced by R code included in chunks. Explain what you did and what it means.

#ggplot for the state_hpi for quick overview
ggplot_state_hpi <- state_hpi %>%
  ggplot(aes(year, price_index, color = state, group = state)) +
  geom_point(size = 0.3, alpha = 0.7) +
  ggtitle("Year and House Price Index chart by state")
  
ggplotly(ggplot_state_hpi)

This chart shows that something happened in 2000 with the HPI. so i did a little research on what happened that year for HPI i found that the HPI is scaled so that each state has a value of 100 in Dec 2000. So each data point is the percent of the Dec 2000 price in that state.

#reindexing the price_index and us_average so that the HPI starts from Jan 1975 instead of 2000.

state_hpi %>%
  mutate(date = ymd(year*1e4 + month*1e2 + 1)) %>%
  group_by(state) %>%
  mutate(price_reindex = price_index /price_index[1] * 100, 
         us_reAvg_hpi = us_avg /us_avg[1] *100) %>%
  ungroup() -> state_hpi
#replotting the above chart with the reindex and reaverage
#ggplot for the state_hpifor quick overview
ggplot_state_hpi2 <- state_hpi %>%
  ggplot(aes(year, price_reindex, color = state, group = state))+
  geom_point(size = 0.3, alpha = 0.7) +
  ylab("HPI_reindexed to Jan 1975 = 100") +
  ggtitle("date vs HPI reindexed to jan 1975 by state")

  
ggplotly(ggplot_state_hpi2)
recession_dates %>%
   extract(period_range, c("from_c","to_c"), "(^.*?(?=-))-((?=-).*?$)") %>% 
  mutate(from = as.Date(parse_date_time(from_c, "%b %Y")), 
         to = as.Date(parse_date_time(to_c, "%b %Y")))
## # A tibble: 14 x 10
##    name           from_c to_c  duration_months time_since_prev~ peak_unemploy_m~
##    <chr>          <chr>  <chr> <chr>           <chr>            <chr>           
##  1 Great Depress~ <NA>   <NA>  3 years7 months "1 year9 months" "21.3%(1932)[45~
##  2 Recession of ~ <NA>   <NA>  1 year1 month   "4 years2 month~ "17.8%[45]-\n19~
##  3 Recession of ~ <NA>   <NA>  8 months        "6 years8 month~ "5.2%[52](1946)"
##  4 Recession of ~ <NA>   <NA>  11 months       "3 years1 month" "7.9%(Oct 1949)"
##  5 Recession of ~ <NA>   <NA>  10 months       "3 years9 month~ "6.1%(Sep 1954)"
##  6 Recession of ~ <NA>   <NA>  8 months        "3 years3 month~ "7.5%(July 1958~
##  7 Recession of ~ <NA>   <NA>  10 months       "2 years"        "7.1%(May 1961)"
##  8 Recession of ~ <NA>   <NA>  11 months       "8 years10 mont~ " 6.1%(Dec 1970~
##  9 1973-75 reces~ <NA>   <NA>  1 year4 months  "3 years"        " 9.0%(May 1975~
## 10 1980 recession <NA>   <NA>  6 months        "4 years10 mont~ " 7.8%(July 198~
## 11 1981-1982 rec~ <NA>   <NA>  1 year4 months  "1 year"         " 10.8%(Nov 198~
## 12 Early 1990s r~ <NA>   <NA>  8 months        "7 years8 month~ " 7.8%(June 199~
## 13 Early 2000s r~ <NA>   <NA>  8 months        "10 years"       " 6.3%(June 200~
## 14 Great Recessi~ <NA>   <NA>  1 year6 months  "6 years\n1 mon~ "10.0%(October ~
## # ... with 4 more variables: gdp_decline_peak_to_trough <chr>,
## #   characteristics <chr>, from <date>, to <date>

##This chart answers the question of: 3) What is the impact of recession on housing price Index and mortgage rates?

#putting all the charts together: house price index, recession, and us_avg index. we get 
state_hpi %>% 
  ggplot() +
  geom_line(aes(date,price_index, group = state, color = state)) +
  geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf), 
            alpha = 0.25, 
            fill = "black", 
            data = recession_dates) +
  xlim(ymd(19750101), NA) +
  ylab("HPI / US avg HPI\nreindexed to Jan 1975 = 100")
## Warning: Removed 9 rows containing missing values (geom_rect).

= We see that in every recession, the housing price index has significantlty come down before before it continues its uptrend.

##Chart of us_average house price index.

state_hpi %>%
  ggplot()+
  geom_line(aes(x = date, y = us_avg))+
   geom_rect(aes(xmin = from, xmax = to, ymin = -Inf, ymax = Inf), 
             alpha = 0.25, 
             fill = "black",
             data = recession_dates) +
  xlim(ymd(19750101), NA) +
  ylab("US avg HPI Jan 1975 = 100") +
  ggtitle("US Average house price Index vs date, including recession periods")
## Warning: Removed 9 rows containing missing values (geom_rect).

Mortgage rate 30 year with fees and points chart

#first only the mortgage rate chart with time
mortgage_rate_chart1 <- mortgage_rate %>%
  ggplot(aes(date, fixed_rate_30_yr)) +
  geom_line(color = "blue") +
  geom_line(aes(date, fees_and_pts_30_yr), 
            alpha = 0.3) +
  ggtitle("Mortgage rate chart, fees and points vs time")

ggplotly(mortgage_rate_chart1)

combing mortgage rate and average house price index. It let us see what effect the mortgage rate has on average house price index

This chart answers the questio of:

  1. What are the impacts of mortgage interest rate on house prices Index?
state_hpi %>%
  ggplot()+
  geom_line(aes(x = date, y = us_avg))+
   geom_line(aes(x = date, y = fixed_rate_30_yr), 
             alpha = 0.25, 
             color = "blue",
             data = mortgage_rate) +
  xlim(ymd(19750101), NA) +
  ylab("house price index + mortgage rate") +
  ggtitle("US Average house price Index with relation to 30 years fixed mortgage rate")
## Warning: Removed 196 row(s) containing missing values (geom_path).

= The housing price index and mortgage seems to have negetive corelation. We see the mortgage rate is in the down trend whereas the housing price index is uptrending with time

Highest HPI by year and state

group_by_year <- state_hpi %>% 
  group_by(year) %>% 
  top_n(1, price_index) %>% 
  ungroup() %>% 
  arrange(year) 

group_by_year
## # A tibble: 44 x 8
##     year month state price_index us_avg date       price_reindex us_reAvg_hpi
##    <dbl> <dbl> <chr>       <dbl>  <dbl> <date>             <dbl>        <dbl>
##  1  1975    11 WV           48.6   24.8 1975-11-01          120.         106.
##  2  1976    12 ND           49.1   27.2 1976-12-01          122.         116.
##  3  1977    12 WV           53.0   31.2 1977-12-01          130.         133.
##  4  1978    12 WV           59.3   35.4 1978-12-01          146.         151.
##  5  1979    12 OK           67.0   39.9 1979-12-01          179.         170.
##  6  1980    11 OK           71.2   42.8 1980-11-01          190.         182.
##  7  1981    12 OK           77.4   44.7 1981-12-01          206.         190.
##  8  1982    10 OK           82.6   45.0 1982-10-01          220.         192.
##  9  1983    12 AK           85.8   47.8 1983-12-01          248.         204.
## 10  1984     1 AK           85.8   48.0 1984-01-01          248.         204.
## # ... with 34 more rows

##plot the graph of the group by year for HPI

This chart answers the question of: 2) Which state and county in the US has highest growth in housing price Index year over year?

p1 <- group_by_year %>%
  ggplot(aes(x= year, y = price_index,  color = state))+
  geom_point(size = 0.5) +
  ggtitle("top state with Highest Housing Price Index by each year since 1975")


ggplotly(p1)

= WE can see the top state with highest housing price index each year.

Hence, These answers all of the questions i was trying to answer with this project about houseing price index, mortgage rate and recession.