Data Preparation

The Housing Affordability Data System (HADS) is a set of files derived from the 1985 and later national American Housing Survey (AHS) and the 2002 and later Metro AHS. This system categorizes housing units by affordability and households by income, with respect to the Adjusted Median Income, Fair Market Rent (FMR), and poverty income. It also includes housing cost burden for owner and renter households.

# load data from github

house <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA606/master/housing_affordability.csv", header=TRUE, sep=",")

summary(house)
##     ĂŻ..AGE1          METRO3          REGION           LMED       
##  Min.   :13.00   Min.   :1.000   Min.   :1.000   Min.   : 38500  
##  1st Qu.:38.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 60300  
##  Median :52.00   Median :2.000   Median :2.000   Median : 64600  
##  Mean   :52.18   Mean   :2.227   Mean   :2.394   Mean   : 68110  
##  3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 74008  
##  Max.   :93.00   Max.   :5.000   Max.   :4.000   Max.   :115300  
##  NA's   :4438                                                    
##       IPOV           BEDRMS         BUILT           TYPE      
##  Min.   :11057   Min.   :0.00   Min.   :1919   Min.   :1.000  
##  1st Qu.:12036   1st Qu.:2.00   1st Qu.:1950   1st Qu.:1.000  
##  Median :15470   Median :3.00   Median :1970   Median :1.000  
##  Mean   :17168   Mean   :2.66   Mean   :1966   Mean   :1.065  
##  3rd Qu.:18639   3rd Qu.:3.00   3rd Qu.:1985   3rd Qu.:1.000  
##  Max.   :51635   Max.   :7.00   Max.   :2013   Max.   :9.000  
##  NA's   :4438                                                 
##      VALUE             ROOMS            ZINC2             ZSMHC      
##  Min.   :      1   Min.   : 1.000   Min.   :   -117   Min.   :    0  
##  1st Qu.: 100000   1st Qu.: 4.000   1st Qu.:  19974   1st Qu.:  510  
##  Median : 180000   Median : 5.000   Median :  44973   Median :  899  
##  Mean   : 246763   Mean   : 5.631   Mean   :  65887   Mean   : 1140  
##  3rd Qu.: 300000   3rd Qu.: 7.000   3rd Qu.:  85600   3rd Qu.: 1454  
##  Max.   :2520000   Max.   :15.000   Max.   :1061921   Max.   :10667  
##  NA's   :27389                      NA's   :4438      NA's   :4438   
##      TOTSAL              FMTMETRO3           FMTBUILT    
##  Min.   :     0   Central City:21493    Pre 1940 :10058  
##  1st Qu.:     0   Nonmetro    :11255   1940-1959 :11078  
##  Median : 28000   Suburb      :31787   1960-1979 :19685  
##  Mean   : 48228                        1980-1989 : 8234  
##  3rd Qu.: 70000                        1990-1999 : 7533  
##  Max.   :698886                        2000-2009 : 7176  
##  NA's   :4438                          After 2010:  771  
##         FMTSTRUCTURETYPE    FMTBEDRMS        FMTOWNRENT   
##                .:    2   0 Studio:  622   1 Owner :37146  
##  1 Single Family:41271   1 1BR   : 9821   2 Renter:27389  
##  2 2-4 units    : 6257   2 2BR   :16401                   
##  3 5-19 units   : 7273   3 3BR   :24850                   
##  4 20-49 units  : 2719   4 4BR+  :12841                   
##  5 50+ units    : 4570                                    
##  6 Mobile Home  : 2443                                    
##      FMTREGION        FMTSTATUS    
##  Midwest  :17400   Occupied:60097  
##  Northeast:16519   Vacant  : 4438  
##  South    :19260                   
##  West     :11356                   
##                                    
##                                    
## 
names(house)
##  [1] "ĂŻ..AGE1"          "METRO3"           "REGION"          
##  [4] "LMED"             "IPOV"             "BEDRMS"          
##  [7] "BUILT"            "TYPE"             "VALUE"           
## [10] "ROOMS"            "ZINC2"            "ZSMHC"           
## [13] "TOTSAL"           "FMTMETRO3"        "FMTBUILT"        
## [16] "FMTSTRUCTURETYPE" "FMTBEDRMS"        "FMTOWNRENT"      
## [19] "FMTREGION"        "FMTSTATUS"

The main purpose of this project is to see the affortability of the house. How people are able to buy house? Is it because there are many members on the house who has individual income ? or the houses are transferred from parents ownership to the next generation?

Owning house is one of the priority of every individual. The more family members are there in a family who works and can contribute the money to buy house the faster they can own the house.

Affordability of house means you have an income source. You have a job which provides you money. That means the economy of the state or country is good because you have a job to support your family otherwise you have to rent an apartment and share with the other family members.

So, owning a house somewhat relates to the good economy. It also relates to the education of that person.If you have a good college degree then you can get a nice job which pays you higher salary. The more money you have means you can afford to buy house for you and your family.

The file is very big so I used IBM SPSS to convert the xpt file to csv file and then later used Excel to get only some of the information that is needed for my project I have uploaded my converted file to the github.

colnames(house)[which(names(house) == "ĂŻ..AGE1")] <- "AGE"

names(house)
##  [1] "AGE"              "METRO3"           "REGION"          
##  [4] "LMED"             "IPOV"             "BEDRMS"          
##  [7] "BUILT"            "TYPE"             "VALUE"           
## [10] "ROOMS"            "ZINC2"            "ZSMHC"           
## [13] "TOTSAL"           "FMTMETRO3"        "FMTBUILT"        
## [16] "FMTSTRUCTURETYPE" "FMTBEDRMS"        "FMTOWNRENT"      
## [19] "FMTREGION"        "FMTSTATUS"

Data collection

The purpose of these datasets is to provide housing analysts with consistent measures of affordability and burdens over a long period. The datasets are based on the American Housing Survey (AHS) national files from 1985 through 2009 and the metropolitan files from 2002 through 2009.

The American Housing Survey tracks housing structures across the United States.

A collection of tables, most with one row per housing unit.

A complex sample survey designed to generalize to both occupied and vacant housing units across the United States and also for about twenty-five metropolitan areas.

Released more or less biennially since 1973.

Sponsored by the Department of Housing and Urban Development (HUD) and conducted by the U.S. Census Bureau.

Data Source

I got the data from the government website.

http://asdfree.com/american-housing-survey-ahs.html

https://www.huduser.gov/portal/datasets/hads/hads.html http://www.census.gov/hhes/poverty/threshld.html

Cases

Each case or record represents 1 candidate.

#to check how many candidates are there
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
count(house, vars = "AGE")
## # A tibble: 1 x 2
##   vars      n
##   <chr> <int>
## 1 AGE   64535

There are 64535 candidates or cases in this dataset.

hist(house$AGE)

The majority of the candiates in this dataset are from age 30 to 70. Definitely, there are candiates who are over 70s and less than 30s. But I am more focused on age 30 to 70 because generally this is the age who work in any family.

Type of study

This is a data collection done by the government to see the factors of housing and rental cost going up and down. This is a data collection like a census data collection done by goevernment in every 10 years or so to see what is going on in the real world of housing ownership.

Their main concern is the ups ands downs of the ownership or rent of any particular house or an apartment.If more people are renting the apartment means people are moving from different locations to this place or if more people are buying house means they want to settle down with their family. They have a settled fix job.

This is more like a an observational data collection. There is no such thing as government is trying to experiment on something.

Dependent Variable

Own or rent the house or apartment

In this project, my main concern will be to see if the houses or apartments are owned or rented. If possible I would like to find if there are certain factors that make people rent or own the houses or apartments.

Independent Variable

Age, Number of People at home, Income, Monthly housing cost

There are many independent varible on this dataset buy mostly I will be looking at the age, number of people at home and the income of the family which makes them a good candidate to buy the house.

# compare age and own or rent house or apartment
plot(house$AGE ~ house$FMTOWNRENT)

According to the plot, the average age of people who own the house is around 55 and the average age of people who rent is around 42.

# to see the type of house owned or rent
plot(house$FMTSTRUCTURETYPE ~ house$FMTOWNRENT)

It seems like single family house are owned in a very high scale rather than apartments. More than 90% of 1 single family house are owned whereas the big apartment complexs which have many units in it are more rented in a monthly rent basis.

#house structure
plot(house$FMTSTRUCTURETYPE)

It seems like there are around 80% of the houses in the states are 1 single family houses and the rest are apartment complexes ranging from 2 to 50 units.

#income vs ownership
plot(house$LMED ~ house$FMTOWNRENT)

According to the plot, there is no relation between the income and the ownership because even though the income is same people are buying as well as renting the house in the same ratio.

#BUILT VS OWN OR RENT
plot(house$BUILT ~ house$FMTOWNRENT)

According to the plot, owning or renting the house or apartments were very high in 1970s.

After that, economy went so down due to the recession.

https://www.investopedia.com/articles/economics/09/1970s-great-inflation.asp which might be the one of the cause for the people to buy less houses after that.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
ggplot(house, aes(x=AGE)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4438 rows containing non-finite values (stat_bin).

scatter145 <- ggplot(data=house, aes(AGE, LMED)) + geom_point(size=2) + 
  xlab("AGE") +  
  ylab("INCOME")+ 
  ggtitle("DOES INCOME LEAD TO OWNING OR RENTING")+ 
  geom_smooth(method = "lm")

scatter145
## Warning: Removed 4438 rows containing non-finite values (stat_smooth).
## Warning: Removed 4438 rows containing missing values (geom_point).

To show the income of different people with different age.

Conlcusion

Lastly, we can say that definitely people aging 50+ own houses in higher number than younger people because of money. But income is not only the main source to own the house.There might be other reasons that people own house like family house, or relatives house passed to nearest alive relative, etc. There is nothing mention like that but it definitely shows that income of any individuals doesn’t lead to being the owner of the house.

  All the data and information are collected and referenced from                                             https://www.huduser.gov/portal/datasets/hads/hads.html