DATA 606 Data Project Proposal

Data Preparation

The Housing Affordability Data System (HADS) is a set of files derived from the 1985 and later national American Housing Survey (AHS) and the 2002 and later Metro AHS. This system categorizes housing units by affordability and households by income, with respect to the Adjusted Median Income, Fair Market Rent (FMR), and poverty income. It also includes housing cost burden for owner and renter households.

# load data

house <- read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA606/master/housing_affordability.csv", header=TRUE, sep=",")

summary(house)

##     ï..AGE1          METRO3          REGION           LMED       
##  Min.   :13.00   Min.   :1.000   Min.   :1.000   Min.   : 38500  
##  1st Qu.:38.00   1st Qu.:1.000   1st Qu.:1.000   1st Qu.: 60300  
##  Median :52.00   Median :2.000   Median :2.000   Median : 64600  
##  Mean   :52.18   Mean   :2.227   Mean   :2.394   Mean   : 68110  
##  3rd Qu.:65.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.: 74008  
##  Max.   :93.00   Max.   :5.000   Max.   :4.000   Max.   :115300  
##  NA's   :4438                                                    
##       IPOV           BEDRMS         BUILT           TYPE      
##  Min.   :11057   Min.   :0.00   Min.   :1919   Min.   :1.000  
##  1st Qu.:12036   1st Qu.:2.00   1st Qu.:1950   1st Qu.:1.000  
##  Median :15470   Median :3.00   Median :1970   Median :1.000  
##  Mean   :17168   Mean   :2.66   Mean   :1966   Mean   :1.065  
##  3rd Qu.:18639   3rd Qu.:3.00   3rd Qu.:1985   3rd Qu.:1.000  
##  Max.   :51635   Max.   :7.00   Max.   :2013   Max.   :9.000  
##  NA's   :4438                                                 
##      VALUE             ROOMS            ZINC2             ZSMHC      
##  Min.   :      1   Min.   : 1.000   Min.   :   -117   Min.   :    0  
##  1st Qu.: 100000   1st Qu.: 4.000   1st Qu.:  19974   1st Qu.:  510  
##  Median : 180000   Median : 5.000   Median :  44973   Median :  899  
##  Mean   : 246763   Mean   : 5.631   Mean   :  65887   Mean   : 1140  
##  3rd Qu.: 300000   3rd Qu.: 7.000   3rd Qu.:  85600   3rd Qu.: 1454  
##  Max.   :2520000   Max.   :15.000   Max.   :1061921   Max.   :10667  
##  NA's   :27389                      NA's   :4438      NA's   :4438   
##      TOTSAL              FMTMETRO3           FMTBUILT    
##  Min.   :     0   Central City:21493    Pre 1940 :10058  
##  1st Qu.:     0   Nonmetro    :11255   1940-1959 :11078  
##  Median : 28000   Suburb      :31787   1960-1979 :19685  
##  Mean   : 48228                        1980-1989 : 8234  
##  3rd Qu.: 70000                        1990-1999 : 7533  
##  Max.   :698886                        2000-2009 : 7176  
##  NA's   :4438                          After 2010:  771  
##         FMTSTRUCTURETYPE    FMTBEDRMS        FMTOWNRENT   
##                .:    2   0 Studio:  622   1 Owner :37146  
##  1 Single Family:41271   1 1BR   : 9821   2 Renter:27389  
##  2 2-4 units    : 6257   2 2BR   :16401                   
##  3 5-19 units   : 7273   3 3BR   :24850                   
##  4 20-49 units  : 2719   4 4BR+  :12841                   
##  5 50+ units    : 4570                                    
##  6 Mobile Home  : 2443                                    
##      FMTREGION        FMTSTATUS    
##  Midwest  :17400   Occupied:60097  
##  Northeast:16519   Vacant  : 4438  
##  South    :19260                   
##  West     :11356                   
##                                    
##                                    
##

names(house)

##  [1] "ï..AGE1"          "METRO3"           "REGION"          
##  [4] "LMED"             "IPOV"             "BEDRMS"          
##  [7] "BUILT"            "TYPE"             "VALUE"           
## [10] "ROOMS"            "ZINC2"            "ZSMHC"           
## [13] "TOTSAL"           "FMTMETRO3"        "FMTBUILT"        
## [16] "FMTSTRUCTURETYPE" "FMTBEDRMS"        "FMTOWNRENT"      
## [19] "FMTREGION"        "FMTSTATUS"

colnames(house)[which(names(house) == "ï..AGE1")] <- "AGE"

The file is very big so I used IBM SPSS to convert the xpt file to csv file and then later used Excel to get only some of the information that is needed for my project I have uploaded my converted file to the github.

Research question

Is it because of the unavailabity of insurance that people are dying because of health risk behaviors and chronic diseases?

Cases

Each case or record represents 1 candidate.

Data collection

The purpose of these datasets is to provide housing analysts with consistent measures of affordability and burdens over a long period. The datasets are based on the American Housing Survey (AHS) national files from 1985 through 2009 and the metropolitan files from 2002 through 2009

Type of study

This is a data collection done by the government to see the factors of housing and rental cost going up and down.

Data Source

I got the data from the government website.

https://www.huduser.gov/portal/datasets/hads/hads.html http://www.census.gov/hhes/poverty/threshld.html

Dependent Variable

Own or rent the house or apartment

Independent Variable

Age, Number of People at home, Income, Monthly housing cost

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.1

ggplot(house, aes(x=AGE)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4438 rows containing non-finite values (stat_bin).

All the data and information are collected and referenced from

https://www.huduser.gov/portal/datasets/hads/hads.html