Dataset

My dataset is from the Housing Affordability Data System(HADS). This dataset measures the affordability of housing units and the housing cost burdens of households. The dataset is derived from the 1985 and later national American Housing Survey(AHS) and the 2002 and later Metro AHS. This dataset is very large. It contains 99 varibles and 145531 observations.

http://catalog.data.gov/dataset/housing-affordability-data-system-hads

data <- read.table("houseafford.csv",header = T, sep =',')
colnames(data) <- c("bedrooms","houseincome","cost_to_AMI")
data$cost_to_AMI <- as.numeric(data$cost_to_AMI)
str(data)
## 'data.frame':    145531 obs. of  3 variables:
##  $ bedrooms   : int  4 3 5 3 4 1 3 3 5 5 ...
##  $ houseincome: int  159972 156772 1488496 124944 149972 129972 34944 258644 310916 561216 ...
##  $ cost_to_AMI: num  1 1 1 1 1 1 1 1 1 1 ...

The dataset is very big, so I do the power analysis to select the sample size. This part will be explained in the four issues.

N <- 485
M <- nrow(data)
set.seed(99)
indicies <- sample(M,N,replace = FALSE)
data <- data[indicies,]

Independent variables

My independent variables are number of bedrooms in unit and household income($).

Dependent categorical variable

My dependent categorical variable is whether the household income is higher or lower than the area median income(AMI)

This categorical variable has 7 categories originally, which are:

Category

  1. LTE 30% AMI

  2. 30.1-50% AMI

  3. 50.1-60% AMI

  4. 60.1-80% AMI

  5. 80.1-100% AMI

  6. 100.1-120% AMI

  7. higher than 120% AMI

I define these categories into 2 categories: lower than 100% AMI is 0, and higher than 100% AMI is 1.

Null Hypothesis and Alternate Hypothesis

My null hypothesis \(H_0\) is that whether the household income is higher or lower than the area median income cannot be explained by number of bedroom in the unit and total household income.

Build the model

cor(data)
##              bedrooms houseincome cost_to_AMI
## bedrooms    1.0000000   0.3446487   0.3463912
## houseincome 0.3446487   1.0000000   0.3820464
## cost_to_AMI 0.3463912   0.3820464   1.0000000

I think my model should be a explanatory model. I try to use number of bedrooms and household income to explain the household income lower or higher than the area median income.

I decide to use step-wise method to build my model. So, according to the correlation matrix, I should first add number of bedrooms in the unit in the model. Second, I add the houseincome in the the model.

attach(data)
logit1 <- glm(cost_to_AMI~bedrooms,data,family = "binomial")
summary(logit1)
## 
## Call:
## glm(formula = cost_to_AMI ~ bedrooms, family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0806  -1.1116  -0.6001   1.2447   2.2121  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3561     0.3049  -7.727 1.10e-14 ***
## bedrooms      0.7331     0.1026   7.145 8.99e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 659.43  on 484  degrees of freedom
## Residual deviance: 597.32  on 483  degrees of freedom
## AIC: 601.32
## 
## Number of Fisher Scoring iterations: 4
logit2 <- glm(cost_to_AMI~bedrooms+houseincome,data,family = "binomial")
summary(logit2)
## 
## Call:
## glm(formula = cost_to_AMI ~ bedrooms + houseincome, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1103  -0.8662  -0.5759   1.0074   2.3325  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.652e+00  3.208e-01  -8.268  < 2e-16 ***
## bedrooms     5.241e-01  1.076e-01   4.870 1.12e-06 ***
## houseincome  1.385e-05  2.295e-06   6.035 1.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 659.43  on 484  degrees of freedom
## Residual deviance: 549.76  on 482  degrees of freedom
## AIC: 555.76
## 
## Number of Fisher Scoring iterations: 4

Plot

Standardized residuals as a function of the fitted model

standresidual <- rstandard(logit2)
plot(fitted(logit2),standresidual, pch = 20, bg = "blue", main = "Standardized plot of fitted value vs Residual", xlab = "fitted values of model", ylab = "standardized Residuals")
abline(0,0,lwd = 2,col = "red")

LINE analysis

1. The mean of the response, E(Yi), at each set of values of the predictor, is a Linear functon of the predictors.

To test this assumption, I first look at the residuals vs fitted plot.The plot shows that the residuals are not random across the dynamic range. Also, the plot does not look linear.

mean(rstandard(logit2))
## [1] -0.05248861

Then, I compute the mean of standardized residual.The mean is -0.05348861. So, this assumption is violated.

2. The error terms are Independent

This assumption means that there should 0 autocorrelation in the residuals.

plot(standresidual)

Looking at the residual plot, I think there is a pattern in the residuals. So, this assumption is violated.

3. The errors at each set of values of the predictor are Normally distributed.

To test this assumption, I need to look at the histogram, boxplot and Q-Q plot of residual.

Histogram plot

hist(resid(logit1), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")

The histogram shows the residuals are not normal distributed. There are two peak in the plot. #### Boxplot

boxplot(residuals(logit2), main = "Boxplot of Residuals")

The boxplot shows the mean of residual is not at 0, so the residuals are left skewed. There is not outliers in the plot.

Q-Q plot

par(mfrow = c(1,1))
qqnorm(residuals(logit2),main = "Q-Q Plot")
qqline(residuals(logit2),col = 'red')

The Q-Q plot also suggests the residuals are not normal distributed.

4. The errors at each set of values of the predictor have Equal variances.

Breusch-Pagan test

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(cost_to_AMI~bedrooms)
## 
##  studentized Breusch-Pagan test
## 
## data:  cost_to_AMI ~ bedrooms
## BP = 12.9715, df = 1, p-value = 0.0003163
bptest(cost_to_AMI~houseincome)
## 
##  studentized Breusch-Pagan test
## 
## data:  cost_to_AMI ~ houseincome
## BP = 12.258, df = 1, p-value = 0.0004633

The p-value of these two Breusch-Pagan Test are all less than 0.05, so we can’t reject the null hypothese. Therefore, there is heteroscaditicity in the model and this assumption is not satisfied.

Four issues

To be finished in the final version.

1. Causality

2. Sample Sizes

3. Collinearity

4. Measurement Error

Test Interaction effects between the IVs

interactmodel <- glm(cost_to_AMI~bedrooms*houseincome,data,family = "binomial")
summary(interactmodel)
## 
## Call:
## glm(formula = cost_to_AMI ~ bedrooms * houseincome, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1064  -0.8753  -0.5549   1.0092   2.4462  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.940e+00  4.922e-01  -5.974 2.31e-09 ***
## bedrooms              6.275e-01  1.700e-01   3.690 0.000224 ***
## houseincome           1.910e-05  7.031e-06   2.717 0.006588 ** 
## bedrooms:houseincome -1.777e-06  2.216e-06  -0.802 0.422711    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 659.43  on 484  degrees of freedom
## Residual deviance: 549.12  on 481  degrees of freedom
## AIC: 557.12
## 
## Number of Fisher Scoring iterations: 4

Conclusion