My dataset is from the Housing Affordability Data System(HADS). This dataset measures the affordability of housing units and the housing cost burdens of households. The dataset is derived from the 1985 and later national American Housing Survey(AHS) and the 2002 and later Metro AHS. This dataset is very large. It contains 99 varibles and 145531 observations.
http://catalog.data.gov/dataset/housing-affordability-data-system-hads
data <- read.table("houseafford.csv",header = T, sep =',')
colnames(data) <- c("bedrooms","houseincome","cost_to_AMI")
data$cost_to_AMI <- as.numeric(data$cost_to_AMI)
str(data)
## 'data.frame': 145531 obs. of 3 variables:
## $ bedrooms : int 4 3 5 3 4 1 3 3 5 5 ...
## $ houseincome: int 159972 156772 1488496 124944 149972 129972 34944 258644 310916 561216 ...
## $ cost_to_AMI: num 1 1 1 1 1 1 1 1 1 1 ...
The dataset is very big, so I do the power analysis to select the sample size. This part will be explained in the four issues.
N <- 485
M <- nrow(data)
set.seed(99)
indicies <- sample(M,N,replace = FALSE)
data <- data[indicies,]
My independent variables are number of bedrooms in unit and household income($).
My dependent categorical variable is whether the household income is higher or lower than the area median income(AMI)
This categorical variable has 7 categories originally, which are:
Category
LTE 30% AMI
30.1-50% AMI
50.1-60% AMI
60.1-80% AMI
80.1-100% AMI
100.1-120% AMI
higher than 120% AMI
I define these categories into 2 categories: lower than 100% AMI is 0, and higher than 100% AMI is 1.
My null hypothesis \(H_0\) is that whether the household income is higher or lower than the area median income cannot be explained by number of bedroom in the unit and total household income.
cor(data)
## bedrooms houseincome cost_to_AMI
## bedrooms 1.0000000 0.3446487 0.3463912
## houseincome 0.3446487 1.0000000 0.3820464
## cost_to_AMI 0.3463912 0.3820464 1.0000000
I think my model should be a explanatory model. I try to use number of bedrooms and household income to explain the household income lower or higher than the area median income.
I decide to use step-wise method to build my model. So, according to the correlation matrix, I should first add number of bedrooms in the unit in the model. Second, I add the houseincome in the the model.
attach(data)
logit1 <- glm(cost_to_AMI~bedrooms,data,family = "binomial")
summary(logit1)
##
## Call:
## glm(formula = cost_to_AMI ~ bedrooms, family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0806 -1.1116 -0.6001 1.2447 2.2121
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3561 0.3049 -7.727 1.10e-14 ***
## bedrooms 0.7331 0.1026 7.145 8.99e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 659.43 on 484 degrees of freedom
## Residual deviance: 597.32 on 483 degrees of freedom
## AIC: 601.32
##
## Number of Fisher Scoring iterations: 4
logit2 <- glm(cost_to_AMI~bedrooms+houseincome,data,family = "binomial")
summary(logit2)
##
## Call:
## glm(formula = cost_to_AMI ~ bedrooms + houseincome, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1103 -0.8662 -0.5759 1.0074 2.3325
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.652e+00 3.208e-01 -8.268 < 2e-16 ***
## bedrooms 5.241e-01 1.076e-01 4.870 1.12e-06 ***
## houseincome 1.385e-05 2.295e-06 6.035 1.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 659.43 on 484 degrees of freedom
## Residual deviance: 549.76 on 482 degrees of freedom
## AIC: 555.76
##
## Number of Fisher Scoring iterations: 4
standresidual <- rstandard(logit2)
plot(fitted(logit2),standresidual, pch = 20, bg = "blue", main = "Standardized plot of fitted value vs Residual", xlab = "fitted values of model", ylab = "standardized Residuals")
abline(0,0,lwd = 2,col = "red")
To test this assumption, I first look at the residuals vs fitted plot.The plot shows that the residuals are not random across the dynamic range. Also, the plot does not look linear.
mean(rstandard(logit2))
## [1] -0.05248861
Then, I compute the mean of standardized residual.The mean is -0.05348861. So, this assumption is violated.
This assumption means that there should 0 autocorrelation in the residuals.
plot(standresidual)
Looking at the residual plot, I think there is a pattern in the residuals. So, this assumption is violated.
To test this assumption, I need to look at the histogram, boxplot and Q-Q plot of residual.
hist(resid(logit1), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")
The histogram shows the residuals are not normal distributed. There are two peak in the plot. #### Boxplot
boxplot(residuals(logit2), main = "Boxplot of Residuals")
The boxplot shows the mean of residual is not at 0, so the residuals are left skewed. There is not outliers in the plot.
par(mfrow = c(1,1))
qqnorm(residuals(logit2),main = "Q-Q Plot")
qqline(residuals(logit2),col = 'red')
The Q-Q plot also suggests the residuals are not normal distributed.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(cost_to_AMI~bedrooms)
##
## studentized Breusch-Pagan test
##
## data: cost_to_AMI ~ bedrooms
## BP = 12.9715, df = 1, p-value = 0.0003163
bptest(cost_to_AMI~houseincome)
##
## studentized Breusch-Pagan test
##
## data: cost_to_AMI ~ houseincome
## BP = 12.258, df = 1, p-value = 0.0004633
The p-value of these two Breusch-Pagan Test are all less than 0.05, so we can’t reject the null hypothese. Therefore, there is heteroscaditicity in the model and this assumption is not satisfied.
To be finished in the final version.
interactmodel <- glm(cost_to_AMI~bedrooms*houseincome,data,family = "binomial")
summary(interactmodel)
##
## Call:
## glm(formula = cost_to_AMI ~ bedrooms * houseincome, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1064 -0.8753 -0.5549 1.0092 2.4462
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.940e+00 4.922e-01 -5.974 2.31e-09 ***
## bedrooms 6.275e-01 1.700e-01 3.690 0.000224 ***
## houseincome 1.910e-05 7.031e-06 2.717 0.006588 **
## bedrooms:houseincome -1.777e-06 2.216e-06 -0.802 0.422711
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 659.43 on 484 degrees of freedom
## Residual deviance: 549.12 on 481 degrees of freedom
## AIC: 557.12
##
## Number of Fisher Scoring iterations: 4