My dataset is from the Housing Affordability Data System (HADS). This dataset measures the affordability of housing units and the housing cost burdens of households. The dataset is derived from the 1985 and later national American Housing Survey (AHS) and the 2002 and later Metro AHS. This dataset is very large. It contains 99 variables and 145531 observations.
http://catalog.data.gov/dataset/housing-affordability-data-system-hads
Based on the concept in HADS, it says housing cost relative to AMI (area median income) is perhaps the most common affordability standard. So, I want to take a deep look into AMI. AMI is used for the following two purposes: to classify households on the basis of income received, and to classify housing units on the basis of the income need to afford them (“affordability”). Also, because the number of bedrooms in the unit reflects the likely household size, AMI needs to be adjusted for number of bedrooms.
So, I want to build a model that using monthly housing cost and number of bedrooms to explain whether a household income is lower or higher(include equal) than the AMI.
AMI is in per household.
data <- read.table("houseafford.csv",header = T, sep =',')
colnames(data) <- c("bedrooms","housingcost","cost_to_AMI")
data$cost_to_AMI <- as.numeric(data$cost_to_AMI)
str(data)
## 'data.frame': 145531 obs. of 3 variables:
## $ bedrooms : int 4 3 5 3 4 1 3 3 5 5 ...
## $ housingcost: int 4240 3502 5014 4609 4891 1798 4919 4095 5310 6866 ...
## $ cost_to_AMI: num 1 1 1 1 1 1 1 1 1 1 ...
The dataset is very big, so I do the power analysis to select the sample size. This part will be explained in the four issues part.
N <- 326
M <- nrow(data)
set.seed(23)
indicies <- sample(M,N,replace = FALSE)
data <- data[indicies,]
My independent variables are number of bedrooms in unit and monthly housing cost ($).
My dependent categorical variable is whether the household income is lower or equal and higher than the area median income (AMI).
This categorical variable has 7 categories originally, which are:
Category
LTE 30% AMI
30.1-50% AMI
50.1-60% AMI
60.1-80% AMI
80.1-100% AMI
100.1-120% AMI
higher than 120% AMI
I define these categories into 2 categories: lower than 100% AMI is 0, and equal or higher than 100% AMI is 1.
My null hypothesis \(H_0\) is that whether the household income is higher or lower than the area median income cannot be explained by number of bedroom in the unit and monthly housing cost.
My alternate hypothesis \(H_1\) is that whether the household income is higher or lower than the area median income can be explained by number of bedroom in the unit and monthly housing cost.
I think my model should be an explanatory model. I try to use number of bedrooms and monthly housing cost to explain the household income lower or higher than the area median income.
cor(data)
## bedrooms housingcost cost_to_AMI
## bedrooms 1.0000000 0.4013657 0.3890666
## housingcost 0.4013657 1.0000000 0.4966925
## cost_to_AMI 0.3890666 0.4966925 1.0000000
I decide to use step-wise method to build my model. So, according to the correlation matrix, I should first add monthly housing cost in the model, with the correlation of 0.4698. Second, I add number of bedroom in the unit into the model, with the correlation of 0.3463. There may be suppression when add number of bedrooms and monthly housing cost both into the model, because they have the correlation of 0.4045.
attach(data)
logit1 <- glm(cost_to_AMI~housingcost,data,family = "binomial")
summary(logit1)
##
## Call:
## glm(formula = cost_to_AMI ~ housingcost, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2447 -0.8249 -0.5165 0.8223 2.1965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3083633 0.2733010 -8.446 < 2e-16 ***
## housingcost 0.0016840 0.0002178 7.733 1.05e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 437.64 on 325 degrees of freedom
## Residual deviance: 339.49 on 324 degrees of freedom
## AIC: 343.49
##
## Number of Fisher Scoring iterations: 5
bo of this model is -2.3083, which means when monthly housing cost is 0, the log odds of income equal and higher than AMI is -2.3083. b1 of this model is 0.001684 and shows that with one dollar increases in monthly housing cost, the log odds of income equal and higher than AMI increases by 0.001684.
logit2 <- glm(cost_to_AMI~housingcost+bedrooms,data,family = "binomial")
summary(logit2)
##
## Call:
## glm(formula = cost_to_AMI ~ housingcost + bedrooms, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1416 -0.7582 -0.4080 0.7833 2.2482
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5720331 0.4544679 -7.860 3.85e-15 ***
## housingcost 0.0014391 0.0002214 6.500 8.02e-11 ***
## bedrooms 0.5683120 0.1478145 3.845 0.000121 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 437.64 on 325 degrees of freedom
## Residual deviance: 322.77 on 323 degrees of freedom
## AIC: 328.77
##
## Number of Fisher Scoring iterations: 5
bo of this model is -3.5720, which means when monthly housing cost and unit of bedroom are 0, the log odds of income equal and higher than AMI is -3.5720. b1 of this model is 0.0014391 and shows that with one dollar increases in monthly housing cost, the log odds of income equal and higher than AMI increases 0.0014391. b2 of this model is 0.5683, which means that with one unit increases in the number of bedrooms, the log odds of equal and higher than AMI increases by 0.5683.
library(aod)
## Warning: package 'aod' was built under R version 3.1.3
wald.test(b=coef(logit2),Sigma = vcov(logit2),Terms =2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 42.3, df = 1, P(> X2) = 8e-11
wald.test(b=coef(logit2),Sigma = vcov(logit2),Terms =3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 14.8, df = 1, P(> X2) = 0.00012
wald.test(b=coef(logit2),Sigma = vcov(logit2),Terms =2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 68.6, df = 2, P(> X2) = 1.2e-15
Doing Chi-squared test, we see that p-value of are all less than 0.05. So, we can reject the null hypothesis and say that whether the household income is lower or higher (and equal) than the area median income (AMI) can be explained by monthly housing cost and number of bedrooms in the unit.
standresidual <- rstandard(logit2)
plot(fitted(logit2),standresidual, pch = 20, bg = "blue", main = "Standardized plot of fitted value vs Residual", xlab = "fitted values of model", ylab = "standardized Residuals")
abline(0,0,lwd = 2,col = "red")
To test this assumption, I first look at the residuals vs fitted plot. There are more points below the line and the residuals are not random across the dynamic range. Also, the plot does not look linear.
mean(rstandard(logit2))
## [1] -0.05828612
Then, I compute the mean of standardized residual. The mean is -0.05778776. So, this assumption is violated.
This assumption means that there should be zero autocorrelation in the residuals and all cases should be independent of one another.
plot(standresidual)
Looking at the residual plot, I think there is not a pattern in the residuals. Also, my data were collected household by household. I think they should not be correlate with one another. So, this assumption is satisfied.
To test this assumption, I need to look at the histogram, boxplot and Q-Q plot of residual.
hist(resid(logit2), breaks = 15, main = "Histograms of Residuals", xlab = "Residuals", ylab = "Density")
The histogram plot is not a bell-shape. I think the plot has kurtosis and it is left skew. So the residuals are not normal distributed.
boxplot(residuals(logit2), main = "Boxplot of Residuals")
The boxplot shows the mean of residual is not at 0 and the residuals are left skew. There is no outliers in the plot.
par(mfrow = c(1,1))
qqnorm(residuals(logit2),main = "Q-Q Plot")
qqline(residuals(logit2),col = 'red')
The Q-Q plot is not linear in the lower, middle, and upper part of the plot. So, it also suggests the residuals are not normal distributed.
By looking through histogram plot, Boxplot, and Q-Q plot, I think the distribution of residuals is not normal and this assumption is violated.
To test this assumption, I first look at the standardized residual plot. The plot looks like there is heteroscaditicity. Then, I do the Breusch-Pagan test. The null hypothesis here is that there is homoscaditicity.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(cost_to_AMI~housingcost+bedrooms)
##
## studentized Breusch-Pagan test
##
## data: cost_to_AMI ~ housingcost + bedrooms
## BP = 32.3435, df = 2, p-value = 9.477e-08
The p-value of Breusch-Pagan Test is 8.946e-08 and it less than 0.05. So, we can reject the null hypothesis and say there is heteroscaditicity. So, this assumption is not satisfied.
After doing the LINE analysis, I find 3 of 4 assumption are violated. However, because this is logistic regression, this is fine. Because logistic regression has other assumptions that this model need to satisfy.
Monthly housing cost and number of bedroom in the unit should be proximal and probabilistic causes of whether income equal and higher or lower than the AMI. They are proximal causes because we can easily know the direct and immediate cause of something, but it’s very had to find the start point of the process. They are not determinate because everything has exception. Not every time when monthly housing cost and number of bedroom increase, the chance that income is equal and higher than the AMI increases.
Monthly housing cost and number of bedroom in the unit are related to whether income equal and higher or lower than the AMI because they are correlated with each other. Then, I think direction of influence should be monthly when housing cost and number of bedroom in the unit increase, the chance that income is equal and higher than the AMI increases. However, their relation is not isolate. Other factors also could increase the chance that income is equal and higher than the AMI.
My dataset is very large. So, it may be waste time to do the regression with the whole data and also could increase the probability of Type I error. So, I need to reduce the sample size.
First, I determine the odds ratio of this model. I input Pr(Y=1|X=1)H1 is 0.5 and Pr(Y=1|X=1)H0 is 0.4. I get Odds ratio should be 1.5. Then, I use G * Power to compute I need how many observations. I choose alpha equal 0.05 and Power is 0.95. G * Power says that I need 326 observations.
lm <- lm(bedrooms~housingcost)
R2 <- summary(lm)$r.squared
plot(housingcost,bedrooms, pch = 21, bg = 'blue', main = "monthly housing cost vs number of bedrooms", xlab = "monthly housing cost", ylab = "number of bedrooms")
Tolerance <- 1-R2
Independent variables in my model have collinearity, but they are not perfect collinearity. According to the correlation matrix, their correlation is 0.4013. Then, looking at the plot, I think IVs are not high correlated. In addition, R^2 is 0.1510 for using monthly housing cost as IV and air number of bedrooms as DV. Tolerance is 0.8389. So, IVs in my model don’t have high collinearity.
I think my model have measurement error. Though the dataset is collected from survey, my dependent variable is calculated from the result of the survey. Also, because we can’t measure any variable perfectly, there is attenuation of correlation. The correlation we compute is lower than the actual correlation.
interactmodel <- glm(cost_to_AMI~bedrooms*housingcost,data,family = "binomial")
summary(interactmodel)
##
## Call:
## glm(formula = cost_to_AMI ~ bedrooms * housingcost, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1593 -0.7690 -0.3468 0.7864 2.3838
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.8394944 0.8336528 -5.805 6.43e-09 ***
## bedrooms 1.0345205 0.2825299 3.662 0.000251 ***
## housingcost 0.0025718 0.0006221 4.134 3.57e-05 ***
## bedrooms:housingcost -0.0004012 0.0001916 -2.095 0.036210 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 437.64 on 325 degrees of freedom
## Residual deviance: 319.20 on 322 degrees of freedom
## AIC: 327.2
##
## Number of Fisher Scoring iterations: 5
According to the summary of this model, the interaction term is significant. Also, the correlation of my independent variables is 0.4013. Also, I think there is suppression between my independent variables. So, there is interaction effect..
Overall, monthly housing cost and number of bedroom in the unit seems can explain whether household income is equal and higher or lower than the area median income. However, I think my model may not be a very good model. My two independent variables have suppression with each other and make the model not accurate. Also, I think other variables may explain whether household income is higher (and equal) or lower than the AMI better.