The dataset I have chosen this week is the American Housing Survery (AHS) from 2015. The dataset contains information regarding housing/renting issues in the United States. I will be focusing on the variables:
I want to determine if there is any relationship between income, gender,and education level. I hypothesize that men will have a higher income than women, and that those with a higher education will also have a higher income.
library(dplyr)
library(Zelig)
library(lme4)
library(nlme)
library(texreg)
library(visreg)
library(ggplot2)
library(readr)
ahs <- read.csv("/Users/rachel_ramphal/Documents/Data Sets/ahs.csv")
ahs1 <- select(ahs, HINCP, TENURE, HHGRAD, HHSEX)
head(ahs1)
## HINCP TENURE HHGRAD HHSEX
## 1 113000 1 44 1
## 2 29000 1 44 2
## 3 45000 2 39 2
## 4 69900 1 47 1
## 5 49200 3 39 1
## 6 162700 1 44 2
ahs1$HHSEX[ahs1$HHSEX == 1] <- "Male"
ahs1$HHSEX[ahs1$HHSEX == 2] <- "Female"
This model will be used to determine how income varies by gender.
cpooling <- lm(HINCP ~ HHSEX, data = ahs1)
summary(cpooling)
##
## Call:
## lm(formula = HINCP ~ HHSEX, data = ahs1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87916 -42916 -11322 9488 4494338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6 1603 -0.004 0.997
## HHSEX3 6 90740 0.000 1.000
## HHSEXFemale 57668 1859 31.028 <2e-16 ***
## HHSEXMale 82922 1848 44.875 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90730 on 22251 degrees of freedom
## Multiple R-squared: 0.08357, Adjusted R-squared: 0.08344
## F-statistic: 676.3 on 3 and 22251 DF, p-value: < 2.2e-16
The model shows that for males the average income is 56972.8, and women make 8383.0 more than males on average.
This model will show if there is an effect on income based on the education level of the respondent. This graph is showing the results for males.
dcoef <- ahs1 %>%
group_by(HHGRAD) %>%
do(mod = lm(HINCP ~ HHSEX, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill = 'dodgerblue')
This graph is showing that the mean is around $56-57,000 which is close to the average income determine from the complete-pooling model above.
dcoef <- ahs1 %>%
group_by(HHGRAD) %>%
do(mod = lm(HINCP ~ HHSEX, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
ggplot(coef, aes(x = sexc)) + geom_histogram(fill = 'mediumpurple2')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Between the intercept and slope models we can see that there is a lot of variation. This means that my complete-pooling model results of 56972.8 and 8383.0 are the results of competing forces.
This model will show the relationship between income and education levels of the respondents. However, this model doesn’t account for the gender differences of the respondents.
m1_lme <- lme(HINCP ~ HHSEX, data = ahs1, random = ~1|HHGRAD, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: ahs1
## AIC BIC logLik
## 568983.2 569031.3 -284485.6
##
## Random effects:
## Formula: ~1 | HHGRAD
## (Intercept) Residual
## StdDev: 39730 85996.69
##
## Fixed effects: HINCP ~ HHSEX
## Value Std.Error DF t-value p-value
## (Intercept) -6.00 39762.62 22234 -0.0001509 0.9999
## HHSEX3 6.00 86017.85 22234 0.0000698 0.9999
## HHSEXFemale 52620.82 40935.01 22234 1.2854723 0.1986
## HHSEXMale 73703.96 40934.37 22234 1.8005400 0.0718
## Correlation:
## (Intr) HHSEX3 HHSEXF
## HHSEX3 -0.001
## HHSEXFemale -0.971 0.001
## HHSEXMale -0.971 0.001 1.000
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.93411185 -0.38846669 -0.07545409 0.12178391 51.83939462
##
## Number of Observations: 22255
## Number of Groups: 18
The standard deviation in this model is 63349.45. While this is a large number for standard deviation, it is not entirely unexpected since there are large differences between the minimum and maximum income amounts.
This model will show the relationship between income and education level of the respondents while also allowing for gender differences.
m2_lme <- lme(HINCP ~ HHSEX, data = ahs1, random = ~ HHSEX|HHGRAD, method = "ML")
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: ahs1
## AIC BIC logLik
## 568977.4 569097.6 -284473.7
##
## Random effects:
## Formula: ~HHSEX | HHGRAD
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 6.343953e-04 (Intr) HHSEX3 HHSEXF
## HHSEX3 4.300995e-03 -0.172
## HHSEXFemale 3.689448e+04 0.826 -0.207
## HHSEXMale 4.387619e+04 0.838 -0.207 0.997
## Residual 8.595426e+04
##
## Fixed effects: HINCP ~ HHSEX
## Value Std.Error DF t-value p-value
## (Intercept) -6.00 1518.90 22234 -0.003950 0.9968
## HHSEX3 6.00 85975.41 22234 0.000070 0.9999
## HHSEXFemale 52952.85 9172.46 22234 5.773023 0.0000
## HHSEXMale 72561.92 10840.62 22234 6.693518 0.0000
## Correlation:
## (Intr) HHSEX3 HHSEXF
## HHSEX3 -0.018
## HHSEXFemale -0.166 0.003
## HHSEXMale -0.140 0.002 0.986
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.98492152 -0.38432592 -0.07790437 0.12142830 51.89147765
##
## Number of Observations: 22255
## Number of Groups: 18
The model shows that the average income for men is 85522.86 while women make 18177.67 less than men.
Using the AIC we will be able to determine which is the best model.
AIC(cpooling, m1_lme, m2_lme)
## df AIC
## cpooling 5 571271.3
## m1_lme 6 568983.2
## m2_lme 15 568977.4
From this table, we can see that the third model (m2_lme - The Random Slope Model) is the best because it has the lowest AIC value. This shows that the model that takes gender and education level into account (rather than just one of the two) produces the most reliable information of the three.
This table will determine if income varies by education level.
m3_lme <- lme(HINCP ~ 1, random = ~1|HHGRAD, data = ahs1, method = "ML")
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
## Data: ahs1
## AIC BIC logLik
## 569261.1 569285.1 -284627.5
##
## Random effects:
## Formula: ~1 | HHGRAD
## (Intercept) Residual
## StdDev: 43366.46 86541.26
##
## Fixed effects: HINCP ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 59923.11 10280.53 22237 5.828794 0
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.84903970 -0.40054688 -0.07598248 0.13013712 51.37574968
##
## Number of Observations: 22255
## Number of Groups: 18
ICC <- 43366.46/(43366.46 + 86541.26)
ICC
## [1] 0.3338251
The ICC value calculated shows that about 33% of the variation among income levels is due to the varying education levels of the respondents.
intervals(m3_lme)
## Approximate 95% confidence intervals
##
## Fixed effects:
## lower est. upper
## (Intercept) 39772.99 59923.11 80073.22
## attr(,"label")
## [1] "Fixed effects:"
##
## Random Effects:
## Level: HHGRAD
## lower est. upper
## sd((Intercept)) 31161.97 43366.46 60350.79
##
## Within-group standard error:
## lower est. upper
## 85740.69 86541.26 87349.31
The confidence interval above shows that the larger percentage of income variaton is based upon gender rather than education levels.
Overall, from this analysis I am able to determine that my hypothesis was correct. There is a great deal of variation among income based on the respondents education level. There is also a large difference in income amounts based on gender, which is unsurprising due to the gender pay gap in the United States.