The dataset I have chosen this week is the American Housing Survery (AHS) from 2015. The dataset contains information regarding housing/renting issues in the United States. I will be focusing on the variables:

  1. HHSEX: Sex of respondent (1 = Male, 2 = Female)
  2. HINCP: Household income of the respondent.
  3. HHGRAD: Education level of respondent.

I want to determine if there is any relationship between income, gender,and education level. I hypothesize that men will have a higher income than women, and that those with a higher education will also have a higher income.

library(dplyr)
library(Zelig)
library(lme4)
library(nlme)
library(texreg)
library(visreg)
library(ggplot2)

Reading in Data

library(readr)
ahs <- read.csv("/Users/rachel_ramphal/Documents/Data Sets/ahs.csv")
ahs1 <- select(ahs, HINCP, TENURE, HHGRAD, HHSEX)
head(ahs1)
##    HINCP TENURE HHGRAD HHSEX
## 1 113000      1     44     1
## 2  29000      1     44     2
## 3  45000      2     39     2
## 4  69900      1     47     1
## 5  49200      3     39     1
## 6 162700      1     44     2

Categorizing Sex

ahs1$HHSEX[ahs1$HHSEX == 1] <- "Male"
ahs1$HHSEX[ahs1$HHSEX == 2] <- "Female"

Complete-Pooling Model

This model will be used to determine how income varies by gender.

cpooling <- lm(HINCP ~ HHSEX, data = ahs1)
summary(cpooling)
## 
## Call:
## lm(formula = HINCP ~ HHSEX, data = ahs1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -87916  -42916  -11322    9488 4494338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -6       1603  -0.004    0.997    
## HHSEX3             6      90740   0.000    1.000    
## HHSEXFemale    57668       1859  31.028   <2e-16 ***
## HHSEXMale      82922       1848  44.875   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90730 on 22251 degrees of freedom
## Multiple R-squared:  0.08357,    Adjusted R-squared:  0.08344 
## F-statistic: 676.3 on 3 and 22251 DF,  p-value: < 2.2e-16

The model shows that for males the average income is 56972.8, and women make 8383.0 more than males on average.

No Pooling Model

This model will show if there is an effect on income based on the education level of the respondent. This graph is showing the results for males.

The Intercept
dcoef <- ahs1 %>%
    group_by(HHGRAD) %>%
    do(mod = lm(HINCP ~ HHSEX, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill = 'dodgerblue')

This graph is showing that the mean is around $56-57,000 which is close to the average income determine from the complete-pooling model above.

The Slope
dcoef <- ahs1 %>%
  group_by(HHGRAD) %>%
  do(mod = lm(HINCP ~ HHSEX, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
ggplot(coef, aes(x = sexc)) + geom_histogram(fill = 'mediumpurple2')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Between the intercept and slope models we can see that there is a lot of variation. This means that my complete-pooling model results of 56972.8 and 8383.0 are the results of competing forces.

Partial-Pooling Model

Random Intercept

This model will show the relationship between income and education levels of the respondents. However, this model doesn’t account for the gender differences of the respondents.

m1_lme <- lme(HINCP ~ HHSEX, data = ahs1, random = ~1|HHGRAD, method = "ML")
summary(m1_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: ahs1 
##        AIC      BIC    logLik
##   568983.2 569031.3 -284485.6
## 
## Random effects:
##  Formula: ~1 | HHGRAD
##         (Intercept) Residual
## StdDev:       39730 85996.69
## 
## Fixed effects: HINCP ~ HHSEX 
##                Value Std.Error    DF    t-value p-value
## (Intercept)    -6.00  39762.62 22234 -0.0001509  0.9999
## HHSEX3          6.00  86017.85 22234  0.0000698  0.9999
## HHSEXFemale 52620.82  40935.01 22234  1.2854723  0.1986
## HHSEXMale   73703.96  40934.37 22234  1.8005400  0.0718
##  Correlation: 
##             (Intr) HHSEX3 HHSEXF
## HHSEX3      -0.001              
## HHSEXFemale -0.971  0.001       
## HHSEXMale   -0.971  0.001  1.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.93411185 -0.38846669 -0.07545409  0.12178391 51.83939462 
## 
## Number of Observations: 22255
## Number of Groups: 18

The standard deviation in this model is 63349.45. While this is a large number for standard deviation, it is not entirely unexpected since there are large differences between the minimum and maximum income amounts.

Random Slope

This model will show the relationship between income and education level of the respondents while also allowing for gender differences.

m2_lme <- lme(HINCP ~ HHSEX, data = ahs1, random = ~ HHSEX|HHGRAD, method = "ML")
summary(m2_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: ahs1 
##        AIC      BIC    logLik
##   568977.4 569097.6 -284473.7
## 
## Random effects:
##  Formula: ~HHSEX | HHGRAD
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev       Corr                
## (Intercept) 6.343953e-04 (Intr) HHSEX3 HHSEXF
## HHSEX3      4.300995e-03 -0.172              
## HHSEXFemale 3.689448e+04  0.826 -0.207       
## HHSEXMale   4.387619e+04  0.838 -0.207  0.997
## Residual    8.595426e+04                     
## 
## Fixed effects: HINCP ~ HHSEX 
##                Value Std.Error    DF   t-value p-value
## (Intercept)    -6.00   1518.90 22234 -0.003950  0.9968
## HHSEX3          6.00  85975.41 22234  0.000070  0.9999
## HHSEXFemale 52952.85   9172.46 22234  5.773023  0.0000
## HHSEXMale   72561.92  10840.62 22234  6.693518  0.0000
##  Correlation: 
##             (Intr) HHSEX3 HHSEXF
## HHSEX3      -0.018              
## HHSEXFemale -0.166  0.003       
## HHSEXMale   -0.140  0.002  0.986
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.98492152 -0.38432592 -0.07790437  0.12142830 51.89147765 
## 
## Number of Observations: 22255
## Number of Groups: 18

The model shows that the average income for men is 85522.86 while women make 18177.67 less than men.

Model Selection

Using the AIC we will be able to determine which is the best model.

AIC(cpooling, m1_lme, m2_lme)
##          df      AIC
## cpooling  5 571271.3
## m1_lme    6 568983.2
## m2_lme   15 568977.4

From this table, we can see that the third model (m2_lme - The Random Slope Model) is the best because it has the lowest AIC value. This shows that the model that takes gender and education level into account (rather than just one of the two) produces the most reliable information of the three.

Intra-Class Correlation

This table will determine if income varies by education level.

m3_lme <- lme(HINCP ~ 1, random = ~1|HHGRAD, data = ahs1, method = "ML")
summary(m3_lme)
## Linear mixed-effects model fit by maximum likelihood
##  Data: ahs1 
##        AIC      BIC    logLik
##   569261.1 569285.1 -284627.5
## 
## Random effects:
##  Formula: ~1 | HHGRAD
##         (Intercept) Residual
## StdDev:    43366.46 86541.26
## 
## Fixed effects: HINCP ~ 1 
##                Value Std.Error    DF  t-value p-value
## (Intercept) 59923.11  10280.53 22237 5.828794       0
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.84903970 -0.40054688 -0.07598248  0.13013712 51.37574968 
## 
## Number of Observations: 22255
## Number of Groups: 18
ICC <- 43366.46/(43366.46 + 86541.26)
ICC
## [1] 0.3338251

The ICC value calculated shows that about 33% of the variation among income levels is due to the varying education levels of the respondents.

Confidence Intervals

intervals(m3_lme)
## Approximate 95% confidence intervals
## 
##  Fixed effects:
##                lower     est.    upper
## (Intercept) 39772.99 59923.11 80073.22
## attr(,"label")
## [1] "Fixed effects:"
## 
##  Random Effects:
##   Level: HHGRAD 
##                    lower     est.    upper
## sd((Intercept)) 31161.97 43366.46 60350.79
## 
##  Within-group standard error:
##    lower     est.    upper 
## 85740.69 86541.26 87349.31

The confidence interval above shows that the larger percentage of income variaton is based upon gender rather than education levels.

Overall, from this analysis I am able to determine that my hypothesis was correct. There is a great deal of variation among income based on the respondents education level. There is also a large difference in income amounts based on gender, which is unsurprising due to the gender pay gap in the United States.