SOC 712

04/08/2018

Part 1

Model Comparison of Lecture #8 Slide 30

In the below table, both models tell the same story and both have the same AIC and BIC results. The only difference is that in Model 2, the teacher experience variable “texp” was transformed to create a new variable for mean of teachers experience “ctexp” by deducting the constant values of “texp” by " mean of texp“. Because all the values were reduced to the equal amount of the difference, when running the analysis, the variable”ctexp" did not have a major effect on the dependent variable.

Statistical models
Model 1 Model 2
(Intercept) 3.31*** 4.89***
(0.16) (0.07)
sex 1.33*** 0.84***
(0.13) (0.06)
texp 0.11***
(0.01)
sex:texp -0.03***
(0.01)
ctexp 0.11***
(0.01)
sex:ctexp -0.03***
(0.01)
AIC 4261.85 4261.85
BIC 4306.66 4306.66
Log Likelihood -2122.92 -2122.92
Num. obs. 2000 2000
Num. groups 100 100
p < 0.001, p < 0.01, p < 0.05

Part 2 - “Income distribution of White poulation by State”.

Introduction.

I extracted data from American Census Survey (ACS) 2015 from Social Explorer. The data includes of demographic information such as race, age, sex, geographic information such as location, and other information such as income, employment status and marital status of people surveyed.

For this assignment, I will be conducting my analysis to explore “Income distribution of White poulation by State”.

census <- read.csv("/users/sharanbhamra/Desktop/SOC 712/R11646581_SL050.csv") 
head(census)
Renaming and Selecting variables.
c1 <- census %>%
  rename(County_Name = Geo_NAME,
        State = Geo_STUSAB,
        Median_income = SE_T057_001, 
        White_Population= PCT_SE_T013_002,
        Male = PCT_SE_T004_002, 
        Female = PCT_SE_T004_003) %>%
dplyr::select(County_Name, State, Median_income, White_Population, Male, Female)
length(unique(c1$State))
## [1] 52
ggplot(data = c1, aes(x = Median_income)) + geom_histogram( fill = "blue") 

From the above histogram, most population earn a median income between $40,000 to $60,000 in all 52 states. A few percentage of the population earns income above $125,000, which will be considered as outliers in this study.

Note: Median income has been adjusted for 2015 inflation adjusted dollars.

Complete Pooling model

c2<-lm(Median_income~White_Population, data=c1)
htmlreg(list(c2))
Statistical models
Model 1
(Intercept) 34852.21***
(1136.77)
White_Population 135.53***
(13.40)
R2 0.03
Adj. R2 0.03
Num. obs. 3220
RMSE 12711.00
p < 0.001, p < 0.01, p < 0.05

Complete pooling model assumes that each unit is the same ignoring the group. In the above case, I ran a simple linear model of predicting median income of all states equally if there is an increase in the white population. From the above, I can say that one percent increase in White population in all states, median Income is likely to increase by $135.5. This model is not appropriate for us to predict the median income for all states equally, this is because:

  • Every state has a different minimum wage rate.
  • According to Nasdaq, Alaska, Florida, Nevada, South Dakota, Texas, Washington, and Wyoming. Tennessee and New Hampshire have no income tax for any residents.
  • Income tax and standard of living expenses across all states is different which may also influence the income earned by people. For example, the standard of living in New York is expensive as compared to standard of living in Tennessee, as well as income tax in New York is high as compared to Tennessee that has no income tax.

No-pooling Model

No-pooling model is where you conduct 52 regression models for each one State. This method can be cumbersome, but with the dplyr package, it is possible to run all 52 regression models.

Intercept Model
dcoef <- c1 %>% 
    group_by(State) %>% 
    do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill="red")

The above histogram of the intercept model suggests the average start income of the white population in 52 states.

Slope Model: White Population Difference
dcoef <- c1 %>% 
    group_by(State) %>% 
    do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(whitec = coef(.$mod)[2]))
ggplot(coef, aes(x = whitec)) + geom_histogram(fill="green")

The above slope model histogram gives the variation of the median income across the States.

Partial Pooling

Random Intercept

The response variable, the median income and I am attempting to explain part of the variation in median income through fitting white population as a fixed effect. But the response variable has some residual variation (i.e. unexplained variation) associated with States. By using random effects, I am modeling for that unexplained variation through variance.

c1_lme <- lme(Median_income~White_Population, data = c1, random = ~1|State, method = "ML")
htmlreg(list(c1_lme))
Statistical models
Model 1
(Intercept) 41456.15***
(1838.33)
White_Population 97.32***
(14.15)
AIC 68716.10
BIC 68740.41
Log Likelihood -34354.05
Num. obs. 3220
Num. groups 52
p < 0.001, p < 0.01, p < 0.05
plot(c1_lme)

In the above random intercept model, I am examining a random effect for each State and this effect is nested within the intercept. Therefore, instead of running 52 different intercepts for 52 states, I am fitting one intercept line for all 52 States. The model above suggests, the fixed affect among all 52 States, the median income of the white population in each State starts at $41456.15, for every one percent increase in white population in each State, median income will increase by $97.32 for each State. As you can also notice from the plot, majority of the white populations median income for each State falls between $40,000 and $55,000.

Random Slope
c2_lme<- lme(Median_income~White_Population, data = c1, random = ~ White_Population|State, method = "ML")
htmlreg(list(c2_lme))
Statistical models
Model 1
(Intercept) 47100.51***
(3250.07)
White_Population 35.18
(35.75)
AIC 68635.78
BIC 68672.24
Log Likelihood -34311.89
Num. obs. 3220
Num. groups 52
p < 0.001, p < 0.01, p < 0.05
plot(c2_lme)

In the above slope model, I am determining if there is a variation of percentage white population by State. The model above does not give the same results as the intercept model. The slope model suggests that there is a variation of the white population by State.

Model Selection

htmlreg(list(c1_lme,c2_lme ))
Statistical models
Model 1 Model 2
(Intercept) 41456.15*** 47100.51***
(1838.33) (3250.07)
White_Population 97.32*** 35.18
(14.15) (35.75)
AIC 68716.10 68635.78
BIC 68740.41 68672.24
Log Likelihood -34354.05 -34311.89
Num. obs. 3220 3220
Num. groups 52 52
p < 0.001, p < 0.01, p < 0.05

Although, the slope model has the slightest lower AIC than the intercept model. Therefore, I can assume that both models are efficient enough to know the variation of median income of the white population in each State.

Mertools: Random Effect

m1 <- lme4::lmer (Median_income~White_Population + (1|State), data=c1)
htmlreg(list(m1))
Statistical models
Model 1
(Intercept) 41460.52***
(1849.70)
White_Population 97.36***
(14.15)
AIC 68692.59
BIC 68716.90
Log Likelihood -34342.29
Num. obs. 3220
Num. groups: State 52
Var: State (Intercept) 103505312.11
Var: Residual 101879276.09
p < 0.001, p < 0.01, p < 0.05
reEx <- REsim(m1)
p1 <- plotREsim(reEx)
p1

Using the Mertools package, random effect simulations were conducted. From the above plot, there is a variation of income for white population in different States.

Conclusion

The majority of the U.S. population is white, from the above analysis concludes that on average the median income of a person being White in any State in the U.S. ranges from $40,000 to $60,000. Further analysis can be conducted, if one wants to look at the difference between the median income of White population by Gender in every State. Also one can also ran a comparison of the median income between all the races in all States.