In the below table, both models tell the same story and both have the same AIC and BIC results. The only difference is that in Model 2, the teacher experience variable “texp” was transformed to create a new variable for mean of teachers experience “ctexp” by deducting the constant values of “texp” by " mean of texp“. Because all the values were reduced to the equal amount of the difference, when running the analysis, the variable”ctexp" did not have a major effect on the dependent variable.
Model 1 | Model 2 | ||
---|---|---|---|
(Intercept) | 3.31*** | 4.89*** | |
(0.16) | (0.07) | ||
sex | 1.33*** | 0.84*** | |
(0.13) | (0.06) | ||
texp | 0.11*** | ||
(0.01) | |||
sex:texp | -0.03*** | ||
(0.01) | |||
ctexp | 0.11*** | ||
(0.01) | |||
sex:ctexp | -0.03*** | ||
(0.01) | |||
AIC | 4261.85 | 4261.85 | |
BIC | 4306.66 | 4306.66 | |
Log Likelihood | -2122.92 | -2122.92 | |
Num. obs. | 2000 | 2000 | |
Num. groups | 100 | 100 | |
p < 0.001, p < 0.01, p < 0.05 |
I extracted data from American Census Survey (ACS) 2015 from Social Explorer. The data includes of demographic information such as race, age, sex, geographic information such as location, and other information such as income, employment status and marital status of people surveyed.
For this assignment, I will be conducting my analysis to explore “Income distribution of White poulation by State”.
census <- read.csv("/users/sharanbhamra/Desktop/SOC 712/R11646581_SL050.csv")
head(census)
c1 <- census %>%
rename(County_Name = Geo_NAME,
State = Geo_STUSAB,
Median_income = SE_T057_001,
White_Population= PCT_SE_T013_002,
Male = PCT_SE_T004_002,
Female = PCT_SE_T004_003) %>%
dplyr::select(County_Name, State, Median_income, White_Population, Male, Female)
length(unique(c1$State))
## [1] 52
ggplot(data = c1, aes(x = Median_income)) + geom_histogram( fill = "blue")
From the above histogram, most population earn a median income between $40,000 to $60,000 in all 52 states. A few percentage of the population earns income above $125,000, which will be considered as outliers in this study.
Note: Median income has been adjusted for 2015 inflation adjusted dollars.
c2<-lm(Median_income~White_Population, data=c1)
htmlreg(list(c2))
Model 1 | ||
---|---|---|
(Intercept) | 34852.21*** | |
(1136.77) | ||
White_Population | 135.53*** | |
(13.40) | ||
R2 | 0.03 | |
Adj. R2 | 0.03 | |
Num. obs. | 3220 | |
RMSE | 12711.00 | |
p < 0.001, p < 0.01, p < 0.05 |
Complete pooling model assumes that each unit is the same ignoring the group. In the above case, I ran a simple linear model of predicting median income of all states equally if there is an increase in the white population. From the above, I can say that one percent increase in White population in all states, median Income is likely to increase by $135.5. This model is not appropriate for us to predict the median income for all states equally, this is because:
No-pooling model is where you conduct 52 regression models for each one State. This method can be cumbersome, but with the dplyr package, it is possible to run all 52 regression models.
dcoef <- c1 %>%
group_by(State) %>%
do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill="red")
The above histogram of the intercept model suggests the average start income of the white population in 52 states.
dcoef <- c1 %>%
group_by(State) %>%
do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(whitec = coef(.$mod)[2]))
ggplot(coef, aes(x = whitec)) + geom_histogram(fill="green")
The above slope model histogram gives the variation of the median income across the States.
The response variable, the median income and I am attempting to explain part of the variation in median income through fitting white population as a fixed effect. But the response variable has some residual variation (i.e. unexplained variation) associated with States. By using random effects, I am modeling for that unexplained variation through variance.
c1_lme <- lme(Median_income~White_Population, data = c1, random = ~1|State, method = "ML")
htmlreg(list(c1_lme))
Model 1 | ||
---|---|---|
(Intercept) | 41456.15*** | |
(1838.33) | ||
White_Population | 97.32*** | |
(14.15) | ||
AIC | 68716.10 | |
BIC | 68740.41 | |
Log Likelihood | -34354.05 | |
Num. obs. | 3220 | |
Num. groups | 52 | |
p < 0.001, p < 0.01, p < 0.05 |
plot(c1_lme)
In the above random intercept model, I am examining a random effect for each State and this effect is nested within the intercept. Therefore, instead of running 52 different intercepts for 52 states, I am fitting one intercept line for all 52 States. The model above suggests, the fixed affect among all 52 States, the median income of the white population in each State starts at $41456.15, for every one percent increase in white population in each State, median income will increase by $97.32 for each State. As you can also notice from the plot, majority of the white populations median income for each State falls between $40,000 and $55,000.
c2_lme<- lme(Median_income~White_Population, data = c1, random = ~ White_Population|State, method = "ML")
htmlreg(list(c2_lme))
Model 1 | ||
---|---|---|
(Intercept) | 47100.51*** | |
(3250.07) | ||
White_Population | 35.18 | |
(35.75) | ||
AIC | 68635.78 | |
BIC | 68672.24 | |
Log Likelihood | -34311.89 | |
Num. obs. | 3220 | |
Num. groups | 52 | |
p < 0.001, p < 0.01, p < 0.05 |
plot(c2_lme)
In the above slope model, I am determining if there is a variation of percentage white population by State. The model above does not give the same results as the intercept model. The slope model suggests that there is a variation of the white population by State.
htmlreg(list(c1_lme,c2_lme ))
Model 1 | Model 2 | ||
---|---|---|---|
(Intercept) | 41456.15*** | 47100.51*** | |
(1838.33) | (3250.07) | ||
White_Population | 97.32*** | 35.18 | |
(14.15) | (35.75) | ||
AIC | 68716.10 | 68635.78 | |
BIC | 68740.41 | 68672.24 | |
Log Likelihood | -34354.05 | -34311.89 | |
Num. obs. | 3220 | 3220 | |
Num. groups | 52 | 52 | |
p < 0.001, p < 0.01, p < 0.05 |
Although, the slope model has the slightest lower AIC than the intercept model. Therefore, I can assume that both models are efficient enough to know the variation of median income of the white population in each State.
m1 <- lme4::lmer (Median_income~White_Population + (1|State), data=c1)
htmlreg(list(m1))
Model 1 | ||
---|---|---|
(Intercept) | 41460.52*** | |
(1849.70) | ||
White_Population | 97.36*** | |
(14.15) | ||
AIC | 68692.59 | |
BIC | 68716.90 | |
Log Likelihood | -34342.29 | |
Num. obs. | 3220 | |
Num. groups: State | 52 | |
Var: State (Intercept) | 103505312.11 | |
Var: Residual | 101879276.09 | |
p < 0.001, p < 0.01, p < 0.05 |
reEx <- REsim(m1)
p1 <- plotREsim(reEx)
p1
Using the Mertools package, random effect simulations were conducted. From the above plot, there is a variation of income for white population in different States.
The majority of the U.S. population is white, from the above analysis concludes that on average the median income of a person being White in any State in the U.S. ranges from $40,000 to $60,000. Further analysis can be conducted, if one wants to look at the difference between the median income of White population by Gender in every State. Also one can also ran a comparison of the median income between all the races in all States.