Assignment 8

Statistical models
	Model 1	Model 2
(Intercept)	3.31^***	4.89^***
	(0.16)	(0.07)
sex	1.33^***	0.84^***
	(0.13)	(0.06)
texp	0.11^***
	(0.01)
sex:texp	-0.03^***
	(0.01)
ctexp		0.11^***
		(0.01)
sex:ctexp		-0.03^***
		(0.01)
AIC	4261.85	4261.85
BIC	4306.66	4306.66
Log Likelihood	-2122.92	-2122.92
Num. obs.	2000	2000
Num. groups	100	100
p < 0.001, p < 0.01, p < 0.05

Part 2 - “Income distribution of White poulation by State”.

Introduction.

I extracted data from American Census Survey (ACS) 2015 from Social Explorer. The data includes of demographic information such as race, age, sex, geographic information such as location, and other information such as income, employment status and marital status of people surveyed.

For this assignment, I will be conducting my analysis to explore “Income distribution of White poulation by State”.

census <- read.csv("/users/sharanbhamra/Desktop/SOC 712/R11646581_SL050.csv") 
head(census)

Renaming and Selecting variables.

c1 <- census %>%
  rename(County_Name = Geo_NAME,
        State = Geo_STUSAB,
        Median_income = SE_T057_001, 
        White_Population= PCT_SE_T013_002,
        Male = PCT_SE_T004_002, 
        Female = PCT_SE_T004_003) %>%
dplyr::select(County_Name, State, Median_income, White_Population, Male, Female)
length(unique(c1$State))

## [1] 52

ggplot(data = c1, aes(x = Median_income)) + geom_histogram( fill = "blue")

From the above histogram, most population earn a median income between $40,000 to $60,000 in all 52 states. A few percentage of the population earns income above $125,000, which will be considered as outliers in this study.

Note: Median income has been adjusted for 2015 inflation adjusted dollars.

Complete Pooling model

c2<-lm(Median_income~White_Population, data=c1)
htmlreg(list(c2))

Statistical models
	Model 1
(Intercept)	34852.21^***
	(1136.77)
White_Population	135.53^***
	(13.40)
R²	0.03
Adj. R²	0.03
Num. obs.	3220
RMSE	12711.00
p < 0.001, p < 0.01, p < 0.05

Complete pooling model assumes that each unit is the same ignoring the group. In the above case, I ran a simple linear model of predicting median income of all states equally if there is an increase in the white population. From the above, I can say that one percent increase in White population in all states, median Income is likely to increase by $135.5. This model is not appropriate for us to predict the median income for all states equally, this is because:

Every state has a different minimum wage rate.
According to Nasdaq, Alaska, Florida, Nevada, South Dakota, Texas, Washington, and Wyoming. Tennessee and New Hampshire have no income tax for any residents.
Income tax and standard of living expenses across all states is different which may also influence the income earned by people. For example, the standard of living in New York is expensive as compared to standard of living in Tennessee, as well as income tax in New York is high as compared to Tennessee that has no income tax.

No-pooling Model

No-pooling model is where you conduct 52 regression models for each one State. This method can be cumbersome, but with the dplyr package, it is possible to run all 52 regression models.

Intercept Model

dcoef <- c1 %>% 
    group_by(State) %>% 
    do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(intc = coef(.$mod)[1]))
ggplot(coef, aes(x = intc)) + geom_histogram(fill="red")

The above histogram of the intercept model suggests the average start income of the white population in 52 states.

Slope Model: White Population Difference

dcoef <- c1 %>% 
    group_by(State) %>% 
    do(mod = lm(Median_income~White_Population, data = .))
coef <- dcoef %>% do(data.frame(whitec = coef(.$mod)[2]))
ggplot(coef, aes(x = whitec)) + geom_histogram(fill="green")

The above slope model histogram gives the variation of the median income across the States.

Partial Pooling

Random Intercept

The response variable, the median income and I am attempting to explain part of the variation in median income through fitting white population as a fixed effect. But the response variable has some residual variation (i.e. unexplained variation) associated with States. By using random effects, I am modeling for that unexplained variation through variance.

c1_lme <- lme(Median_income~White_Population, data = c1, random = ~1|State, method = "ML")
htmlreg(list(c1_lme))

Statistical models
	Model 1
(Intercept)	41456.15^***
	(1838.33)
White_Population	97.32^***
	(14.15)
AIC	68716.10
BIC	68740.41
Log Likelihood	-34354.05
Num. obs.	3220
Num. groups	52
p < 0.001, p < 0.01, p < 0.05

plot(c1_lme)

In the above random intercept model, I am examining a random effect for each State and this effect is nested within the intercept. Therefore, instead of running 52 different intercepts for 52 states, I am fitting one intercept line for all 52 States. The model above suggests, the fixed affect among all 52 States, the median income of the white population in each State starts at $41456.15, for every one percent increase in white population in each State, median income will increase by $97.32 for each State. As you can also notice from the plot, majority of the white populations median income for each State falls between $40,000 and $55,000.

Random Slope

c2_lme<- lme(Median_income~White_Population, data = c1, random = ~ White_Population|State, method = "ML")
htmlreg(list(c2_lme))

Statistical models
	Model 1
(Intercept)	47100.51^***
	(3250.07)
White_Population	35.18
	(35.75)
AIC	68635.78
BIC	68672.24
Log Likelihood	-34311.89
Num. obs.	3220
Num. groups	52
p < 0.001, p < 0.01, p < 0.05

plot(c2_lme)

In the above slope model, I am determining if there is a variation of percentage white population by State. The model above does not give the same results as the intercept model. The slope model suggests that there is a variation of the white population by State.

Model Selection

htmlreg(list(c1_lme,c2_lme ))

Statistical models
	Model 1	Model 2
(Intercept)	41456.15^***	47100.51^***
	(1838.33)	(3250.07)
White_Population	97.32^***	35.18
	(14.15)	(35.75)
AIC	68716.10	68635.78
BIC	68740.41	68672.24
Log Likelihood	-34354.05	-34311.89
Num. obs.	3220	3220
Num. groups	52	52
p < 0.001, p < 0.01, p < 0.05

Although, the slope model has the slightest lower AIC than the intercept model. Therefore, I can assume that both models are efficient enough to know the variation of median income of the white population in each State.

Mertools: Random Effect

m1 <- lme4::lmer (Median_income~White_Population + (1|State), data=c1)
htmlreg(list(m1))

Statistical models
	Model 1
(Intercept)	41460.52^***
	(1849.70)
White_Population	97.36^***
	(14.15)
AIC	68692.59
BIC	68716.90
Log Likelihood	-34342.29
Num. obs.	3220
Num. groups: State	52
Var: State (Intercept)	103505312.11
Var: Residual	101879276.09
p < 0.001, p < 0.01, p < 0.05

reEx <- REsim(m1)
p1 <- plotREsim(reEx)
p1

Using the Mertools package, random effect simulations were conducted. From the above plot, there is a variation of income for white population in different States.

Conclusion

The majority of the U.S. population is white, from the above analysis concludes that on average the median income of a person being White in any State in the U.S. ranges from $40,000 to $60,000. Further analysis can be conducted, if one wants to look at the difference between the median income of White population by Gender in every State. Also one can also ran a comparison of the median income between all the races in all States.