Using Zelig, I will use extracted data from the 1994 US Census Data set to estimate several statistical models and obtain quantities of interest through simulation and present the results graphically. The count data model will be used to look at the effects of age and years of education on the number of hours worked weekly. Finally, the Gamma model will be used to look at whether someone earning more/less than 50K per year influences the number of hours worked weekly.
library(Zelig)
library(ZeligChoice)
library(tidyverse)
library(tidyr)
library(readr)
library(dplyr)
library(visreg)
library(sjmisc)
library(ggplot2)
library(radiant.data)
library(pander)
In this step I loaded the required packages to analyze and visualize the data.
Income<- read_csv("C:/Users/Papa/Desktop/Soc 712 -R/adult.csv", col_names = TRUE)
head(Income)
In this step, I loaded my data set and saved it as “Income”. Currently, we only see the first 6 observations.
Income2 <- Income %>% mutate(newincome = ifelse(income == ">50K",1,0))
head(Income2)
I generated a new variable called newincome from the existing income variable with 1 being earning more than 50K per year and 0 being earning less than or equal to 50K per year.
The count data model is good for depedent variables that are counted and whose values are greater than 0. This is good model for looking at the number of hours worked weekly (which is a continous variable)
hrsperweek <- zelig(hours.per.week ~ age + education.num, model = "poisson", data = Income, cite = F)
In this step, we first estimated the count data model.
The second step is to set up a counterfactual (which is the quantity of interest) and run the simulation. For this step, I used two counterfactuals, age and number of years of education.
age.range = min(Income$age):max(Income$age)
x <- setx(hrsperweek, age = age.range)
s <- sim(hrsperweek, x = x)
ci.plot(s)
From this graph we see a positive relationship between number of hours worked weekly and age. As age increases the number of hours worked weekly increases also.
x1 <- setx(hrsperweek, age = mean(Income$age))
x2 <- setx(hrsperweek, age = mean(Income$age) + sd(Income$age))
s <- sim(hrsperweek, x = x1, x1 = x2)
summary(s)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 40.38727 0.03524233 40.3887 40.31339 40.45192
pv
mean sd 50% 2.5% 97.5%
[1,] 40.531 6.40344 40 28 53
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 41.16917 0.05017619 41.17047 41.07354 41.26561
pv
mean sd 50% 2.5% 97.5%
[1,] 41.339 6.568807 42 29 54
fd
mean sd 50% 2.5% 97.5%
[1,] 0.7819002 0.03618149 0.7810677 0.7086848 0.8526171
In this step we look at the first difference (which is the difference between the expected values for x1 and x2) between average age and the average age with one added standard deviation. We see that for 1 standard deviation increase in age we see an 0.8 hour increase in the number of hours worked weekly.
xl <- setx(hrsperweek, age = quantile(Income$age, .25))
xl1 <- setx(hrsperweek, age = quantile(Income$age, .25) + sd(Income$age))
sl <- sim(hrsperweek, x = xl, x1 = xl1)
fdl <- sl$get_qi(xvalue="x1", qi="fd")
xm <- setx(hrsperweek, age = quantile(Income$age, .50))
xm1 <- setx(hrsperweek, age = quantile(Income$age, .50) + sd(Income$age))
sm <- sim(hrsperweek, x = xm, x1 = xm1)
fdm <- sm$get_qi(xvalue="x1", qi="fd")
xh <- setx(hrsperweek, age = quantile(Income$age, .75))
xh1 <- setx(hrsperweek, age = quantile(Income$age, .75) + sd(Income$age))
sh <- sim(hrsperweek, x = xh, x1 = xh1)
fdh <- sh$get_qi(xvalue="x1", qi="fd")
In this step we created a counterfactual for age for each quartile. For each quartile we also created a line in which we added 1 standard deviation to the average age.
d <- as.data.frame(cbind(fdl, fdm, fdh))
head(d)
tidd <- d %>%
gather(quantile, dif, 1:3)
head(tidd)
tidd %>%
group_by(quantile) %>%
summarise(mean = mean(dif), sd = sd(dif))
In this step we looked at the differences of the effect of age on hours worked weekly for earch of the quartiles. We see very small differences in number of hours worked (all less than 1 hour).
ggplot(tidd, aes(dif)) + geom_histogram() + facet_grid(~ quantile)
Using ggplot2, I visualized the differences in hours worked weekly by the 3 quartiles and we see that for the 25% quartile the distribution is normal whereas for the 50% and 75% quartiles we have a bimodal distribution, with the 75% quartile being slighly skewed to the right. Overall, we don’t see large differences in the effect of age on the number of hours worked weekly.
yrs.range = min(Income$education.num):max(Income$education.num)
x <- setx(hrsperweek, education.num = yrs.range)
s <- sim(hrsperweek, x = x)
ci.plot(s)
From this graph we see a positive relationship between number of hours worked weekly and years of education. As years of education increases the number of hours worked weekly increases also.
x1 <- setx(hrsperweek, education.num = mean(Income$education.num))
x2 <- setx(hrsperweek, education.num = mean(Income$education.num) + sd(Income$education.num))
s <- sim(hrsperweek, x = x1, x1 = x2)
summary(s)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 40.3884 0.0357912 40.38829 40.31494 40.45821
pv
mean sd 50% 2.5% 97.5%
[1,] 40.509 6.281192 40 29 54
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 42.23348 0.05061428 42.23464 42.13361 42.3334
pv
mean sd 50% 2.5% 97.5%
[1,] 42.366 6.516173 42 30 55
fd
mean sd 50% 2.5% 97.5%
[1,] 1.845077 0.03576884 1.844289 1.775922 1.913383
In this step we look at the first difference (which is the difference between the expected values for x1 and x2) between average years of education and the average years of education with one added standard deviation. We see that for 1 standard deviation increase in age we see an 1.85 hour increase in the number of hours worked weekly. The differences in hours worked weekly seem to be larger with respect to years of education.
xl <- setx(hrsperweek, education.num = quantile(Income$education.num, .25))
xl1 <- setx(hrsperweek, education.num = quantile(Income$education.num, .25) + sd(Income$education.num))
sl <- sim(hrsperweek, x = xl, x1 = xl1)
fdl <- sl$get_qi(xvalue="x1", qi="fd")
xm <- setx(hrsperweek, education.num = quantile(Income$education.num, .50))
xm1 <- setx(hrsperweek, education.num = quantile(Income$education.num, .50) + sd(Income$education.num))
sm <- sim(hrsperweek, x = xm, x1 = xm1)
fdm <- sm$get_qi(xvalue="x1", qi="fd")
xh <- setx(hrsperweek, education.num = quantile(Income$education.num, .75))
xh1 <- setx(hrsperweek, education.num = quantile(Income$education.num, .75) + sd(Income$education.num))
sh <- sim(hrsperweek, x = xh, x1 = xh1)
fdh <- sh$get_qi(xvalue="x1", qi="fd")
In this step we created a counterfactual for years of education for each quartile. For each quartile we also created a line in which we added 1 standard deviation to the average years of education.
d <- as.data.frame(cbind(fdl, fdm, fdh))
head(d)
tidd <- d %>%
gather(quantile, dif, 1:3)
head(tidd)
tidd %>%
group_by(quantile) %>%
summarise(mean = mean(dif), sd = sd(dif))
In this step we looked at the differences of the effect of years of education on hours worked weekly for earch of the quartiles. We see some differences in number of hours worked (the third quartile shows almost a difference of 2 hours in hours worked weekly).
ggplot(tidd, aes(dif)) + geom_histogram() + facet_grid(~ quantile)
Using ggplot2 again, I visualized the differences in hours worked weekly by the 3 quartiles and we see that for the 25% quartile the distribution is postively skewed whereas for the 50% quartile we have a normal distribution. For the 75% quartile we have a negatively skewed distribution. Overall, we see some slightly larger differences in the effect of the number of years of education on the number of hours worked weekly.
Overall, there are slightly larger differences in hours worked weekly with respect to number years of education than with age.
The Gamma model is good for depedent variables that are continuous and whose values are greater than 0. In addition, there is no time event. This is good model for looking at the number of hours worked weekly (which is a continous variable) and a categorical variable such as the whether someone earns more or less than 50K.
hrsperweek.gamma <- zelig(hours.per.week ~ newincome + age, model = "gamma", data = Income2, cite = F)
summary(hrsperweek.gamma)
Model:
Call:
z5$zelig(formula = hours.per.week ~ newincome + age, data = Income2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.38639 -0.12348 0.02846 0.09233 1.11616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.026077638 0.000125844 207.223 < 0.0000000000000002
newincome -0.003688582 0.000092412 -39.915 < 0.0000000000000002
age -0.000008985 0.000003137 -2.864 0.00418
(Dispersion parameter for Gamma family taken to be 0.0903869)
Null deviance: 4151.5 on 32560 degrees of freedom
Residual deviance: 3998.8 on 32558 degrees of freedom
AIC: 261578
Number of Fisher Scoring iterations: 5
Next step: Use 'setx' method
In this step, we first estimated the Gamma model.
For this step I decided to use the income variable as the counterfactual value and then did a simulation.
x.less <- setx(hrsperweek.gamma, newincome = 0)
x.more <- setx(hrsperweek.gamma, newincome = 1)
hrs.out <- sim(hrsperweek.gamma, x = x.less, x1 = x.more)
summary(hrs.out)
sim x :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 38.86506 0.07185251 38.86515 38.71635 39.00946
pv
mean sd 50% 2.5% 97.5%
[1,] 39.17821 13.39876 38.51414 17.09025 71.11752
sim x1 :
-----
ev
mean sd 50% 2.5% 97.5%
[1,] 45.36663 0.1518994 45.36826 45.0815 45.67178
pv
mean sd 50% 2.5% 97.5%
[1,] 45.28354 15.67114 42.72768 20.18519 78.68954
fd
mean sd 50% 2.5% 97.5%
[1,] 6.501568 0.17322 6.498916 6.16782 6.833301
Looking at the first difference in hours worked weekly with respect to income, we see that earning more than 50K results in a 6.5 hours increase in the humber of hours worked compared to those that earn less than 50K.
plot(hrs.out)
Looking at the results graphically we see that the differences in the number of hours worked weekly for individuals earning more than 50K have a positively skewed distribution, whereas the he differences in the number of hours worked weekly for individuals earning less than 50K have a a normal distribution. The differences in hours worked weekly seemed to be large when looking at the first difference between those earning more than 50K.