It sounds like folks are looking to get hands on with some real data. That’s awesome. We’ll step back a bit from the conceptual parts of statistical inference now and dig into the NLSS 2022 data set. This is a moderately complex data set, which means we’ll need to do some data wrangling before we can do any interesting analysis. Data wrangling is needed for pretty much anything but very simple data sets; this is the work that R is very good at.
I’m not going to explain every nuance of the syntax, but if you’re feeling bewildered check out the book “R for Data Science” by Wickham et al. in the chat. Just walk through the examples and you’ll feel comfortable in no time. R will always do exactly what you tell it to do. That also means that everything has to be just right. If you’ve got a comma in the wrong place, you’ll likely get an error. If you’re not sure what an error message means, post a copy of your code and the error to the chat and somebody will help out.
Here, rather than try to explain everything, I’ll just highlight what each step is doing in general terms.
First, we’ll import our libraries. The rethinking package gives us quap for model fitting, and tidyverse is our library for data manipulation and visualization. The haven, scales, and modelr libraries are all made by the same people as tidyverse, and they assist with data import, graph scales, and respectively.
The two other libraries, tidybayes and
tidybayes.rethinking, are just convenience libraries that
will bridge the data formats of quap and tidyverese. In the regression
example file, we did all the conversions manually, but the functions
here are quicker and more convenient.
Down the road, we’ll stop using quap and start using stan (mcmc)
directly. For that, a new library called cmdstanr will be
needed. For the moment, though, we don’t need it, so we’ll keep it
commented out.
This next bit of code is a bit ugly, but it allows us to import all of the NLSS sections in one go. If you get an error, you may need to set your working directory in the Files pane (usually in the bottom right quarter of R Studio).
options(scipen = 999, digits = 3)
library(rethinking)
library(tidyverse)
library(haven)
library(scales)
library(tidymodels)
library(tidybayes)
library(tidybayes.rethinking)
# library(cmdstanr)
for( .file in list.files('nlss2022/data/stata format') ) {
.name <- strsplit(.file, ".", fixed=TRUE)[[1]][1]
.path <- paste0('nlss2022/data/stata format/', .file)
.table <- read_dta(.path) |> as_factor()
assign( .name , .table )
}
To transform our raw data into a format we can use, we’re going to use “pipes” |> to connect commands together in a series. Pipes allow us to take the output of one command and feed it into the input of another command, like water flowing through a pipe.
# The pipe series:
1:10 |> mean() |> round()
## [1] 6
# is exactly the same as this:
numbers <- 1:10
average <- mean(numbers)
round(average)
## [1] 6
# and also this:
round( mean( 1:10 ) )
## [1] 6
We’ll use the pipe syntax a lot because it allows us to break our data transformations down into simple steps. If you’ve ever used a language like python, this is very similar to the method chaining you can do on objects.
For this first analysis, let’s compare income patterns for different types of jobs. This data comes from Section 10 of the NLSS 2022.
We’ll also need the “weight” table. This table isn’t part of the survey itself but rather provides some metadata about each of the 9600 households of the survey. Most importantly, it includes the hhs_wt variable, which tells us how to weight each of the sampling units in our stratified sample.
We’ll start our pipe with the ‘weight’ table as a base, join it with the ‘S10’ table, perform some manipulations, and store the results in the new variable “df”. This is a very common pattern for designing an analysis. Take a look at the example below to make sure you can see that pattern.
A few things to note:
right_join function will combine our
weight table and our S10 table together. In
order to do that, R needs to know how to how to identify different
households so that data in different tables can be matched together. We
haven’t specified that key, so the right_join function will look at both
tables and try to find columns with matching names. In this case, it
will find two matching columns: psu_number and hh_number. Together,
these two numbers allow us to uniquely identify every one of the 9600
households in the NLSS.mutate function allows us to change the values of a
column or create new columns. Here, we need to create a new variable for
daily-, contract-, and salary-based work structures. All this data is in
the NLSS, but it’s organized in different ways. Look at Section 10,
questions 6—11 for more details.pivot_longer function allows us to “rotate” columns
into separate rows. This makes our data easier to plot.NA. We’ll
filter or data set, keeping only values where the income
column is not (!) NA.work_type variable to change
the order of the different factor values. Factor order can be important
for certain kinds of analyses. If we had a factor variable for months,
for example, it might be important to know that may comes
between apr and jun. In this case, though, the
only effect of releveling our factors is to change the order that the
different values have in our graph key.df <- weight |>
right_join(S10) |>
mutate(
daily = q10_06 * (q10_07 + q10_08),
contract = q10_11_a + q10_11_b,
salary = q10_09_a + q10_09_b + q10_09_c + q10_09_d + q10_10
) |>
pivot_longer(cols=c(daily, contract, salary), names_to="work_type", values_to="income") |>
filter(!is.na(income)) |>
mutate(
work_type = fct_relevel(work_type, "daily", "contract", "salary")
)
## Joining with `by = join_by(psu_number, hh_number)`
Next, let’s use the summary function to count the number
of observations we have for each work type.
summary(df$work_type)
## daily contract salary
## 8872 953 3283
Finally, let’s graph our results. We’ll take the df
variable we created, pipe it into ggplot with our aesthetic mappings,
then add our various graphic layers.
Notice the aesthetics mappings we create. We’ll map
income to the x-axis and work_type to color.
For many graphs, we can also map a weight variable to account for our
stratified sampling strategy. This is extremely convenient! It won’t
always be this easy.
For our visualization, we’ll use a geom_density layer.
The area under a density curve always equals 1. That means that each of
our different work_types will take up the same amount of space on our
graph. This makes it easy to compare income distributions for different
work types, but it removes our ability to see how common a work_type is
in our data set. If you want to see the distribution as absolute counts
rather than standardized densities, try using
geom_histogram or geom_area instead.
Also, note that we’re using a logarithmic scale for our x-axis. You can try switching to a continuous (linear) scale, but notice how difficult it becomes to read the data. Statistics by itself can’t really explain why income tends to present better on a logscale. The way that income distributions work is a social fact, not a mathematical fact. Nevertheless, it is very common in social science research to represent income on a logscale. The world would probably be a better place if it didn’t work that way, but it does.
Finally, the labs layer lets us define the labels on our
graph.
df |>
ggplot(mapping=aes(x=income, color=work_type, weight=hhs_wt)) +
geom_density() +
# geom_histogram() +
# geom_area(aes(y = ..count..), stat = "bin", fill="transparent") +
scale_x_log10(labels=unit_format(unit="Rs", big.mark=",")) +
# scale_x_continuous(labels=unit_format(unit="Rs.", big.mark=",")) +
labs(title="Income Patterns", x = "Annual Income (NPR)")
## Warning in scale_x_log10(labels = unit_format(unit = "Rs", big.mark = ",")):
## log-10 transformation introduced infinite values.
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_density()`).
There’s a problem, though! Sometimes, people have more than one job. In the visualization above, individuals with more than one job will show up multiple times in the graph, each time representing only part of their income. For individuals with more than one job, let’s combine them together so we get a picture individual earnings. I won’t go through this line-by-line, but take a look and see if you can understand what’s going on.
The most notable feature is the group_by method. By
itself, group_by doesn’t actually do anything. But, it’s
important because it tells the “summarize” function how to cluster data
together. Here, most people have only one type of work that they do, but
a few have more than one. We want to group all rows belonging to the
same person so we can merge them together using the summarize function.
Once we’ve run summarize, we’ll drop the group_by data with
the .groups = 'drop' parameter. Groupings can be very
computationally intense, so dropping the groups will make the rest of
our operations run faster.
df2 <- df |>
group_by(psu_number, hh_number, idcode) |>
summarize(
income = sum(income),
work_type = if (n_distinct(work_type) > 1) "mixed" else unique(work_type),
hhs_wt = unique(hhs_wt),
.groups = "drop"
) |>
mutate(
work_type = fct_relevel(work_type, "daily", "contract", "salary", "mixed")
)
summary(df2$work_type)
## daily contract salary mixed
## 5440 572 2956 493
df2 |>
ggplot(mapping=aes(x=income, color=work_type, weight=hhs_wt)) +
geom_density() +
scale_x_log10(labels=unit_format(unit="Rs.", big.mark=",")) +
labs(title="Income Patterns", x = "Annual Income (NPR)")
## Warning in scale_x_log10(labels = unit_format(unit = "Rs.", big.mark = ",")):
## log-10 transformation introduced infinite values.
## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_density()`).
edu_years <- function(level) {
case_match(
level,
"DK" ~ NA,
"Illiterate" ~ 0,
"Literate (levelless)" ~ 0,
"Pre-school/kindergarden" ~ 0,
"Class 1" ~ 1,
"Class 2" ~ 2,
"Class 3" ~ 3,
"Class 4" ~ 4,
"Class 5" ~ 5,
"Class 6" ~ 6,
"Class 7" ~ 7,
"Class 8" ~ 8,
"Class 9" ~ 9,
"Class 10" ~ 10,
"SEE/SLC" ~ 10,
"SLC" ~ 10,
"Intermediate/Class 12" ~ 12,
"Professional degree" ~ 12,
"Bachelor level" ~ 15,
"Master level or higher" ~ 17
)
}
df3 <- df2 |>
left_join(S01, by=join_by(psu_number, hh_number, idcode)) |>
left_join(S07, by=join_by(psu_number, hh_number, q01_11 == idcode)) |>
left_join(S07, by=join_by(psu_number, hh_number, q01_14 == idcode)) |>
mutate(
income = ifelse(is.finite(income), income, 0),
age = q01_03,
edu.father = coalesce(
q01_12,
q07_06.x,
ifelse(q07_02.x == "YES", "Literate (levelless)", NA),
ifelse(q07_02.x == "NO", "Illiterate", NA)
) |> edu_years(),
edu.mother = coalesce(
q01_15,
q07_06.y,
ifelse(q07_02.y == "YES", "Literate (levelless)", NA),
ifelse(q07_02.y == "NO", "Illiterate", NA)
) |> edu_years(),
) |>
filter(!is.na(edu.mother) & !is.na(edu.father)) |>
mutate(
I = log10(income + 1000) |> standardize(),
A = standardize(age),
EF = standardize(edu.father),
EM = standardize(edu.mother),
) |>
select(-starts_with("q"))
df3 |>
ggplot(mapping = aes(x=EF, y=I, color=work_type)) +
geom_jitter(shape=1, width=.4)
model <- quap(
alist(
I ~ dnorm(mu , sigma) ,
mu <- a + bEF * EF,
a ~ dnorm(0 , 2) ,
bEF ~ dnorm(0 , 2) ,
sigma ~ dunif(0 , 5)
) ,
data = df3
)
precis(model)
## mean sd 5.5% 94.5%
## a -0.00000086 0.01027 -0.0164 0.0164
## bEF 0.16558435 0.01027 0.1492 0.1820
## sigma 0.98614503 0.00726 0.9745 0.9978
summary <- tibble(
EF = seq(min(df3$EF), max(df3$EF), length.out=51)
) |>
add_linpred_draws(model) |>
median_qi() |>
select(
EF,
mu = .linpred,
mu.lower = .lower,
mu.upper = .upper
) |>
add_predicted_draws(model) |>
median_qi() |>
select(
EF, mu, mu.lower, mu.upper,
pred.lower = .lower,
pred.upper = .upper
)
summary |>
ggplot(aes(x=EF)) +
geom_jitter(data=df3, aes(y=I, color=work_type), shape=1, width=.4) +
geom_ribbon(aes(ymin=pred.lower, ymax=pred.upper), color="lightgray", alpha=0.4) +
geom_ribbon(aes(ymin=mu.lower, ymax=mu.upper), color="darkgray", alpha=0.4) +
geom_line(aes(y=mu))
model <- quap(
alist(
I ~ dnorm(mu , sigma) ,
mu <- a[work_type] + bEF[work_type] * EF,
a[work_type] ~ dnorm(0 , 2) ,
bEF[work_type] ~ dnorm(0 , 2) ,
sigma ~ dunif(0 , 5)
) ,
data = df3
) |> recover_types(df3)
precis(model, depth=2)
## mean sd 5.5% 94.5%
## a[1] -0.4460 0.0123 -0.4656 -0.4263
## a[2] 0.2158 0.0364 0.1577 0.2740
## a[3] 0.6545 0.0168 0.6277 0.6812
## a[4] 0.3731 0.0388 0.3111 0.4352
## bEF[1] -0.0641 0.0158 -0.0894 -0.0388
## bEF[2] 0.0328 0.0388 -0.0292 0.0948
## bEF[3] 0.0680 0.0126 0.0478 0.0881
## bEF[4] 0.1222 0.0421 0.0549 0.1895
## sigma 0.8557 0.0063 0.8456 0.8658
summary <- expand_grid(
work_type = unique(df3$work_type),
EF = seq(min(df3$EF), max(df3$EF), length.out=51),
) |>
add_linpred_draws(model) |>
median_qi() |>
rename_with(\(x) paste0("mu", x), starts_with(".")) |>
add_predicted_draws(model) |>
median_qi() |>
rename_with(\(x) paste0("pred", x), starts_with("."))
summary |>
ggplot(aes(x=EF, color=work_type, fill=work_type)) +
geom_jitter(data=df3, aes(y=I), shape=1, width=.4) +
# geom_ribbon(aes(ymin=pred.lower, ymax=pred.upper), alpha=0.4) +
geom_ribbon(aes(ymin=mu.lower, ymax=mu.upper), alpha=0.4) +
geom_line(aes(y=mu.linpred))
model <- quap(
alist(
I ~ dnorm(mu , sigma) ,
mu <- a + bEF * EF + bEM * EM,
a ~ dnorm(0 , 2) ,
bEF ~ dnorm(0 , 2) ,
bEM ~ dnorm(0 , 2) ,
sigma ~ dunif(0 , 5)
) ,
data = df3
)
precis(model)
## mean sd 5.5% 94.5%
## a -0.0000081 0.01026 -0.0164 0.0164
## bEF 0.1387186 0.01200 0.1195 0.1579
## bEM 0.0518357 0.01200 0.0327 0.0710
## sigma 0.9851453 0.00726 0.9735 0.9967