It sounds like folks are looking to get hands on with some real data. That’s awesome. We’ll step back a bit from the conceptual parts of statistical inference now and dig into the NLSS 2022 data set. This is a moderately complex data set, which means we’ll need to do some data wrangling before we can do any interesting analysis. Data wrangling is needed for pretty much anything but very simple data sets; this is the work that R is very good at.

I’m not going to explain every nuance of the syntax, but if you’re feeling bewildered check out the book “R for Data Science” by Wickham et al. in the chat. Just walk through the examples and you’ll feel comfortable in no time. R will always do exactly what you tell it to do. That also means that everything has to be just right. If you’ve got a comma in the wrong place, you’ll likely get an error. If you’re not sure what an error message means, post a copy of your code and the error to the chat and somebody will help out.

Here, rather than try to explain everything, I’ll just highlight what each step is doing in general terms.

Setup

First, we’ll import our libraries. The rethinking package gives us quap for model fitting, and tidyverse is our library for data manipulation and visualization. The haven, scales, and modelr libraries are all made by the same people as tidyverse, and they assist with data import, graph scales, and respectively.

The two other libraries, tidybayes and tidybayes.rethinking, are just convenience libraries that will bridge the data formats of quap and tidyverese. In the regression example file, we did all the conversions manually, but the functions here are quicker and more convenient.

Down the road, we’ll stop using quap and start using stan (mcmc) directly. For that, a new library called cmdstanr will be needed. For the moment, though, we don’t need it, so we’ll keep it commented out.

This next bit of code is a bit ugly, but it allows us to import all of the NLSS sections in one go. If you get an error, you may need to set your working directory in the Files pane (usually in the bottom right quarter of R Studio).

options(scipen = 999, digits = 3)
library(rethinking)
library(tidyverse)
library(haven)
library(scales)

library(tidymodels)
library(tidybayes)
library(tidybayes.rethinking)

# library(cmdstanr)

for( .file in list.files('nlss2022/data/stata format') ) {
  .name <- strsplit(.file, ".", fixed=TRUE)[[1]][1]
  .path <- paste0('nlss2022/data/stata format/', .file)
  .table <- read_dta(.path) |> as_factor()
  assign( .name , .table )
}

A note on “pipes”

To transform our raw data into a format we can use, we’re going to use “pipes” |> to connect commands together in a series. Pipes allow us to take the output of one command and feed it into the input of another command, like water flowing through a pipe.

# The pipe series:
1:10 |> mean() |> round()

## [1] 6

# is exactly the same as this:
numbers <- 1:10
average <- mean(numbers)
round(average)

## [1] 6

# and also this:
round( mean( 1:10 ) )

## [1] 6

We’ll use the pipe syntax a lot because it allows us to break our data transformations down into simple steps. If you’ve ever used a language like python, this is very similar to the method chaining you can do on objects.

Jobs data

For this first analysis, let’s compare income patterns for different types of jobs. This data comes from Section 10 of the NLSS 2022.

We’ll also need the “weight” table. This table isn’t part of the survey itself but rather provides some metadata about each of the 9600 households of the survey. Most importantly, it includes the hhs_wt variable, which tells us how to weight each of the sampling units in our stratified sample.

We’ll start our pipe with the ‘weight’ table as a base, join it with the ‘S10’ table, perform some manipulations, and store the results in the new variable “df”. This is a very common pattern for designing an analysis. Take a look at the example below to make sure you can see that pattern.

A few things to note:

The right_join function will combine our weight table and our S10 table together. In order to do that, R needs to know how to how to identify different households so that data in different tables can be matched together. We haven’t specified that key, so the right_join function will look at both tables and try to find columns with matching names. In this case, it will find two matching columns: psu_number and hh_number. Together, these two numbers allow us to uniquely identify every one of the 9600 households in the NLSS.
The mutate function allows us to change the values of a column or create new columns. Here, we need to create a new variable for daily-, contract-, and salary-based work structures. All this data is in the NLSS, but it’s organized in different ways. Look at Section 10, questions 6—11 for more details.
The pivot_longer function allows us to “rotate” columns into separate rows. This makes our data easier to plot.
When we used pivot_longer, we created a lot of empty rows in our data set. In R, missing data is represented as NA. We’ll filter or data set, keeping only values where the income column is not (!) NA.
Finally, we’ll relevel our work_type variable to change the order of the different factor values. Factor order can be important for certain kinds of analyses. If we had a factor variable for months, for example, it might be important to know that may comes between apr and jun. In this case, though, the only effect of releveling our factors is to change the order that the different values have in our graph key.

df <- weight |>
  right_join(S10) |> 
  mutate(
    daily = q10_06 * (q10_07 + q10_08),
    contract = q10_11_a + q10_11_b,
    salary = q10_09_a + q10_09_b + q10_09_c + q10_09_d + q10_10
  ) |>
  pivot_longer(cols=c(daily, contract, salary), names_to="work_type", values_to="income") |>
  filter(!is.na(income)) |>
  mutate(
    work_type = fct_relevel(work_type, "daily", "contract", "salary")
  )

## Joining with `by = join_by(psu_number, hh_number)`

Next, let’s use the summary function to count the number of observations we have for each work type.

summary(df$work_type)

##    daily contract   salary 
##     8872      953     3283

Finally, let’s graph our results. We’ll take the df variable we created, pipe it into ggplot with our aesthetic mappings, then add our various graphic layers.

Notice the aesthetics mappings we create. We’ll map income to the x-axis and work_type to color. For many graphs, we can also map a weight variable to account for our stratified sampling strategy. This is extremely convenient! It won’t always be this easy.

For our visualization, we’ll use a geom_density layer. The area under a density curve always equals 1. That means that each of our different work_types will take up the same amount of space on our graph. This makes it easy to compare income distributions for different work types, but it removes our ability to see how common a work_type is in our data set. If you want to see the distribution as absolute counts rather than standardized densities, try using geom_histogram or geom_area instead.

Also, note that we’re using a logarithmic scale for our x-axis. You can try switching to a continuous (linear) scale, but notice how difficult it becomes to read the data. Statistics by itself can’t really explain why income tends to present better on a logscale. The way that income distributions work is a social fact, not a mathematical fact. Nevertheless, it is very common in social science research to represent income on a logscale. The world would probably be a better place if it didn’t work that way, but it does.

Finally, the labs layer lets us define the labels on our graph.

df |>
  ggplot(mapping=aes(x=income, color=work_type, weight=hhs_wt)) +
    geom_density() +
    # geom_histogram() +
    # geom_area(aes(y = ..count..), stat = "bin", fill="transparent") +
    scale_x_log10(labels=unit_format(unit="Rs", big.mark=",")) +
    # scale_x_continuous(labels=unit_format(unit="Rs.", big.mark=",")) +
    labs(title="Income Patterns", x = "Annual Income (NPR)")

## Warning in scale_x_log10(labels = unit_format(unit = "Rs", big.mark = ",")):
## log-10 transformation introduced infinite values.

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_density()`).

Fixing our data

There’s a problem, though! Sometimes, people have more than one job. In the visualization above, individuals with more than one job will show up multiple times in the graph, each time representing only part of their income. For individuals with more than one job, let’s combine them together so we get a picture individual earnings. I won’t go through this line-by-line, but take a look and see if you can understand what’s going on.

The most notable feature is the group_by method. By itself, group_by doesn’t actually do anything. But, it’s important because it tells the “summarize” function how to cluster data together. Here, most people have only one type of work that they do, but a few have more than one. We want to group all rows belonging to the same person so we can merge them together using the summarize function. Once we’ve run summarize, we’ll drop the group_by data with the .groups = 'drop' parameter. Groupings can be very computationally intense, so dropping the groups will make the rest of our operations run faster.

df2 <- df |>
  group_by(psu_number, hh_number, idcode) |>
  summarize(
    income = sum(income),
    work_type = if (n_distinct(work_type) > 1) "mixed" else unique(work_type),
    hhs_wt = unique(hhs_wt),
    .groups = "drop"
  ) |>
  mutate(
    work_type = fct_relevel(work_type, "daily", "contract", "salary", "mixed")
  )

summary(df2$work_type)

##    daily contract   salary    mixed 
##     5440      572     2956      493

df2 |>
  ggplot(mapping=aes(x=income, color=work_type, weight=hhs_wt)) +
    geom_density() +
    scale_x_log10(labels=unit_format(unit="Rs.", big.mark=",")) +
    labs(title="Income Patterns", x = "Annual Income (NPR)")

## Warning in scale_x_log10(labels = unit_format(unit = "Rs.", big.mark = ",")):
## log-10 transformation introduced infinite values.

## Warning: Removed 7 rows containing non-finite outside the scale range
## (`stat_density()`).

Exploring causes

edu_years <- function(level) {
  case_match(
    level,
    "DK" ~ NA,
    "Illiterate" ~ 0,
    "Literate (levelless)" ~ 0,
    "Pre-school/kindergarden" ~ 0,
    "Class 1" ~ 1,
    "Class 2" ~ 2,
    "Class 3" ~ 3,
    "Class 4" ~ 4,
    "Class 5" ~ 5,
    "Class 6" ~ 6,
    "Class 7" ~ 7,
    "Class 8" ~ 8,
    "Class 9" ~ 9,
    "Class 10" ~ 10,
    "SEE/SLC" ~ 10,
    "SLC" ~ 10,
    "Intermediate/Class 12" ~ 12,
    "Professional degree" ~ 12,
    "Bachelor level" ~ 15,
    "Master level or higher" ~ 17
  )
}

df3 <- df2 |>
  left_join(S01, by=join_by(psu_number, hh_number, idcode)) |>
  left_join(S07, by=join_by(psu_number, hh_number, q01_11 == idcode)) |>
  left_join(S07, by=join_by(psu_number, hh_number, q01_14 == idcode)) |>
  mutate(
    income = ifelse(is.finite(income), income, 0),
    age = q01_03,
    edu.father = coalesce(
      q01_12,
      q07_06.x,
      ifelse(q07_02.x == "YES", "Literate (levelless)", NA),
      ifelse(q07_02.x == "NO", "Illiterate", NA)
    ) |> edu_years(),
    edu.mother = coalesce(
      q01_15,
      q07_06.y,
      ifelse(q07_02.y == "YES", "Literate (levelless)", NA),
      ifelse(q07_02.y == "NO", "Illiterate", NA)
    ) |> edu_years(),
  ) |>
  filter(!is.na(edu.mother) & !is.na(edu.father)) |>
  mutate(
    I = log10(income + 1000) |> standardize(),
    A = standardize(age),
    EF = standardize(edu.father),
    EM = standardize(edu.mother),
  ) |>
  select(-starts_with("q"))

df3 |>
  ggplot(mapping = aes(x=EF, y=I, color=work_type)) +
    geom_jitter(shape=1, width=.4)

Simple linear regression with visuals

model <- quap(
  alist(
    I ~ dnorm(mu , sigma) ,
    mu <- a + bEF * EF,
    a ~ dnorm(0 , 2) ,
    bEF ~ dnorm(0 , 2) ,
    sigma ~ dunif(0 , 5)
  ) ,
  data = df3
)
precis(model)

##              mean      sd    5.5%  94.5%
## a     -0.00000086 0.01027 -0.0164 0.0164
## bEF    0.16558435 0.01027  0.1492 0.1820
## sigma  0.98614503 0.00726  0.9745 0.9978

summary <- tibble(
  EF = seq(min(df3$EF), max(df3$EF), length.out=51)
) |>
  add_linpred_draws(model) |>
  median_qi() |>
  select(
    EF,
    mu = .linpred,
    mu.lower = .lower,
    mu.upper = .upper
  ) |>
  add_predicted_draws(model) |>
  median_qi() |>
  select(
    EF, mu, mu.lower, mu.upper,
    pred.lower = .lower,
    pred.upper = .upper
  )

summary |>
  ggplot(aes(x=EF)) +
    geom_jitter(data=df3, aes(y=I, color=work_type), shape=1, width=.4) +
    geom_ribbon(aes(ymin=pred.lower, ymax=pred.upper), color="lightgray", alpha=0.4) +
    geom_ribbon(aes(ymin=mu.lower, ymax=mu.upper), color="darkgray", alpha=0.4) +
    geom_line(aes(y=mu))

Complex linear regression with visuals

model <- quap(
  alist(
    I ~ dnorm(mu , sigma) ,
    mu <- a[work_type] + bEF[work_type] * EF,
    a[work_type] ~ dnorm(0 , 2) ,
    bEF[work_type] ~ dnorm(0 , 2) ,
    sigma ~ dunif(0 , 5)
  ) ,
  data = df3
) |> recover_types(df3)

precis(model, depth=2)

##           mean     sd    5.5%   94.5%
## a[1]   -0.4460 0.0123 -0.4656 -0.4263
## a[2]    0.2158 0.0364  0.1577  0.2740
## a[3]    0.6545 0.0168  0.6277  0.6812
## a[4]    0.3731 0.0388  0.3111  0.4352
## bEF[1] -0.0641 0.0158 -0.0894 -0.0388
## bEF[2]  0.0328 0.0388 -0.0292  0.0948
## bEF[3]  0.0680 0.0126  0.0478  0.0881
## bEF[4]  0.1222 0.0421  0.0549  0.1895
## sigma   0.8557 0.0063  0.8456  0.8658

summary <- expand_grid(
  work_type = unique(df3$work_type),
  EF = seq(min(df3$EF), max(df3$EF), length.out=51),
) |>
  add_linpred_draws(model)  |>
  median_qi() |>
  rename_with(\(x) paste0("mu", x), starts_with(".")) |>
  add_predicted_draws(model) |>
  median_qi() |>
  rename_with(\(x) paste0("pred", x), starts_with("."))

summary |>
  ggplot(aes(x=EF, color=work_type, fill=work_type)) +
    geom_jitter(data=df3, aes(y=I), shape=1, width=.4) +
    # geom_ribbon(aes(ymin=pred.lower, ymax=pred.upper), alpha=0.4) +
    geom_ribbon(aes(ymin=mu.lower, ymax=mu.upper), alpha=0.4) +
    geom_line(aes(y=mu.linpred))

Multiple regression

model <- quap(
  alist(
    I ~ dnorm(mu , sigma) ,
    mu <- a + bEF * EF + bEM * EM,
    a ~ dnorm(0 , 2) ,
    bEF ~ dnorm(0 , 2) ,
    bEM ~ dnorm(0 , 2) ,
    sigma ~ dunif(0 , 5)
  ) ,
  data = df3
)
precis(model)

##             mean      sd    5.5%  94.5%
## a     -0.0000081 0.01026 -0.0164 0.0164
## bEF    0.1387186 0.01200  0.1195 0.1579
## bEM    0.0518357 0.01200  0.0327 0.0710
## sigma  0.9851453 0.00726  0.9735 0.9967

Jobs demo