For this presentation, I’ll be borrrowing extensively from Nick Eubank’s Computational Methods for Social Sciences.

Assertive programming, what is it?

A type of defensive programming.

Think of defensive programming like defensive driving. We go out of our way to position ourselves to respond well when mistakes (inevitably) happen. When coding, we often run into mistakes when we assume things about our code or data that do not actually hold. Consider the genes that have needed to be renamed because excel automatically converts them to dates.

When beginning an analysis, we have certain expectations on the data we’re looking at, whether they’re stated or not. ‘I thought I was dealing with percentages not decimals’, ‘I reverse coded my liberal-conservative scale’, ‘Why are there 50 observations for the same county when my analysis is at the county level?’, ‘Did we get a reasonably close number of treatment/control observations?’, ‘we have how many NAs?’ type questions are vitally important, but may not rise to the level of awareness. Think: The stuff a manager in a private job won’t bother checking because they’re thinking about the important modeling questions, but would he huge if wrong.

To put it more concretely:

y <- 5

assert_that(y > 3)
## [1] TRUE
# commenting this out because the document wouldn't knit:
# assert_that(y < 4)

We can think of an assertion as a very confident statement about these types of foundational questions (or anything really). We’re basically telling the computer, ‘I’m about to make a statement that evaluates to TRUE or FALSE. If it’s true, cool don’t do anything. If it’s false, stop what you’re doing. Throw errors, flashing lights whatever. Do not continue until this evaluates to true.’ Assertive programming just that: feeding the computer statements that must evaluate to TRUE for the code to continue to run.

Consider the code below. What assumptions am I implicitly making about the underlying data in my code?

# making a toy dataframe... imagine this came from the real world lol. For pedagogical reasons, don't read too much into it 

set.seed(861)

df_from_the_world <- tibble(
  
  party = sample(c("R", "D"), 100, .5),
  resp_income = rnorm(100, mean = 50000, sd = 10000),
  msa_income =  rnorm(100, mean = 50, sd = 10),
  ideo_scale = if_else(
      party == "D",
      sample(c(1, 2, 3), 100, .3),
      sample(c(3, 4, 5), 100, .3),
  ),
  ft_tax_liberalization = if_else(
    party == "D",
    rnorm(100, mean = 65, sd = 20),
    rnorm(100, mean = 45, sd = 20)
  )
  
)


# let's say we're interested in feelings toward a proposal that increases taxes. Our main variable of interest is the respondent's RELATIVE income within their MSA. We may do something like this

df_from_the_world <- df_from_the_world %>% 
  mutate(
    relative_income = resp_income / msa_income
  )

lm(ft_tax_liberalization ~ relative_income + party + ideo_scale, data = df_from_the_world) %>% 
  modelsummary(coef_rename = c("relative_income" = "Personal Income / MSA Income", 
                               "partyR" = "PID, R = 1",
                               "ideo_scale" = "Ideology: 1 - V.Conservative, 5 = V.Liberal"),
               stars = TRUE)
tinytable_qay7t3pn7621n3b235sb
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 76.363***
(8.083)
Personal Income / MSA Income -0.013+
(0.007)
PID, R = 1 -19.256**
(6.994)
Ideology: 1 - V.Conservative, 5 = V.Liberal 0.371
(2.579)
Num.Obs. 100
R2 0.193
R2 Adj. 0.167
AIC 897.1
BIC 910.1
Log.Lik. -443.548
F 7.629
RMSE 20.42

Not too bad… probably generally meets expectations right? Probably wouldn’t think too much more about it?

Tax is driven mostly by partisanship, weak (substantively and statistically) relationship with relative economic standing, weird coefficient on partisanship but voters are weird… Political Behavior here we come??

The analysis is completely off. The survey firm made a couple mistakes that we left undetected.

df_from_the_world %>% 
  select(
    resp_income,
    msa_income
    ) %>% 
  summary(
    fivenum()
    )
##   resp_income      msa_income   
##  Min.   :15656   Min.   :27.19  
##  1st Qu.:41988   1st Qu.:44.89  
##  Median :49013   Median :53.09  
##  Mean   :49080   Mean   :51.25  
##  3rd Qu.:55339   3rd Qu.:57.83  
##  Max.   :76408   Max.   :70.78
df_from_the_world %>% 
  group_by(party) %>% 
  summarise(mean_deo = mean(ideo_scale))
## # A tibble: 2 × 2
##   party mean_deo
##   <chr>    <dbl>
## 1 D         1.98
## 2 R         4.15

Uh oh. They put the respondent’s income in dollars and the MSA income in thousands. They also misdocumented the direction of the scale.

How could we use assertions to catch these errors?

library(assertthat)

assert_that(
  between(
    mean(df_from_the_world$resp_income),
    mean(df_from_the_world$msa_income) - 10000,
    mean(df_from_the_world$msa_income) + 10000
    )
  )

# annoyingly, we can't pass an assert_that() argument through a filter pipe
assert_that(mean(df_from_the_world$ideo_scale[df_from_the_world$party == "D"]) > 3)


# would need to do this
df_from_the_world %>%
  filter(party == "D") %>%
  {assert_that(mean(.$ideo_scale, na.rm = TRUE) > 3)}

Now, fixing…

df_from_the_world <- df_from_the_world %>% 
  mutate(
    correct_relative_income = resp_income / (msa_income * 1000)
  )

# assertion passes
assert_that(
  between(
    mean(df_from_the_world$correct_relative_income),
    mean(df_from_the_world$msa_income) - 10000,
    mean(df_from_the_world$msa_income) + 10000
    )
  )
## [1] TRUE
lm(ft_tax_liberalization ~ correct_relative_income + party + ideo_scale, data = df_from_the_world) %>% 
  modelsummary(
    coef_rename = c(
      "correct_relative_income" = "CORRECT Personal Income / MSA Income", 
      "partyR" = "PID, R = 1",
      "ideo_scale" = "Ideology: 1 - V.Liberal, 5 = V.Conservative"
      ),
    stars = TRUE)
tinytable_k9kqjna9z9n92nt6chuc
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
(Intercept) 76.363***
(8.083)
CORRECT Personal Income / MSA Income -12.873+
(6.888)
PID, R = 1 -19.256**
(6.994)
Ideology: 1 - V.Liberal, 5 = V.Conservative 0.371
(2.579)
Num.Obs. 100
R2 0.193
R2 Adj. 0.167
AIC 897.1
BIC 910.1
Log.Lik. -443.548
F 7.629
RMSE 20.42

Picture looks a bit different!

How does assertive programming help us?

It allows us to validate that data structures stay similar in repeated datasets

Imagine you cyclically analyze a dataset that is released quarterly. You will probably set about automating the analyses, using the same set of code (what’s the phrase that laziness is a sign of a good developer?) To ensure your code works on subsequent datasets, you’ll want to test the assumptions the code makes. Enter, assertions!

We can check for logical consistency between different fields

Perhaps we have a new dataset. We’re armed with a codebook and ready to start the analysis. Should we dive into our lm(y ~ x, data)? Probably not. We should check whether the data generating process is internally valid, especially if any hand coding/stitching data from multiple sources was involved.

  • We could use the first 2 digits of a zip code to identify a state, helping us validate geocoding.

  • People identified as Republicans in one question should tend to be conservative in another.

  • Perhaps there are legal thresholds that you can use to test the validity of fields. Apologies in going back to housing finance, but Fannie has eligibility thresholds on their behavior. If you find loans outside of these thresholds owned by Fannie, you probably have a data issue.

We can answer questions someone less familar with the data may have, before they ask the question

This is more of a ‘impress your boss/a reviewer/a professional adversary’s’ point. If you assert distributions/rates of categorical variables proactively, you’ll be able to confidently answer questions on the fly. Also, people are less likely to read their console/won’t run the code that produces it. Assertions, however, put the answers right in their face and don’t need to be run to be shown

  • How many missing observations are in the dataset?
  • What is the distribution of your continuous variables?
set.seed(871)

a_survey_people_didnt_like <- tibble(
  
  party = sample(
    c("R", "D", NA), 
    size = 100, 
    replace = TRUE,
    c(.2, .1, .7)
    ),
  
  ft_outparty = rnorm(
    100, 
    mean = 20,
    sd = 3
    )
  
)

# "?? This is a a blank line? What does it mean? Why do I care?" - A cranky boss
sum(is.na(a_survey_people_didnt_like$party))
## [1] 75
# "good catch... this is a lot of NAs!" - the same boss
assert_that(sum(is.na(a_survey_people_didnt_like$party)) == 75)
## [1] TRUE

Activity

I’m making a big dataframe below. Split the class into 2 groups. Group 1 writes a useful assertion. Group 2, manipulate the dataframe in some way that causes an issue, but the assertion doesn’t detect. Group 1 modify the assertion so that issue is detected, back to group 2, etc. Think of each cycle corresponding to a new survey release

activity_df <- tibble(
  party = sample(
    c("R", "D"), 
    size = 100, 
    .5
    ),
  
  state = sample(
    
    c("OH", "ID", "MI", "CA", "TX", "PA"),
    size = 100,
    replace = TRUE,
    prob = rep(100/6, 6)
  ),
  
  income = rnorm(100, mean = 50, sd = 4),
  
  ft_outparty = rnorm(
    100, 
    mean = 20,
    sd = 3
    ),
  
  ft_gov = rnorm(
    100, 
    mean = 20,
    sd = 3
    ),
  
  date_of_survey = sample(
    c("10/3/2023", "11/4/2023", "6/3/2022"),
    size = 100,
    replace = TRUE,
    prob = rep(100/3, 3)
    
  )
  
)