For this presentation, I’ll be borrrowing extensively from Nick Eubank’s Computational Methods for Social Sciences.
A type of defensive programming.
Think of defensive programming like defensive driving. We go out of our way to position ourselves to respond well when mistakes (inevitably) happen. When coding, we often run into mistakes when we assume things about our code or data that do not actually hold. Consider the genes that have needed to be renamed because excel automatically converts them to dates.
When beginning an analysis, we have certain expectations on the data we’re looking at, whether they’re stated or not. ‘I thought I was dealing with percentages not decimals’, ‘I reverse coded my liberal-conservative scale’, ‘Why are there 50 observations for the same county when my analysis is at the county level?’, ‘Did we get a reasonably close number of treatment/control observations?’, ‘we have how many NAs?’ type questions are vitally important, but may not rise to the level of awareness. Think: The stuff a manager in a private job won’t bother checking because they’re thinking about the important modeling questions, but would he huge if wrong.
To put it more concretely:
y <- 5
assert_that(y > 3)
## [1] TRUE
# commenting this out because the document wouldn't knit:
# assert_that(y < 4)
We can think of an assertion as a very confident statement about these types of foundational questions (or anything really). We’re basically telling the computer, ‘I’m about to make a statement that evaluates to TRUE or FALSE. If it’s true, cool don’t do anything. If it’s false, stop what you’re doing. Throw errors, flashing lights whatever. Do not continue until this evaluates to true.’ Assertive programming just that: feeding the computer statements that must evaluate to TRUE for the code to continue to run.
Consider the code below. What assumptions am I implicitly making about the underlying data in my code?
# making a toy dataframe... imagine this came from the real world lol. For pedagogical reasons, don't read too much into it
set.seed(861)
df_from_the_world <- tibble(
party = sample(c("R", "D"), 100, .5),
resp_income = rnorm(100, mean = 50000, sd = 10000),
msa_income = rnorm(100, mean = 50, sd = 10),
ideo_scale = if_else(
party == "D",
sample(c(1, 2, 3), 100, .3),
sample(c(3, 4, 5), 100, .3),
),
ft_tax_liberalization = if_else(
party == "D",
rnorm(100, mean = 65, sd = 20),
rnorm(100, mean = 45, sd = 20)
)
)
# let's say we're interested in feelings toward a proposal that increases taxes. Our main variable of interest is the respondent's RELATIVE income within their MSA. We may do something like this
df_from_the_world <- df_from_the_world %>%
mutate(
relative_income = resp_income / msa_income
)
lm(ft_tax_liberalization ~ relative_income + party + ideo_scale, data = df_from_the_world) %>%
modelsummary(coef_rename = c("relative_income" = "Personal Income / MSA Income",
"partyR" = "PID, R = 1",
"ideo_scale" = "Ideology: 1 - V.Conservative, 5 = V.Liberal"),
stars = TRUE)
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 76.363*** |
| (8.083) | |
| Personal Income / MSA Income | -0.013+ |
| (0.007) | |
| PID, R = 1 | -19.256** |
| (6.994) | |
| Ideology: 1 - V.Conservative, 5 = V.Liberal | 0.371 |
| (2.579) | |
| Num.Obs. | 100 |
| R2 | 0.193 |
| R2 Adj. | 0.167 |
| AIC | 897.1 |
| BIC | 910.1 |
| Log.Lik. | -443.548 |
| F | 7.629 |
| RMSE | 20.42 |
Not too bad… probably generally meets expectations right? Probably wouldn’t think too much more about it?
Tax is driven mostly by partisanship, weak (substantively and statistically) relationship with relative economic standing, weird coefficient on partisanship but voters are weird… Political Behavior here we come??
The analysis is completely off. The survey firm made a couple mistakes that we left undetected.
df_from_the_world %>%
select(
resp_income,
msa_income
) %>%
summary(
fivenum()
)
## resp_income msa_income
## Min. :15656 Min. :27.19
## 1st Qu.:41988 1st Qu.:44.89
## Median :49013 Median :53.09
## Mean :49080 Mean :51.25
## 3rd Qu.:55339 3rd Qu.:57.83
## Max. :76408 Max. :70.78
df_from_the_world %>%
group_by(party) %>%
summarise(mean_deo = mean(ideo_scale))
## # A tibble: 2 × 2
## party mean_deo
## <chr> <dbl>
## 1 D 1.98
## 2 R 4.15
Uh oh. They put the respondent’s income in dollars and the MSA income in thousands. They also misdocumented the direction of the scale.
How could we use assertions to catch these errors?
library(assertthat)
assert_that(
between(
mean(df_from_the_world$resp_income),
mean(df_from_the_world$msa_income) - 10000,
mean(df_from_the_world$msa_income) + 10000
)
)
# annoyingly, we can't pass an assert_that() argument through a filter pipe
assert_that(mean(df_from_the_world$ideo_scale[df_from_the_world$party == "D"]) > 3)
# would need to do this
df_from_the_world %>%
filter(party == "D") %>%
{assert_that(mean(.$ideo_scale, na.rm = TRUE) > 3)}
Now, fixing…
df_from_the_world <- df_from_the_world %>%
mutate(
correct_relative_income = resp_income / (msa_income * 1000)
)
# assertion passes
assert_that(
between(
mean(df_from_the_world$correct_relative_income),
mean(df_from_the_world$msa_income) - 10000,
mean(df_from_the_world$msa_income) + 10000
)
)
## [1] TRUE
lm(ft_tax_liberalization ~ correct_relative_income + party + ideo_scale, data = df_from_the_world) %>%
modelsummary(
coef_rename = c(
"correct_relative_income" = "CORRECT Personal Income / MSA Income",
"partyR" = "PID, R = 1",
"ideo_scale" = "Ideology: 1 - V.Liberal, 5 = V.Conservative"
),
stars = TRUE)
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 76.363*** |
| (8.083) | |
| CORRECT Personal Income / MSA Income | -12.873+ |
| (6.888) | |
| PID, R = 1 | -19.256** |
| (6.994) | |
| Ideology: 1 - V.Liberal, 5 = V.Conservative | 0.371 |
| (2.579) | |
| Num.Obs. | 100 |
| R2 | 0.193 |
| R2 Adj. | 0.167 |
| AIC | 897.1 |
| BIC | 910.1 |
| Log.Lik. | -443.548 |
| F | 7.629 |
| RMSE | 20.42 |
Picture looks a bit different!
Imagine you cyclically analyze a dataset that is released quarterly. You will probably set about automating the analyses, using the same set of code (what’s the phrase that laziness is a sign of a good developer?) To ensure your code works on subsequent datasets, you’ll want to test the assumptions the code makes. Enter, assertions!
Perhaps we have a new dataset. We’re armed with a codebook and ready to start the analysis. Should we dive into our lm(y ~ x, data)? Probably not. We should check whether the data generating process is internally valid, especially if any hand coding/stitching data from multiple sources was involved.
People identified as Republicans in one question should tend to be conservative in another.
Perhaps there are legal thresholds that you can use to test the validity of fields. Apologies in going back to housing finance, but Fannie has eligibility thresholds on their behavior. If you find loans outside of these thresholds owned by Fannie, you probably have a data issue.
This is more of a ‘impress your boss/a reviewer/a professional adversary’s’ point. If you assert distributions/rates of categorical variables proactively, you’ll be able to confidently answer questions on the fly. Also, people are less likely to read their console/won’t run the code that produces it. Assertions, however, put the answers right in their face and don’t need to be run to be shown
set.seed(871)
a_survey_people_didnt_like <- tibble(
party = sample(
c("R", "D", NA),
size = 100,
replace = TRUE,
c(.2, .1, .7)
),
ft_outparty = rnorm(
100,
mean = 20,
sd = 3
)
)
# "?? This is a a blank line? What does it mean? Why do I care?" - A cranky boss
sum(is.na(a_survey_people_didnt_like$party))
## [1] 75
# "good catch... this is a lot of NAs!" - the same boss
assert_that(sum(is.na(a_survey_people_didnt_like$party)) == 75)
## [1] TRUE
I’m making a big dataframe below. Split the class into 2 groups. Group 1 writes a useful assertion. Group 2, manipulate the dataframe in some way that causes an issue, but the assertion doesn’t detect. Group 1 modify the assertion so that issue is detected, back to group 2, etc. Think of each cycle corresponding to a new survey release
activity_df <- tibble(
party = sample(
c("R", "D"),
size = 100,
.5
),
state = sample(
c("OH", "ID", "MI", "CA", "TX", "PA"),
size = 100,
replace = TRUE,
prob = rep(100/6, 6)
),
income = rnorm(100, mean = 50, sd = 4),
ft_outparty = rnorm(
100,
mean = 20,
sd = 3
),
ft_gov = rnorm(
100,
mean = 20,
sd = 3
),
date_of_survey = sample(
c("10/3/2023", "11/4/2023", "6/3/2022"),
size = 100,
replace = TRUE,
prob = rep(100/3, 3)
)
)