R Essentials for Social Scientists

Introduction

The aim of this guide is to suggest which packages and functions are most appropriate for common tasks faced by social scientists working in R. If you’ve discovered this page, I hope you find it useful! I’ve written this guide for my own reference, but if other people discover it and would like to contribute or provided feedback, please email me.

General Advice

The tools best suited for creating publication-ready tables and plots may lie in a different package than the ones you use to initially explore and analyze the data
Initial data cleaning is best done in a regular R script, but use Rmarkdown when sharing results with colleagues. Replications files should use Rmarkdown unless there is a good reason not to (e.g., lengthy calculations) since it make comments easy to read. Your data cleaning can live in a separate R script.
Variable names should be easy to decipher, but don’t try to make them publication worthy. That is, feel free to leave in underscores. To produce the variable names that will appear in your published tables and plots, either create a named vector of the form c(working_name_1 = "New Name 1", working_name_2 = "New Name 2") which you can use repeatedly to rename these variables or else use labelled data. Try to hold off on doing this till you’re almost ready for circulate a draft of your paper so that you don’t waste time customizing plots and figures to fit names you don’t end up using.

Review of Important Packages

I start every R session by loading tidyverse. It contains dplyr and ggplot so you don’t need to load these separately and some other useful packages like forcats. It’s most distinctive feature is the ability to use pipes %>% to unwrap functions so that mean(x) becomes x %>% mean(), allowing to you avoid a tangle of nested paretheses. In fact, the authors suggest having one function per line for clarify. Note that as of version 4.0, R now has its own native pipe |> which the tidyverse folks recommend adopting, but old habits die hard (and the percent sign one has slightly more functionality).
fixeest was designed for fixed effects regressions, but it’s become my go-to for nearly all regressions because of the tables and convenient syntax. The tables look great in a RMarkdown or Quarto output and can be exported to Latex to insert in a document, though complications sometimes ensue. Its biggest advantage for analysis is its ability to run several regressions in one line without cluttering up your code with multiple objects. For instance, you can use split= ~ gender to split the sample by gender, sw(age, generation) to swap out alternate predictors, csw(race, gender, age) to add each of these regressors one by one to each subsequent model, and c(outcome,outcome2,...)~ for comparing different outcomes. You can create a list of control variables you want to use repeatedly like setFixest_fml( ..controls = ~ age + gender +race) and then reference them each time by putting ..controls in your regression formula. It handles robust standard errors and clustering with ease (you even set these to be defaults). One drawback of is that not many packages accept fixest model objects for further processing, though this is starting to change. Another potential downside is that its built-in plotting functionality uses base R graphics, but you can use the broom package to tidy up fixest object into a tibble for plotting in ggplot2.
I used to use data.table a lot, but whenever I’d wander away from it for a few weeks, I’d forget much of the syntax. I think it’s probably best for people familiar with SQL. The package is unusual for R, because it modifies objects in place without having to resave them. For instance dt[ x >3, y := 2*z] will change y’s values to 2 times z, but only in rows where x>3. You don’t need dt <- in front! I’ve found this useful for complex randomization schemes and simuation because it’s very fast and much more compact that tidyverse. However, for more complex renaming and grouping operations, the syntax is hard to remember. Thus, I recommend dtplyr package which integrates it into dplyr, allowing you to take advantage of the speed without the syntax. It does offer the values %like% function which works like dplyr::contains() but for character vectors instead of variable names (e.g., c("tame", "meat", "I") %like% "me" yields T,T,F).
jtools is a handy package for regression tables and plots. For plotting, it builds on ggplot2 so you can use those tools for further customization. However, as I’ve gotten more comfortable using ggplot, I often find it at least as efficient to create my plots from scratch. jtools is quick if like the defaults, but the more you have to customize, the more it makes sense to just create everything in ggplot.
Use visreg or easystats for contrast plots that show the differences between various levels of a factor (and whether those differences are significant). visreg has a lot of promise for plotting, but I’ve scarcely tried it. I’ve read all the vignettes though. It can do both base R graphics and ggplot.
ggforestplot will produce nice regression plots once you’ve used broom to tidy your regression output. That said, you can also reproduce its features yourself in ggplot without much difficulty if you want more customization
easystats looks very rich in features. It’s tidyverse compliant. Probably good for plotting but not for tables because it has no latex or Rmarkdown output options. It can do robust standard errors.
sjmisc is a grab bag package (as is jtools), but it’s mainly useful for data manipulation and exploration. It provides modest though not huge improvements over what you can already get out of dplyr, forcats, stringr, etc. The main advantage of these functions is that they can incorporate labeled data into tables and plots. However, it often gives unexpected results without warning you and that, in my option, makes it not worth the trouble except for a few functions which you should call using “::” without loading the whole package, since it tends to override a lot of existing functions.
sjPlot seems to have a lot going for it. The figures look great and the regression tables supposedly can be output to latex with the help of an auxiliary function from github.
If you want confidence intervals and p-values for predicted probabilities, marginal means, and other quantities based on regression coefficients, the clarify package is your friend. If formula for these quantities don’t exist (or are unreliable), and bootstrapping is too slow, clarify provides a good solution. After your regression procudes an estimate $\hat{\beta}$ and standard error $\hat{\sigma}$ for your favorite coefficient, clarify’s functions will sample additional estimates from $N(\hat{\beta},\hat{\sigma})$ (or some non-normal distribution with those parameters). You can then use these simulated coefficient estimates to calculate CIs/p-values for more complex quantities like predicted values and marginal effects. This package is meant replicate Gary King’s clarify package in Stata and replaces the zelig package.

Ecosystems

Here are some sets of packages by a given author (or group) that often tend to play well together. Some of them are in fact dependent on each other.

tidyverse (Hadley Wickham and co-authors)
- core packages automatically loaded when you run library(tidyverse)
  - ggplot2 to plot
  - dplyr to manipulate and summarize data
  - tidyr to reshape data
  - readr to import CSVs and other files
  - purrr to do functional programming instead of sapply
  - tibble stays in the background
  - stringr to modify strings
  - forcats to modify factors
- some additional packages to load separately
  - lubridate to convert characters to date-times
  - hms to convert characters to times
  - glue to improve upon the base R paste
  - dtplyr to run data.table in the back-end for speed
  - googlesheets4 to import from Google sheets
  - haven to import from SPSS
  - readxl to import from individual sheets in excel
Daniel Lüdecke packages
- sjPlot
- sjmisc
- sjlabelled
- sjstats
- ggeffects
- this author is also one of three co-creators of the easystats ecosystem. This includes (among others):
  - see
  - parameters
  - modelbased to estimate effects and contrasts between groups
  - effectsize
  - report
  - insight
  - bayestestR for analyzing Bayesian models
Claus O. Wilke packages
- ggtext provides a way to do formatting with markdown and html inside of plot text appearing in captions, titles, legends, ticks, and axes. See website.
- cowplot allows you to combine graphs into a single figure quite elegantly. Also allows you to annotate and some themes.
- ggridges cool ridgeline plots. See gallery
Alboukadel Kassambara packages
- ggpubr allows you to combine plots into a single figure
- ggcorrplot lets you visualize correlations between a bunch of variables and test their significance
Declare Design Packages
- fabricatr allows you to generate fake data
- estimatr allows you to estimate effect sizes
- randomizr allows you to randomize treatment assignments, even in complex designs
- DesignLibrary is a collection of research design templates. There’s also an online wizard to generate code and data for you.
- All these packages are focused on the preregistration stage, though estimatr can be useful beyond that thanks to its diff-in-mean and robust linear regression tools. Still, I would not suggest using it outside of the study design context since it produces a different type of model object that’s hard to use other functions on. Still, this package does have built-in export to latex, so maybe the package has everything it needs. All plotting takes place in ggplot with the help of tidy()
Gregory Demin packages
- exprss is for tables and labels - check it out!
- maditr provides an alternative to dtplyr, letting you use pipes with data.table
- excel.link is to facilitate data exchange between R and Excel - check it out!
- comprehenr is for list comprehensions in R - check it out!
Thomas Leeper packages
- margins
  - use in conjunction with this guy’s add-on https://github.com/vincentarelbundock/marginsplot
- prediction
Vincent Arel-Bundock packages
- modelsummary
- marginaleffects
- marginsplot

To-Do List

Packages to Review

table1
modelsummary package with functions
- modelsummary()
- modelplot()
- datasummary()
Table packages including gt, gtsummary, kableExtra, and xtable

Setup

Here I load some libraries that will be used in examples throughout.

library(tidyverse)
library(data.table)

RMarkdown Tips

Rmarkdown has trouble finding filepaths. The here package is great for that. After loading here run here("topmost_folder", "file_name") to retrieve the path to your file. You can put that statement directly inside of load or fread or whatever loading function you’re using.

Internal links can be achieved a few ways, but my preferred method is to put {#my_bookmark} at end of a heading and then reference it by writing [text_to_display](my_bookmark). Hyperlinks use a similar approach: [text_to_display](www.mysite.com).

You can set the date to automatically update like this date: "October 10, 2024"

Here’s the code to create a table of contents. Feel free to adjust the depth:

output:
  html_document:
    toc: true
      toc_depth: 3

This is a good first line to include:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

You can display regression tables nicely using kableExtra package as described here

Importing Data

Pre-existing R objects

The readRDS() function is used to load a .rds file directly into a variable.

# Save
saveRDS(dat, "mydataframe.Rds")

# Open
d <- readRDS("mydataframe.Rds")

In contract, load() will load a .Rdata file containing multiple objects.

# Save
save(dat1, dat2, "mydata_objects.Rdata")

# Open
load("mydata_objects.Rdata")

CSV files

For small files, the readr package offers the read_csv() function. For large files, the data.table package’s fread() function for speed.

Stata and SPSS

Use the haven package’s read_stata and read_spss

Qualtrics

The qualtRics is great for pulling data directly from a Qualtrics survey. If you are using a Qualtrics account for the first time (or maybe when using a new project), be sure to register your account within Rstudio.

qualtrics_api_credentials(api_key = "adsfadfsvavf",
                          base_url = "iad1.qualtrics.com",
                          install = TRUE,
                          overwrite = TRUE)

Here’s an example:

svy_id <- "SV_bOUmVpTISZiIJMO"
fname_raw <- "raw_data_file"
fname_clean <- "clean_data_file"

get_new_data <- FALSE

if(get_new_data){
  cat("getting new data")
  mysurvey <- fetch_survey(
    surveyID = svy_id, 
    label = FALSE,
    convert = FALSE,
    unanswer_recode_multi = 0, 
    force_request = T,
    col_types = readr::cols(StartDate = readr::col_datetime(),
                            EndDate = readr::col_datetime(),
                            RecordedDate = readr::col_datetime()))
  Qs <- survey_questions(svy_id)
  cmap <- extract_colmap(mysurvey)
  
  save(mysurvey, file = paste0(fname_raw, ".Rda"))
} else {
  cat("loading existing data")
  load(paste0(fname_raw, ".Rda"))
}

Data Cleaning

Cleaning data most often involves the following tasks

Renaming variables
- Individually
- Systematically
Recoding values
- Individually
- Systematically
Reorganizing factors
- Reordering the levels
- Combining levels
Creating new variables
- From one variable, conditional on its own values
- From one variable, conditional on another variable
- From a set of variables
Reshaping data sets
Combining data sets

We will review the best approaches to these tasks here.

In general, the tidyverse packages preferable to data.table for data cleaning and inspection. Not only do more people understand tidyverse, but the syntax is more readable and, unless you’re really proficient, requires less looking up how to do stuff. However, there are contexts in which data.table is superior, the most common of which is when you need to change certain values of variable_1, contingent on the values of variable_2. For instance, dt[age < 18, employment := NA] tends to be simpler than the dplyr equivalent:

d <- d %>% mutate(employment = case_when(age < 18 ~ NA_character,
                                         T ~ employment)

If you have to this rigmarole for multiple variables, it becomes a real pain, particularly when you could just be copying and pasting your one-line data.table code.

Another use for data.table is the %like% operator which saves you from having to do a bunch complicated regex mischief. Finally, data.tables are pretty convenient for randomizing: d[ , new_value := sample(values)], randomizing withing groups d[ , new_value := sample(values), by = group ], and bootstrapping d_boot = d[ sample(id, replace = F), ]

While it may be tempting to use the sjmisc and sjlaballed packages for data.cleaning—indeed, that’s what the former is primarily intended for—at this point the package tends to suck up more of your time than it will save. The primary advantage of these packages is that they preserve labels inherited from SPSS, Stata, or Qualtrics, both for the variables (usually the question text in a survey) and the values (e.g., the response options). However, this package tends to overwrite existing base R functions such as as_factor making them behave differently than you expect, and it rarely (if ever) issues warnings when something goes wrong—it will just silently fail. You’ll spend a bunch of time looking up how to do stuff, and in the end, you’ve introduced a greater risk for error. There are also functions in other packages that will not work on labelled data.

Recoding Values

The most commonly used functions in recoding might just be na_if and replace_na. They work like you imagine they should: simply stick in the value to replace (or to use as a replacement). More generally, case_match will rename values while preserving the variable type. It uses old ~ new syntax.

d %>% mutate(capital = case_match(country, 
                              Brazil ~ "Brasilia",
                              Afghanistan ~ "Kabul",
                              China ~ "Beijing"))

recode_factor will transform your variable into a factor and reorder the levels to match the order you wrote all in one fell swoop.¹ The data.frame itself will not be re-sorted.

d %>% mutate(capital = recode_factor(country, 
                              Brazil = "Brasilia",
                              Afghanistan = "Kabul",
                              China = "Beijing"))

In either case, numerics need to be surrounded by apostrophes.

d %>% mutate(year = recode(year, 
                           `1999` = 0,
                           `2000` = 1))

fct_recode from the forcats package (loaded in tidyverse) does much the same thing as recode_factor but with new = old format. Note that quotes are still needed to the right of the equals sign but not the left. fct_recode will accept a character or factor as its input, but the output will always be a factor.

d %>% mutate(capital = fct_recode(country, 
                              Brasilia = "Brazil",
                              Kabul = "Afghanistan",
                              Beijing = "China"))

Overall, recode_factor is preferable to fct_recode. The latter does not reorder the levels of the factor, which this is usually something you want to do when recoding. Also, cannot accept numerics. The new = old syntax does nicely match up with the mutate, rename, and select functions, but not with other renaming functions found elsewhere in R. On the other hand, the tidyverse team considers this order problematic and says then will likely depreciate this function once they come up with a workaround.

In both cases you can use !!! to pass a renaming vector to mutate()

recoding_vec <- c(Brazil = "Brasilia", Afghanistan = "Kabul", China = "Beijing")

d %>% mutate(capital = recode_factor(country, !!!recoding_vec))

If needed, you can turn a new = old name vector into an old = new naming vector

old_equals_new <- names(new_equals_old)
names(old_equals_new) <- new_equals_old

One instance in which the sjmisc package seems reasonably safe and there doesn’t seem to be a good alternative is if you need to reverse code a bunch of variables at once. Note that this works outside of mutate() (as well as within).

df %>% sjmisc::rec(starts_with(my_prefix), 
                   rec = "rev", 
                   suffix = "")

Recoding Multiple Columns

The across() function is useful for recoding multiple columns the same way. For instance, we can replace 999 with NA across all the columns that begin with “my_prefix”

df %>% mutate(
  across(starts_with("my_prefix"), .fns = ~ na_if(.x, 999)))

This can also be used to change all logicals to numerics at the end of a cleaning:

df %>% mutate(
  across(where(is.logical), .fns = ~ as.numeric))

Use the base R droplevels() function at the end of a cleaning session to get rid of lingering unused levels in factors.

Recoding a Range of Values

To recode a range of numeric values into a factor (or into fewer numeric values) use case_when

df <- df %>% mutate(new_factor = case_when(nums < 5 ~ "level 1",
                                           nums >=5 ~ "level 2") %>% factor())

The function if_else is harder to read

df <- df %>% mutate(new_factor = if_else(nums < 5, "level 1", "level 2") 
                    %>% factor())

while the base R function ifelse fails to account for NA values:

df <- df %>% mutate(new_factor = ifelse(nums < 5, "level 1", "level 2") 
                    %>% factor())

will recode a num = NA as "level 2" rather than leaving it as NA. The %in% operator likewise does not respect NAs: for instance 2 %in% NA will return FALSE even though 2 == NA will return NA.

When recoding multiple factor levels into fewer factor levels, an alternative from the forcats packages is to use fct_collapse(). This will respect the order and leave omitted factors untouched.

df <- df %>% mutate(new_factor = fct_collapse(
  "level a" = c("level 1", "level 2"),
  "level b" = c("level 4", "level 5")))

Assuming there was a level 3, it will remain the new factor. You can use the other_level argument to collapse all unnamed levels into a single level. As an added bonus, the order of the levels will be the order in which you wrote them. Alternatively, if you have a lot of levels to combine and one to leave alone, you can use fct_other(f, keep = c("level 1", "level 2") other_level = "Other")

Reordering Factor Levels

If you just have one level that you want to make the baseline (reference level), use the base R function relevel(). If you want to explicitly rearrange multiple levels, use the forcats function fct_relevel(). To reverse the order of the levels (often useful in ggplot), use fct_rev()

datawizard

The datawizard package was created as an alternative to tidyverse with fewer dependencies. However, it also has a few useful functions that can be used in conjunction with tidyverse.

The rownames_as_column() function is self-explanatory. I know of no equivalent.
The standardize() function will center and rescale by standard deviation. Additional arguments allow for robust standardization (center at median, devided by MAD) or Gelman standardization (devided by 2 SDs). Can be applied to individual variables or all the numeric variables in a dataset. Works within groups (though I’m not sure if you can use the groups specified by dplyr).
normalize() will scale variables between 0 and 1
demean() will subtract the means, by group if desired
rescale() will rescale them to a range you specify
reverse() will reverse code them
ranktransform() converts values to ranks
slide() can shift values up or down uniformly
adjust() will, rather amazingly, partial out other variables using a choice of regression models including OLS, random effects, fixed effect, GAMs, or Bayesian. It does not appear to be able to accommodate logistic models and I’m unsure about interactions. Nevertheless, its very impressive.
winsorize() will censor extreme values above/below a percentile, z-score, or pair of thresholds you specify

Renaming Variables and Creating New Ones

Variable names should follow a consistent set of rules. I prefer all lower case letters, words separated by underscores, and prioritize suffixes over prefixes (since it makes it easier to find variables you want using auto-complete when typing). Here are some useful suffixes for when multiple versions of a variable are included:

.cat for categorical (factor)
.cat.3 for three levels (assumigng the original had a different number of levels)
.num for numeric (if the default version is categorial)
.01 for binary or rescaled 0 to 1
.imp for imputed
.dm for demeaned
.z for standardized
.1, .2, .3 for multiple variables for the same type like if each person is asked to list three friends

You might prefer to use underscores, though I do find it helpful to reserve underscores for separating parts of a name.

Make existing variable names lowercase

d <- d %>% rename_with(tolower, starts_with("my_prefix"))

Replace periods with underscores

d <- d %>% rename_with(~ gsub(".", "_", .x, fixed = TRUE))

The argument fixed ensures the the period and underscore are interpreted as strings and not as symbols in regular expressions.

I prioritize easy-to-remember names over short ones or pretty ones, on the assumption that any variable appearing in a publication will be renamed before going into a table or plot.

In order to rename multiple variables with a custom function, use the following dplyr syntax:

rename_columns <- function(df, new_name, old_name){
  df %>% rename({{new_name}} := .data[[old_name]])
}

mutate_columns <- function(df, new_name, old_name){
  df %>% mutate({{new_name}} := get(old_name))
}

I don’t really understand why get() (which comes from base R) doesn’t work with rename(), but for some reason it throws an error. However, it can be very useful if you want to make the RHS more complex:

mutate_columns <- function(df, new_name, old_name){
  df %>% mutate({{new_name}} := case_when(get(old_name) < 1 ~ 1,
                                          get(old_name) >= 1 ~ 0))
}

Variables that Summarize Rows

To summarize rows it is often helpful to use rowwise() which treats each row as a group. The function c_across()is then used to select columns inside of the mutate() or transmtue() function.

Find the average, sum, or max of a row (among columns with the prefix “my_prefix”)

df %>% 
  rowwise() %>% 
  mutate(mean_value = mean(c_across(starts_with("my_prefix"))),
         sum_value = sum(c_across(starts_with("my_prefix"))),
         max_value = max(c_across(starts_with("my_prefix")))
         )

Alternatively:

df %>% 
  mutate(mean_value = rowMeans(cur_data() %>%
                                 select(starts_with("my_prefix"))),
         sum_value = rowSums(cur_data() %>%
                               select(starts_with("my_prefix"))),
         max_value = pmax(cur_data() %>% 
                            select(starts_with("my_prefix")))
  )

I personally think the rowwise() approach is simpler, since you don’t need to remember the alternative functions for mean, sum, and max.

Here is a way to create a variable indicating whether a certain value (in this case 1) appears anywhere in a row:

df %>% mutate(row_with_one = if_any(starts_with("my_prefix"), ~ . == 1)

Likewise, if_all() can be used to see if all values in a row are NA:

df %>% mutate(row_all_na = if_all(starts_with("my_prefix"), ~ is.na)

String Handling

Tidyverse’s stringr package is your friend. To cite but one example, str_wrap() is a much better way to handle line breaks than trying to predict ahead of time where they should go. Here is a useful cheatsheet that also serves as a handy guide to RegEx.

The base R sprintf() function is far more flexible version of paste() that make it easy to insert numbers (or other variables) into a string. You can also use it for rounding.

Be careful when using any type of special character in a string. You may be invoking a RegEx command unknowingly.

Data Exploration

Try using describe_distribution() and data_tabulate() from the datawizard package. For weighted surveys, you might want weighted_mean() which can be applied to the entire dataset at once (or a subset of it)

The sjmisc package offers descr() which allows you to customize what is included as will include variable labels (though irritatingly, not if the variable is a factor; it also has mysterious errors sometimes)

The readr package from tidyverse offers the spec() function, but it only works on tibbles that have been read in by readr functions

Summary Stats & Balance Tables

The gtsummary package has tbl_summary which I’ve used before. I forget what the tradeoffs are.

The modelsummary package offers the datasummary_balance function, which can create neat summary and balance tables. It can even save them as latex files.

datasummary_balance( ~ 1, 
                     data = df, 
                     fmt = 2, 
                     title = "Descriptive Statistics\\label{tab:desc_stats}",
                     output = "latex"
)

datasummary_balance( ~ treated,
                     data = df,
                     fmt = 2, 
                     dinm_statistic = "p.value",
                     title = "Balance Table\\label{tab:bal_tab}", 
                     output = "latex")

The fmt argument specifies the number of digits to display.

Prep Work

tibbles should be converted to data.frames
You might want to convert factors to numeric dummies for each level
You might want to first replace underscores with spaces using rename_with(~ gsub("_", " ", .x, fixed = TRUE)) so that it looks nicer. Factors a printed separately at the bottom and lack some statistics that you might want. They also make the table wider.

You can save a table as an string and then print several at once to the same latex file:

print(summary.table, "\n\n", balance.table, file = "appendix_tables.tex")

Regressions

Formulas

Suppose you have a bunch of variable names in a character vector c(var1, var2, var3). Here’s a useful function for turning them into a formula:

make_formula <- function(char_vec){
  char_vec %>% paste(collapse = " + ") %>% paste("~", .) %>% as.formula()
}

The fixest package

The fixest package was designed for fast estimation of fixed effects, but it has much wider variety of uses. The feglm() function is the main workhorse—you can use it for anything you would normally use lm() or glm() for. The generic version has identical syntax to those base R functions: mod = feglm(y ~ x1 + x2, data = mydata). However, it’s much more flexible. For instance, if you can run multiple regressions in one call:

mod <- feglm(c(height, weight) ~ treatment, data = d)

The object mod will contain two fixest models—one for height and one for weight. If you’re going to use the same outcomes again, you you might want to pre-specify them like this:

outcomes <- c("voted", "protested")

linear_mod <- feglm(.[outcomes] ~ treatment, data = d)
logistic_mod <- feglm(.[outcomes] ~ treatment, data = d, 
              family = binomial(link = "logit"))

Variables on the right-hand side can be pre-specified as well:

outcomes <- c("voted", "protested")
controls <- c("district", "age", "male")

linear_mod <- feglm(.[outcomes] ~ treatment + .[controls], 
                    data = d)

Or, more commonly, using a macro:

setFixest_fml(..ctrls = ~ district + age + male)

mod <- feglm(.[outcomes] ~ treatment + ..ctrls, data = d)

However, one downside of the macro is that you forget what exactly ..ctrls contains, you cannot type it into the command line; you have to find it in your document. It also cannot easily be reused in non-fixest functions.

One of the most powerful uses of feglm() is the stepwise argument:

mod <- d %>% 
  feglm( data = . , 
         .[outcomes] ~ treatment + sw(age, income))

For each outcome, we will get one model that includes age as a control and one that includes in come. The function sw0() will also include a model with neither. The functions csw() and csw0() will add their arguments cumulatively.

The following setup code is helpful to run at the beginning of your script:

setFixest_vcov(no_FE = "hetero")

This tells fixest to use HC1 standards errors by default. If you include fixed effects, fixest will cluster standard errors on your fixed effects unless you tell it not to with the vcov argument inside feglm(). In the absence of fixed effects, you can specify what variable to cluster on using the cluster. If you have multiple fixed effects, you might specify vcov = twoway. Note that these arguments can either be set universally with setFixest_vcov(), at the model level inside feglm() or in the model summary functions summary() and etable().

You should also include a setup line for what you want etable() to display:

setFixest_etable(drop = c("(Intercept)"), 
                 fitstat = "n", 
                 digits = 2, 
                 depvar = T)

This tells etable() to never display the intercept, drop all summary statistics from the bottom except for sample size, and limit the number of digits to 2. Of course, it many case we do want other fit statistics, so you can modify fitstat= with a vector of names of statistics to include. You can use keep= rather than drop= if, say, you only want to display the treatment variable by default. The arguments drop= and keep= use regular expressions, so they will also keep/drop variables with longer names that contain the string you provided. This is often helpful since it includes interactions, but sometimes if you’re using variable like age there are other variable names containing it that you don’t want to include/exclude. In that case, the regex symbols ^ and $ can you get you an exact match: ^age$.

Bayesian Models

To run on all available cores by default, your script should begin with options(cores = detectCores())

There are two convenient packages for working with Bayesian statistics: brms is more flexible but rstanarm has stored stan code which can speed things up. Both use the same syntex. I think I’d lean toward using rstanarm by default since speed is a big issue in these models I probably don’t know enough to want extreme customization. However, some of the examples below are in brms fr now since that’s what I was using most recently at the time I wrote this section.

The brmstools package appears to extend brms with some useful plotting features.

Here is a brms example of a Bayesian multilevel model with country and year varying intercepts.

mod <- brm(outcome ~ tr + ( 1 | country) + (1|year),
               data = dat,
               seed = 12345,
               control = list(adapt_delta = .95),
               prior = set_prior("normal(0,1)", class = "b"),
               cores = 4)

Note the use of a weakly informed prior for the coefficient estimates. The adapt_delta parameter is set to the default (and could therefore be omitted), but can be increased towards 1 if the model is not behaving (the readout will prompt you to do so).

The following function is useful for getting a point estimate and inner and outer confidence intervals for a given parameter:

library(parameters) # for get_parameters()
library(bayestestR) # for hdi()

get_est_and_ci <- function(mod, param){
  posterior.df <- get_parameters(mod, parameters = param)
  posteriors <- posterior.df[,1]
  bind_cols(estimate = median(posteriors),
            hdi(posteriors),
            hdi(posteriors,ci = .9) %>%
              transmute(ci_low_90=CI_low, ci_high_90=CI_high))
  }

I’m a little unsure what the ,1 is doing so it would be good to inspect this function before using.

List Experiments

To analyze list experiments, Blair and Imai’s list package is recommended. Here is an example:

mod <- ictreg(list_expt_count ~ witness_vio + age + male, 
              treat = "list_expt_treated", J = 3, 
              data = df, 
              method = "ml",
              robust = T)

The argument j specicies the number of non-sensitive items in the list. The ml method (maximum likelihood) is considered the most efficient, though lm is more interpretable. If there are multiple sensitive items (i.e., some respondents get one sensitive item, others get another), use this approach:

mod <- ictreg(list_expt_count ~ age + male, 
       treat = "list_expt_3_arms", J = 3, 
       data = df, 
       method = "ml", multi.condition = "level")

If there are no controls include, the formula list_expt_count ~ 1 will suffice.

To measure social desirability bias, you can ask control subjects the sensitive item after the experiment, and then add this variable to their count. Then just use linear regression (no special packages are needed) to see if the list experiment made a significant difference in the count people reported.

d %>% 
  mutate(count_combined = 
           case_when(list_expt_treated == "control" ~ 
                       list_expt_count + sensitive_question,
                     list_expt_treated == "treated" ~ list_expt_count))

lm(count_combined ~ list_expt_treated, data = d)

Regression Tables

stargazer

For most regression functions, the stargazer package and its one-stop-shop stargazer() function provides everything you need. The style argument even allows you set the style to a particular journal such as AJPS, IO, or APSR. The out parameter allows you to save the table directly to a Latex file. You can replace the standard errors with ones you supply using se. You can view the model in the console using type = text.

kableExtra: kable_classic()

The kableExtra package provides a good alternative. It is particularly useful for displaying results within an RMarkdown document. The following pretty_print function makes the results look nice.

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

pretty_print <- function(reg_table_df){
  reg_table_df %>%
    kbl() %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>%
    scroll_box(width = "100%")
}

fixest: etable

When using the fixest package, the etable() function produces nice latex tables and can also print a table to the console as text (set tex = F). You can save the table directly to a Latex file with the file argument. Be sure to set replace = T if you want it to overwrite the existing file rather than just appending, though appending is nice if you want to display multiple tables in a row, say, in an appendix.

jtools: summ() and export_summs()

The summ function from jtools provides a number of useful features

Robust standard errors. Note that Stata uses HC1 standard errors while the authors of the sandwich package, used throughout R, recommend HC3. The default in jtools is HC3, but you can adjust it.
Scale all predictors by their SD, or as Gelman suggests, by dividing again by 2.
can accommodate lm, glm, svyglm, merMod (lme4), and rq (quantreg or quantile regression) models
set defaults with set_summ_defaults
round to your desired precision or set it globally with options("jtools-digits" = 2). This is purely for display; the underlying summ object retains the original percision of the model. This will globally affect all jtools functions.
set_summ_defaults() to set defaults for summ
will output both CIs and p-valeus if requested. CI width can be adjusted. However, it does NOT display stars
can exponentiate coefficients

export_summs is a wrapper for huxtable::hugreg use to export these tables to Latex, RMarkdown, Word, PDF, HTML, excel. In RMarkdown, specify results = 'asis' in the code-block. It provides a lot of customization including providing CIs and p-values simultaneously if desired. You may need to wrap in huxtable::to_latex.

Other packages

The sjPlot package has no native way to export tables to PDF or Latex. But this guy came up with a way. The function is html2latex::html2pdf().

Plotting

Getting the Variable Order Right

guide = guide_legend(reverse = T) inside of scale_XXX() function will get legend order to match the order of the colors in the plot. You will likely have to use fct_rev() beforehand though as the items will be listed bottom-to-top.

Colors

You can assign specific colors using scale_color_manual(values = c("treated" = "blue", "control" = "red"))

If you want colors that are colorblind safe and black-and-white printer safe, the viridisLite package offers a good option: scale_color_viridis_d(option = "magma", end =0.8, direction = -1)

The option parameter allows for several different sets of colors. Set the end argument at less than 1 so that the brightest color won’t be too faint. The direction argument reverse the direction of the colors. scale_fill_viridis_c() is for continuous data and scale_fill_viridis_d() is for discrete data. Replace the word fill with color as depending on the type of plot you’re using it for.Themes

Below are the three standard themes from base ggplot2 and their closest equivalents from cowplot. There are also theme_minimal_vgrid and theme_minimal_hgrid themes if you just want grid lines going vertically or horizontally. For the most part, the cowplot versions look similar albeit with larger font sizes.

The grid of multiple plots is created with + and / signs is enabled by the patchwork package

library(ggplot2)
library(patchwork)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:patchwork':
## 
##     align_plots

## The following object is masked from 'package:lubridate':
## 
##     stamp

p <- ggplot(mpg, aes(displ, hwy, color = class)) + geom_point()

p1 <- p + theme_minimal() + ggtitle("theme_minimal")
p2 <- p + theme_bw() + ggtitle("theme_bw")
p3 <- p + theme_classic() + ggtitle("theme_classic")
p4 <- p + theme_minimal_grid() + ggtitle("theme_minimal_grid")
p5 <- p + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
p6 <- p + theme_cowplot() + ggtitle("theme_cowplot")

(p1 + p4) / (p2 + p5) / (p3 + p6)

Note that some of these themes are less desirable when used with faceting

pr <- p + facet_wrap(~drv)

pr1 <- pr + theme_minimal() + ggtitle("theme_minimal")
pr2 <- pr + theme_bw() + ggtitle("theme_bw")
pr3 <- pr + theme_classic() + ggtitle("theme_classic")
pr4 <- pr + theme_minimal_grid() + ggtitle("theme_minimal_grid")
pr5 <- pr + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
pr6 <- pr + theme_cowplot() + ggtitle("theme_cowplot")

pr1 / pr2 / pr3 / pr4 / pr5 / pr6

theme_bw() is probably best here. The cowplot website has additional suggestions.

The ggforest package provides theme_forest() and geom_stripes() which look nice for alternating colors. A couple of other packages also have their own theme_forest() function.

Exporting

ggsave is the base ggplot option. Don’t forget to specify dimensions. It will save the most recent plot you generated, which can be problematic if you’re stitching individual plots together into a multiplot. However, you can always specify the a particular plot you’ve named using the plot= argument
ggpubr::ggexport is used in place of ggarrange() to both make a multiplot and save it.
cowplot::save_plot supposedly has good default dimensions for theme_cowplot(). It also claims to be better for figures containing multiple plots.

Multiple Plots in One Figure

There are two approaches to including multiple plots in one figure. One is to use the ggplot2’s “facet” layers. This is best for situations where there is a variable that you could in theory show using colors, but you’ve already used colors for something else. You panels will share a legend, and you can easily keep the axis scales the same for both if desired. The other approach is to you one of several packages for combining separate plot objects into single image. This is the more flexible approach, which may or may not be a good thing depending on your needs. If you are combining two different types of plots (like a box plot and a scatter plot), then this is the approach to use. If you are combining two scatter plots from different studies, you might be better off using facets unless you need to customized each plot.

Facets

Let’s begin with a base plot to modify:

p <- ggplot(mpg, aes(x = displ, y = hwy, color= class)) + geom_point()

The ggplot2 package uses facet as a verb and refers to the individual plots as panels. Here we will use facet_wrap to facet the variable drv, creating two panels.

p + facet_wrap("drv")

Unfortunately, you can’t just type drv into the argument like you can for the variables inside aes( ). Rather, you must put it in quotes or into a formula (facet_wrap(~ drv)) or wrap in it ‘vars’ (p + facet_wrap(vars(drv))).

The order of values in drv is “4”, “f”, “v” (alphabetical, with numbers coming before letters). You can change this by turning drv into a factor with fct_relevel()² and listing the levels:

mpg %>% mutate(drv = fct_relevel(drv, "f", "4", "r")) %>% 
  ggplot(aes(x = displ, y = hwy, color= class)) + 
  geom_point() +
  facet_wrap("drv")

Similarly, if you want to change the names of the panels, recode the factor levels with fct_recode() or case_match():

mpg %>% mutate(drv = fct_recode(drv, 
                                front = "f",
                                "4-wheel" = "4",
                                rear = "r")) %>% 
  ggplot(aes(x = displ, y = hwy, color= class)) + 
  geom_point() +
  facet_wrap("drv")

You can specify the number of rows and columns with nrow= and ncol=:

p + facet_wrap("drv", ncol = 2)

By default, the panels will be laid out row-by-row. If you’d prefer to lay them out by column by column, set dir="v" for vertical:

p + facet_wrap("drv", ncol = 2, dir = "v")

Either way, the layout begins in the top left (like a table. To begin in the bottom left (like a plot), set as.table=F.

p + facet_wrap("drv", ncol = 2, as.table=F)

The term “strip” refers to the little strip attached to each panel where the panel’s title lives. The argument strip.position= controls on which side of the subplots the subplot names appear.

p + facet_wrap("drv", strip.position = "bottom")

You can customize the strip more using the various “strip” arguments in a theme():

p + facet_wrap("drv", strip.position = "bottom") +
  theme(strip.background = 
          element_rect(colour = "red", fill = "gray"),
        strip.text.x = 
          element_text(colour = "white", face = "bold"),
        strip.placement = "inside")

The strip.placement = "inside" line determines whether the strips are inside or outside the axis labels. The most useful strip them may simply be to remove the background. As usual, the element_blank() function is the key to removing stuff in ggplot.³

p + facet_wrap("drv", strip.position = "bottom") +
  theme(strip.background = element_blank())

You can let the axis scales range freely with scales:

p + facet_wrap("drv", scales = "free_y")

p + facet_wrap("drv", scales = "free")

Letting the scales vary is particularly helpful in regression plots where some outcomes only apply to one sample or the other:

ggplot(mpg, aes(x = hwy , y = manufacturer, color= class)) + 
geom_point() +
facet_wrap("drv", scales = "free_y")

If you have more than one variable you’d like to facet on, you may be better using facet_grid. This allows you efficiently graph 4 variables.

p + facet_grid(fl~drv)

Combining Plots with Patchwork

The patchwork package is terrific in the slightly more complicated scenario where you want to stitch together multiple plots that have different axis labels or even different types of plots.

p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp)) + 
  ggtitle('Plot 1')

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear)) + 
  ggtitle('Plot 2')

p3 <- ggplot(mtcars) + 
  geom_point(aes(hp, wt, colour = mpg)) + 
  ggtitle('Plot 3')

p4 <- ggplot(mtcars) + 
  geom_bar(aes(gear)) + 
  facet_wrap(~cyl) + 
  ggtitle('Plot 4')

p1a <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp, colour = mpg, size = wt)) + 
  ggtitle('Plot 1a')

(p1 + p2) / p3

p1 + p4 | (p2 / p3)

Create insets

p1 + theme_classic() + inset_element(p2, left = 0.6, bottom = 0.6, right = 1, top = 1)

You can give all the plots a common legend.

(p1a | (p2 / p3)) + plot_layout(guides = 'collect')

Annotate the metaplot as well as tag the individual subplots with letters or roman numerals.

(p1 + p2) / p3 + 
  plot_annotation(
  tag_levels = 'A',
  title = 'The surprising truth about mtcars',
  subtitle = 'These 3 plots will reveal yet-untold secrets about our beloved data-set',
  caption = 'Disclaimer: None of these plots are insightful')

Alternative Packages: cowplot and ggpubr

The cowplot package can also combine multiple plots with its plot_grid() function. It appears to be less flexible than patchwork and I have yet to find a context where it’s preferable. If you want a common legend, you’ll need to use the ggpubr package. It’s ggarrange() fuction is a wrapper for plot_grid() which includes a common.legend= arguments but gets rid of the some other arguments that are potentially useful. In both cases, the panel titles are a bit hard to use since they tend to exceed the margins or overlap the plots. Better to just make titles for each plot. Better yet, just use the patchwork package.

Coefficient Plots

fixest

The fixest package has a coefplot() function, but it uses base R graphics rather than ggplot2 and thus is best avoided unless you are fluent in base R graphics. However, it’s not too hard to extra the features you want into a tibble and then plot them with ggplot2.

The following function will extract outcome names, regressor names, coefficient estimates, and inner/outer confidence intervals from a feglm model for plotting. It is setup to use HC1 standard errors.

est_and_ci <- function(model, parameters=NULL, ci.level = 0.95, inner.ci.level = 0.9){
  outer_cis <- confint(model, 
                       parm = parameters, 
                       ci.level = ci.level, 
                       vcov = "hetero")
  names(outer_cis) <- c("outer_low", "outer_high")
  
  inner_cis <- confint(model, 
                       parm = parameters, 
                       ci.level = inner.ci.level, 
                       vcov = "hetero")
  names(inner_cis) <- c("inner_low", "inner_high")
  
  cbind(estimate = coef(model), 
        outer_cis, 
        inner_cis,
        dv = etable(model)[1,1]) %>% 
    as.data.table(keep.rownames = "regressor")
}

If your input is a fixest_multi object—for instance, if you had multiple outcomes, a split sample, or a stepwise comparison—then use the following:

map(mods, est_and_ci) %>% rbindlist()

You can enter other arguments you want after est_and_ci:

map(mods, est_and_ci, params = c("age", "sex")) %>% rbindlist()

ggplot2

Here is an example:

df_to_plot %>%
  ggplot() + 
  geom_pointrange(aes(y = dv, 
                      x = estimate, 
                      xmin = outer_low, 
                      xmax = outer_high, 
                      color = group),
                  position = position_dodge(0.5)) +
  geom_linerange(aes( y = dv,
                      xmin = inner_low,
                      xmax = inner_high,
                      color = group), 
                 size = 1,
                 position = position_dodge(0.5)) +
  facet_wrap(~framework) +
  geom_vline(xintercept = 0, lty = "dotted") + 
  scale_color_grey(
    name = "Group",
    guide = guide_legend(reverse = T)) + 
  theme_minimal() + 
  theme(text = element_text(family = "serif"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = xtext, 
       y = "outcome variable" , 
       caption = "95% CIs with robust standard errors")

Packages that Build on ggplot2

jtools::plot_summs is a real blessing. Features:
- inner and outer confidence intervals
- robust standard errors (any type you wish)
- scaled coefficients
- multiple models with legend in correct order
- log odds exponentiation
- unsupported models types that can be coerced to a tidy output with broom will still display but without some of the extra features. The more general function is called plot_coefs but there’s no reason to ever use it since plot_summs will alias it when needed
ggforestplot::forestplot is also great. Features:
- hollow dots when p < alpha
- multiple models with legend in correct order
- nonstandard evaluation (no need to quote variable names)
- alternating backgroun colors
- log odds exponentation
- can accept any model type provided you’ve converted the output to a tibble with variables names, estimates, and standard errors. It’s up to you to make these standard errors robust. However, you don’t need to calcuate the CIs. See webiste

Other Packages

The easystats ecoysystem features the see package which I have yet to use. Might be worth investigating.

Jared Lander has a package called coefplot which looks tempting but hasn’t been updated in many years and lacks a webpage. I seem to recall it getting the colors in the legend backwards with to way to fix. Avoid! It’s main redeeming feature is its ability to easily produce secret weapon plots.

Histograms and Bar Charts

There are three scenarios to consider. If your data frame contains the underlying data points and you want R to bin and count them, use geom_histogram(). If your data points are already binned or sorted to categories, use geom_bar().If your data frame contains counts for each bin/category rather than data points, then use geom_col

# from unbinned data points
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 20)

# from binned/categorized data points
ggplot(mpg, aes(class)) + geom_bar()

# from bins/categories with counts
data.frame(value = c(.3, .4, .5, .6), count = 1:4) %>% 
  ggplot(aes(x=value, y=count)) + geom_col()

Marginal Effects and Predictions

ggplot2

If you have a dataset of predictions already created, this code provides a good way to visualize it.

Example data:

df <- tibble(player = rep(c("infielder", "outfielder","DH"), 8),
             bats = rep(c("lefty", "righty"), 12),
             mean = 100*runif(24),
             low = mean*0.92,
             high = mean*1.04,
             value_label = sprintf("%.1f%%", mean),
             outcome = c(rep("singles", 6), rep("doubles",6),
                         rep("triples", 6), rep("homeruns",6))
             )

*The code in the following four examples is identical except that a) certain parts are commented out and b) the point label text size and dodge width are adjusted to fit better

Two variables

df %>% filter(bats == "lefty", 
              outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             #color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            nudge_y = 0.2,
            #position = position_dodge(width = 0.5)
            ) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() #+

## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  #theme(legend.position = "bottom", 
        #axis.text.y = element_text(size=11), 
        #plot.title.position = "plot"
        #) +
  # scale_color_manual(values = c("black", "gray50"),
  #                  name = "Batting Stance",
  #                  guide = guide_legend(reverse = TRUE)) +
  #expand_limits(x = c(0, 100)) + 
  #facet_wrap(~ outcome)

Three variables using faceting

df %>% filter(bats == "lefty", 
              #outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             #color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            nudge_y = 0.25,
            #position = position_dodge(width = 0.5)
            ) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  #theme(legend.position = "bottom", 
        #axis.text.y = element_text(size=11), 
        #plot.title.position = "plot"
        #) +
  # scale_color_manual(values = c("black", "gray50"),
  #                  name = "Batting Stance",
  #                  guide = guide_legend(reverse = TRUE)) +
  expand_limits(x = c(0, 100)) +
  facet_wrap(~ outcome)

Three variables using color

df %>% filter(#bats == "lefty", 
              outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            #nudge_y = 0.2,
            position = position_dodge(width = 0.5)) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  theme(legend.position = "bottom",
  axis.text.y = element_text(size=11),
  plot.title.position = "plot"
  ) +
  scale_color_manual(values = c("black", "gray50"),
                   name = "Batting Stance",
                   guide = guide_legend(reverse = TRUE)) #+

  # expand_limits(x = c(0, 100)) +
  # facet_wrap(~ outcome)

Four variables using faceting and color

df %>% #filter(#bats == "lefty", 
              #outcome == "triples"
              #) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 2, 
            #nudge_y = 0.2,
            position = position_dodge(width = 1)) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  theme(legend.position = "bottom",
  axis.text.y = element_text(size=11),
  plot.title.position = "plot"
  ) +
  scale_color_manual(values = c("black", "gray50"),
                   name = "Batting Stance",
                   guide = guide_legend(reverse = TRUE)) +
  expand_limits(x = c(0, 100)) +
  facet_wrap(~ outcome)

Note:

value_label = sprintf("%.1f%%", mean) converts creates the data labels. It can be done ahead of time in the dataset as in this example at the time it called in ggplot. The sprintf command is a more flexible version of paste(). The first argument is the output and the initial % sign marks where the value is to be inserted. The .1 indicates that it should be rounded to 1 decimal place. The%% inserts a percent sign.
ggstance::geom_pointrangeh() gives you horizontal plots without needing to flip the axes with coord_flip().
- The shape=1 argument provides hollow dots which is helpful if the CIs are ever small
geom_text() allows you to attach value labels to the points. There are two ways to adjust the position of the text
- The nudge_y=0.2 argument will move everything up by the specified amount to avoid overlapping the text with the points (as nudge_x argument is also available).
- The position = position_dodge(width = 0.5) will move one group of labels up and the other group down by the specified amount to avoid overlapping with each other. Therefore, it should be used when there are two colors in play.
- If there are more than two colors, you’ll need to call geom_text multiple times—once for each color—and specify a y_nudge amount for each. Something like geom_text(data = df %>% filter(bats == "righty", nudge_y = 0.2)
scale_color_manual allows us to manipulate the colors
- values=c("black", "gray50") allows us to pick our own colors which may be more black-and-white printer and color-blind reader friendly. Higher numbers appended to “gray” are actually lighter, not darker (may seem counterintuitive)
- y = element_blank() provides an example of how to get rid of a labeling element in ggplot
- guide = guide_legend(reverse = TRUE) corrects the order of the color labels to match the plot
The themes have several components
- theme_bw() is neat and simple-looking theme for publication. However, it must come first or it will override the others.
- legend.position = "bottom" moves the legend to the bottom, which is sometimes a more efficient use of space on the page
- axis.text.y = element_text(size=11) provides an example of how to manually adjust the font size for a particular element relative to the others
- plot.title.position = "plot" aligns the title and subtitle with the y-axis labels rather than the subplots
expand_limits(x = c(0, 100)) ensures that the scale of the x-axis goes from 0 to 100, or whatever we want it to

jtools

effect_plot will plot your chosen predictor against the outcome as a line/curve with shaded confidence band using robust standard errors. It’s great!

Features:

scatter plot of data points or partial residuals in background if desired
for categorical predictors it can make error bars, connected error bars, or dynamite bar plots. The default cross ties are thick and ugly, but this can adjusted internally
rug plot in the margins
force a numeric with 0 and 1 (or in theory, more levels) to be treated as categorical
the at parameter allows you to set other predictor to a given level. By default they are mean centered
colors
- “jtools provides 6 color palettes design for qualitative data. 4 of the 6 are based on Paul Tol’s suggestions (see references) and are meant to both optimize your ability to quickly differentiate the colors and to be distinguishable to colorblind people.”
- “Rainbow” is Paul Tol’s compromise rainbow color scheme that is fairly differentiable for colorblind people and when rendered in gray scale.
- “You may also provide any color palette supported by RColorBrewer…jtools automatically…drops the lightest color.”
- “If you want something a little non-standard, I’d suggest taking a look at”blue2” or “seagreen”.”
- useful site: https://personal.sron.nl/~pault/

visreg

This package’s eponymous visreg() function can create plots with either base R graphics or ggplot2 (set gg=TRUE).

Features:

scatter plot of partial residuals
uncertainty band included by default (but not robust!)
unlike jtools, other variables are held to their median (continuous) or mode (factor)
rugplots including pair top and bottom rug plots for logistic regressions
can customize a lot of features
can produce contrast plots where one factor level is set as reference an other factors levels are compared to it. The y-axis shows how the expected value of the outcome at these other predictor levels differs from the EV at the reference level. Great for comparing treatments since reference CI goes to zero and others get wider
be sure to specify scale="response" with logistic regressions

Interactions

interactions package

interactions, a Jacob Long package, is great here. It’s a spinoff from jtools and produces ggplot objects. See https://interactions.jacob-long.com/articles/interactions.html for details

interact_plot generates an xy-plot (predictor v. outcome) with lines corresponding to various levels of your moderator.
- You can include robust confidence bands at whatever level you set. and you can project a scatter plot of the original data points or their partial residuals in the background.
- The moderator can be a factor or numeric.
- All other values held their means unless you specify otherwise. It can handle squared terms too, and check linearity assumptions.
- It can handle 3-way interactions too, showing side-by-side panels
Simple slopes analysis is pretty useful too
- sim_slopes will tell you the slope and significance of your predictors at various moderator levels
- plot a sim slopes object with plot to obtain a nice horizontal coef plot showing the slope at various moderator levels. You can see which are significant and which are different from each other.
- apply huxtable::as_huxtable to your sim slopes object to get a regression-table-style table of the slope at various moderator levels ready for publication
- set johnson_neyman = TRUE to see over what interval of moderator values the predictor slopes are significant. You may wish to set control.fdr = TRUE to deal with multiple comparisons implicit in this approach, but it risks being overly conservative
- johnson_neyman() will give plot a curve with confidence band showing the slope of the predictor on the y-axis versus the moderator value on the x-axis. It will color where it is significant
- use probe_interaction to simple slopes analysis and interaction plot all at once. Good for exploring models.
If your interaction contains two categorical variables, use cat_plot
- default give you vertical dot-and-errorbar plot with the different moderator levels differentiated by color.
you can connect the point estimates with lines
you can turn this into a “dynamite” plot: a bar graph with error bars
see https://interactions.jacob-long.com/articles/categorical.html for details

visreg

The eponymous visreg function described in the above also does interactions. The key advantage that is you can specify whether to display levels in side-by-side panels or overlay the lines.

Imputation

The Amelia package is the likely best for imputation, though its can be a handful. It documentation is mediocre: Gary King has a rather general page about it with some links to published articles about it, and there’s also a Github page with some very brief explanation, but there’s no vignette to walk you through how to use it. The clarify package has a couple good vignettes about it here and here.

At it’s most basic, Amelia works like this:

imputed.dfs <- amelia(pre_imputed_data, 
             noms = c("district", "party", "parent"))

The noms argument specifies nominal (that is, categorical) variables. Everything else including binary variables is treated as having a multivariate normal distribution, which according to the package developers tends to work pretty well. There is also an ords argument for ordinal variables, though this is less important, and an idvars argument for id variables or any other variable that should not affect the distribution of the variables you are imputing. The pre_imputed_data must be data.frame, not a tibble, and you probably want to drop columns you won’t be using in your subsequent regressions. If it’s a big data set, you should either set parallel="multicore" and ncpus=4, or globally set options(amelia.parallel = "multicore", amelia.ncpus = 4) at the top of your script (4 is just an example).

The output will be 5 datasets, and you will want to run your regressions on all of them using Amelia::with():

imp.mods <- with(imputed.dfs, lm(y ~ x1 + x2))

You now have five different models, so you’ll want to aggregate the results. The mi.combine() will do so, returning a tibble and, if desired, confidence intervals.⁴

mi.combine(imp.mods, conf.int = TRUE)

From there you can graph the averaged coefficients in ggplot2, create a table, or generate simulated prediction with with clarify package (which is built to accommodate Amelia).

I last worked with imputed data before the clarify package came online, and I was using the mi.meld() function instead of mi.combine. I wrote a useful wrapper for the former to be used with fixest (it probably works fine with a bunch of other model objects). I’m not sure if mi.combine renders it superfluous, but I’m leaving it here for now.

library(Amelia)

combine_models <- function(imp_regs){
  
  # this is a low-level function that extracts the coefficient estimates
   # and standard errors from a list of model objects and combines them
   # using Rubin's rules. It returns the model for the first imputed data set,
   # the combined coefficient estimates, and the combined standard errors
  
  # arguments
     # imp_regs : list of model objects, presumably derived from a list of
     # imputations of a dataset
  
  model.obj <- imp_regs[[1]]
  term.names <- names(model.obj$coefficients)
  
  coef.mtx <- map(imp_regs, coef) %>% bind_cols()
  se.mtx <- 
    map(imp_regs, ~ lmtest::coeftest(., vcov = vcovHC(., type="HC1"))[,2]) %>%
    bind_cols()
  
  combo.list <- mi.meld(q = coef.mtx, se = se.mtx, byrow = F)
  
  coefs <- as.numeric(combo.list$q.mi) %>% setNames(term.names)
  SEs <- as.numeric(combo.list$se.mi) %>% setNames(term.names)
  
  return(list(reg = model.obj, coefs = coefs, SEs = SEs))
}

The generic recode() function still works, but has been superseded by case_match().↩︎
The base R relevel() function only allows you to move a single level into the first position. The reorder() and fct_reorder() functions are for reordering a variable according to the values of some other variable.↩︎
I almost said “strip stuff away” rather than “remove stuff.” Remember that strip is used as a noun, not a verb in this ggplot.↩︎
While point estimates can simply be averaged across the imputed datasets, one needs to combine the standard errors in a way that accounts of the increased uncertainty caused by imputation. The Amelia package uses a method the authors call the “Rubin rules” after: Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley.↩︎

R Essentials for Social Scientists

Matthew D. Simonson

October 10, 2024

Introduction

General Advice

Review of Important Packages

Ecosystems

To-Do List

Setup

RMarkdown Tips

Importing Data

Pre-existing R objects

CSV files

Stata and SPSS

Qualtrics

Data Cleaning

Recoding Values

Recoding Multiple Columns

Recoding a Range of Values

Reordering Factor Levels

datawizard

Renaming Variables and Creating New Ones

Variables that Summarize Rows

String Handling

Data Exploration

Summary Stats & Balance Tables

Regressions

Formulas

The fixest package

Bayesian Models

List Experiments

Regression Tables

stargazer

kableExtra: kable_classic()

fixest: etable

jtools: summ() and export_summs()

Other packages

Plotting

Getting the Variable Order Right

Colors

Exporting

Multiple Plots in One Figure

Facets

Combining Plots with Patchwork

Alternative Packages: cowplot and ggpubr

Coefficient Plots

fixest

ggplot2

Packages that Build on ggplot2

Other Packages

Histograms and Bar Charts

Marginal Effects and Predictions

ggplot2

Two variables

Three variables using faceting

Three variables using color

Four variables using faceting and color

jtools

visreg

Interactions

interactions package

visreg

Imputation