Introduction

Welcome to my guide to R! I originally created this document as a place to store functions and code snippets I found useful, and note to my future self, but I figured, “Why not share it with colleagues?” A lot of you, it seems, have found it useful. I am a political scientist (site: www.msimonson.com), so this guide reflects the type of statistics I use most often. I hope that it will also be relevant to folks in other fields. Some part of this guide get updated more often that others, so if you find something outdated or have a suggestion for something to add, please email me at matthew.simonson@mail.huji.ac.il. If you’re looking to learn R for the first time, I recommend R for Data Science (2e), a free online textbook from the creators of the Tidyverse packagers. Start with sections 1-4, 6, 8, 27, and 28 for the essentials.

General Advice

  • The tools best suited for creating publication-ready tables and plots may lie in a different package than the ones you use to initially explore and analyze the data

  • Initial data cleaning is best done in a regular R script, but use Rmarkdown when sharing results with colleagues. Replications files should use Rmarkdown unless there is a good reason not to (e.g., lengthy calculations) since it make comments easy to read. Your data cleaning can live in a separate R script.

  • Variable names should be easy to decipher, but don’t try to make them publication worthy. That is, feel free to leave in underscores. To produce the variable names that will appear in your published tables and plots, either create a named vector of the form c(working_name_1 = "New Name 1", working_name_2 = "New Name 2") which you can use repeatedly to rename these variables or else use labelled data. Try to hold off on doing this till you’re almost ready for circulate a draft of your paper so that you don’t waste time customizing plots and figures to fit names you don’t end up using.

Review of Important Packages

  • jtools seems like the all around best package for regression tables and plots though it does have a few drawbacks. For plotting it builds on ggplot2() so you can use those tools for further customization.

  • Use visreg or easystats for contrast plots that show the differences between various levels of a factor (and whether those differences are significant)

  • ggforestplot will produce nice regression plots once you’ve used broom to tidy your regression output. That said, you can also reproduce its features yourselt in ggplot without much difficulty if you want more customization

  • fixeest is great for analysis but not for plots, because it uses base R graphics. Its tables look nice in a report and can be exported to latex, though complications sometimes ensue. Its biggest advantage for analysis is its ability to run several regressions in one line without cluttering up your code with multiple objects. For instance, you can use split= to split the sample, sw() for stepwise changes in parameters, csw() for nesting, and c(outcome,outcome2,...)~ for comparing different outcomes. The main drawback of fixef is that almost no other packages accept fixest model objects, so you can rarely use it in conjunction with other packages. This will hopefully change of over time (the package was first released in June 2021). One work around may be to use the broom package to tidy up the models for plotting.

  • visreg has a lot of promise for plotting, but I’ve scarcely tried it. I’ve read all the vignettes though. It can do both base R graphics and ggplot.

  • easystats looks very rich in features. It’s tidyverse compliant. Probably good for plotting but not for tables because it has no latex or Rmarkdown output options. It can do robust standard errors.

  • sjmisc is a grab bag package (as is jtools), but it’s mainly useful for data manipulation and exploration. It provides modest though not huge improvements over what you can already get out of dplyr, forcats, stringr, etc. The main advantage of these functions is that they can incorporate labeled data into tables and plots. However, it often gives unexpected results without warning you and that, in my option, makes it not worth the trouble except for a few functions which you should call using “::” without loading the whole package, since it tends to override a lot of existing functions.

  • sjPlot seems to have a lot going for it. The figures look great and the regression tables supposedly can be output to latex with the help of an auxiliary function from github.

  • Avoid zelig unless the authors decide to update/fix it. See website. I do need to find another R equivalent of clarify though. One alternative is to use bootstrapping from the sandwich package. Another is to roll your own function (I have such a function in Dropbox).

Ecosystems

  • Daniel Lüdecke packages
    • sjPlot
    • sjmisc
    • sjlabelled
    • sjstats
    • ggeffects
    • this author is also one of three co-creators of the easystats ecosystem. This includes (among others):
      • see
      • parameters
      • modelbased to estimate effects and contrasts between groups
      • effectsize
      • report
      • insight
      • bayestestR for anayzing Bayesian models
  • Claus O. Wilke packages
    • ggtext provides a way to do formatting with markdown and html inside of plot text appearing in captions, titles, legends, ticks, and axes. See website.
    • cowplot allows you to combine graphs into a single figure quite elegantly. Also allows you to annotate and some themes
    • ggridges cool ridgeline plots. See gallery
  • Alboukadel Kassambara packages
    • ggpubr allows you to combine plots into a single figure
    • ggcorrplot lets you visualize correlations between a bunch of variables and test their significance
  • Declare Design Packages
    • fabricatr allows you to generate fake data
    • estimatr allows you to estimate effect sizes
    • randomizr allows you to randomize treatment assignments, even in complex designs
    • DesignLibrary is a collection of research design templates. There’s also an online wizard to generate code and data for you.
    • All these packages are focused on the preregistration stage, though estimatr can be useful beyond that thanks to its diff-in-mean and robust linear regression tools. Still, I would not suggest using it outside of the study design context since it produces a different type of model object that’s hard to use other functions on. Still, this package does have built-in export to latex, so maybe the package has everything it needs. All plotting takes place in ggplot with the help of tidy()
  • tidyverse (Hadley Wickham and co-authors)
    • core packages automatically loaded when you run library(tidyverse)
      • ggplot2 to plot
      • dplyr to manipulate and summarize data
      • tidyr to reshape data
      • readr to import CSVs and other files
      • purrr to do functional programming instead of sapply
      • tibble stays in the background
      • stringr to modify strings
      • forcats to modify factors
    • some additional packages to load separately
      • lubridate to convert characters to date-times
      • hms to convert characters to times
      • glue to improve upon the base R paste
      • dtplyr to run data.table in the back-end for speed
      • googlesheets4 to import from Google sheets
      • haven to import from SPSS
      • readxl to import from individual sheets in excel
  • Matt Dowle packages
    • data.table is fast but not easy to read. Still, it sometimes has functions that are super convenient such as the %like% operator. dtplyr combines data.table’s speed with dplyr’s syntax as is a good alternative when you need the comptuational efficiency
  • Gregory Demin packages
    • exprss is for tables and labels - check it out!
    • maditr provides an alternative to dtplyr, letting you use pipes with data.table
    • excel.link is to facilitate data exchange between R and Excel - check it out!
    • comprehenr is for list comprehensions in R - check it out!
  • Thomas Leeper packages
  • Vincent Arel-Bundock packages

To-Do List

Packages to Review

  • table1
  • modelsummary package with functions
    • modelsummary()
    • modelplot()
    • datasummary()
  • Table packages including gt, gtsummary, kableExtra, and xtable

Setup

Here I load some libraries that will be used in examples throughout.

library(tidyverse)
library(data.table)

RMarkdown Tips

Rmarkdown has trouble finding filepaths. The here package is great for that. After loading here run here("topmost_folder", "file_name") to retrieve the path to your file. You can put that statement directly inside of load or fread or whatever loading function you’re using.

Internal links can be achieved a few ways, but my preferred method is to put {#my_bookmark} at end of a heading and then reference it by writing [text_to_display](my_bookmark). Hyperlinks use a similar approach: [text_to_display](www.mysite.com).

You can set the date to automatically update like this date: "December 03, 2024"

Here’s the code to create a table of contents. Feel free to adjust the depth:

output:
  html_document:
    toc: true
      toc_depth: 3

This is a good first line to include:

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

You can display regression tables nicely using kableExtra package as described here

Assorted functions and code snippets to keep handy

this code will standardize a bunch of variables (that is, center at mean and divide by standard deviation)

mutate(across(all_of(c(# list your variables here
  )), 
  ~ scale(.) %>% as.vector()))

This function adjusts p-values to minimize the false discovery rate using the Benjamini & Hochberg method

adjust <- function(results){
  results %>%
    mutate(p_adjusted = p.adjust(`p.value`, method = "BH") %>% round(3),
           `p.value` = `p.value` %>% round(3),
           stars = case_when(p_adjusted<0.001 ~ "***",
                             p_adjusted<0.01 ~ "**",
                             p_adjusted<0.05 ~ "*",
                             p_adjusted<0.10 ~ ".",
                             .default = "")) %>%
    select(study:estimate, `p.value`, p_adjusted:stars)
}

This function adjusts p-values according to 4 different methods. They are listed in order of strictness. The Bonferroni method is considered overly conservative and may be dominated by Holm but is better known

adjust_multi <- function(results){
  results %>%
    mutate(
      Bonferroni = p.adjust(`p.value`, method = "bonferroni") %>% round(4),
      Holm = p.adjust(`p.value`, method = "holm") %>% round(4),
      Hommel = p.adjust(`p.value`, method = "hommel") %>% round(4),
      BH = p.adjust(`p.value`, method = "BH") %>% round(4),
      `p.value` = `p.value` %>% round(3)
    ) %>%
    rename(Original = `p.value`) %>%
    select(!estimate:statistic)
}

For functions to create regression plots, see the Coefficient Plots section for fixest and ggplot2

Importing Data

Pre-existing R objects

The readRDS() function is used to load a .rds file directly into a variable.

# Save
saveRDS(dat, "mydataframe.Rds")

# Open
d <- readRDS("mydataframe.Rds")

In contract, load() will load a .Rdata file containing multiple objects.

# Save
save(dat1, dat2, "mydata_objects.Rdata")

# Open
load("mydata_objects.Rdata")

CSV files

For small files, the readr package offers the read_csv() function. For large files, the data.table package’s fread() function for speed.

Stata and SPSS

Use the haven package’s read_stata and read_spss

Qualtrics

The qualtRics is great for pulling data directly from a Qualtrics survey. If you are using a Qualtrics account for the first time (or maybe when using a new project), be sure to register your account within Rstudio.

qualtrics_api_credentials(api_key = "adsfadfsvavf",
                          base_url = "iad1.qualtrics.com",
                          install = TRUE,
                          overwrite = TRUE)

Here’s an example:

svy_id <- "SV_bOUmVpTISZiIJMO"
fname_raw <- "raw_data_file"
fname_clean <- "clean_data_file"

get_new_data <- FALSE

if(get_new_data){
  cat("getting new data")
  mysurvey <- fetch_survey(
    surveyID = svy_id, 
    label = FALSE,
    convert = FALSE,
    unanswer_recode_multi = 0, 
    force_request = T,
    col_types = readr::cols(StartDate = readr::col_datetime(),
                            EndDate = readr::col_datetime(),
                            RecordedDate = readr::col_datetime()))
  Qs <- survey_questions(svy_id)
  cmap <- extract_colmap(mysurvey)
  
  save(mysurvey, file = paste0(fname_raw, ".Rda"))
} else {
  cat("loading existing data")
  load(paste0(fname_raw, ".Rda"))
}

Data Cleaning

Cleaning data most often involves the following tasks

  1. Renaming variables
    • Individually
    • Systematically
  2. Recoding values
    • Individually
    • Systematically
  3. Reorganizing factors
    • Reordering the levels
    • Combining levels
  4. Creating new variables
    • From one variable, conditional on its own values
    • From one variable, conditional on another variable
    • From a set of variables
  5. Reshaping data sets
  6. Combining data sets

We will review the best approaches to these tasks here.

In general, the tidyverse packages preferable to data.table for data cleaning and inspection. Not only do more people understand tidyverse, but the syntax is more readable and, unless you’re really proficient, requires less looking up how to do stuff. However, there are contexts in which data.table is superior, the most common of which is when you need to change certain values of variable_1, contingent on the values of variable_2. For instance, dt[age < 18, employment := NA] tends to be simpler than the dplyr equivalent:

d <- d %>% mutate(employment = case_when(age < 18 ~ NA_character,
                                         T ~ employment)

If you have to this rigmarole for multiple variables, it becomes a real pain, particularly when you could just be copying and pasting your one-line data.table code.

Another use for data.table is the %like% operator which saves you from having to do a bunch complicated regex mischief. Finally, data.tables are pretty convenient for randomizing: d[ , new_value := sample(values)], randomizing withing groups d[ , new_value := sample(values), by = group ], and bootstrapping d_boot = d[ sample(id, replace = F), ]

While it may be tempting to use the sjmisc and sjlaballed packages for data.cleaning—indeed, that’s what the former is primarily intended for—at this point the package tends to suck up more of your time than it will save. The primary advantage of these packages is that they preserve labels inherited from SPSS, Stata, or Qualtrics, both for the variables (usually the question text in a survey) and the values (e.g., the response options). However, this package tends to overwrite existing base R functions such as as_factor making them behave differently than you expect, and it rarely (if ever) issues warnings when something goes wrong—it will just silently fail. You’ll spend a bunch of time looking up how to do stuff, and in the end, you’ve introduced a greater risk for error. There are also functions in other packages that will not work on labelled data.

Recoding Values

The most commonly used functions in recoding might just be na_if and replace_na. They work like you imagin they should: simply stick in the value to replace (or to use as a replacement). More generally, recode will rename values while preserving the variable type. It uses old = new syntax

d %>% mutate(capital = recode(country, 
                              Brazil = "Brasilia",
                              Afghanistan = "Kabul",
                              China = "Beijing"))

recode_factor will transform your variable into a factor and reorder the levels to match the order you wrote all in one fell swoop. The data.frame itself will not be resorted.

d %>% mutate(capital = recode_factor(country, 
                              Brazil = "Brasilia",
                              Afghanistan = "Kabul",
                              China = "Beijing"))

In either case, numerics need to be surrounded by apostrophes.

d %>% mutate(year = recode(year, 
                           `1999` = 0,
                           `2000` = 1))

fct_recode from the forcats package (loaded in tidyverse) does much the same thing as recode_factor but with new = old format. Note that quotes are still needed to the right of the equals sign but not the left. fct_recode will accept a character or factor as its input, but the output will always be a factor.

d %>% mutate(capital = fct_recode(country, 
                              Brasilia = "Brazil",
                              Kabul = "Afghanistan",
                              Beijing = "China"))

Overall, recode and recode_factor are preferable to fct_recode. The latter does not reorder the levels of the factor, which this is usually something you want to do when recoding. Also, cannot accept numerics. The new = old syntax does nicely match up with the mutate, rename, and select functions, but not with other renaming functions found elsewhere in R. On the other hand, the tidyverse team considers this order problematic and says then will likely depreciate this function once they come up with a workaround.

In both cases you can use !!! to pass a renaming vector to mutate()

recoding_vec <- c(Brazil = "Brasilia", Afghanistan = "Kabul", China = "Beijing")

d %>% mutate(capital = recode(country, !!!recoding_vec))

If needed, you can turn a new = old name vector into an old = new naming vector

old_equals_new <- names(new_equals_old)
names(old_equals_new) <- new_equals_old

One instance in which the sjmisc seems reasonable safe and there doesn’t seem to be a good alternative is if you need to reverse code a bunch of variables at once. Note that this works outside of mutate (as well as within).

df %>% rec(starts_with(my_prefix), rec = "rev", suffix = "")

Recoding Multiple Columns

The across function is useful for recoding multiple columns the same way. For instance, we can replace 999 with NA across all the columns that begin with “my_prefix”

df %>% mutate(
  across(starts_with("my_prefix"), .fns = ~ na_if(.x, 999)))

This can also be used to change all logicals to numerics at the end of a cleaning:

df %>% mutate(
  across(where(is.logical), .fns = ~ as.numeric))

Use droplevels at the end of a cleaning session to get rid of lingering unused levels in factors.

Recoding a Range of Values

To recode a range of numeric values into a factor (or into fewer numeric values) use case_when

df <- df %>% mutate(new_factor = case_when(nums < 5 ~ "level 1",
                                           nums >=5 ~ "level 2") %>% factor())

The function if_else is harder to read

df <- df %>% mutate(new_factor = if_else(nums < 5, "level 1", "level 2") 
                    %>% factor())

while the base R function ifelse fails to account for NA values:

df <- df %>% mutate(new_factor = ifelse(nums < 5, "level 1", "level 2") 
                    %>% factor())

will recode a num = NA as "level 2" rather than leaving it as NA. The %in% operator likewise does not respect NAs: for instance 2 %in% NA will return FALSE even though 2 == NA will return NA.

When recoding multiple factor levels into fewer factor levels, an alternative from the forcats packages is to use fct_collapse(). This will respect the order and leave omitted factors untouched.

df <- df %>% mutate(new_factor = fct_collapse(
  "level a" = c("level 1", "level 2"),
  "level b" = c("level 4", "level 5")))

Assuming there was a level 3, it will remain the new factor. You can use the other_level argument to collapse all unnamed levels into a single level. As an added bonus, the order of the levels will be the order in which you wrote them. Alternatively, if you have a lot of levels to combine and one to leave alone, you can use fct_other(f, keep = c("level 1", "level 2") other_level = "Other")

Reordering Factor Levels

If you just have one level that you want to make the baseline (reference level), use the base R function relevel(). If you want to explicitly rearrange multiple levels, use the forcats function fct_relevel(). To reverse the order of the levels (often useful in ggplot), use fct_rev()

Other Trikcs

The datawizard package was created an alternative to tidyverse with fewer dependencies. However, it also has a few useful functions that can be used in conjunction with tidyverse.

The rownames_as_column() function is self-explanatory. I know of no equivalent.

The standardize() function will center and rescale by standard deviation. Additional arguments allow for robust standardization (center at median, devided by MAD) or Gelman standardization (devided by 2 SDs). Can be applied to individual variables or all the numeric variables in a dataset. Works within groups (though I’m not sure if you can use the groups specified by dplyr).

Similarly,

  • normalize() will scale variables between 0 and 1
  • demean() will subtract the means, by group if desired
  • rescale() will rescale them to a range you specify
  • reverse() will reverse code them
  • ranktransform() converts values to ranks
  • slide() can shift values up or down uniformly
  • adjust() will, rather amazingly, partial out other variables using a choice of regression models including OSL, random effects, fixed effect, GAMs, or Bayesian. It does not appear to be able to accomodate logistic models and I’m unsure about interactions. Neverthess, its very impressive.
  • winsorize() will censor extreme values above/below a percentile, z-score, or pair of thresholds you specify

Renaming Variables and Creating New Ones

Variable names should follow a consistent set of rules. I prefer all lower case letters, words separated by underscores, and prioritize suffixes over prefixes (since it makes it easier to find variables you want using auto-complete when typing). Here are some useful suffixes for when multiple versions of a variable are included

  • .cat for categorical (factor)
  • .cat.3 for three levels (assumigng the original had a different number of levels)
  • .num for numeric (if the default version is categorial)
  • .01 for binary or rescaled 0 to 1
  • .imp for imputed
  • .dm for demeaned
  • .z for standardized
  • .1, .2, .3 for multiple variables for the same type like if each person is asked to list three friends

You might prefer to use underscores, though I do find it helpful to reserve underscores for seperating parts of a name.

Make existing variable names lowercase

d <- d %>% rename_with(tolower, starts_with("my_prefix"))

Replace periods with underscores

d <- d %>% rename_with(~ gsub(".", "_", .x, fixed = TRUE))

The argument fixed ensures the the period and underscore are interpreted as strings and not as symobols in regular expressions.

If the majority of variables in your dataset are going to be modified, I like to use transmute to drop all unused variables; any variable you wish to retain without renaming or rename without changing should be listed inside the transmute function. This make it easy to look stuff up later.

I prioritize easy-to-remember names over short ones or pretty ones, on the assumption that any variable appearing in a publication will be renamed before going into a table or plot.

In order to rename multiple variables with a custom function, use the following dplyr syntax:

rename_columns <- function(df, new_name, old_name){
  df %>% rename({{new_name}} := .data[[old_name]])
}

mutate_columns <- function(df, new_name, old_name){
  df %>% mutate({{new_name}} := get(old_name))
}

I don’t really understand why get (which comes from base R) doesn’t work with rename, but for some reason it throws an error. However, it can be very useful if you want to make the RHS more complex:

mutate_columns <- function(df, new_name, old_name){
  df %>% mutate({{new_name}} := case_when(get(old_name) < 1 ~ 1,
                                          get(old_name) >= 1 ~ 0))
}

Variables that Summarize Rows

To summarize rows it is often helpful to use rowwise() which treats each row as a group. The function c_across()is then used to select columns inside of the mutate() or transmtue() function.

Find the average, sum, or max of a row (among columns with the prefix “my_prefix”)

df %>% 
  rowwise() %>% 
  mutate(mean_value = mean(c_across(starts_with("my_prefix"))),
         sum_value = sum(c_across(starts_with("my_prefix"))),
         max_value = max(c_across(starts_with("my_prefix")))
         )

Alternatively:

df %>% 
  mutate(mean_value = rowMeans(cur_data() %>%
                                 select(starts_with("my_prefix"))),
         sum_value = rowSums(cur_data() %>%
                               select(starts_with("my_prefix"))),
         max_value = pmax(cur_data() %>% 
                            select(starts_with("my_prefix")))
  )

I personally think the rowwise() approach is simpler since you don’t need to remembers the alternative functions for mean, sum, and max.

Here is a way to create a variable indicating whether a certain value (in this case 1) appears anywhere in a row:

df %>% mutate(row_with_one = if_any(starts_with("my_prefix"), ~ . == 1)

Likewise, if_all() can be used to see if all values in a row are NA:

df %>% mutate(row_all_na = if_all(starts_with("my_prefix"), ~ is.na)

Data Exploration

Try using describe_distribution() and data_tabulate() from the datawizard package. For weighted surveys, you might want weighted_mean() which can be applied to the entire dataset at once (or a subset of it)

The sjmisc package offers descr() which allows you to customize what is included as will include variable labels (though irritatingly, not if the variable is a factor; it also has mysterious errors sometimes)

The readr package from tidyverse offers the spec() function, but it only works on tibbles that have been read in by readr functions

Summary Stats & Balance Tables

The gtsummary package has tbl_summary which I’ve used before. I forget what the tradeoffs are.

The modelsummary package offers the datasummary_balance function, which can create neat summary and balance tables. It can even save them as latex files.

datasummary_balance( ~ 1, 
                     data = df, 
                     fmt = 2, 
                     title = "Descriptive Statistics\\label{tab:desc_stats}",
                     output = "latex"
)

datasummary_balance( ~ treated,
                     data = df,
                     fmt = 2, 
                     dinm_statistic = "p.value",
                     title = "Balance Table\\label{tab:bal_tab}", 
                     output = "latex")

The fmt argument specifies the number of digits to display.

Prep Work

  • tibbles should be converted to data.frames

  • You might want to convert factors to numeric dummies for each level

  • You might want to first replace underscores with spaces using rename_with(~ gsub("_", " ", .x, fixed = TRUE)) so that it looks nicer. Factors a printed separately at the bottom and lack some statistics that you might want. They also make the table wider.

You can save a table as an string and then print several at once to the same latex file:

print(summary.table, "\n\n", balance.table, file = "appendix_tables.tex")

Crosstabs

Regressions

Here’s a useful function for building up a formula from a character vector:

make_formula <- function(char_vec){
  char_vec %>% paste(collapse = " + ") %>% paste("~", .) %>% as.formula()
  }

Fixest

The fixest package was designed for fast estimation of fixed effects, but it has much wider variety of uses. You can use it for anything you would normally use lm or glm for. The main times when I don’t use it are for random effects models (I go Bayesian for those) or multinomial logit models for categorial outcomes (use the mlogit package).

Here is an example

outcomes <- c("height", "weight")
controls <- c("age", "income")

mod <- d %>% 
  filter( age < 18) %>%
  feglm( data = . , 
         .[outcomes] ~ treatment + .[controls] | region,
         split = ~ sex)

This example will produce two four models: two for each outcome, of which one is applied to the subsample of males, the other females. You can use fsplit to also includes models for the full sample. Fixed effects for region will be included but will not be shown in the summary since they are assumed to be nuisance paraemters.

Variables on the RHS can also be prepared using a macro:

setFixest_fml(..ctrls = ~ district + age + male)

However, this is less ideal because one cannot call ..ctrls to inspect it; you have to scroll back up and find where you typed it to see what’s inside. It also cannot easily be reused in non fixest functions. If you want to use multiple outcomes without pre-specifying them, simply use c() to list them:

mod <- d %>% 
  feglm( data = . , 
         c(height, weight) ~ treatment)

One of the most powerful uses of feglm is the stepwise argument:

mod <- d %>% 
  feglm( data = . , 
         .[outcomes] ~ treatment + sw(age, income))

For each outcome, we will get one model that includes age as a control and one that includes in come. The function sw0() will also include a model with neither. The functions csw() and csw0() will add their arguments cumulatively.

When dealing with categorical variables, use the i() function to pre-specify what your baseline category will be:

mod <- d %>% 
  feglm( data = . , 
         .[outcomes] ~ i(race, ref="white"))

You can also use it to interact two variables, the second of which can be categorical or numeric. The following code will create dummy variables for each race and education combination:

mod <- d %>% 
  feglm( data = . , 
         .[outcomes] ~ i(race, ref="white", educ, ref2 = "BA"))

Alternatively, may wish to include the un-interacted terms as well:

mod <- d %>% 
  feglm( data = . , 
         .[outcomes] ~ i(race, ref="white") + 
           i(educ, ref="BA") +
           i(race, ref="white", educ, ref2 = "BA"))

In the first example, the “race = black x educ = MA” coefficient represents the difference between a Black person with an MA and a White person with a BA (the default categories, represented by the intercept). Thus, it essentially merged race and education into a single categorical variable that includes every combination of the two. In the second example, the “race = black x educ = MA” coefficient represents the additional gain or loss that come from the Black-MA combination, on top the effects that MA and Black exert separately. The latter is usually what political scientists are referring to when the talk about interactions.

The following setup code is helpful to run at the beginning of your script:

setFixest_vcov(no_FE = "hetero")

This tells fixest to use HC1 standards errors by default. If you include fixed effects, fixest will cluster standard errors on your fixed effects unless you tell it not to with the vcov argument inside feglm(). In the absence of fixed effects, you can specify what variable to cluster on using the cluster. If you have multiple fixed effects, you might specify vcov = twoway. Note that these arguments can either be set universally with setFixest_vcov(), at the model level inside feglm(), or in the model summary functions summary() and etable().

You should also include a setup line for what you want etable() to display:

setFixest_etable(drop = c("(Intercept)"), 
                 fitstat = "n", 
                 digits = 2, 
                 depvar = T)

This tells etable() to never display the intercept, drop all summary statistics from the bottom except for sample size, and limit the number of digits to 2. Of course, it many case we do want other fit statistics, so you can modify fitstat with a vector of names of statistics to include. You can use keep rather than drop if, say, you only want to display the treatment variable by default. The arguments drop and keep uses regular expressions, so they will also keep/drop variables with longer names that contain the string you provided. This is often helpful since it includes interactions, but sometimes if you’re using variable like age there are other variable names containing it that you don’t want to include/exclude. In that case, the regex symbols ^ and $ can you get you an exact match: ^age$.

Bayesian

To run on all available cores by default, your script should begin with options(cores = detectCores())

There are two convenient packages for working with Bayesian statistics: brms is more flexible but rstanarm has stored stan code which can speed things up. Both use the same syntex. I think I’d lean toward using rstanarm by default since speed is a big issue in these models I probably don’t know enough to want extreme customization. However, some of the examples below are in brms fr now since that’s what I was using most recently at the time I wrote this section.

The brmstools package appears to extend brms with some useful plotting features.

Here is a brms example of a Bayesian multilevel model with country and year varying intercepts.

mod <- brm(outcome ~ tr + ( 1 | country) + (1|year),
               data = dat,
               seed = 12345,
               control = list(adapt_delta = .95),
               prior = set_prior("normal(0,1)", class = "b"),
               cores = 4)

Note the use of a weakly informed prior for the coefficient estimates. The adapt_delta parameter is set to the default (and could therefore be omitted), but can be increased towards 1 if the model is not behaving (the readout will prompt you to do so).

The following function is useful for getting a point estimate and inner and outer confidence intervals for a given parameter:

library(parameters) # for get_parameters()
library(bayestestR) # for hdi()

get_est_and_ci <- function(mod, param){
  posterior.df <- get_parameters(mod, parameters = param)
  posteriors <- posterior.df[,1]
  bind_cols(estimate = median(posteriors),
            hdi(posteriors),
            hdi(posteriors,ci = .9) %>%
              transmute(ci_low_90=CI_low, ci_high_90=CI_high))
  }

I’m a little unsure what the ,1 is doing so it would be good to inspect this function before using.

List Experiments

To analyze list experiments, Blair and Imai’s list package is recommended. Here is an example:

mod <- ictreg(list_expt_count ~ witness_vio + age + male, 
              treat = "list_expt_treated", J = 3, 
              data = df, 
              method = "ml",
              robust = T) 

The argument j specicies the number of non-sensitive items in the list. The ml method (maximum likelihood) is considered the most efficient, though lm is more interpretable. If there are multiple sensitive items (i.e., some respondents get one sensitive item, others get another), use this approach:

mod <- ictreg(list_expt_count ~ age + male, 
       treat = "list_expt_3_arms", J = 3, 
       data = df, 
       method = "ml", multi.condition = "level") 

If there are no controls include, the formula list_expt_count ~ 1 will suffice.

To measure social desirability bias, you can ask control subjects the sensitive item after the experiment, and then add this variable to their count. Then just use linear regression (no special packages are needed) to see if the list experiment made a significant difference in the count people reported.

d %>% 
  mutate(count_combined = 
           case_when(list_expt_treated == "control" ~ 
                       list_expt_count + sensitive_question,
                     list_expt_treated == "treated" ~ list_expt_count))

lm(count_combined ~ list_expt_treated, data = d)

Multinomial Logit

Regression Tables

stargazer

For most regression functions, the stargazer package and its one-stop-shop stargazer() function provides everything you need. The style argument even allows you set the style to a particular journal such as AJPS, IO, or APSR. The out parameter allows you to save the table directly to a Latex file. You can replace the standard errors with ones you supply using se. You can view the model in the console using type = text.

kableExtra: kable_classic()

The kableExtra package provides a good alternative. It is particularly useful for displaying results within an RMarkdown document. The following pretty_print function makes the results look nice.

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
pretty_print <- function(reg_table_df){
  reg_table_df %>%
    kbl() %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>%
    scroll_box(width = "100%")
}

fixest: etable

When using the fixest package, the etable() function produces nice latex tables and can also print a table to the console as text (set tex = F). You can save the table directly to a Latex file with the file argument. Be sure to set replace = T if you want it to overwrite the existing file rather than just appending, though appending is nice if you want to display multiple tables in a row, say, in an appendix.

jtools: summ() and export_summs()

The summ function from jtools provides a number of useful features

  • Robust standard errors. Note that Stata uses HC1 standard errors while the authors of the sandwich package, used throughout R, recommend HC3. The default in jtools is HC3, but you can adjust it.

  • Scale all predictors by their SD, or as Gelman suggests, by dividing again by 2.

  • can accommodate lm, glm, svyglm, merMod (lme4), and rq (quantreg or quantile regression) models

  • set defaults with set_summ_defaults

  • round to your desired precision or set it globally with options("jtools-digits" = 2). This is purely for display; the underlying summ object retains the original percision of the model. This will globally affect all jtools functions.

  • set_summ_defaults() to set defaults for summ

  • will output both CIs and p-valeus if requested. CI width can be adjusted. However, it does NOT display stars

  • can exponentiate coefficients

export_summs is a wrapper for huxtable::hugreg use to export these tables to Latex, RMarkdown, Word, PDF, HTML, excel. In RMarkdown, specify results = 'asis' in the code-block. It provides a lot of customization including providing CIs and p-values simultaneously if desired. You may need to wrap in huxtable::to_latex.

Other packages

The sjPlot package has no native way to export tables to PDF or Latex. But this guy came up with a way. The function is html2latex::html2pdf().

Plots

String Handling

Tidyverse’s stringr package is your friend. To cite but one example, str_wrap() is a much better way to handle line breaks than trying to predict ahead of time where they should go. Here is a useful cheatsheet that also serves as a handy guide to RegEx.

The base R sprintf() function is far more flexible version of paste() that make it easy to insert numbers (or other variables) into a string. You can also use it for rounding.

Be careful when using any type of special character in a string. You may be invoking a RegEx command unknowingly.

Themes

Below are the three standard themes from base ggplot2 and their closest equivalents from cowplot. There are also theme_minimal_vgrid and theme_minimal_hgrid themes if you just want grid lines going vertically or horizontally. For the most part, the cowplot versions look similar albeit with larger font sizes.

The grid of multiple plots is created with + and / signs is enabled by the patchwork package

library(ggplot2)
library(patchwork)
library(cowplot)
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:patchwork':
## 
##     align_plots
## The following object is masked from 'package:lubridate':
## 
##     stamp
p <- ggplot(mpg, aes(displ, hwy, color = class)) + geom_point()

p1 <- p + theme_minimal() + ggtitle("theme_minimal")
p2 <- p + theme_bw() + ggtitle("theme_bw")
p3 <- p + theme_classic() + ggtitle("theme_classic")
p4 <- p + theme_minimal_grid() + ggtitle("theme_minimal_grid")
p5 <- p + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
p6 <- p + theme_cowplot() + ggtitle("theme_cowplot")

(p1 + p4) / (p2 + p5) / (p3 + p6)

Note that some of these themes are less desirable when used with faceting

pr <- p + facet_wrap(~drv)

pr1 <- pr + theme_minimal() + ggtitle("theme_minimal")
pr2 <- pr + theme_bw() + ggtitle("theme_bw")
pr3 <- pr + theme_classic() + ggtitle("theme_classic")
pr4 <- pr + theme_minimal_grid() + ggtitle("theme_minimal_grid")
pr5 <- pr + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
pr6 <- pr + theme_cowplot() + ggtitle("theme_cowplot")

pr1 / pr2 / pr3 / pr4 / pr5 / pr6

theme_bw() is probably best here. The cowplot website has additional suggestions.

The ggforest package provides theme_forest() and geom_stripes() which look nice for alternating colors. A couple of other packages also have their own theme_forest() function.

Multiplots

ggplot2

p <- ggplot(mpg, aes(displ, hwy)) + geom_point() 

The function facet_wrap() can be used to display an addition variable by creating multiple plots, either in a line, column, or grid.

p + facet_wrap(~drv)

Equivalently, you can write p + facet_wrap(vars(drv)). You can specify the number of rows and columns with nrow= and ncol= and the direction (“right then down” or “down then right”) with dir= (as.table=F will set it down “up then right”). The arguments switch= and strip.position= control on which side of the subplots the subplot names appear. To change the order or names of the subplots, adjust the factor levels of the factor your faceting on. You can let the axis scales range freely with scales

If you have more than one variable you’d like to facet on, you may be better using facet_grid. This allows you efficiently graph 4 variables.

p + facet_grid(rows = vars(drv), cols = vars(fl))

patchwork

The patchwork package is terrific in the slightly more complicated scenario where you want to stitch together multiple plots that have different axis labels or even different types of plots.

p1 <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp)) + 
  ggtitle('Plot 1')

p2 <- ggplot(mtcars) + 
  geom_boxplot(aes(gear, disp, group = gear)) + 
  ggtitle('Plot 2')

p3 <- ggplot(mtcars) + 
  geom_point(aes(hp, wt, colour = mpg)) + 
  ggtitle('Plot 3')

p4 <- ggplot(mtcars) + 
  geom_bar(aes(gear)) + 
  facet_wrap(~cyl) + 
  ggtitle('Plot 4')

p1a <- ggplot(mtcars) + 
  geom_point(aes(mpg, disp, colour = mpg, size = wt)) + 
  ggtitle('Plot 1a')

patchwork <- (p1 + p2) / p3
p1 + p4 | (p2 / p3)

Create insets

p1 + theme_classic() + inset_element(p2, left = 0.6, bottom = 0.6, right = 1, top = 1) 

You can give all the plots a common legend.

(p1a | (p2 / p3)) + plot_layout(guides = 'collect')

Annotate the metaplot as well as tag the individual subplots with letters or roman numerals.

patchwork + plot_annotation(
  tag_levels = 'A',
  title = 'The surprising truth about mtcars',
  subtitle = 'These 3 plots will reveal yet-untold secrets about our beloved data-set',
  caption = 'Disclaimer: None of these plots are insightful')

If you have the same axes for all your graphs, just use facet_wrap() since you will clutter the plot by repeating the axes labels on every subplot.

cowplot and ggpubr

The cowplot function plot_grid can also combine multiple plots. It lacks to ability to create a common legend and it’s less flexible. The ggpubr function ggarrange() is a wrapper for plot_grid that adds a common.legend= arguments but gets rid of the some other arguments that were useful. In both cases, the individual plot are bit hard to use since they tend to exceed the margins or overlap the plots. Better to just make titles for each plot. Better yet, just use the patchwork package.

Exporting Plots

  • ggsave is the base ggplot option. Don’t forget to specify dimensions. It will save the most recent plot you generated, which can be problematic if you’re stitching individual plots together into a multiplot. However, you can always specify the a particular plot you’ve named using the plot= argument

  • ggpubr::ggexport is used in place of ggarrange() to both make a multiplot and save it.

  • cowplot::save_plot supposedly has good default dimensions for theme_cowplot(). It also claims to be better for multiplots.

Coefficient Plots

fixest

The fixest package has a coefplot() function, but it uses base R graphics rather than ggplot2 and thus is best avoided unless you are fluent in base R graphics. However, it’s not too hard to extra the features you want into a data.frame and then plot them with ggplot2.

The following function will extract outcome names, regressor names, coefficient estimates, and inner/outer confidence intervals from a feglm model for plotting. It is setup to use HC1 standard errors.

est_and_ci <- function(model, parameters=NULL, ci.level = 0.95, inner.ci.level = 0.9){
  outer_cis <- confint(model, 
                       parm = parameters, 
                       ci.level = ci.level, 
                       vcov = "hetero")
  names(outer_cis) <- c("outer_low", "outer_high")
  
  inner_cis <- confint(model, 
                       parm = parameters, 
                       ci.level = inner.ci.level, 
                       vcov = "hetero")
  names(inner_cis) <- c("inner_low", "inner_high")
  
  cbind(estimate = coef(model), 
        outer_cis, 
        inner_cis,
        dv = etable(model)[1,1]) %>% 
    as.data.table(keep.rownames = "regressor")
}

If your input is fixest_multi object—for instance, if you had multiple outcomes, a split sample, or a stepwise comparison—then use the following:

map(mods, est_and_ci) %>% rbindlist()

You can enter other arguments you want after est_and_ci:

map(mods, est_and_ci, params = c("age", "sex")) %>% rbindlist()

Here is an alternative function that you could customize to your needs:

tidy_fixest_multi <- function(fixest_multi_object){
  fixest_multi_object %>% purrr::map_dfr(broom::tidy, .id = "outcome") %>%
    filter() %>%  # list any filter criteria here
    mutate() %>% # rename terms or outcomes as needed
    drop_na() %>% # drop NAs if desired
}

If you’re splitting the sample with the split argument, add this code after mutate:

separate(outcome, 
                     into = c("splitter", "study", "outcome"), 
                     sep = "; ",
                     remove = T) %>% 
    mutate(across(splitter:outcome, ~ sub(".*: ", "", .)),
           outcome = rename_outcomes(outcome)
    )

ggplot2

Here is an example:

df_to_plot %>%
  ggplot() + 
  geom_pointrange(aes(y = dv, 
                      x = estimate, 
                      xmin = outer_low, 
                      xmax = outer_high, 
                      color = group),
                  position = position_dodge(0.5)) +
  geom_linerange(aes( y = dv,
                      xmin = inner_low,
                      xmax = inner_high,
                      color = group), 
                 size = 1,
                 position = position_dodge(0.5)) +
  facet_wrap(~framework) +
  geom_vline(xintercept = 0, lty = "dotted") + 
  scale_color_grey(
    name = "Group",
    guide = guide_legend(reverse = T)) + 
  theme_minimal() + 
  theme(text = element_text(family = "serif"),
        plot.title = element_text(size = 16, face = "bold")) +
  labs(x = xtext, 
       y = "outcome variable" , 
       caption = "95% CIs with robust standard errors")

Note that guide = guide_legend(reverse = T) inside of scale_XXX() function will get the legend order to match the order of the colors in the plot. You will likely have to use fct_rev() beforehand though as the items will be listed bottom-to-top.

That said, you might want to turn this code into a function to avoid having to do much copy and paste. Here is an example that doesn’t assume your data frame was the confidence intervals in it yet: and then some useful code for plotting

# these variables are used in regression plots
dots <- geom_point(stat = "identity", position = position_dodge(width = 0.7))

ci95 <- geom_errorbar(aes(ymin = estimate - 1.96 * std.error, 
                         ymax = estimate + 1.96 * std.error),
                     width = 0.2, position = position_dodge(width = 0.7))

ci90 <- geom_errorbar(aes(ymin = estimate - 1.645 * std.error, 
                  ymax = estimate + 1.645 * std.error),
              width = 0.2, position = position_dodge(width = 0.7), 
              linetype = "dashed")  # 90% CI

# this is our go-to function for regression plots, which uses the variables
# saved above. In rare circumstances, we need to use dots, ci90, ci95 without
# this function to achieve greater customization.
# this function assume you've already created a ggplot2 object from a tibble
# it adds the desired layers (formatting)
decorate <- function(plt, 
                     title = element_blank(), 
                     caption = element_blank(), 
                     xlab = "Outcome", 
                     ylab = "Treatment Effect", 
                     scales = "fixed",
                     facet_var = "study"){
  plt + dots + ci95 + ci90 +
    guides(col = guide_legend(reverse = TRUE)) + 
    geom_hline(yintercept = 0, linetype = "dashed", color = "black") + 
    coord_flip() +
    theme_minimal() + labs(
      title = title,
      caption = caption,
      x = xlab,
      y = ylab
    ) + 
    theme(legend.position = "bottom") +
    facet_wrap(facet_var, scales = scales)
}

If you want colors that are colorblind safe and black-and-white printer safe, the viridisLite package offeres a good option. Just be sure to send the end at less than 1 or the brightest color will often be too faint.

library(viridisLite)
scale_colour_viridis_d(option = "magma", end =0.8, direction = -1)

Packages that Build on ggplot2

  • jtools::plot_summs is a real blessing. Features:

    • inner and outer confidence intervals
    • robust standard errors (any type you wish)
    • scaled coefficients
    • multiple models with legend in correct order
    • log odds exponentiation
    • unsupported models types that can be coerced to a tidy output with broom will still display but without some of the extra features. The more general function is called plot_coefs but there’s no reason to ever use it since plot_summs will alias it when needed
  • ggforestplot::forestplot is also great. Features:

    • hollow dots when p < alpha
    • multiple models with legend in correct order
    • nonstandard evaluation (no need to quote variable names)
    • alternating backgroun colors
    • log odds exponentation
    • can accept any model type provided you’ve converted the output to a tibble with variables names, estimates, and standard errors. It’s up to you to make these standard errors robust. However, you don’t need to calcuate the CIs. See webiste

Other Packages

The easystats ecoysystem features the see package which I have yet to use. Might be worth investigating.

Jared Lander has a package called coefplot which looks tempting but hasn’t been updated in many years and lacks a webpage. I seem to recall it getting the colors in the legend backwards with to way to fix. Avoid! It’s main redeeming feature is its ability to easily produce secret weapon plots.

Histograms and Bar Charts

There are three scenarios to consider. If your data frame contains the underlying data points and you want R to bin and count them, use geom_histogram(). If your data points are already binned or sorted to categories, use geom_bar().If your data frame contains counts for each bin/category rather than data points, then use geom_col

# from unbinned data points
ggplot(diamonds, aes(x = carat)) +
  geom_histogram(bins = 20)

# from binned/categorized data points
ggplot(mpg, aes(class)) + geom_bar()

# from bins/categories with counts
data.frame(value = c(.3, .4, .5, .6), count = 1:4) %>% 
  ggplot(aes(x=value, y=count)) + geom_col()

Marginal Effects and Predictions

ggplot2

If you have a dataset of predictions already created, this code provides a good way to visualize it.

Example data:

df <- tibble(player = rep(c("infielder", "outfielder","DH"), 8),
             bats = rep(c("lefty", "righty"), 12),
             mean = 100*runif(24),
             low = mean*0.92,
             high = mean*1.04,
             value_label = sprintf("%.1f%%", mean),
             outcome = c(rep("singles", 6), rep("doubles",6),
                         rep("triples", 6), rep("homeruns",6))
             )

*The code in the following four examples is identical except that a) certain parts are commented out and b) the point label text size and dodge width are adjusted to fit better

Two variables
df %>% filter(bats == "lefty", 
              outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             #color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            nudge_y = 0.2,
            #position = position_dodge(width = 0.5)
            ) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() #+
## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  #theme(legend.position = "bottom", 
        #axis.text.y = element_text(size=11), 
        #plot.title.position = "plot"
        #) +
  # scale_color_manual(values = c("black", "gray50"),
  #                  name = "Batting Stance",
  #                  guide = guide_legend(reverse = TRUE)) +
  #expand_limits(x = c(0, 100)) + 
  #facet_wrap(~ outcome)
Three variables using faceting
df %>% filter(bats == "lefty", 
              #outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             #color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            nudge_y = 0.25,
            #position = position_dodge(width = 0.5)
            ) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  #theme(legend.position = "bottom", 
        #axis.text.y = element_text(size=11), 
        #plot.title.position = "plot"
        #) +
  # scale_color_manual(values = c("black", "gray50"),
  #                  name = "Batting Stance",
  #                  guide = guide_legend(reverse = TRUE)) +
  expand_limits(x = c(0, 100)) +
  facet_wrap(~ outcome)

Three variables using color
df %>% filter(#bats == "lefty", 
              outcome == "triples"
              ) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 3, 
            #nudge_y = 0.2,
            position = position_dodge(width = 0.5)) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  theme(legend.position = "bottom",
  axis.text.y = element_text(size=11),
  plot.title.position = "plot"
  ) +
  scale_color_manual(values = c("black", "gray50"),
                   name = "Batting Stance",
                   guide = guide_legend(reverse = TRUE)) #+

  # expand_limits(x = c(0, 100)) +
  # facet_wrap(~ outcome)
Four variables using faceting and color
df %>% #filter(#bats == "lefty", 
              #outcome == "triples"
              #) %>%
  ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
             label = value_label,
             color = bats
             )) + 
  ggstance::geom_pointrangeh(shape = 1) + 
  geom_text(size = 2, 
            #nudge_y = 0.2,
            position = position_dodge(width = 1)) +
  labs(title = "Example Title",
       subtitle = "Example Sub",
       x = "Example x axis",
       y = element_blank(),
       caption = "Example Caption") + 
  theme_bw() +
  theme(legend.position = "bottom",
  axis.text.y = element_text(size=11),
  plot.title.position = "plot"
  ) +
  scale_color_manual(values = c("black", "gray50"),
                   name = "Batting Stance",
                   guide = guide_legend(reverse = TRUE)) +
  expand_limits(x = c(0, 100)) +
  facet_wrap(~ outcome)

Note:

  • value_label = sprintf("%.1f%%", mean) converts creates the data labels. It can be done ahead of time in the dataset as in this example at the time it called in ggplot. The sprintf command is a more flexible version of paste(). The first argument is the output and the initial % sign marks where the value is to be inserted. The .1 indicates that it should be rounded to 1 decimal place. The%% inserts a percent sign.
  • ggstance::geom_pointrangeh() gives you horizontal plots without needing to flip the axes with coord_flip().
    • The shape=1 argument provides hollow dots which is helpful if the CIs are ever small
  • geom_text() allows you to attach value labels to the points. There are two ways to adjust the position of the text
    • The nudge_y=0.2 argument will move everything up by the specified amount to avoid overlapping the text with the points (as nudge_x argument is also available).
    • The position = position_dodge(width = 0.5) will move one group of labels up and the other group down by the specified amount to avoid overlapping with each other. Therefore, it should be used when there are two colors in play.
    • If there are more than two colors, you’ll need to call geom_text multiple times—once for each color—and specify a y_nudge amount for each. Something like geom_text(data = df %>% filter(bats == "righty", nudge_y = 0.2)
  • scale_color_manual allows us to manipulate the colors
    • values=c("black", "gray50") allows us to pick our own colors which may be more black-and-white printer and color-blind reader friendly. Higher numbers appended to “gray” are actually lighter, not darker (may seem counterintuitive)
    • y = element_blank() provides an example of how to get rid of a labeling element in ggplot
    • guide = guide_legend(reverse = TRUE) corrects the order of the color labels to match the plot
  • The themes have several components
    • theme_bw() is neat and simple-looking theme for publication. However, it must come first or it will override the others.
    • legend.position = "bottom" moves the legend to the bottom, which is sometimes a more efficient use of space on the page
    • axis.text.y = element_text(size=11) provides an example of how to manually adjust the font size for a particular element relative to the others
    • plot.title.position = "plot" aligns the title and subtitle with the y-axis labels rather than the subplots
  • expand_limits(x = c(0, 100)) ensures that the scale of the x-axis goes from 0 to 100, or whatever we want it to

jtools

effect_plot will plot your chosen predictor against the outcome as a line/curve with shaded confidence band using robust standard errors. It’s great!

Features:

  • scatter plot of data points or partial residuals in background if desired
  • for categorical predictors it can make error bars, connected error bars, or dynamite bar plots. The default cross ties are thick and ugly, but this can adjusted internally
  • rug plot in the margins
  • force a numeric with 0 and 1 (or in theory, more levels) to be treated as categorical
  • the at parameter allows you to set other predictor to a given level. By default they are mean centered
  • colors
    • “jtools provides 6 color palettes design for qualitative data. 4 of the 6 are based on Paul Tol’s suggestions (see references) and are meant to both optimize your ability to quickly differentiate the colors and to be distinguishable to colorblind people.”
    • “Rainbow” is Paul Tol’s compromise rainbow color scheme that is fairly differentiable for colorblind people and when rendered in gray scale.
    • “You may also provide any color palette supported by RColorBrewer…jtools automatically…drops the lightest color.”
    • “If you want something a little non-standard, I’d suggest taking a look at”blue2” or “seagreen”.”
    • useful site: https://personal.sron.nl/~pault/

visreg

This package’s eponymous visreg() function can create plots with either base R graphics or ggplot2 (set gg=TRUE).

Features:

  • scatter plot of partial residuals
  • uncertainty band included by default (but not robust!)
  • unlike jtools, other variables are held to their median (continuous) or mode (factor)
  • rugplots including pair top and bottom rug plots for logistic regressions
  • can customize a lot of features
  • can produce contrast plots where one factor level is set as reference an other factors levels are compared to it. The y-axis shows how the expected value of the outcome at these other predictor levels differs from the EV at the reference level. Great for comparing treatments since reference CI goes to zero and others get wider
  • be sure to specify scale="response" with logistic regressions

Interactions

interactions package

interactions, a Jacob Long package, is great here. It’s a spinoff from jtools and produces ggplot objects. See https://interactions.jacob-long.com/articles/interactions.html for details

  • interact_plot generates an xy-plot (predictor v. outcome) with lines corresponding to various levels of your moderator.
    • You can include robust confidence bands at whatever level you set. and you can project a scatter plot of the original data points or their partial residuals in the background.
    • The moderator can be a factor or numeric.
    • All other values held their means unless you specify otherwise. It can handle squared terms too, and check linearity assumptions.
    • It can handle 3-way interactions too, showing side-by-side panels
  • Simple slopes analysis is pretty useful too
    • sim_slopes will tell you the slope and significance of your predictors at various moderator levels
    • plot a sim slopes object with plot to obtain a nice horizontal coef plot showing the slope at various moderator levels. You can see which are significant and which are different from each other.
    • apply huxtable::as_huxtable to your sim slopes object to get a regression-table-style table of the slope at various moderator levels ready for publication
    • set johnson_neyman = TRUE to see over what interval of moderator values the predictor slopes are significant. You may wish to set control.fdr = TRUE to deal with multiple comparisons implicit in this approach, but it risks being overly conservative
    • johnson_neyman() will give plot a curve with confidence band showing the slope of the predictor on the y-axis versus the moderator value on the x-axis. It will color where it is significant
    • use probe_interaction to simple slopes analysis and interaction plot all at once. Good for exploring models.
  • If your interaction contains two categorical variables, use cat_plot
    • default give you vertical dot-and-errorbar plot with the different moderator levels differentiated by color.
  • you can connect the point estimates with lines
  • you can turn this into a “dynamite” plot: a bar graph with error bars
  • see https://interactions.jacob-long.com/articles/categorical.html for details

visreg

The eponymous visreg function described in the above also does interactions. The key advantage that is you can specify whether to display levels in side-by-side panels or overlay the lines.

Imputation

The Amelia package is the likely best for imputation, though its can be a handful. At it’s most basic, it works like this

am <- amelia(pre_imputed_data, 
             noms = c("district", "party", "parent"))

The noms argument specifies nominal (that is, categorical) variables. Everything else including binary variables is treated as having a multivariate normal distribution, which according to the package developers tends to work pretty well. There is also an ords argument for ordinal variables, though this is less important, and an idvars argument for id variables or any other variable that should not affect the distribution of the variables you are imputing. The pre_imputed_data must be data.frame, not a tibble, and you probably want to drop column you won’t be using in your subsequent regressions. If it’s a bit data set, you should either set parallel="multicore" and ncpus=4, or globally set options(amelia.parallel = "multicore", amelia.ncpus = 4) at the top of your script (note that 4 is just an example).

The output will be 5 datasets and you will want to run your regressions on all of them. This used to be relatively easy thanks to Zelig, but that package appears to no longer be maintained and many of the functions are broken. While point estimates can simply be averaged across the imputed datasets, aggregating the standard errors is much more difficult and requires what the authors refer to a Rubin’s rules, which refers to Don Rubin’s 1987 book “Multiple Imputation for Nonresponse in Surveys”. The Amelia package provides a function called mi.meld() which accepts a data frame of point estimates and a data frame of standard errors and will return the quantities you want. It’s unclear to me whether how one might aggregated confidence intervals directly; maybe through bootstrapping.

Here’s a useful wrapper for the mi.meld() function. I developed it with fixest regression in mind, but it probably works fine with a bunch of other model objects.

library(Amelia)

combine_models <- function(imp_regs){
  
  # this is a low-level function that extracts the coefficient estimates
   # and standard errors from a list of model objects and combines them
   # using Rubin's rules. It returns the model for the first imputed data set,
   # the combined coefficient estimates, and the combined standard errors
  
  # arguments
     # imp_regs : list of model objects, presumably derived from a list of
     # imputations of a dataset
  
  model.obj <- imp_regs[[1]]
  term.names <- names(model.obj$coefficients)
  
  coef.mtx <- map(imp_regs, coef) %>% bind_cols()
  se.mtx <- 
    map(imp_regs, ~ lmtest::coeftest(., vcov = vcovHC(., type="HC1"))[,2]) %>%
    bind_cols()
  
  combo.list <- mi.meld(q = coef.mtx, se = se.mtx, byrow = F)
  
  coefs <- as.numeric(combo.list$q.mi) %>% setNames(term.names)
  SEs <- as.numeric(combo.list$se.mi) %>% setNames(term.names)
  
  return(list(reg = model.obj, coefs = coefs, SEs = SEs))
}

Sections to Add

Random Forests

ML: Lasso and Ridge Regressions

Bootstrapping

Model Selection

Regex