Welcome to my guide to R! I originally created this document as a place to store functions and code snippets I found useful, and note to my future self, but I figured, “Why not share it with colleagues?” A lot of you, it seems, have found it useful. I am a political scientist (site: www.msimonson.com), so this guide reflects the type of statistics I use most often. I hope that it will also be relevant to folks in other fields. Some part of this guide get updated more often that others, so if you find something outdated or have a suggestion for something to add, please email me at matthew.simonson@mail.huji.ac.il. If you’re looking to learn R for the first time, I recommend R for Data Science (2e), a free online textbook from the creators of the Tidyverse packagers. Start with sections 1-4, 6, 8, 27, and 28 for the essentials.
The tools best suited for creating publication-ready tables and plots may lie in a different package than the ones you use to initially explore and analyze the data
Initial data cleaning is best done in a regular R script, but use Rmarkdown when sharing results with colleagues. Replications files should use Rmarkdown unless there is a good reason not to (e.g., lengthy calculations) since it make comments easy to read. Your data cleaning can live in a separate R script.
Variable names should be easy to decipher, but don’t try to make
them publication worthy. That is, feel free to leave in underscores. To
produce the variable names that will appear in your published tables and
plots, either create a named vector of the form
c(working_name_1 = "New Name 1", working_name_2 = "New Name 2")
which you can use repeatedly to rename these variables or else use
labelled data. Try to hold off on doing this till you’re almost ready
for circulate a draft of your paper so that you don’t waste time
customizing plots and figures to fit names you don’t end up
using.
jtools
seems like the all around best
package for regression tables and plots though it does have a
few drawbacks. For plotting it builds on ggplot2()
so you
can use those tools for further customization.
Use visreg
or easystats
for
contrast plots that show the differences between
various levels of a factor (and whether those differences are
significant)
ggforestplot
will produce nice regression plots once
you’ve used broom
to tidy your regression output. That
said, you can also reproduce its features yourselt in ggplot without
much difficulty if you want more customization
fixeest
is great for analysis but not for
plots, because it uses base R graphics. Its tables look nice in
a report and can be exported to latex, though complications sometimes
ensue. Its biggest advantage for analysis is its ability to run several
regressions in one line without cluttering up your code with multiple
objects. For instance, you can use split=
to split the
sample, sw()
for stepwise changes in parameters,
csw()
for nesting, and
c(outcome,outcome2,...)~
for comparing different outcomes.
The main drawback of fixef
is that almost no other packages
accept fixest
model objects, so you can rarely use it in
conjunction with other packages. This will hopefully change of over time
(the package was first released in June 2021). One work around may be to
use the broom
package to tidy up the models for
plotting.
visreg
has a lot of promise for plotting, but I’ve
scarcely tried it. I’ve read all the vignettes though. It can do both
base R graphics and ggplot
.
easystats
looks very rich in features. It’s
tidyverse
compliant. Probably good for plotting but
not for tables because it has no latex or Rmarkdown output
options. It can do robust standard errors.
sjmisc
is a grab bag package (as is
jtools
), but it’s mainly useful for data manipulation and
exploration. It provides modest though not huge improvements over what
you can already get out of dplyr
, forcats
,
stringr
, etc. The main advantage of these functions is that
they can incorporate labeled data into tables and
plots. However, it often gives unexpected results without warning you
and that, in my option, makes it not worth the trouble except for a few
functions which you should call using “::” without loading the whole
package, since it tends to override a lot of existing
functions.
sjPlot
seems to have a lot going for it. The figures
look great and the regression tables supposedly can be output to latex
with the help of an auxiliary function from github.
Avoid zelig
unless the authors decide to update/fix
it. See website.
I do need to find another R equivalent of clarify though. One
alternative is to use bootstrapping from the sandwich package. Another
is to roll your own function (I have such a function in
Dropbox).
sjPlot
sjmisc
sjlabelled
sjstats
ggeffects
easystats
ecosystem. This includes (among others):
see
parameters
modelbased
to estimate effects and contrasts between
groupseffectsize
report
insight
bayestestR
for anayzing Bayesian modelsggtext
provides a way to do formatting with markdown
and html inside of plot text appearing in captions, titles, legends,
ticks, and axes. See website.cowplot
allows you to combine graphs into a single
figure quite elegantly. Also allows you to annotate and some themesggridges
cool ridgeline plots. See galleryggpubr
allows you to combine plots into a single
figureggcorrplot
lets you visualize correlations between a
bunch of variables and test their significancefabricatr
allows you to generate fake dataestimatr
allows you to estimate effect sizesrandomizr
allows you to randomize treatment
assignments, even in complex designsDesignLibrary
is a collection of research design
templates. There’s also an online wizard to generate code and data for
you.estimatr
can be useful beyond that thanks to its
diff-in-mean and robust linear regression tools. Still, I would not
suggest using it outside of the study design context since it produces a
different type of model object that’s hard to use other functions on.
Still, this package does have built-in export to latex, so maybe the
package has everything it needs. All plotting takes place in ggplot with
the help of tidy()
tidyverse
(Hadley Wickham and co-authors)
library(tidyverse)
ggplot2
to plotdplyr
to manipulate and summarize datatidyr
to reshape datareadr
to import CSVs and other filespurrr
to do functional programming instead of
sapply
tibble
stays in the backgroundstringr
to modify stringsforcats
to modify factorslubridate
to convert characters to date-timeshms
to convert characters to timesglue
to improve upon the base R paste
dtplyr
to run data.table
in the back-end
for speedgooglesheets4
to import from Google sheetshaven
to import from SPSSreadxl
to import from individual sheets in exceldata.table
is fast but not easy to read. Still, it
sometimes has functions that are super convenient such as the %like%
operator. dtplyr
combines data.table’s speed with dplyr’s
syntax as is a good alternative when you need the comptuational
efficiencyexprss
is for tables and labels - check it
out!maditr
provides an alternative to dtplyr, letting you
use pipes with data.table
excel.link
is to facilitate data exchange between R and
Excel - check it out!comprehenr
is for list comprehensions in R -
check it out!modelsummary
Packages to Review
table1
modelsummary
package with functions
modelsummary()
modelplot()
datasummary()
gt
, gtsummary
,
kableExtra
, and xtable
Here I load some libraries that will be used in examples throughout.
library(tidyverse)
library(data.table)
Rmarkdown has trouble finding filepaths. The here
package is great for that. After loading here
run
here("topmost_folder", "file_name")
to retrieve the path to
your file. You can put that statement directly inside of load or fread
or whatever loading function you’re using.
Internal links can be achieved a few ways, but my preferred method is
to put {#my_bookmark}
at end of a heading and then
reference it by writing [text_to_display](my_bookmark)
.
Hyperlinks use a similar approach:
[text_to_display](www.mysite.com)
.
You can set the date to automatically update like this
date: "December 03, 2024"
Here’s the code to create a table of contents. Feel free to adjust the depth:
output:
html_document:
toc: true
toc_depth: 3
This is a good first line to include:
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
You can display regression tables nicely using
kableExtra
package as described here
this code will standardize a bunch of variables (that is, center at mean and divide by standard deviation)
mutate(across(all_of(c(# list your variables here
)),
~ scale(.) %>% as.vector()))
This function adjusts p-values to minimize the false discovery rate using the Benjamini & Hochberg method
adjust <- function(results){
results %>%
mutate(p_adjusted = p.adjust(`p.value`, method = "BH") %>% round(3),
`p.value` = `p.value` %>% round(3),
stars = case_when(p_adjusted<0.001 ~ "***",
p_adjusted<0.01 ~ "**",
p_adjusted<0.05 ~ "*",
p_adjusted<0.10 ~ ".",
.default = "")) %>%
select(study:estimate, `p.value`, p_adjusted:stars)
}
This function adjusts p-values according to 4 different methods. They are listed in order of strictness. The Bonferroni method is considered overly conservative and may be dominated by Holm but is better known
adjust_multi <- function(results){
results %>%
mutate(
Bonferroni = p.adjust(`p.value`, method = "bonferroni") %>% round(4),
Holm = p.adjust(`p.value`, method = "holm") %>% round(4),
Hommel = p.adjust(`p.value`, method = "hommel") %>% round(4),
BH = p.adjust(`p.value`, method = "BH") %>% round(4),
`p.value` = `p.value` %>% round(3)
) %>%
rename(Original = `p.value`) %>%
select(!estimate:statistic)
}
For functions to create regression plots, see the Coefficient Plots section for fixest and ggplot2
The readRDS()
function is used to load a
.rds
file directly into a variable.
# Save
saveRDS(dat, "mydataframe.Rds")
# Open
d <- readRDS("mydataframe.Rds")
In contract, load()
will load a .Rdata
file
containing multiple objects.
# Save
save(dat1, dat2, "mydata_objects.Rdata")
# Open
load("mydata_objects.Rdata")
For small files, the readr
package offers the
read_csv()
function. For large files, the
data.table
package’s fread()
function for
speed.
Use the haven
package’s read_stata
and
read_spss
The qualtRics
is great for pulling data directly from a
Qualtrics survey. If you are using a Qualtrics account for the first
time (or maybe when using a new project), be sure to register your
account within Rstudio.
qualtrics_api_credentials(api_key = "adsfadfsvavf",
base_url = "iad1.qualtrics.com",
install = TRUE,
overwrite = TRUE)
Here’s an example:
svy_id <- "SV_bOUmVpTISZiIJMO"
fname_raw <- "raw_data_file"
fname_clean <- "clean_data_file"
get_new_data <- FALSE
if(get_new_data){
cat("getting new data")
mysurvey <- fetch_survey(
surveyID = svy_id,
label = FALSE,
convert = FALSE,
unanswer_recode_multi = 0,
force_request = T,
col_types = readr::cols(StartDate = readr::col_datetime(),
EndDate = readr::col_datetime(),
RecordedDate = readr::col_datetime()))
Qs <- survey_questions(svy_id)
cmap <- extract_colmap(mysurvey)
save(mysurvey, file = paste0(fname_raw, ".Rda"))
} else {
cat("loading existing data")
load(paste0(fname_raw, ".Rda"))
}
Cleaning data most often involves the following tasks
We will review the best approaches to these tasks here.
In general, the tidyverse
packages preferable to
data.table
for data cleaning and inspection. Not only do
more people understand tidyverse
, but the syntax is more
readable and, unless you’re really proficient, requires less looking up
how to do stuff. However, there are contexts in which
data.table
is superior, the most common of which is when
you need to change certain values of variable_1
, contingent
on the values of variable_2
. For instance,
dt[age < 18, employment := NA]
tends to be simpler than
the dplyr
equivalent:
d <- d %>% mutate(employment = case_when(age < 18 ~ NA_character,
T ~ employment)
If you have to this rigmarole for multiple variables, it becomes a
real pain, particularly when you could just be copying and pasting your
one-line data.table
code.
Another use for data.table is the %like%
operator which
saves you from having to do a bunch complicated regex mischief. Finally,
data.tables are pretty convenient for randomizing:
d[ , new_value := sample(values)]
, randomizing withing
groups d[ , new_value := sample(values), by = group ]
, and
bootstrapping d_boot = d[ sample(id, replace = F), ]
While it may be tempting to use the sjmisc
and
sjlaballed
packages for data.cleaning—indeed, that’s what
the former is primarily intended for—at this point the package tends to
suck up more of your time than it will save. The primary advantage of
these packages is that they preserve labels inherited from SPSS, Stata,
or Qualtrics, both for the variables (usually the question text in a
survey) and the values (e.g., the response options). However, this
package tends to overwrite existing base R functions such as
as_factor
making them behave differently than you expect,
and it rarely (if ever) issues warnings when something goes wrong—it
will just silently fail. You’ll spend a bunch of time looking up how to
do stuff, and in the end, you’ve introduced a greater risk for error.
There are also functions in other packages that will not work on
labelled data.
The most commonly used functions in recoding might just be
na_if
and replace_na
. They work like you
imagin they should: simply stick in the value to replace (or to use as a
replacement). More generally, recode
will rename values
while preserving the variable type. It uses old = new
syntax
d %>% mutate(capital = recode(country,
Brazil = "Brasilia",
Afghanistan = "Kabul",
China = "Beijing"))
recode_factor
will transform your variable into a factor
and reorder the levels to match the order you wrote all in one fell
swoop. The data.frame itself will not be resorted.
d %>% mutate(capital = recode_factor(country,
Brazil = "Brasilia",
Afghanistan = "Kabul",
China = "Beijing"))
In either case, numerics need to be surrounded by apostrophes.
d %>% mutate(year = recode(year,
`1999` = 0,
`2000` = 1))
fct_recode
from the forcats
package (loaded
in tidyverse
) does much the same thing as
recode_factor
but with new = old
format. Note
that quotes are still needed to the right of the equals sign but not the
left. fct_recode
will accept a character or factor as its
input, but the output will always be a factor.
d %>% mutate(capital = fct_recode(country,
Brasilia = "Brazil",
Kabul = "Afghanistan",
Beijing = "China"))
Overall, recode
and recode_factor
are
preferable to fct_recode
. The latter does not reorder the
levels of the factor, which this is usually something you want to do
when recoding. Also, cannot accept numerics. The new = old
syntax does nicely match up with the mutate
,
rename
, and select
functions, but not with
other renaming functions found elsewhere in R. On the other hand, the
tidyverse team considers this order problematic and says then will
likely depreciate this function once they come up with a workaround.
In both cases you can use !!!
to pass a renaming vector
to mutate()
recoding_vec <- c(Brazil = "Brasilia", Afghanistan = "Kabul", China = "Beijing")
d %>% mutate(capital = recode(country, !!!recoding_vec))
If needed, you can turn a new = old
name vector into an
old = new
naming vector
old_equals_new <- names(new_equals_old)
names(old_equals_new) <- new_equals_old
One instance in which the sjmisc
seems reasonable safe
and there doesn’t seem to be a good alternative is if you need to
reverse code a bunch of variables at once. Note that this works outside
of mutate (as well as within).
df %>% rec(starts_with(my_prefix), rec = "rev", suffix = "")
The across
function is useful for recoding multiple
columns the same way. For instance, we can replace 999
with
NA
across all the columns that begin with “my_prefix”
df %>% mutate(
across(starts_with("my_prefix"), .fns = ~ na_if(.x, 999)))
This can also be used to change all logicals to numerics at the end of a cleaning:
df %>% mutate(
across(where(is.logical), .fns = ~ as.numeric))
Use droplevels
at the end of a cleaning session to get
rid of lingering unused levels in factors.
To recode a range of numeric values into a factor (or into fewer
numeric values) use case_when
df <- df %>% mutate(new_factor = case_when(nums < 5 ~ "level 1",
nums >=5 ~ "level 2") %>% factor())
The function if_else
is harder to read
df <- df %>% mutate(new_factor = if_else(nums < 5, "level 1", "level 2")
%>% factor())
while the base R function ifelse
fails to account for
NA
values:
df <- df %>% mutate(new_factor = ifelse(nums < 5, "level 1", "level 2")
%>% factor())
will recode a num = NA
as "level 2"
rather
than leaving it as NA
. The %in%
operator
likewise does not respect NA
s: for instance
2 %in% NA
will return FALSE
even though
2 == NA
will return NA
.
When recoding multiple factor levels into fewer factor levels, an
alternative from the forcats
packages is to use
fct_collapse()
. This will respect the order and leave
omitted factors untouched.
df <- df %>% mutate(new_factor = fct_collapse(
"level a" = c("level 1", "level 2"),
"level b" = c("level 4", "level 5")))
Assuming there was a level 3, it will remain the new factor. You can
use the other_level
argument to collapse all unnamed levels
into a single level. As an added bonus, the order of the levels will be
the order in which you wrote them. Alternatively, if you have a lot of
levels to combine and one to leave alone, you can use
fct_other(f, keep = c("level 1", "level 2") other_level = "Other")
If you just have one level that you want to make the baseline
(reference level), use the base R function relevel()
. If
you want to explicitly rearrange multiple levels, use the
forcats
function fct_relevel()
. To reverse the
order of the levels (often useful in ggplot
), use
fct_rev()
The datawizard
package was created an alternative to
tidyverse
with fewer dependencies. However, it also has a
few useful functions that can be used in conjunction with
tidyverse
.
The rownames_as_column()
function is self-explanatory. I
know of no equivalent.
The standardize()
function will center and rescale by
standard deviation. Additional arguments allow for robust
standardization (center at median, devided by MAD) or Gelman
standardization (devided by 2 SDs). Can be applied to individual
variables or all the numeric variables in a dataset. Works within groups
(though I’m not sure if you can use the groups specified by
dplyr
).
Similarly,
normalize()
will scale variables between 0 and 1demean()
will subtract the means, by group if
desiredrescale()
will rescale them to a range you specifyreverse()
will reverse code themranktransform()
converts values to ranksslide()
can shift values up or down uniformlyadjust()
will, rather amazingly, partial out other
variables using a choice of regression models including OSL, random
effects, fixed effect, GAMs, or Bayesian. It does not appear to be able
to accomodate logistic models and I’m unsure about interactions.
Neverthess, its very impressive.winsorize()
will censor extreme values above/below a
percentile, z-score, or pair of thresholds you specifyVariable names should follow a consistent set of rules. I prefer all lower case letters, words separated by underscores, and prioritize suffixes over prefixes (since it makes it easier to find variables you want using auto-complete when typing). Here are some useful suffixes for when multiple versions of a variable are included
.cat
for categorical (factor).cat.3
for three levels (assumigng the original had a
different number of levels).num
for numeric (if the default version is
categorial).01
for binary or rescaled 0 to 1.imp
for imputed.dm
for demeaned.z
for standardized.1
, .2
, .3
for multiple
variables for the same type like if each person is asked to list three
friendsYou might prefer to use underscores, though I do find it helpful to reserve underscores for seperating parts of a name.
Make existing variable names lowercase
d <- d %>% rename_with(tolower, starts_with("my_prefix"))
Replace periods with underscores
d <- d %>% rename_with(~ gsub(".", "_", .x, fixed = TRUE))
The argument fixed
ensures the the period and underscore
are interpreted as strings and not as symobols in regular
expressions.
If the majority of variables in your dataset are going to be
modified, I like to use transmute
to drop all unused
variables; any variable you wish to retain without renaming or rename
without changing should be listed inside the transmute
function. This make it easy to look stuff up later.
I prioritize easy-to-remember names over short ones or pretty ones, on the assumption that any variable appearing in a publication will be renamed before going into a table or plot.
In order to rename multiple variables with a custom function, use the
following dplyr
syntax:
rename_columns <- function(df, new_name, old_name){
df %>% rename({{new_name}} := .data[[old_name]])
}
mutate_columns <- function(df, new_name, old_name){
df %>% mutate({{new_name}} := get(old_name))
}
I don’t really understand why get (which comes from base R) doesn’t work with rename, but for some reason it throws an error. However, it can be very useful if you want to make the RHS more complex:
mutate_columns <- function(df, new_name, old_name){
df %>% mutate({{new_name}} := case_when(get(old_name) < 1 ~ 1,
get(old_name) >= 1 ~ 0))
}
To summarize rows it is often helpful to use rowwise()
which treats each row as a group. The function c_across()
is
then used to select columns inside of the mutate()
or
transmtue()
function.
Find the average, sum, or max of a row (among columns with the prefix “my_prefix”)
df %>%
rowwise() %>%
mutate(mean_value = mean(c_across(starts_with("my_prefix"))),
sum_value = sum(c_across(starts_with("my_prefix"))),
max_value = max(c_across(starts_with("my_prefix")))
)
Alternatively:
df %>%
mutate(mean_value = rowMeans(cur_data() %>%
select(starts_with("my_prefix"))),
sum_value = rowSums(cur_data() %>%
select(starts_with("my_prefix"))),
max_value = pmax(cur_data() %>%
select(starts_with("my_prefix")))
)
I personally think the rowwise()
approach is simpler
since you don’t need to remembers the alternative functions for mean,
sum, and max.
Here is a way to create a variable indicating whether a certain value
(in this case 1
) appears anywhere in a row:
df %>% mutate(row_with_one = if_any(starts_with("my_prefix"), ~ . == 1)
Likewise, if_all()
can be used to see if all values in a
row are NA
:
df %>% mutate(row_all_na = if_all(starts_with("my_prefix"), ~ is.na)
Try using describe_distribution()
and
data_tabulate()
from the datawizard
package.
For weighted surveys, you might want weighted_mean()
which
can be applied to the entire dataset at once (or a subset of it)
The sjmisc
package offers descr()
which
allows you to customize what is included as will include variable labels
(though irritatingly, not if the variable is a factor; it also has
mysterious errors sometimes)
The readr
package from tidyverse
offers the
spec()
function, but it only works on tibbles that have
been read in by readr
functions
The gtsummary
package has tbl_summary
which
I’ve used before. I forget what the tradeoffs are.
The modelsummary
package offers the
datasummary_balance
function, which can create neat summary
and balance tables. It can even save them as latex files.
datasummary_balance( ~ 1,
data = df,
fmt = 2,
title = "Descriptive Statistics\\label{tab:desc_stats}",
output = "latex"
)
datasummary_balance( ~ treated,
data = df,
fmt = 2,
dinm_statistic = "p.value",
title = "Balance Table\\label{tab:bal_tab}",
output = "latex")
The fmt
argument specifies the number of digits to
display.
Prep Work
tibble
s should be converted to
data.frame
s
You might want to convert factors to numeric dummies for each level
You might want to first replace underscores with spaces using
rename_with(~ gsub("_", " ", .x, fixed = TRUE))
so that it
looks nicer. Factors a printed separately at the bottom and lack some
statistics that you might want. They also make the table wider.
You can save a table as an string and then print several at once to the same latex file:
print(summary.table, "\n\n", balance.table, file = "appendix_tables.tex")
Here’s a useful function for building up a formula from a character vector:
make_formula <- function(char_vec){
char_vec %>% paste(collapse = " + ") %>% paste("~", .) %>% as.formula()
}
The fixest
package was designed for fast estimation of
fixed effects, but it has much wider variety of uses. You can use it for
anything you would normally use lm
or glm
for.
The main times when I don’t use it are for random effects models (I go
Bayesian for those) or multinomial logit models for categorial outcomes
(use the mlogit
package).
Here is an example
outcomes <- c("height", "weight")
controls <- c("age", "income")
mod <- d %>%
filter( age < 18) %>%
feglm( data = . ,
.[outcomes] ~ treatment + .[controls] | region,
split = ~ sex)
This example will produce two four models: two for each outcome, of
which one is applied to the subsample of males, the other females. You
can use fsplit
to also includes models for the full sample.
Fixed effects for region will be included but will not be shown in the
summary since they are assumed to be nuisance paraemters.
Variables on the RHS can also be prepared using a macro:
setFixest_fml(..ctrls = ~ district + age + male)
However, this is less ideal because one cannot call
..ctrls
to inspect it; you have to scroll back up and find
where you typed it to see what’s inside. It also cannot easily be reused
in non fixest
functions. If you want to use multiple
outcomes without pre-specifying them, simply use c()
to
list them:
mod <- d %>%
feglm( data = . ,
c(height, weight) ~ treatment)
One of the most powerful uses of feglm
is the stepwise
argument:
mod <- d %>%
feglm( data = . ,
.[outcomes] ~ treatment + sw(age, income))
For each outcome, we will get one model that includes age as a
control and one that includes in come. The function sw0()
will also include a model with neither. The functions csw()
and csw0()
will add their arguments cumulatively.
When dealing with categorical variables, use the i()
function to pre-specify what your baseline category will be:
mod <- d %>%
feglm( data = . ,
.[outcomes] ~ i(race, ref="white"))
You can also use it to interact two variables, the second of which can be categorical or numeric. The following code will create dummy variables for each race and education combination:
mod <- d %>%
feglm( data = . ,
.[outcomes] ~ i(race, ref="white", educ, ref2 = "BA"))
Alternatively, may wish to include the un-interacted terms as well:
mod <- d %>%
feglm( data = . ,
.[outcomes] ~ i(race, ref="white") +
i(educ, ref="BA") +
i(race, ref="white", educ, ref2 = "BA"))
In the first example, the “race = black x educ = MA” coefficient represents the difference between a Black person with an MA and a White person with a BA (the default categories, represented by the intercept). Thus, it essentially merged race and education into a single categorical variable that includes every combination of the two. In the second example, the “race = black x educ = MA” coefficient represents the additional gain or loss that come from the Black-MA combination, on top the effects that MA and Black exert separately. The latter is usually what political scientists are referring to when the talk about interactions.
The following setup code is helpful to run at the beginning of your script:
setFixest_vcov(no_FE = "hetero")
This tells fixest
to use HC1 standards errors by
default. If you include fixed effects, fixest
will cluster
standard errors on your fixed effects unless you tell it not to with the
vcov
argument inside feglm()
. In the absence
of fixed effects, you can specify what variable to cluster on using the
cluster
. If you have multiple fixed effects, you might
specify vcov = twoway
. Note that these arguments can either
be set universally with setFixest_vcov()
, at the model
level inside feglm()
, or in the model summary functions
summary()
and etable()
.
You should also include a setup line for what you want
etable()
to display:
setFixest_etable(drop = c("(Intercept)"),
fitstat = "n",
digits = 2,
depvar = T)
This tells etable()
to never display the intercept, drop
all summary statistics from the bottom except for sample size, and limit
the number of digits to 2. Of course, it many case we do want other fit
statistics, so you can modify fitstat
with a vector of
names of statistics to include. You can use keep
rather
than drop
if, say, you only want to display the treatment
variable by default. The arguments drop
and
keep
uses regular expressions, so they will also keep/drop
variables with longer names that contain the string you provided. This
is often helpful since it includes interactions, but sometimes if you’re
using variable like age
there are other variable names
containing it that you don’t want to include/exclude. In that case, the
regex symbols ^
and $
can you get you an exact
match: ^age$
.
To run on all available cores by default, your script should begin
with options(cores = detectCores())
There are two convenient packages for working with Bayesian
statistics: brms
is more flexible but rstanarm
has stored stan code which can speed things up. Both use the same
syntex. I think I’d lean toward using rstanarm
by default
since speed is a big issue in these models I probably don’t know enough
to want extreme customization. However, some of the examples below are
in brms
fr now since that’s what I was using most recently
at the time I wrote this section.
The brmstools
package appears to extend
brms
with some useful plotting features.
Here is a brms
example of a Bayesian multilevel model
with country and year varying intercepts.
mod <- brm(outcome ~ tr + ( 1 | country) + (1|year),
data = dat,
seed = 12345,
control = list(adapt_delta = .95),
prior = set_prior("normal(0,1)", class = "b"),
cores = 4)
Note the use of a weakly informed prior for the coefficient
estimates. The adapt_delta
parameter is set to the default
(and could therefore be omitted), but can be increased towards 1 if the
model is not behaving (the readout will prompt you to do so).
The following function is useful for getting a point estimate and inner and outer confidence intervals for a given parameter:
library(parameters) # for get_parameters()
library(bayestestR) # for hdi()
get_est_and_ci <- function(mod, param){
posterior.df <- get_parameters(mod, parameters = param)
posteriors <- posterior.df[,1]
bind_cols(estimate = median(posteriors),
hdi(posteriors),
hdi(posteriors,ci = .9) %>%
transmute(ci_low_90=CI_low, ci_high_90=CI_high))
}
I’m a little unsure what the ,1
is doing so it would be
good to inspect this function before using.
To analyze list experiments, Blair and Imai’s list
package is recommended. Here is an example:
mod <- ictreg(list_expt_count ~ witness_vio + age + male,
treat = "list_expt_treated", J = 3,
data = df,
method = "ml",
robust = T)
The argument j
specicies the number of non-sensitive
items in the list. The ml
method (maximum likelihood) is
considered the most efficient, though lm
is more
interpretable. If there are multiple sensitive items (i.e., some
respondents get one sensitive item, others get another), use this
approach:
mod <- ictreg(list_expt_count ~ age + male,
treat = "list_expt_3_arms", J = 3,
data = df,
method = "ml", multi.condition = "level")
If there are no controls include, the formula
list_expt_count ~ 1
will suffice.
To measure social desirability bias, you can ask control subjects the sensitive item after the experiment, and then add this variable to their count. Then just use linear regression (no special packages are needed) to see if the list experiment made a significant difference in the count people reported.
d %>%
mutate(count_combined =
case_when(list_expt_treated == "control" ~
list_expt_count + sensitive_question,
list_expt_treated == "treated" ~ list_expt_count))
lm(count_combined ~ list_expt_treated, data = d)
For most regression functions, the stargazer
package and
its one-stop-shop stargazer()
function provides everything
you need. The style
argument even allows you set the style
to a particular journal such as AJPS, IO, or APSR. The out
parameter allows you to save the table directly to a Latex file. You can
replace the standard errors with ones you supply using se
.
You can view the model in the console using
type = text
.
The kableExtra
package provides a good
alternative. It is particularly useful for displaying results within an
RMarkdown document. The following pretty_print
function
makes the results look nice.
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
pretty_print <- function(reg_table_df){
reg_table_df %>%
kbl() %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
scroll_box(width = "100%")
}
When using the fixest
package, the etable()
function produces nice latex tables and can also print a table to the
console as text (set tex = F
). You can save the table
directly to a Latex file with the file
argument. Be sure to
set replace = T if you want it to overwrite the existing file rather
than just appending, though appending is nice if you want to display
multiple tables in a row, say, in an appendix.
The summ
function from jtools
provides a
number of useful features
Robust standard errors. Note that Stata uses HC1 standard errors
while the authors of the sandwich package, used throughout
R
, recommend HC3. The default in jtools
is
HC3, but you can adjust it.
Scale all predictors by their SD, or as Gelman suggests, by dividing again by 2.
can accommodate lm
, glm
,
svyglm
, merMod
(lme4
), and
rq
(quantreg
or quantile regression)
models
set defaults with set_summ_defaults
round to your desired precision or set it globally with
options("jtools-digits" = 2)
. This is purely for display;
the underlying summ object retains the original percision of the model.
This will globally affect all jtools functions.
set_summ_defaults()
to set defaults for
summ
will output both CIs and p-valeus if requested. CI width can be adjusted. However, it does NOT display stars
can exponentiate coefficients
export_summs
is a wrapper for
huxtable::hugreg
use to export these tables to Latex,
RMarkdown, Word, PDF, HTML, excel. In RMarkdown, specify
results = 'asis'
in the code-block. It provides a lot of
customization including providing CIs and p-values simultaneously if
desired. You may need to wrap in huxtable::to_latex
.
The sjPlot
package has no native way to export tables to
PDF or Latex. But this
guy came up with a way. The function is
html2latex::html2pdf()
.
Tidyverse’s stringr
package is your friend. To cite but
one example, str_wrap()
is a much better way to handle line
breaks than trying to predict ahead of time where they should go. Here
is a useful cheatsheet that also serves as a handy guide to RegEx.
The base R sprintf()
function is far more flexible
version of paste()
that make it easy to insert numbers (or
other variables) into a string. You can also use it for rounding.
Be careful when using any type of special character in a string. You may be invoking a RegEx command unknowingly.
Below are the three standard themes from base ggplot2
and their closest equivalents from cowplot
. There are also
theme_minimal_vgrid
and theme_minimal_hgrid
themes if you just want grid lines going vertically or horizontally. For
the most part, the cowplot versions look similar albeit with larger font
sizes.
The grid of multiple plots is created with + and / signs is enabled
by the patchwork
package
library(ggplot2)
library(patchwork)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:patchwork':
##
## align_plots
## The following object is masked from 'package:lubridate':
##
## stamp
p <- ggplot(mpg, aes(displ, hwy, color = class)) + geom_point()
p1 <- p + theme_minimal() + ggtitle("theme_minimal")
p2 <- p + theme_bw() + ggtitle("theme_bw")
p3 <- p + theme_classic() + ggtitle("theme_classic")
p4 <- p + theme_minimal_grid() + ggtitle("theme_minimal_grid")
p5 <- p + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
p6 <- p + theme_cowplot() + ggtitle("theme_cowplot")
(p1 + p4) / (p2 + p5) / (p3 + p6)
Note that some of these themes are less desirable when used with faceting
pr <- p + facet_wrap(~drv)
pr1 <- pr + theme_minimal() + ggtitle("theme_minimal")
pr2 <- pr + theme_bw() + ggtitle("theme_bw")
pr3 <- pr + theme_classic() + ggtitle("theme_classic")
pr4 <- pr + theme_minimal_grid() + ggtitle("theme_minimal_grid")
pr5 <- pr + theme_cowplot() + background_grid() + panel_border() + ggtitle("grid and border")
pr6 <- pr + theme_cowplot() + ggtitle("theme_cowplot")
pr1 / pr2 / pr3 / pr4 / pr5 / pr6
theme_bw()
is probably best here. The
cowplot
website has additional suggestions.
The ggforest
package provides
theme_forest()
and geom_stripes()
which look
nice for alternating colors. A couple of other packages also have their
own theme_forest()
function.
p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
The function facet_wrap()
can be used to display an
addition variable by creating multiple plots, either in a line, column,
or grid.
p + facet_wrap(~drv)
Equivalently, you can write p + facet_wrap(vars(drv))
.
You can specify the number of rows and columns with nrow=
and ncol=
and the direction (“right then down” or “down
then right”) with dir=
(as.table=F
will set it
down “up then right”). The arguments switch=
and
strip.position=
control on which side of the subplots the
subplot names appear. To change the order or names of the subplots,
adjust the factor levels of the factor your faceting on. You can let the
axis scales range freely with scales
If you have more than one variable you’d like to facet on, you may be
better using facet_grid.
This allows you efficiently graph
4 variables.
p + facet_grid(rows = vars(drv), cols = vars(fl))
The patchwork
package is terrific in the slightly more
complicated scenario where you want to stitch together multiple plots
that have different axis labels or even different types of plots.
p1 <- ggplot(mtcars) +
geom_point(aes(mpg, disp)) +
ggtitle('Plot 1')
p2 <- ggplot(mtcars) +
geom_boxplot(aes(gear, disp, group = gear)) +
ggtitle('Plot 2')
p3 <- ggplot(mtcars) +
geom_point(aes(hp, wt, colour = mpg)) +
ggtitle('Plot 3')
p4 <- ggplot(mtcars) +
geom_bar(aes(gear)) +
facet_wrap(~cyl) +
ggtitle('Plot 4')
p1a <- ggplot(mtcars) +
geom_point(aes(mpg, disp, colour = mpg, size = wt)) +
ggtitle('Plot 1a')
patchwork <- (p1 + p2) / p3
p1 + p4 | (p2 / p3)
Create insets
p1 + theme_classic() + inset_element(p2, left = 0.6, bottom = 0.6, right = 1, top = 1)
You can give all the plots a common legend.
(p1a | (p2 / p3)) + plot_layout(guides = 'collect')
Annotate the metaplot as well as tag the individual subplots with letters or roman numerals.
patchwork + plot_annotation(
tag_levels = 'A',
title = 'The surprising truth about mtcars',
subtitle = 'These 3 plots will reveal yet-untold secrets about our beloved data-set',
caption = 'Disclaimer: None of these plots are insightful')
If you have the same axes for all your graphs, just use
facet_wrap()
since you will clutter the plot by repeating
the axes labels on every subplot.
cowplot and ggpubr
The cowplot
function plot_grid
can also
combine multiple plots. It lacks to ability to create a common legend
and it’s less flexible. The ggpubr
function
ggarrange()
is a wrapper for plot_grid
that
adds a common.legend=
arguments but gets rid of the some
other arguments that were useful. In both cases, the individual plot are
bit hard to use since they tend to exceed the margins or overlap the
plots. Better to just make titles for each plot. Better yet, just use
the patchwork
package.
ggsave
is the base ggplot
option. Don’t
forget to specify dimensions. It will save the most recent plot you
generated, which can be problematic if you’re stitching individual plots
together into a multiplot. However, you can always specify the a
particular plot you’ve named using the plot=
argument
ggpubr::ggexport
is used in place of
ggarrange()
to both make a multiplot and save it.
cowplot::save_plot
supposedly has good default
dimensions for theme_cowplot()
. It also claims to be better
for multiplots.
The fixest
package has a coefplot()
function, but it uses base R graphics rather than ggplot2
and thus is best avoided unless you are fluent in base R graphics.
However, it’s not too hard to extra the features you want into a
data.frame and then plot them with ggplot2
.
The following function will extract outcome names, regressor names,
coefficient estimates, and inner/outer confidence intervals from a
feglm
model for plotting. It is setup to use HC1 standard
errors.
est_and_ci <- function(model, parameters=NULL, ci.level = 0.95, inner.ci.level = 0.9){
outer_cis <- confint(model,
parm = parameters,
ci.level = ci.level,
vcov = "hetero")
names(outer_cis) <- c("outer_low", "outer_high")
inner_cis <- confint(model,
parm = parameters,
ci.level = inner.ci.level,
vcov = "hetero")
names(inner_cis) <- c("inner_low", "inner_high")
cbind(estimate = coef(model),
outer_cis,
inner_cis,
dv = etable(model)[1,1]) %>%
as.data.table(keep.rownames = "regressor")
}
If your input is fixest_multi
object—for instance, if
you had multiple outcomes, a split sample, or a stepwise comparison—then
use the following:
map(mods, est_and_ci) %>% rbindlist()
You can enter other arguments you want after
est_and_ci
:
map(mods, est_and_ci, params = c("age", "sex")) %>% rbindlist()
Here is an alternative function that you could customize to your needs:
tidy_fixest_multi <- function(fixest_multi_object){
fixest_multi_object %>% purrr::map_dfr(broom::tidy, .id = "outcome") %>%
filter() %>% # list any filter criteria here
mutate() %>% # rename terms or outcomes as needed
drop_na() %>% # drop NAs if desired
}
If you’re splitting the sample with the split
argument,
add this code after mutate:
separate(outcome,
into = c("splitter", "study", "outcome"),
sep = "; ",
remove = T) %>%
mutate(across(splitter:outcome, ~ sub(".*: ", "", .)),
outcome = rename_outcomes(outcome)
)
Here is an example:
df_to_plot %>%
ggplot() +
geom_pointrange(aes(y = dv,
x = estimate,
xmin = outer_low,
xmax = outer_high,
color = group),
position = position_dodge(0.5)) +
geom_linerange(aes( y = dv,
xmin = inner_low,
xmax = inner_high,
color = group),
size = 1,
position = position_dodge(0.5)) +
facet_wrap(~framework) +
geom_vline(xintercept = 0, lty = "dotted") +
scale_color_grey(
name = "Group",
guide = guide_legend(reverse = T)) +
theme_minimal() +
theme(text = element_text(family = "serif"),
plot.title = element_text(size = 16, face = "bold")) +
labs(x = xtext,
y = "outcome variable" ,
caption = "95% CIs with robust standard errors")
Note that guide = guide_legend(reverse = T)
inside of
scale_XXX()
function will get the legend order to match the
order of the colors in the plot. You will likely have to use
fct_rev()
beforehand though as the items will be listed
bottom-to-top.
That said, you might want to turn this code into a function to avoid having to do much copy and paste. Here is an example that doesn’t assume your data frame was the confidence intervals in it yet: and then some useful code for plotting
# these variables are used in regression plots
dots <- geom_point(stat = "identity", position = position_dodge(width = 0.7))
ci95 <- geom_errorbar(aes(ymin = estimate - 1.96 * std.error,
ymax = estimate + 1.96 * std.error),
width = 0.2, position = position_dodge(width = 0.7))
ci90 <- geom_errorbar(aes(ymin = estimate - 1.645 * std.error,
ymax = estimate + 1.645 * std.error),
width = 0.2, position = position_dodge(width = 0.7),
linetype = "dashed") # 90% CI
# this is our go-to function for regression plots, which uses the variables
# saved above. In rare circumstances, we need to use dots, ci90, ci95 without
# this function to achieve greater customization.
# this function assume you've already created a ggplot2 object from a tibble
# it adds the desired layers (formatting)
decorate <- function(plt,
title = element_blank(),
caption = element_blank(),
xlab = "Outcome",
ylab = "Treatment Effect",
scales = "fixed",
facet_var = "study"){
plt + dots + ci95 + ci90 +
guides(col = guide_legend(reverse = TRUE)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
coord_flip() +
theme_minimal() + labs(
title = title,
caption = caption,
x = xlab,
y = ylab
) +
theme(legend.position = "bottom") +
facet_wrap(facet_var, scales = scales)
}
If you want colors that are colorblind safe and black-and-white
printer safe, the viridisLite
package offeres a good
option. Just be sure to send the end at less than 1 or the brightest
color will often be too faint.
library(viridisLite)
scale_colour_viridis_d(option = "magma", end =0.8, direction = -1)
jtools::plot_summs
is a real blessing. Features:
plot_coefs
but there’s no
reason to ever use it since plot_summs
will alias it when
neededggforestplot::forestplot
is also great.
Features:
The easystats
ecoysystem features the see
package which I have yet to use. Might be worth investigating.
Jared Lander has a package called coefplot
which looks
tempting but hasn’t been updated in many years and lacks a webpage. I
seem to recall it getting the colors in the legend backwards with to way
to fix. Avoid! It’s main redeeming feature is its
ability to easily produce secret weapon plots.
There are three scenarios to consider. If your data frame contains
the underlying data points and you want R to bin and count them, use
geom_histogram()
. If your data points are already binned or
sorted to categories, use geom_bar()
.If your data frame
contains counts for each bin/category rather than data points, then use
geom_col
# from unbinned data points
ggplot(diamonds, aes(x = carat)) +
geom_histogram(bins = 20)
# from binned/categorized data points
ggplot(mpg, aes(class)) + geom_bar()
# from bins/categories with counts
data.frame(value = c(.3, .4, .5, .6), count = 1:4) %>%
ggplot(aes(x=value, y=count)) + geom_col()
If you have a dataset of predictions already created, this code provides a good way to visualize it.
Example data:
df <- tibble(player = rep(c("infielder", "outfielder","DH"), 8),
bats = rep(c("lefty", "righty"), 12),
mean = 100*runif(24),
low = mean*0.92,
high = mean*1.04,
value_label = sprintf("%.1f%%", mean),
outcome = c(rep("singles", 6), rep("doubles",6),
rep("triples", 6), rep("homeruns",6))
)
*The code in the following four examples is identical except that a) certain parts are commented out and b) the point label text size and dodge width are adjusted to fit better
df %>% filter(bats == "lefty",
outcome == "triples"
) %>%
ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
label = value_label,
#color = bats
)) +
ggstance::geom_pointrangeh(shape = 1) +
geom_text(size = 3,
nudge_y = 0.2,
#position = position_dodge(width = 0.5)
) +
labs(title = "Example Title",
subtitle = "Example Sub",
x = "Example x axis",
y = element_blank(),
caption = "Example Caption") +
theme_bw() #+
## Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#theme(legend.position = "bottom",
#axis.text.y = element_text(size=11),
#plot.title.position = "plot"
#) +
# scale_color_manual(values = c("black", "gray50"),
# name = "Batting Stance",
# guide = guide_legend(reverse = TRUE)) +
#expand_limits(x = c(0, 100)) +
#facet_wrap(~ outcome)
df %>% filter(bats == "lefty",
#outcome == "triples"
) %>%
ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
label = value_label,
#color = bats
)) +
ggstance::geom_pointrangeh(shape = 1) +
geom_text(size = 3,
nudge_y = 0.25,
#position = position_dodge(width = 0.5)
) +
labs(title = "Example Title",
subtitle = "Example Sub",
x = "Example x axis",
y = element_blank(),
caption = "Example Caption") +
theme_bw() +
#theme(legend.position = "bottom",
#axis.text.y = element_text(size=11),
#plot.title.position = "plot"
#) +
# scale_color_manual(values = c("black", "gray50"),
# name = "Batting Stance",
# guide = guide_legend(reverse = TRUE)) +
expand_limits(x = c(0, 100)) +
facet_wrap(~ outcome)
df %>% filter(#bats == "lefty",
outcome == "triples"
) %>%
ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
label = value_label,
color = bats
)) +
ggstance::geom_pointrangeh(shape = 1) +
geom_text(size = 3,
#nudge_y = 0.2,
position = position_dodge(width = 0.5)) +
labs(title = "Example Title",
subtitle = "Example Sub",
x = "Example x axis",
y = element_blank(),
caption = "Example Caption") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.y = element_text(size=11),
plot.title.position = "plot"
) +
scale_color_manual(values = c("black", "gray50"),
name = "Batting Stance",
guide = guide_legend(reverse = TRUE)) #+
# expand_limits(x = c(0, 100)) +
# facet_wrap(~ outcome)
df %>% #filter(#bats == "lefty",
#outcome == "triples"
#) %>%
ggplot(aes(x = mean, xmin = low, xmax = high, y = player,
label = value_label,
color = bats
)) +
ggstance::geom_pointrangeh(shape = 1) +
geom_text(size = 2,
#nudge_y = 0.2,
position = position_dodge(width = 1)) +
labs(title = "Example Title",
subtitle = "Example Sub",
x = "Example x axis",
y = element_blank(),
caption = "Example Caption") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.y = element_text(size=11),
plot.title.position = "plot"
) +
scale_color_manual(values = c("black", "gray50"),
name = "Batting Stance",
guide = guide_legend(reverse = TRUE)) +
expand_limits(x = c(0, 100)) +
facet_wrap(~ outcome)
Note:
value_label = sprintf("%.1f%%", mean)
converts creates
the data labels. It can be done ahead of time in the dataset as in this
example at the time it called in ggplot. The sprintf
command is a more flexible version of paste()
. The first
argument is the output and the initial %
sign marks where
the value is to be inserted. The .1
indicates that it
should be rounded to 1 decimal place. The%%
inserts a
percent sign.ggstance::geom_pointrangeh()
gives you horizontal plots
without needing to flip the axes with coord_flip()
.
shape=1
argument provides hollow dots which is
helpful if the CIs are ever smallgeom_text()
allows you to attach value labels to the
points. There are two ways to adjust the position of the text
nudge_y=0.2
argument will move everything up by the
specified amount to avoid overlapping the text with the points (as
nudge_x argument is also available).position = position_dodge(width = 0.5)
will move
one group of labels up and the other group down by the specified amount
to avoid overlapping with each other. Therefore, it should be used when
there are two colors in play.geom_text(data = df %>% filter(bats == "righty", nudge_y = 0.2)
scale_color_manual
allows us to manipulate the colors
values=c("black", "gray50")
allows us to pick our own
colors which may be more black-and-white printer and color-blind reader
friendly. Higher numbers appended to “gray” are actually lighter, not
darker (may seem counterintuitive)y = element_blank()
provides an example of how to get
rid of a labeling element in ggplotguide = guide_legend(reverse = TRUE)
corrects the order
of the color labels to match the plottheme_bw()
is neat and simple-looking theme for
publication. However, it must come first or it will override the
others.legend.position = "bottom"
moves the legend to the
bottom, which is sometimes a more efficient use of space on the
pageaxis.text.y = element_text(size=11)
provides an example
of how to manually adjust the font size for a particular element
relative to the othersplot.title.position = "plot"
aligns the title and
subtitle with the y-axis labels rather than the subplotsexpand_limits(x = c(0, 100))
ensures that the scale of
the x-axis goes from 0 to 100, or whatever we want it toeffect_plot
will plot your chosen predictor against the
outcome as a line/curve with shaded confidence band using robust
standard errors. It’s great!
Features:
at
parameter allows you to set other predictor to a
given level. By default they are mean centeredThis package’s eponymous visreg()
function can create
plots with either base R graphics or ggplot2
(set
gg=TRUE
).
Features:
jtools
, other variables are held to their median
(continuous) or mode (factor)scale="response"
with logistic
regressionsinteractions
, a Jacob Long package, is great here. It’s
a spinoff from jtools
and produces ggplot
objects. See https://interactions.jacob-long.com/articles/interactions.html
for details
interact_plot
generates an xy-plot (predictor v.
outcome) with lines corresponding to various levels of your moderator.
sim_slopes
will tell you the slope and significance of
your predictors at various moderator levelsplot
to obtain a nice
horizontal coef plot showing the slope at various moderator levels. You
can see which are significant and which are different from each
other.huxtable::as_huxtable
to your sim slopes object
to get a regression-table-style table of the slope at various moderator
levels ready for publicationjohnson_neyman = TRUE
to see over what interval of
moderator values the predictor slopes are significant. You may wish to
set control.fdr = TRUE
to deal with multiple comparisons
implicit in this approach, but it risks being overly conservativejohnson_neyman()
will give plot a curve with confidence
band showing the slope of the predictor on the y-axis versus the
moderator value on the x-axis. It will color where it is
significantprobe_interaction
to simple slopes analysis and
interaction plot all at once. Good for exploring models.cat_plot
The eponymous visreg
function described in the above also does interactions. The key advantage that
is you can specify whether to display levels in side-by-side panels or
overlay the lines.
The Amelia
package is the likely best for imputation,
though its can be a handful. At it’s most basic, it works like this
am <- amelia(pre_imputed_data,
noms = c("district", "party", "parent"))
The noms
argument specifies nominal (that is,
categorical) variables. Everything else including binary variables is
treated as having a multivariate normal distribution, which according to
the package developers tends to work pretty well. There is also an
ords
argument for ordinal variables, though this is less
important, and an idvars
argument for id variables or any
other variable that should not affect the distribution of the variables
you are imputing. The pre_imputed_data
must be
data.frame
, not a tibble
, and you probably
want to drop column you won’t be using in your subsequent regressions.
If it’s a bit data set, you should either set
parallel="multicore"
and ncpus=4
, or globally
set
options(amelia.parallel = "multicore", amelia.ncpus = 4)
at
the top of your script (note that 4 is just an example).
The output will be 5 datasets and you will want to run your
regressions on all of them. This used to be relatively easy thanks to
Zelig
, but that package appears to no longer be maintained
and many of the functions are broken. While point estimates can simply
be averaged across the imputed datasets, aggregating the standard errors
is much more difficult and requires what the authors refer to a Rubin’s
rules, which refers to Don Rubin’s 1987 book “Multiple Imputation for
Nonresponse in Surveys”. The Amelia
package provides a
function called mi.meld()
which accepts a data frame of
point estimates and a data frame of standard errors and will return the
quantities you want. It’s unclear to me whether how one might aggregated
confidence intervals directly; maybe through bootstrapping.
Here’s a useful wrapper for the mi.meld()
function. I
developed it with fixest regression in mind, but it probably works fine
with a bunch of other model objects.
library(Amelia)
combine_models <- function(imp_regs){
# this is a low-level function that extracts the coefficient estimates
# and standard errors from a list of model objects and combines them
# using Rubin's rules. It returns the model for the first imputed data set,
# the combined coefficient estimates, and the combined standard errors
# arguments
# imp_regs : list of model objects, presumably derived from a list of
# imputations of a dataset
model.obj <- imp_regs[[1]]
term.names <- names(model.obj$coefficients)
coef.mtx <- map(imp_regs, coef) %>% bind_cols()
se.mtx <-
map(imp_regs, ~ lmtest::coeftest(., vcov = vcovHC(., type="HC1"))[,2]) %>%
bind_cols()
combo.list <- mi.meld(q = coef.mtx, se = se.mtx, byrow = F)
coefs <- as.numeric(combo.list$q.mi) %>% setNames(term.names)
SEs <- as.numeric(combo.list$se.mi) %>% setNames(term.names)
return(list(reg = model.obj, coefs = coefs, SEs = SEs))
}