This notebook walks through the manual generation of summary statistics in \(\LaTeX\) that are broken out by treatment status. It uses the stargazer package, but it’s a bit more advanced than the previous walkthrough.
library(stargazer)
library(readxl) # I'm using this to read my data from an Excel file
library(tidyverse) # A suite of packages for manipulating data
library(stats)
I’ll be using updated survey response data from the Colombia-LfP project.
colombia_LfP_raw <- read_xlsx("../colombia_LfP/HH7.7.23.xlsx")
print(dim(colombia_LfP_raw)) # Just confirming the import worked correctly
## [1] 2024 1823
Every new summary statistics table, even if you’ve worked with the source data before, probably requires a somehow unique data cleaning process. That applies here, too, so this notebook unfortunately won’t be able to walk you through that step of things, but stargazer can cut down the amount of this work if you let it.
In short, you want your data cleaning to take your original dataset and shrink it to only the rows and columns you’re interested in. That’s it for now. No additional group-bys or aggregations, just filter()s and select()s, mostly. The aggregations happen later. For now, I’ll apply a toy version of some data cleaning I’ve performed on this dataset before here.
The main change from last time is the addition of a binary column, treat, that indicates treatment status.
# Define vereda geography subset in hh
vereda_status <- c("control-vereda","treatment-vereda")
# NEW: Define a binary column indicating treatment status
colombia_LfP_raw$treat <- ifelse(
colombia_LfP_raw$treatment_status == "treatment-vereda",
1,0
)
# Make a subset of six columns including only responses in "Vereda" geographies
hh_vereda <- colombia_LfP_raw %>%
filter(colombia_LfP_raw$treatment_status %in% vereda_status) %>%
# NEW: include 'treat' in the select() below
select(resp_sex, born, born_head, econstatus, hhmem_size, fhhconsume, treat)
print(dim(hh_vereda)) # Confirming data trimming worked
## [1] 1033 7
stargazer only recognizes numeric fields in its calculation of summary statistics, so ensure that every column that will eventually be included in the data.frame passed to the stargazer() function (which we’ll see soon) is numeric. One example of a process to do this conversion is demonstrated here.
First, find out the types of each column using the str() function
str(hh_vereda)
## tibble [1,033 x 7] (S3: tbl_df/tbl/data.frame)
## $ resp_sex : chr [1:1033] "1" "2" "1" "2" ...
## $ born : chr [1:1033] "0" "0" "1" "1" ...
## $ born_head : chr [1:1033] NA NA NA NA ...
## $ econstatus: num [1:1033] 1 1 1 3 3 2 1 2 1 7 ...
## $ hhmem_size: num [1:1033] 5 1 3 4 5 1 3 3 2 2 ...
## $ fhhconsume: chr [1:1033] "1" "2" "3" "3" ...
## $ treat : num [1:1033] 1 1 1 1 1 1 1 1 1 1 ...
# Define all character columns that you want to convert to numeric
# ... as items in a list of strings
convert_to_numeric <- c(
"resp_sex","born","born_head","fhhconsume"
)
# Once the list of columns to convert is defined, use the code below to apply
# ... the conversion to your data.frame
hh_vereda <- hh_vereda %>%
mutate(across(
all_of(convert_to_numeric),
as.numeric
))
str(hh_vereda) # Confirm changes were applied
## tibble [1,033 x 7] (S3: tbl_df/tbl/data.frame)
## $ resp_sex : num [1:1033] 1 2 1 2 1 1 1 1 1 2 ...
## $ born : num [1:1033] 0 0 1 1 0 0 0 0 0 0 ...
## $ born_head : num [1:1033] NA NA NA NA NA NA NA NA NA NA ...
## $ econstatus: num [1:1033] 1 1 1 3 3 2 1 2 1 7 ...
## $ hhmem_size: num [1:1033] 5 1 3 4 5 1 3 3 2 2 ...
## $ fhhconsume: num [1:1033] 1 2 3 3 2 1 3 2 2 3 ...
## $ treat : num [1:1033] 1 1 1 1 1 1 1 1 1 1 ...
I have one more thing to do: convert a categorical column representing respondent gender to a boolean field representing whether or not the respondent is female.
# Convert resp_sex categorical responses to binary
hh_vereda$resp_f <- ifelse(
hh_vereda$resp_sex == 2,
1,0
)
# Drop resp_sex
hh_vereda <- hh_vereda %>%
select(resp_f, born, born_head, econstatus, hhmem_size, fhhconsume, treat)
stargazer tableThis is the final step, and at the end of it you’ll have a .tex file with a \(\LaTeX\) table of summary statistics broken out by treatment status.
My current dataset includes column titles, like born and fhhconsume, whose meanings might be unclear even to someone who has worked closely on the project. Let’s choose better names for our columns.
NB: No need to include treat in the renaming, because we’re not representing treatment status as a row in our summary. We’re representing it as a horizontal subset. (If that’s a confusing phrase, you’ll see what I mean later.)
# Define a list for renaming columns: "New Name" = "old_name"
renaming_list <- c(
"Respondent Female" = "resp_f",
"Born Locally" = "born",
"Head of HH Born Locally" = "born_head",
"Economic Status" = "econstatus",
"HH Size" = "hhmem_size",
"Forest Covers Basic Needs" = "fhhconsume"
)
# Create a new dataframe with renamed columns
stargazer_prep <- hh_vereda %>%
data.frame() %>% # Include this line to ensure formatting is correct later
rename(all_of(renaming_list))
This is where things get more complicated compared to last time. stargazer has no mechanism to easily define groups (like treatment groups) and display summary statistics in separate horizontal groupings on that basis. So, we need to create our own data.frame object that looks like this and pass it through the stargazer() function with the argument summary=FALSE, as you’ll see below.
This is a pretty long for-loop, but it’s doing simple stuff. For each column from your data you want to summarize (names(renaming_list) above), you want to create a row in your summary table. To do this, we’re going to create a list of vectors (summary_cols below) where each vector will eventually represent a column, and the length of those vectors will eventually be the number of rows. So for each data column you’re summarizing, you’ll tack on (aka append()) one element (a summary statistic) to each vector.
# Define empty list with summary table column titles
# This might look a bit different each time based on what stats you want.
summary_cols = list(
covariate = c(),
control.N = c(),
control.mean = c(),
control.sd = c(),
control.min = c(),
control.max = c(),
treat.N = c(),
treat.mean = c(),
treat.sd = c(),
treat.min = c(),
treat.max = c()
)
# Iterate through the columns we're summarizing to create our table row-by-row
# Notice that each list item above has its own little code chunk in this loop
for (col in names(renaming_list)) {
summary_cols$covariate <- summary_cols$covariate %>%
append(col) # Covariate name
summary_cols$control.N <- summary_cols$control.N %>%
append( # Add to our summary column
stargazer_prep %>% # From our data
filter(treat == 0) %>% # Restrict to treatment_status group
.[, col] %>% # Grab column as a vector
is.na() %>% `!` %>% # Convert to T/F where NA is FALSE
sum() # Control count -- sum up the TRUEs
)
summary_cols$control.mean <- summary_cols$control.mean %>%
append(
stargazer_prep %>%
filter(treat == 0) %>%
.[, col] %>%
mean(na.rm=T) # Control mean
)
summary_cols$control.sd <- summary_cols$control.sd %>%
append(
stargazer_prep %>%
filter(treat == 0) %>%
.[, col] %>%
sd(na.rm=T) # Control standard deviation
)
summary_cols$control.min <- summary_cols$control.min %>%
append(
stargazer_prep %>%
filter(treat == 0) %>%
.[, col] %>%
min(na.rm=T) # Control min
)
summary_cols$control.max <- summary_cols$control.max %>%
append(
stargazer_prep %>%
filter(treat == 0) %>%
.[, col] %>%
max(na.rm=T) # Control max
)
summary_cols$treat.N <- summary_cols$treat.N %>%
append(
stargazer_prep %>%
filter(treat == 1) %>%
.[, col] %>%
is.na() %>% `!` %>%
sum() # Treat count
)
summary_cols$treat.mean <- summary_cols$treat.mean %>%
append(
stargazer_prep %>%
filter(treat == 1) %>%
.[, col] %>%
mean(na.rm=T) # Treat mean
)
summary_cols$treat.sd <- summary_cols$treat.sd %>%
append(
stargazer_prep %>%
filter(treat == 1) %>%
.[, col] %>%
sd(na.rm=T) # Treat standard deviation
)
summary_cols$treat.min <- summary_cols$treat.min %>%
append(
stargazer_prep %>%
filter(treat == 1) %>%
.[, col] %>%
min(na.rm=T) # Treat min
)
summary_cols$treat.max <- summary_cols$treat.max %>%
append(
stargazer_prep %>%
filter(treat == 1) %>%
.[, col] %>%
max(na.rm=T) # Treat max
)
}
# Convert the list of vectors to a data.frame so it prints out nicely
summary_final <- summary_cols %>%
data.frame()
# Display a pretty version of our table
summary_final %>% knitr::kable()
| covariate | control.N | control.mean | control.sd | control.min | control.max | treat.N | treat.mean | treat.sd | treat.min | treat.max |
|---|---|---|---|---|---|---|---|---|---|---|
| Respondent Female | 509 | 0.4047151 | 0.4913197 | 0 | 1 | 513 | 0.3625731 | 0.4812123 | 0 | 1 |
| Born Locally | 509 | 0.1925344 | 0.3946783 | 0 | 1 | 513 | 0.2534113 | 0.4353889 | 0 | 1 |
| Head of HH Born Locally | 114 | 0.1578947 | 0.3662522 | 0 | 1 | 92 | 0.2608696 | 0.4415150 | 0 | 1 |
| Economic Status | 509 | 1.9214145 | 1.2524437 | 1 | 9 | 513 | 2.0350877 | 1.4424970 | 1 | 10 |
| HH Size | 509 | 3.3418468 | 1.7937548 | 1 | 11 | 513 | 3.1598441 | 1.6833820 | 1 | 10 |
| Forest Covers Basic Needs | 509 | 1.9508841 | 0.7909081 | 1 | 3 | 513 | 2.0155945 | 0.7829672 | 1 | 3 |
stargazer tableTime to make the table! Call the stargazer() function with a data.frame containing only the columns you’re interested in as the first argument. Use the out= argument to define the file path for your .tex file
stargazer(
summary_final,
# Note the "rownames=F" argument! Try taking it away to see why I use it
summary= FALSE, rownames= FALSE,
out= "colombia_summary.tex",
title= "Summary Statistics from Vereda Areas"
)
##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: Tue, Jul 18, 2023 - 1:30:04 PM
## \begin{table}[!htbp] \centering
## \caption{Summary Statistics from Vereda Areas}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}} ccccccccccc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## covariate & control.N & control.mean & control.sd & control.min & control.max & treat.N & treat.mean & treat.sd & treat.min & treat.max \\
## \hline \\[-1.8ex]
## Respondent Female & $509$ & $0.405$ & $0.491$ & $0$ & $1$ & $513$ & $0.363$ & $0.481$ & $0$ & $1$ \\
## Born Locally & $509$ & $0.193$ & $0.395$ & $0$ & $1$ & $513$ & $0.253$ & $0.435$ & $0$ & $1$ \\
## Head of HH Born Locally & $114$ & $0.158$ & $0.366$ & $0$ & $1$ & $92$ & $0.261$ & $0.442$ & $0$ & $1$ \\
## Economic Status & $509$ & $1.921$ & $1.252$ & $1$ & $9$ & $513$ & $2.035$ & $1.442$ & $1$ & $10$ \\
## HH Size & $509$ & $3.342$ & $1.794$ & $1$ & $11$ & $513$ & $3.160$ & $1.683$ & $1$ & $10$ \\
## Forest Covers Basic Needs & $509$ & $1.951$ & $0.791$ & $1$ & $3$ & $513$ & $2.016$ & $0.783$ & $1$ & $3$ \\
## \hline \\[-1.8ex]
## \end{tabular}
## \end{table}
To check your table’s appearance, copy it into Overleaf somewhere or use an online interpreter like quicklatex.com. Here is my table, directly exported from stargazer:
You’ll probably notice here that the table is too wide. Like, there’s no way it’ll fit on a page properly. This happens because we needed to attach control. or treat. to every column name in our R code to maintain unique column references. Let’s directly edit the \(\LaTeX\) code to make something that looks better! Here’s the original, too-wide header from this export:
\begin{table}[!htbp] \centering
\caption{Summary Statistics from Vereda Areas}
\label{}
\begin{tabular}{@{\extracolsep{5pt}} ccccccccccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
covariate & control.N & control.mean & control.sd & control.min & control.max & treat.N & treat.mean & treat.sd & treat.min & treat.max \\
\hline \\[-1.8ex]
We want to do a few things with this:
cs in the \begin{tabular} line, change the c on the left to lTo add the new “Control/Treatment” row to the header, copy the following lines below the two \hlines and above the new column titles. Remember to edit the number ranges in the \cline commands if you need to realign the treatment groups!
& \multicolumn{5}{l}{Control} & \multicolumn{5}{l}{Treatment}
& \cline{2-6} \cline{7-11} \\[-1.8ex]
And here is the new table, along with the new code for the full header:
\begin{table}[!htbp] \centering
\caption{Summary Statistics from Vereda Areas}
\label{}
\begin{tabular}{@{\extracolsep{5pt}} lcccccccccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{5}{l}{Control} & \multicolumn{5}{l}{Treatment}
& \cline{2-6} \cline{7-11} \\[-1.8ex]
Item & N & Mean & St.Dev & Min & Max & N & Mean & St.Dev & Min & Max \\
\hline \\[-1.8ex]
It’s not as easy as the default stargazer tables, and you’ll need to use trial and error to make your \(\LaTeX\) tables look good, but it’s not impossible! Good luck!