Working Better with R Markdown

Tony Chen¹

2020-12-11

Introduction

Recently, I completed an assignment that applied basic statistical methodology and data science. With some tweaks, I thought this would also be a good opportunity to share some introductory pointers to regressions, R, and some helpful packages.

If you found this interesting, check out my page on RPubs or LinkedIn.

Key takeaways

Using Rmarkdown to record & publish your work
- Using the “rmdformats” package to add some flair to your finished product
Installing & Loading Packages
Loading & preparing data with Tidyverse
Plotting with ggplot2
An example of a basic linear regression
Using stargazer to publish regression results

Writing in Rmarkdown

R Markdown in an incredibly useful and powerful tool for writing out easily reproducible work. It comes with R Studio, and new markdown files can be created in the top left corner of the program (click the white page with the plus sign, then select R Markdown).

I usually do my write-ups with the R Markdown cheat sheet from RStudio (Barnier 2020) opened in the background. The platform lets me write out code, html elements, and mathematical symbols with ease (and can also output in pdf and word!).

For example, with regards to our example regression:

Let’s ask: “how do key political socio-economic conditions relate to female life expectancy?”

We present our first model in response to this question:

Life expectancy at birth, female ~ Mean years of schooling (females aged 25 years and above) + Share of seats in parliament (% held by women) + Adolescent birth rate (births per 1,000 women ages 15-19) + Population, urban (%) + Labour force participation rate, female (% ages 15 and older) + HIV prevalence, adult (% ages 15-49) + Maternal mortality ratio (deaths per 100,000 live births) + Maternal mortality ratio (deaths per 100,000 live births) + Suicide rate, female (per 100,000 people) + Antenatal care coverage, at least one visit (%) + Current health expenditure (% of GDP)

\(\hat{lifeexp}\) = \(\hat{\alpha}\) + \(\hat{\beta}_{educyrs}\)educyrs + \(\hat{\beta}_{femparl}\)femparl + \(\hat{\beta}_{adobirth}\)adobirth + \(\hat{\beta}_{urban}\)urban + \(\hat{\beta}_{femlabor}\)femlabor + \(\hat{\beta}_{HIV}\)HIV + \(\hat{\beta}_{matmort}\)matmort + \(\hat{\beta}_{femsui}\)femsui + \(\hat{\beta}_{antcare}\)antcare + \(\hat{\beta}_{healthexp}\)healthexp

For a secondary research question (and to add an interaction variable to our model), we can ask “how does mean years of education and the adolescent birth rate conjointly affect our dependent variable, female life expectancy at birth?”

So we present our second model:

This sort of question suggests that there are reasonable or theoretical grounds to suspect that two different variables, mean years of education and the adolescent birth rate, causes an additional effect when these two independent variables interact with each other.

Adding some flair to your HTML with “rmdformats”

In addition, I use the package “rmdformats” (Barnier 2020) to make my work more readable. This publication, for example, uses the downcute format.

To install this package, either install the latest stable version from CRAN:

install.packages("rmdformats")

or the latest development from GitHub:

if (!require(remotes)) install.packages("remotes")
remotes::install_github("juba/rmdformats")

To add a format to an existing document, add:

---
output: 
  rmdformats::<template name>
---

to the document’s YAML preamble (the first few lines that are nested between 2 ’- - -’s).

Alternatively in R Studio, you can create a new R Markdown file pre-loaded with these options in the “From Template” window after selecting ‘R Markdown…’ in the add document drop-down list, located in the top left corner.

To add Table of Content options without rmdformats, you instead add the following to the preamble:

---
output: 
  html_document:
    toc: TRUE #adds a Table of Contents
    number_sections: TRUE #number your headings/sections
    toc_float: TRUE #let your ToC follow you as you scroll
    theme: cerulean #select a different theme from the default
---

For more themes that are available by default, see see Zieffler (2018) for an excellent list of base Markdown themes and examples.

Installing and Loading Packages

Now we’ll begin the usual preliminary work, starting with installing and loading the packages we’ll need for our analysis: tidyverse, ggplot2, and stargazer.

There’s mainly 3 ways to approach this:

Load packages only, which will simply load the packages if they’re already installed (if not installed, users will receive a prompt to do so).

library(tidyverse)
library(ggplot2)
library(stargazer)

Install & load.

install.packages(tidyverse)
install.packages(ggplot2)
install.packages(stargazer)

library(tidyverse)
library(ggplot2)
library(stargazer)

If package is missing, install & load.

if (!require(tidyverse)) install.packages(tidyverse)
if (!require(ggplot2)) install.packages(ggplot2)
if (!require(stargazer)) install.packages(stargazer)

library(tidyverse)
library(ggplot2)
library(stargazer)

Loading & preparing data with Tidyverse

First, let’s clear our workplace, set our working directory, and load our data.

# Clearing workspace
rm(list = ls())

# Set working directory
setwd("~/Documents/RPubs/Template")

# If your dataset is, for example, a .csv file
hdro <- read.csv("hdro2015.csv")

Let’s take a look at our dataset:

names(hdro)

 [1] "country_code" "year"         "value.47906"  "value.110806" "value.21806" 
 [6] "value.23806"  "value.23906"  "value.24006"  "value.24106"  "value.24206" 
[11] "value.31706"  "value.36806"  "value.43006"  "value.44206"  "value.45106" 
[16] "value.46006"  "value.48706"  "value.48806"  "value.49006"  "value.52606" 
[21] "value.63106"  "value.63406"  "value.64306"  "value.64406"  "value.68606" 
[26] "value.69206"  "value.69706"  "value.71406"  "value.71506"  "value.101606"
[31] "value.101806" "value.103006" "value.103206" "value.103606" "value.103706"
[36] "value.110906" "value.120606" "value.121106" "value.121206" "value.122006"
[41] "value.123306" "value.123406" "value.123506" "value.123606" "value.127606"
[46] "value.132706" "value.132806" "value.136706" "value.136906" "value.137006"
[51] "value.137506" "value.137906" "value.140606" "value.141706" "value.143306"
[56] "value.146206" "value.148206" "value.148306" "value.150606" "value.150706"
[61] "value.153706" "value.169706" "value.169806" "value.175106" "value.181206"
[66] "value.181606" "value.53506"  "value.71606"  "value.73506"  "value.101706"
[71] "value.111306" "value.135006" "value.138806" "value.38406"  "value.38506" 
 [ reached getOption("max.print") -- omitted 73 entries ]

Most of our variables need an index for interpretation (not posted here). What we need to do is select the variables that are relevant to our research questions, rename those variables for ease of interpretation, and finally attach our data frame for analysis.

df <-hdro %>% # select the loaded dataset, hdro
# select the variables we want to keep and rename them by select(new variable name = variable we want to keep)
  select(country = country_code, lifeexp = value.120606, educyrs = value.24106, femparl = value.31706, adobirth = value.36806, urban = value.45106, femlabor = value.48706, HIV = value.58006, matmort = value.89006, femsui = value.112606, antcare = value.175206, healthexp = value.181806) 

attach(df)

Plotting with ggplot2

Probably the most popular package for data visualization is ggplot2, which we’ll use here to see some of the relationships between our variables.

For example, we can see the distribution of a single continuous variable with a histogram:

df %>% # this is one way of selecting the data that the plot will draw from
  ggplot(aes(lifeexp)) + # aes() chooses the variable we'll visualize
  geom_histogram(fill = "lightblue") + # specify the visual element you'd like to incorporate, with other options
  labs(title="Number of Countries per Female Life Expectancy Age", x="Age", y="Count") # set labels

Or, let’s plot two numeric variables:

ggplot(df, aes(lifeexp, educyrs)) + # we write the data we want to refer to before the 'aes' if it isn't specified prior
  geom_point(color = "red") + # specify the visual element you'd like to incorporate, with other options
  labs(title="Education for Females and Their Life Expectancy", x="Age", y="Mean Years of Education") # set labels

Again, it can be helpful to refer to the respective cheat sheeet from RStudio.

An example of a basic linear regression

Linear Model 1

First, let’s run the model with all variables included.

# assign lm.1 as our first linear regression model in the order lm(dependent
# variable ~ explanatory variables)
lm.1 <- lm(df$lifeexp ~ df$educyrs + df$femparl + df$adobirth + df$urban + df$femlabor + 
    df$HIV + df$matmort + df$femsui + df$antcare + df$healthexp)
# produce a summary of the results
summary(lm.1)


Call:
lm(formula = df$lifeexp ~ df$educyrs + df$femparl + df$adobirth + 
    df$urban + df$femlabor + df$HIV + df$matmort + df$femsui + 
    df$antcare + df$healthexp)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3275 -0.7021 -0.4015  1.3136  2.5154 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  94.216432  16.863612   5.587   0.0014 **
df$educyrs   -0.225593   0.707360  -0.319   0.7606   
df$femparl    0.045443   0.072245   0.629   0.5525   
df$adobirth  -0.073725   0.051254  -1.438   0.2004   
df$urban      0.121222   0.069284   1.750   0.1308   
df$femlabor   0.039359   0.058502   0.673   0.5261   
df$HIV        0.098224   0.492666   0.199   0.8486   
df$matmort   -0.017014   0.007154  -2.378   0.0549 . 
df$femsui    -0.564977   0.264306  -2.138   0.0764 . 
df$antcare   -0.204238   0.117381  -1.740   0.1325   
df$healthexp -0.011857   0.483530  -0.025   0.9812   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.486 on 6 degrees of freedom
  (178 observations deleted due to missingness)
Multiple R-squared:  0.9652,    Adjusted R-squared:  0.9072 
F-statistic: 16.65 on 10 and 6 DF,  p-value: 0.001327

Our first model yielded no statistically significant variables at the 5% \(\alpha\) level. This is likely the result of over-fitting our model with too many variables, a conclusion supported by our high R² score.

Linear Model 2

lm.2 <- lm(df$lifeexp ~ df$educyrs + df$adobirth + df$urban + df$matmort + df$healthexp)
summary(lm.2)


Call:
lm(formula = df$lifeexp ~ df$educyrs + df$adobirth + df$urban + 
    df$matmort + df$healthexp)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.8478  -1.7181   0.3648   2.0191   5.7733 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  71.603875   1.302496  54.974  < 2e-16 ***
df$educyrs    0.172758   0.124322   1.390    0.167    
df$adobirth  -0.040566   0.008927  -4.544 1.09e-05 ***
df$urban      0.056817   0.013379   4.247 3.69e-05 ***
df$matmort   -0.020176   0.001918 -10.521  < 2e-16 ***
df$healthexp  0.511097   0.099791   5.122 8.72e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.988 on 158 degrees of freedom
  (31 observations deleted due to missingness)
Multiple R-squared:  0.8752,    Adjusted R-squared:  0.8713 
F-statistic: 221.7 on 5 and 158 DF,  p-value: < 2.2e-16

In the actual analysis I ran, I performed a visual inspection of different plots between the dependent & independent variables (not shown here) and eliminated variables that both appear unlikely to have an effect on our dependent variable and variables that lack adequate theoretical foundations.

We reject the null hypothesis at the 1% \(\alpha\) level for the following variables: adolescent births, % of population living in urban areas, maternal mortality, and health expenditures.

Surprisingly, we fail to reject the null hypothesis that the mean years of education (female) has no effect on our response variable. However, the effect direction corresponds with our expectations.

Linear Model 3

lm.3 <- lm(df$lifeexp ~ df$educyrs + df$adobirth + df$educyrs * df$adobirth + df$urban + 
    df$matmort + df$healthexp)
summary(lm.3)


Call:
lm(formula = df$lifeexp ~ df$educyrs + df$adobirth + df$educyrs * 
    df$adobirth + df$urban + df$matmort + df$healthexp)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8889 -1.7574  0.3841  2.0641  6.9334 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            70.719505   1.327098  53.289  < 2e-16 ***
df$educyrs              0.346604   0.140094   2.474   0.0144 *  
df$adobirth            -0.011545   0.014409  -0.801   0.4242    
df$urban                0.053460   0.013220   4.044 8.22e-05 ***
df$matmort             -0.021740   0.001983 -10.960  < 2e-16 ***
df$healthexp            0.486738   0.098581   4.937 2.01e-06 ***
df$educyrs:df$adobirth -0.004527   0.001782  -2.540   0.0121 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.937 on 157 degrees of freedom
  (31 observations deleted due to missingness)
Multiple R-squared:  0.8802,    Adjusted R-squared:  0.8756 
F-statistic: 192.2 on 6 and 157 DF,  p-value: < 2.2e-16

Here we test our second research question. We now reject the null hypothesis that educyrs has no effect on our response variable, and that the interaction variable between educyrs:adobirth have no effect. However, we fail to reject the null hypothesis that adobirth has no effect on our response variable.

The effect direction of the interaction variable is confusing. This is due to the fact that we multiply the coefficient of educyrs, which is naturally positive, against the coefficient of the adobirth, which is negative, which means the interaction coefficient will always be negative. Time permitting, we would make a new variable (perhaps the inverse of the adolescent birth rate) so its coefficient would be >0. We would use this variable for our interaction variable instead.

Note: there’s many other issues with our models, like the high number of missing variables, testing for heteroskedasticity and multicolinearity, etc. We won’t go into these here, because this is just meant to be an illustration of the ways we can make our work easier to produce & pleasing to the eye.

Using stargazer to publish regression results

Now, we’d like to compare the results between are different regressions. My favourite tool to accomplish this is stargazer (Hlavac 2018).

With stargazer, we can summarize data sets like so:

stargazer(df, type = "text")


=================================================================
Statistic  N   Mean   St. Dev.  Min   Pctl(25) Pctl(75)    Max   
-----------------------------------------------------------------
lifeexp   185 73.931   8.221   52.800  67.800   79.700   87.000  
educyrs   172  8.172   3.488   1.000   5.075    11.200   13.700  
femparl   189 20.729   11.468  0.100   12.700   27.400   57.500  
adobirth  185 53.099   43.550  0.300   15.300   78.700   200.700 
urban     195 57.993   23.361  12.100  39.350   77.200   100.000 
femlabor  180 51.629   15.559  6.100   45.075   61.150   84.100  
HIV       136  1.990   4.541   0.100   0.100    1.425    28.300  
matmort   182 169.549 232.733  3.000   14.250  229.000  1,360.000
femsui    183  5.192   3.867   0.300   2.650    6.300    32.100  
antcare   20  89.215   13.020  54.700  88.100   97.200   99.400  
healthexp 186  6.728   2.979   2.000   4.725    8.200    22.600  
-----------------------------------------------------------------

And we can compare our three regressions like so:

stargazer(lm.1, lm.2, lm.3, type = "text")


============================================================================================
                                              Dependent variable:                           
                    ------------------------------------------------------------------------
                                                    lifeexp                                 
                             (1)                     (2)                      (3)           
--------------------------------------------------------------------------------------------
educyrs                     -0.226                  0.173                   0.347**         
                           (0.707)                 (0.124)                  (0.140)         
                                                                                            
femparl                     0.045                                                           
                           (0.072)                                                          
                                                                                            
adobirth                    -0.074                -0.041***                  -0.012         
                           (0.051)                 (0.009)                  (0.014)         
                                                                                            
urban                       0.121                  0.057***                 0.053***        
                           (0.069)                 (0.013)                  (0.013)         
                                                                                            
femlabor                    0.039                                                           
                           (0.059)                                                          
                                                                                            
HIV                         0.098                                                           
                           (0.493)                                                          
                                                                                            
matmort                    -0.017*                -0.020***                -0.022***        
                           (0.007)                 (0.002)                  (0.002)         
                                                                                            
femsui                     -0.565*                                                          
                           (0.264)                                                          
                                                                                            
antcare                     -0.204                                                          
                           (0.117)                                                          
                                                                                            
healthexp                   -0.012                 0.511***                 0.487***        
                           (0.484)                 (0.100)                  (0.099)         
                                                                                            
adobirth                                                                    -0.005**        
                                                                            (0.002)         
                                                                                            
Constant                  94.216***               71.604***                70.720***        
                           (16.864)                (1.302)                  (1.327)         
                                                                                            
--------------------------------------------------------------------------------------------
Observations                  17                     164                      164           
R2                          0.965                   0.875                    0.880          
Adjusted R2                 0.907                   0.871                    0.876          
Residual Std. Error     2.486 (df = 6)         2.988 (df = 158)         2.937 (df = 157)    
F Statistic         16.645*** (df = 10; 6) 221.697*** (df = 5; 158) 192.195*** (df = 6; 157)
============================================================================================
Note:                                                            *p<0.1; **p<0.05; ***p<0.01

We can further change our labels:

stargazer(lm.1, lm.2, lm.3, dep.var.labels = "Life expectancy at birth, female", 
    covariate.labels = c("Mean years of schooling (females aged ≥ 25)", "Parliament seats, female (%)", 
        "Adolescent birth rate (births/1,000 women, ages 15-19)", "Population, urban (%)", 
        "Labour force participation rate, female (%, ages ≥ 15)", "HIV prevalence, adult (%, ages 15-49)", 
        "Maternal mortality ratio (deaths/100,000 live births)", "Suicide rate, female (per 100,000 people)", 
        "Antenatal care coverage, ≥ one visit (%)", "Current health expenditure (% of GDP)", 
        "Adolescent Birth Rate:Education"), type = "text")


===============================================================================================================================
                                                                                 Dependent variable:                           
                                                       ------------------------------------------------------------------------
                                                                           Life expectancy at birth, female                    
                                                                (1)                     (2)                      (3)           
-------------------------------------------------------------------------------------------------------------------------------
Mean years of schooling (females aged ≥ 25)                    -0.226                  0.173                   0.347**         
                                                              (0.707)                 (0.124)                  (0.140)         
                                                                                                                               
Parliament seats, female (%)                                   0.045                                                           
                                                              (0.072)                                                          
                                                                                                                               
Adolescent birth rate (births/1,000 women, ages 15-19)         -0.074                -0.041***                  -0.012         
                                                              (0.051)                 (0.009)                  (0.014)         
                                                                                                                               
Population, urban (%)                                          0.121                  0.057***                 0.053***        
                                                              (0.069)                 (0.013)                  (0.013)         
                                                                                                                               
Labour force participation rate, female (%, ages ≥ 15)         0.039                                                           
                                                              (0.059)                                                          
                                                                                                                               
HIV prevalence, adult (%, ages 15-49)                          0.098                                                           
                                                              (0.493)                                                          
                                                                                                                               
Maternal mortality ratio (deaths/100,000 live births)         -0.017*                -0.020***                -0.022***        
                                                              (0.007)                 (0.002)                  (0.002)         
                                                                                                                               
Suicide rate, female (per 100,000 people)                     -0.565*                                                          
                                                              (0.264)                                                          
                                                                                                                               
Antenatal care coverage, ≥ one visit (%)                       -0.204                                                          
                                                              (0.117)                                                          
                                                                                                                               
Current health expenditure (% of GDP)                          -0.012                 0.511***                 0.487***        
                                                              (0.484)                 (0.100)                  (0.099)         
                                                                                                                               
Adolescent Birth Rate:Education                                                                                -0.005**        
                                                                                                               (0.002)         
                                                                                                                               
Constant                                                     94.216***               71.604***                70.720***        
                                                              (16.864)                (1.302)                  (1.327)         
                                                                                                                               
-------------------------------------------------------------------------------------------------------------------------------
Observations                                                     17                     164                      164           
R2                                                             0.965                   0.875                    0.880          
Adjusted R2                                                    0.907                   0.871                    0.876          
Residual Std. Error                                        2.486 (df = 6)         2.988 (df = 158)         2.937 (df = 157)    
F Statistic                                            16.645*** (df = 10; 6) 221.697*** (df = 5; 158) 192.195*** (df = 6; 157)
===============================================================================================================================
Note:                                                                                               *p<0.1; **p<0.05; ***p<0.01

Now we have a comparison of our different models as well as descriptive variable names that will lend to easier interpretation.

Please note: as far as I’m aware, direct outputs to HTML in R Markdown is only compatible with the setting ‘type = “text”’. If you set the type to Latex or HTML, it’ll generate the code as opposed to the finished result. If you’d like your stargazer output in another format (say, to copy to a word document), simply add an “out = [file name].[format]” setting, like so:

stargazer(lm.1, lm.2, lm.3, dep.var.labels = "Life expectancy at birth, female", 
    covariate.labels = c("Mean years of schooling (females aged ≥ 25)", "Parliament seats, female (%)", 
        "Adolescent birth rate (births/1,000 women, ages 15-19)", "Population, urban (%)", 
        "Labour force participation rate, female (%, ages ≥ 15)", "HIV prevalence, adult (%, ages 15-49)", 
        "Maternal mortality ratio (deaths/100,000 live births)", "Suicide rate, female (per 100,000 people)", 
        "Antenatal care coverage, ≥ one visit (%)", "Current health expenditure (% of GDP)", 
        "Adolescent Birth Rate:Education"), type = "html", out = "regressions.html")

And your table will be ready in your working directory: regressions.html

References

Barnier, Julien. 2020. Rmdformats: HTML Output Formats and Templates for Rmarkdown Documents. https://github.com/juba/rmdformats.

Hlavac, Marek. 2018. stargazer: beautiful latex, html and ASCII tables from R statistical output. https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf.

Zieffler, Andrew. 2018. R Markdown Theme Gallery. https://www.datadreaming.org/post/r-markdown-theme-gallery/.

The London School of Economics and Political Science, School of Government↩︎