Introduction
Recently, I completed an assignment that applied basic statistical methodology and data science. With some tweaks, I thought this would also be a good opportunity to share some introductory pointers to regressions, R, and some helpful packages.
If you found this interesting, check out my page on RPubs or LinkedIn.
Key takeaways
- Using Rmarkdown to record & publish your work
- Using the “rmdformats” package to add some flair to your finished product
- Installing & Loading Packages
- Loading & preparing data with Tidyverse
- Plotting with ggplot2
- An example of a basic linear regression
- Using stargazer to publish regression results
Writing in Rmarkdown
R Markdown in an incredibly useful and powerful tool for writing out easily reproducible work. It comes with R Studio, and new markdown files can be created in the top left corner of the program (click the white page with the plus sign, then select R Markdown).
I usually do my write-ups with the R Markdown cheat sheet from RStudio (Barnier 2020) opened in the background. The platform lets me write out code, html elements, and mathematical symbols with ease (and can also output in pdf and word!).
For example, with regards to our example regression:
Let’s ask: “how do key political socio-economic conditions relate to female life expectancy?”
We present our first model in response to this question:
Life expectancy at birth, female ~ Mean years of schooling (females aged 25 years and above) + Share of seats in parliament (% held by women) + Adolescent birth rate (births per 1,000 women ages 15-19) + Population, urban (%) + Labour force participation rate, female (% ages 15 and older) + HIV prevalence, adult (% ages 15-49) + Maternal mortality ratio (deaths per 100,000 live births) + Maternal mortality ratio (deaths per 100,000 live births) + Suicide rate, female (per 100,000 people) + Antenatal care coverage, at least one visit (%) + Current health expenditure (% of GDP)
OR
\(\hat{lifeexp}\) = \(\hat{\alpha}\) + \(\hat{\beta}_{educyrs}\)educyrs + \(\hat{\beta}_{femparl}\)femparl + \(\hat{\beta}_{adobirth}\)adobirth + \(\hat{\beta}_{urban}\)urban + \(\hat{\beta}_{femlabor}\)femlabor + \(\hat{\beta}_{HIV}\)HIV + \(\hat{\beta}_{matmort}\)matmort + \(\hat{\beta}_{femsui}\)femsui + \(\hat{\beta}_{antcare}\)antcare + \(\hat{\beta}_{healthexp}\)healthexp
For a secondary research question (and to add an interaction variable to our model), we can ask “how does mean years of education and the adolescent birth rate conjointly affect our dependent variable, female life expectancy at birth?”
So we present our second model:
Life expectancy at birth, female ~ Mean years of schooling (females aged 25 years and above) + Share of seats in parliament (% held by women) + Adolescent birth rate (births per 1,000 women ages 15-19) + Population, urban (%) + Labour force participation rate, female (% ages 15 and older) + HIV prevalence, adult (% ages 15-49) + Maternal mortality ratio (deaths per 100,000 live births) + Maternal mortality ratio (deaths per 100,000 live births) + Suicide rate, female (per 100,000 people) + Antenatal care coverage, at least one visit (%) + Current health expenditure (% of GDP) + Mean years of schooling (females aged 25 years and above) * Adolescent birth rate (births per 1,000 women ages 15-19)
OR
\(\hat{lifeexp}\) = \(\hat{\alpha}\) + \(\hat{\beta}_{educyrs}\)educyrs + \(\hat{\beta}_{femparl}\)femparl + \(\hat{\beta}_{adobirth}\)adobirth + \(\hat{\beta}_{urban}\)urban + \(\hat{\beta}_{femlabor}\)femlabor + \(\hat{\beta}_{HIV}\)HIV + \(\hat{\beta}_{matmort}\)matmort + \(\hat{\beta}_{femsui}\)femsui + \(\hat{\beta}_{antcare}\)antcare + \(\hat{\beta}_{healthexp}\)healthexp + \(\hat{\beta}_{educyrs}\)educyrs * \(\hat{\beta}_{adobirth}\)adobirth
This sort of question suggests that there are reasonable or theoretical grounds to suspect that two different variables, mean years of education and the adolescent birth rate, causes an additional effect when these two independent variables interact with each other.
Adding some flair to your HTML with “rmdformats”
In addition, I use the package “rmdformats” (Barnier 2020) to make my work more readable. This publication, for example, uses the downcute format.
To install this package, either install the latest stable version from CRAN:
install.packages("rmdformats")
or the latest development from GitHub:
if (!require(remotes)) install.packages("remotes")
remotes::install_github("juba/rmdformats")
To add a format to an existing document, add:
---
output:
rmdformats::<template name>
---
to the document’s YAML preamble (the first few lines that are nested between 2 ’- - -’s).
Alternatively in R Studio, you can create a new R Markdown file pre-loaded with these options in the “From Template” window after selecting ‘R Markdown…’ in the add document drop-down list, located in the top left corner.
To add Table of Content options without rmdformats, you instead add the following to the preamble:
---
output:
html_document:
toc: TRUE #adds a Table of Contents
number_sections: TRUE #number your headings/sections
toc_float: TRUE #let your ToC follow you as you scroll
theme: cerulean #select a different theme from the default
---
For more themes that are available by default, see see Zieffler (2018) for an excellent list of base Markdown themes and examples.
Installing and Loading Packages
Now we’ll begin the usual preliminary work, starting with installing and loading the packages we’ll need for our analysis: tidyverse, ggplot2, and stargazer.
There’s mainly 3 ways to approach this:
Load packages only, which will simply load the packages if they’re already installed (if not installed, users will receive a prompt to do so).
library(tidyverse)
library(ggplot2)
library(stargazer)
Install & load.
install.packages(tidyverse)
install.packages(ggplot2)
install.packages(stargazer)
library(tidyverse)
library(ggplot2)
library(stargazer)
If package is missing, install & load.
if (!require(tidyverse)) install.packages(tidyverse)
if (!require(ggplot2)) install.packages(ggplot2)
if (!require(stargazer)) install.packages(stargazer)
library(tidyverse)
library(ggplot2)
library(stargazer)
Loading & preparing data with Tidyverse
First, let’s clear our workplace, set our working directory, and load our data.
# Clearing workspace
rm(list = ls())
# Set working directory
setwd("~/Documents/RPubs/Template")
# If your dataset is, for example, a .csv file
hdro <- read.csv("hdro2015.csv")
Let’s take a look at our dataset:
names(hdro)
[1] "country_code" "year" "value.47906" "value.110806" "value.21806"
[6] "value.23806" "value.23906" "value.24006" "value.24106" "value.24206"
[11] "value.31706" "value.36806" "value.43006" "value.44206" "value.45106"
[16] "value.46006" "value.48706" "value.48806" "value.49006" "value.52606"
[21] "value.63106" "value.63406" "value.64306" "value.64406" "value.68606"
[26] "value.69206" "value.69706" "value.71406" "value.71506" "value.101606"
[31] "value.101806" "value.103006" "value.103206" "value.103606" "value.103706"
[36] "value.110906" "value.120606" "value.121106" "value.121206" "value.122006"
[41] "value.123306" "value.123406" "value.123506" "value.123606" "value.127606"
[46] "value.132706" "value.132806" "value.136706" "value.136906" "value.137006"
[51] "value.137506" "value.137906" "value.140606" "value.141706" "value.143306"
[56] "value.146206" "value.148206" "value.148306" "value.150606" "value.150706"
[61] "value.153706" "value.169706" "value.169806" "value.175106" "value.181206"
[66] "value.181606" "value.53506" "value.71606" "value.73506" "value.101706"
[71] "value.111306" "value.135006" "value.138806" "value.38406" "value.38506"
[ reached getOption("max.print") -- omitted 73 entries ]
Most of our variables need an index for interpretation (not posted here). What we need to do is select the variables that are relevant to our research questions, rename those variables for ease of interpretation, and finally attach our data frame for analysis.
df <-hdro %>% # select the loaded dataset, hdro
# select the variables we want to keep and rename them by select(new variable name = variable we want to keep)
select(country = country_code, lifeexp = value.120606, educyrs = value.24106, femparl = value.31706, adobirth = value.36806, urban = value.45106, femlabor = value.48706, HIV = value.58006, matmort = value.89006, femsui = value.112606, antcare = value.175206, healthexp = value.181806)
attach(df)
Plotting with ggplot2
Probably the most popular package for data visualization is ggplot2, which we’ll use here to see some of the relationships between our variables.
For example, we can see the distribution of a single continuous variable with a histogram:
df %>% # this is one way of selecting the data that the plot will draw from
ggplot(aes(lifeexp)) + # aes() chooses the variable we'll visualize
geom_histogram(fill = "lightblue") + # specify the visual element you'd like to incorporate, with other options
labs(title="Number of Countries per Female Life Expectancy Age", x="Age", y="Count") # set labels
Or, let’s plot two numeric variables:
ggplot(df, aes(lifeexp, educyrs)) + # we write the data we want to refer to before the 'aes' if it isn't specified prior
geom_point(color = "red") + # specify the visual element you'd like to incorporate, with other options
labs(title="Education for Females and Their Life Expectancy", x="Age", y="Mean Years of Education") # set labels
Again, it can be helpful to refer to the respective cheat sheeet from RStudio.
An example of a basic linear regression
Linear Model 1
First, let’s run the model with all variables included.
# assign lm.1 as our first linear regression model in the order lm(dependent
# variable ~ explanatory variables)
lm.1 <- lm(df$lifeexp ~ df$educyrs + df$femparl + df$adobirth + df$urban + df$femlabor +
df$HIV + df$matmort + df$femsui + df$antcare + df$healthexp)
# produce a summary of the results
summary(lm.1)
Call:
lm(formula = df$lifeexp ~ df$educyrs + df$femparl + df$adobirth +
df$urban + df$femlabor + df$HIV + df$matmort + df$femsui +
df$antcare + df$healthexp)
Residuals:
Min 1Q Median 3Q Max
-2.3275 -0.7021 -0.4015 1.3136 2.5154
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.216432 16.863612 5.587 0.0014 **
df$educyrs -0.225593 0.707360 -0.319 0.7606
df$femparl 0.045443 0.072245 0.629 0.5525
df$adobirth -0.073725 0.051254 -1.438 0.2004
df$urban 0.121222 0.069284 1.750 0.1308
df$femlabor 0.039359 0.058502 0.673 0.5261
df$HIV 0.098224 0.492666 0.199 0.8486
df$matmort -0.017014 0.007154 -2.378 0.0549 .
df$femsui -0.564977 0.264306 -2.138 0.0764 .
df$antcare -0.204238 0.117381 -1.740 0.1325
df$healthexp -0.011857 0.483530 -0.025 0.9812
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.486 on 6 degrees of freedom
(178 observations deleted due to missingness)
Multiple R-squared: 0.9652, Adjusted R-squared: 0.9072
F-statistic: 16.65 on 10 and 6 DF, p-value: 0.001327
Our first model yielded no statistically significant variables at the 5% \(\alpha\) level. This is likely the result of over-fitting our model with too many variables, a conclusion supported by our high R2 score.
Linear Model 2
lm.2 <- lm(df$lifeexp ~ df$educyrs + df$adobirth + df$urban + df$matmort + df$healthexp)
summary(lm.2)
Call:
lm(formula = df$lifeexp ~ df$educyrs + df$adobirth + df$urban +
df$matmort + df$healthexp)
Residuals:
Min 1Q Median 3Q Max
-10.8478 -1.7181 0.3648 2.0191 5.7733
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 71.603875 1.302496 54.974 < 2e-16 ***
df$educyrs 0.172758 0.124322 1.390 0.167
df$adobirth -0.040566 0.008927 -4.544 1.09e-05 ***
df$urban 0.056817 0.013379 4.247 3.69e-05 ***
df$matmort -0.020176 0.001918 -10.521 < 2e-16 ***
df$healthexp 0.511097 0.099791 5.122 8.72e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.988 on 158 degrees of freedom
(31 observations deleted due to missingness)
Multiple R-squared: 0.8752, Adjusted R-squared: 0.8713
F-statistic: 221.7 on 5 and 158 DF, p-value: < 2.2e-16
In the actual analysis I ran, I performed a visual inspection of different plots between the dependent & independent variables (not shown here) and eliminated variables that both appear unlikely to have an effect on our dependent variable and variables that lack adequate theoretical foundations.
We reject the null hypothesis at the 1% \(\alpha\) level for the following variables: adolescent births, % of population living in urban areas, maternal mortality, and health expenditures.
Surprisingly, we fail to reject the null hypothesis that the mean years of education (female) has no effect on our response variable. However, the effect direction corresponds with our expectations.
Linear Model 3
lm.3 <- lm(df$lifeexp ~ df$educyrs + df$adobirth + df$educyrs * df$adobirth + df$urban +
df$matmort + df$healthexp)
summary(lm.3)
Call:
lm(formula = df$lifeexp ~ df$educyrs + df$adobirth + df$educyrs *
df$adobirth + df$urban + df$matmort + df$healthexp)
Residuals:
Min 1Q Median 3Q Max
-9.8889 -1.7574 0.3841 2.0641 6.9334
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.719505 1.327098 53.289 < 2e-16 ***
df$educyrs 0.346604 0.140094 2.474 0.0144 *
df$adobirth -0.011545 0.014409 -0.801 0.4242
df$urban 0.053460 0.013220 4.044 8.22e-05 ***
df$matmort -0.021740 0.001983 -10.960 < 2e-16 ***
df$healthexp 0.486738 0.098581 4.937 2.01e-06 ***
df$educyrs:df$adobirth -0.004527 0.001782 -2.540 0.0121 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.937 on 157 degrees of freedom
(31 observations deleted due to missingness)
Multiple R-squared: 0.8802, Adjusted R-squared: 0.8756
F-statistic: 192.2 on 6 and 157 DF, p-value: < 2.2e-16
Here we test our second research question. We now reject the null hypothesis that educyrs has no effect on our response variable, and that the interaction variable between educyrs:adobirth have no effect. However, we fail to reject the null hypothesis that adobirth has no effect on our response variable.
The effect direction of the interaction variable is confusing. This is due to the fact that we multiply the coefficient of educyrs, which is naturally positive, against the coefficient of the adobirth, which is negative, which means the interaction coefficient will always be negative. Time permitting, we would make a new variable (perhaps the inverse of the adolescent birth rate) so its coefficient would be >0. We would use this variable for our interaction variable instead.
Note: there’s many other issues with our models, like the high number of missing variables, testing for heteroskedasticity and multicolinearity, etc. We won’t go into these here, because this is just meant to be an illustration of the ways we can make our work easier to produce & pleasing to the eye.
Using stargazer to publish regression results
Now, we’d like to compare the results between are different regressions. My favourite tool to accomplish this is stargazer (Hlavac 2018).
With stargazer, we can summarize data sets like so:
stargazer(df, type = "text")
=================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
-----------------------------------------------------------------
lifeexp 185 73.931 8.221 52.800 67.800 79.700 87.000
educyrs 172 8.172 3.488 1.000 5.075 11.200 13.700
femparl 189 20.729 11.468 0.100 12.700 27.400 57.500
adobirth 185 53.099 43.550 0.300 15.300 78.700 200.700
urban 195 57.993 23.361 12.100 39.350 77.200 100.000
femlabor 180 51.629 15.559 6.100 45.075 61.150 84.100
HIV 136 1.990 4.541 0.100 0.100 1.425 28.300
matmort 182 169.549 232.733 3.000 14.250 229.000 1,360.000
femsui 183 5.192 3.867 0.300 2.650 6.300 32.100
antcare 20 89.215 13.020 54.700 88.100 97.200 99.400
healthexp 186 6.728 2.979 2.000 4.725 8.200 22.600
-----------------------------------------------------------------
And we can compare our three regressions like so:
stargazer(lm.1, lm.2, lm.3, type = "text")
============================================================================================
Dependent variable:
------------------------------------------------------------------------
lifeexp
(1) (2) (3)
--------------------------------------------------------------------------------------------
educyrs -0.226 0.173 0.347**
(0.707) (0.124) (0.140)
femparl 0.045
(0.072)
adobirth -0.074 -0.041*** -0.012
(0.051) (0.009) (0.014)
urban 0.121 0.057*** 0.053***
(0.069) (0.013) (0.013)
femlabor 0.039
(0.059)
HIV 0.098
(0.493)
matmort -0.017* -0.020*** -0.022***
(0.007) (0.002) (0.002)
femsui -0.565*
(0.264)
antcare -0.204
(0.117)
healthexp -0.012 0.511*** 0.487***
(0.484) (0.100) (0.099)
adobirth -0.005**
(0.002)
Constant 94.216*** 71.604*** 70.720***
(16.864) (1.302) (1.327)
--------------------------------------------------------------------------------------------
Observations 17 164 164
R2 0.965 0.875 0.880
Adjusted R2 0.907 0.871 0.876
Residual Std. Error 2.486 (df = 6) 2.988 (df = 158) 2.937 (df = 157)
F Statistic 16.645*** (df = 10; 6) 221.697*** (df = 5; 158) 192.195*** (df = 6; 157)
============================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
We can further change our labels:
stargazer(lm.1, lm.2, lm.3, dep.var.labels = "Life expectancy at birth, female",
covariate.labels = c("Mean years of schooling (females aged ≥ 25)", "Parliament seats, female (%)",
"Adolescent birth rate (births/1,000 women, ages 15-19)", "Population, urban (%)",
"Labour force participation rate, female (%, ages ≥ 15)", "HIV prevalence, adult (%, ages 15-49)",
"Maternal mortality ratio (deaths/100,000 live births)", "Suicide rate, female (per 100,000 people)",
"Antenatal care coverage, ≥ one visit (%)", "Current health expenditure (% of GDP)",
"Adolescent Birth Rate:Education"), type = "text")
===============================================================================================================================
Dependent variable:
------------------------------------------------------------------------
Life expectancy at birth, female
(1) (2) (3)
-------------------------------------------------------------------------------------------------------------------------------
Mean years of schooling (females aged ≥ 25) -0.226 0.173 0.347**
(0.707) (0.124) (0.140)
Parliament seats, female (%) 0.045
(0.072)
Adolescent birth rate (births/1,000 women, ages 15-19) -0.074 -0.041*** -0.012
(0.051) (0.009) (0.014)
Population, urban (%) 0.121 0.057*** 0.053***
(0.069) (0.013) (0.013)
Labour force participation rate, female (%, ages ≥ 15) 0.039
(0.059)
HIV prevalence, adult (%, ages 15-49) 0.098
(0.493)
Maternal mortality ratio (deaths/100,000 live births) -0.017* -0.020*** -0.022***
(0.007) (0.002) (0.002)
Suicide rate, female (per 100,000 people) -0.565*
(0.264)
Antenatal care coverage, ≥ one visit (%) -0.204
(0.117)
Current health expenditure (% of GDP) -0.012 0.511*** 0.487***
(0.484) (0.100) (0.099)
Adolescent Birth Rate:Education -0.005**
(0.002)
Constant 94.216*** 71.604*** 70.720***
(16.864) (1.302) (1.327)
-------------------------------------------------------------------------------------------------------------------------------
Observations 17 164 164
R2 0.965 0.875 0.880
Adjusted R2 0.907 0.871 0.876
Residual Std. Error 2.486 (df = 6) 2.988 (df = 158) 2.937 (df = 157)
F Statistic 16.645*** (df = 10; 6) 221.697*** (df = 5; 158) 192.195*** (df = 6; 157)
===============================================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Now we have a comparison of our different models as well as descriptive variable names that will lend to easier interpretation.
Please note: as far as I’m aware, direct outputs to HTML in R Markdown is only compatible with the setting ‘type = “text”’. If you set the type to Latex or HTML, it’ll generate the code as opposed to the finished result. If you’d like your stargazer output in another format (say, to copy to a word document), simply add an “out = [file name].[format]” setting, like so:
stargazer(lm.1, lm.2, lm.3, dep.var.labels = "Life expectancy at birth, female",
covariate.labels = c("Mean years of schooling (females aged ≥ 25)", "Parliament seats, female (%)",
"Adolescent birth rate (births/1,000 women, ages 15-19)", "Population, urban (%)",
"Labour force participation rate, female (%, ages ≥ 15)", "HIV prevalence, adult (%, ages 15-49)",
"Maternal mortality ratio (deaths/100,000 live births)", "Suicide rate, female (per 100,000 people)",
"Antenatal care coverage, ≥ one visit (%)", "Current health expenditure (% of GDP)",
"Adolescent Birth Rate:Education"), type = "html", out = "regressions.html")
And your table will be ready in your working directory:
References
Barnier, Julien. 2020. Rmdformats: HTML Output Formats and Templates for Rmarkdown Documents. https://github.com/juba/rmdformats.
Hlavac, Marek. 2018. stargazer: beautiful latex, html and ASCII tables from R statistical output. https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf.
Zieffler, Andrew. 2018. R Markdown Theme Gallery. https://www.datadreaming.org/post/r-markdown-theme-gallery/.
The London School of Economics and Political Science, School of Government↩︎