Learning Objective:

You will be able to apply a common set of tools for data analysis in the course and interpret the analysis through clear written and visual output.¹

Introduction

In this tutorial, we will cover topics of reading data into R, producing descriptive statistics, visualizing data, and running statistical models. Several of the topics covered in this tutorial should be review. Some topics, materials, and methods may be new. Please read through the tutorial closely and examine the code closely to get a sense of the steps that are covered.

We have structured this RMarkdown document such that the code chunks are ready for your to run. You can run individual code chunks by clicking on the green right-facing arrow or by using the Ctrl+Shift+Enter key combination. For more information on RMarkdown documents, see the RMarkdown chapter from R for Data Science.

Interspersed through the tutorial, we have included sections where your input is required. These are in the form of comprehension checks and assignment questions.

Comprehension Check

Here is where we will ask you to interpret or reflect on the data, analysis, or models used.

[DISCUSS HERE]

R Packages

To promote collaborative, reproducible work we will be following a set of guidelines and using a prescribed set of R packages throughout the course. We will make extensive use of the tidyverse, “a set of packages that work in harmony because they share common data representations and API design.” Check out the Tidyverse packages. And for an excellent introduction to R, the tidyverse, and data science, we recommend R for Data Science.

The following is an RMarkdown “code chunk” to load the tidyverse package (sometimes called a library).

# Ensure that pacman is installed for package management and loading.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse) # for data reading wrangling and visualization

Coding Style

From the start, we want to establish a set of common good practices in writing interpretable, well-commented code in R. If you have not done so already, please review the tidyverse style guide by Hadley Wickham (A shorter, similar style guide can be found at Google’s R Style Guide).

Load additional required packages

# for working directories
pacman::p_load(here)
# for working with strings
pacman::p_load(glue)
# for tables of summary statistics
pacman::p_load(stargazer)
# for calculating correlations
pacman::p_load(corrr)
# for correlation plots
pacman::p_load(ggcorrplot)
# for marginal effects from lineal regressions
pacman::p_load(margins)
# Test for linear regression models
pacman::p_load(lmtest)
# for robust standard errors
pacman::p_load(sandwich)
# meta-framework for model training and testing
pacman::p_load(tidymodels)
# for elastic net and lasso
pacman::p_load(glmnet)
# for plotting glmnet
pacman::p_load(plotmo)
# for classification and regression trees
pacman::p_load(rpart)
# for updated ggplot2 theme
pacman::p_load(hrbrthemes)
# for updated ggplot2 colorblind-friendly scheme
pacman::p_load(ggthemes)
# for combining ggplot2 figures
pacman::p_load(patchwork)
# for parallel processing
pacman::p_load(doFuture)
#vroom
pacman::p_load(vroom)

Data

The data we will be working are from Kevin Arceneaux, Alan S. Gerber and Donald P. Green (2006)’s paper “Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment” (see article). These data come from a randomized experiment. A research study was conducted in Iowa and Michigan before the 2002 US midterm elections. Households with one or two registered voters were randomly assigned to treatment and control groups. The treatment involved calling the selected individuals by phone for a get-out-the-vote (GOTV) campaign. The study involved over 1.9 million households (1,845,348 in the control group and 59,972 in the group that was assigned to receive the GOTV phone call).

The data that we will use includes information about voting records in elections prior to 2002 as well as several demographic and location variables.

This tutorial assumes that the data is in a folder named “data” with a .here file located. We will use the here package (See Practice safe paths from Jennifer Bryan and Jim Hester’s book) to assist with finding files. We encourage you to work within a RStudio Project (See Project-oriented workflows).

Let’s read in the data:

# # Download the data from GitHub
# temp <- tempfile()
# download.file("https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Mobilization/ProcessedData/mobilization_with_unlisted.zip", temp)
#
# # Unzip the data file and load into memory
# df <- readr::read_csv(unz(temp, "mobilization_with_unlisted.csv"))
# unlink(temp)

# # Save the data
# vroom_write(df, here::here("data", "mobilization.csv"), delim = ",")

# Read in the data
df <- readr::read_csv(here::here("data", "mobilization.csv"))

Take a look at the data. The glimpse function from the tibble package is a handy way to check the data types of variables in the data and get a sense of the values that one is working with in the data.

tibble::glimpse(df)

## Observations: 2,474,927
## Variables: 29
## $ persons         <dbl> 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,…
## $ competiv        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ vote00          <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ treatment       <dbl> -1, 0, 0, -1, 0, -1, 0, 0, 0, -1, -1, 0, 0, -1, 0, 0,…
## $ county          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ st_sen          <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 2…
## $ st_hse          <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 5…
## $ vote98          <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ newreg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pid_maj         <dbl> 0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 0, 2, 2, 2, 2, 0,…
## $ age             <dbl> 44, 39, 82, 73, 44, 75, 78, 74, 46, 87, 59, 78, 42, 4…
## $ female          <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ vote02          <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,…
## $ contact         <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ pseudo_c        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ treat_real      <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ treat_pseudo    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ state           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ comp_mi         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ comp_ia         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fem_miss        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ female2         <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ lc              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ hc              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pseudo_portion  <dbl> 0, NA, NA, 0, NA, 0, NA, NA, NA, 1, 0, NA, NA, 0, NA,…
## $ W               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ female_decoded  <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ pid_maj_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ d_unlisted      <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,…

As you can see, the data include 2,474,927 observations and 29 variables.

Can you start to get a sense of what each variable represents?

Our data is unlabelled, so to help understand some of the variables, here’s a list of some key variables that we will work with:

Variable	Description
age	Age
persons	Number of registered voters in household
newreg	Newly registered voter
female	Indicator variable where \(= 1\) indicates female
st_sen	Index of state senate district
st_hse	Index of state congressional district
vote00	Indicator variable where \(= 1\) indicates voted in the 2000 general election
vote98	Indicator variable where \(= 1\) indicates voted in the 1998 midterm election
county	County index
state	US State indicator variable where \(= 0\) indicates Michigan and \(= 1\) indicates Iowa
comp_mi	Indicator of whether the congressional district was classified as competitive in Michigan
comp_ia	Indicator of whether the congressional district was classified as competitive in Iowa
vote02	Indicator variable where \(= 1\) indicates voted in the 2002 midterm election
treatment	Indicator variable specifying the treatment in the experiment
contact	Indicator of whether the person was contacted (`NA` - indicates an observation with an unlisted phone number)

For this tutorial, we will focus our analysis on the control group, e.g. observations from people that did not receive the GOTV campaign messaging. We will, thus, put aside the observations from people that were included in the campaign. Additionally, we are going to restrict our analysis to use the subset of variables listed above. For computational reasons, we will restrict our analysis to the largest county.

# We will retain a subset of the variables for our analysis
covariate_names <- c("age", "persons", "newreg", "vote00", "vote98", "female", "st_sen", "st_hse")

mobilize_df <- df %>%
  # Keep the control group
  filter(treatment == 0) %>%
  # Keep the largest county
  filter(county == 82) %>%
  # select the variables of interest
  select(vote02, all_of(covariate_names)) %>%
  # drop observations missing female indicator
  drop_na(female)

# Remove the full data from memory
rm(df)

Descriptive Statistics

To get a quick understanding of the variables, it’s often useful to summarize the data. One way to do so is the summary function.

summary(mobilize_df)

##      vote02            age            persons          newreg      
##  Min.   :0.0000   Min.   : 18.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 37.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :1.0000   Median : 49.00   Median :1.000   Median :0.0000  
##  Mean   :0.5268   Mean   : 52.08   Mean   :1.423   Mean   :0.1221  
##  3rd Qu.:1.0000   3rd Qu.: 67.00   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :113.00   Max.   :2.000   Max.   :1.0000  
##      vote00           vote98           female           st_sen     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.0000   Median :0.0000   Median :1.0000   Median :6.000  
##  Mean   :0.5382   Mean   :0.1657   Mean   :0.5685   Mean   :5.144  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:7.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :8.000  
##      st_hse     
##  Min.   : 1.00  
##  1st Qu.: 8.00  
##  Median :15.00  
##  Mean   :13.57  
##  3rd Qu.:19.00  
##  Max.   :23.00

That’s a bit overwhelming when we have lots of variables. Instead, we can get the mean and standard deviation of individual variables.

glue::glue("Mean household size:", round(mean(mobilize_df$persons), digits = 4))

## Mean household size:1.4232

glue::glue("Standard deviation:", round(sd(mobilize_df$persons), digits = 4))

## Standard deviation:0.4941

We can also use the stargazer function to produce easy to read summary statistics tables.

stargazer::stargazer(as.data.frame(mobilize_df), summary = TRUE, type = "text")

## 
## ===========================================================
## Statistic    N     Mean  St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------
## vote02    243,042 0.527   0.499    0     0        1      1 
## age       243,042 52.082  18.573  18     37       67    113
## persons   243,042 1.423   0.494    1     1        2      2 
## newreg    243,042 0.122   0.327    0     0        0      1 
## vote00    243,042 0.538   0.499    0     0        1      1 
## vote98    243,042 0.166   0.372    0     0        0      1 
## female    243,042 0.568   0.495    0     0        1      1 
## st_sen    243,042 5.144   2.265    1     3        7      8 
## st_hse    243,042 13.566  6.545    1     8        19    23 
## -----------------------------------------------------------

Data Visualization

A good way to get to know the data is to produce some data visualizations. Let’s take a look at a few of the variables. First, we’ll look a univariate visualizations.

Relying on the tidyverse, let’s produce a histogram of age in 2002 (age) with ggplot2.

# Try out a few different ggplot themes to see which one you find most accessible
theme_set(hrbrthemes::theme_ipsum())
# theme_set(ggplot2::theme_minimal())
# theme_set(ggthemes::theme_base())
# theme_set(ggthemes::theme_tufte())

ggplot2::ggplot(data = mobilize_df, aes(x = age)) +
  geom_histogram()

We can compare the distribution of age among those that voted in the 2002 midterm election by gender.

# Labels for voting
vote_labs <- c("Didn't Vote", "Voted")
names(vote_labs) <- c("0", "1")

# Labels for gender
gender_labs <- c("Male", "Female")
names(gender_labs) <- c("0", "1")

mobilize_df %>%
  # convert female variable to factor variable for plotting
  mutate(female = as.factor(female)) %>%
  ggplot(aes(x = age, color = female, fill = female)) +
  # set number of bins
  geom_histogram(bins = 20) +
  # grid by vote02 and female
  facet_grid(vote02 ~ female,
    labeller = labeller(vote02 = vote_labs, female = gender_labs)
  ) +
  # use a colorblind friendly color palette
  scale_color_colorblind() +
  scale_fill_colorblind() +
  # drop the legend
  guides(color = FALSE, fill = FALSE) +
  labs(
    title = "Age",
    subtitle = "By gender and voting record in the 2002 midterm election"
  )

Comprehension Check

Using the code for the previous figure, create a 2-by-2 grid of figures showing the age distribution where the columns correspond to whether or not a person voted in 2000 (vote00) and the rows correspond to whether or not the person voted in 2002 (vote02).

# Labels for voting
vote_labs <- c("Didn't Vote", "Voted")
names(vote_labs) <- c("0", "1")

mobilize_df %>%
  # convert vote00 variable to factor variable for plotting
  mutate(vote00 = as.factor(vote00)) %>%
  ggplot(aes(x = age, color = vote00, fill = vote00)) +
  # set number of bins
  geom_histogram(bins = 20) +
  # [YOUR CODE HERE]
  facet_grid(vote02 ~ vote00,
    labeller = labeller(vote02 = vote_labs, vote00 = vote_labs)
  ) +
  # use a colorblind friendly color palette
  scale_color_colorblind() +
  scale_fill_colorblind() +
  # drop the legend
  guides(color = FALSE, fill = FALSE) +
  labs(
    title = "Age",
    subtitle = "By voting in 2000 and 2002"
  )

T-tests

Let’s formally test for differences in age. We want to conduct a hypothesis test of whether the age of those who voted in the 2002 midterm election are older than those that did not vote.

Formally, what we are doing is setting up a null hypothesis, \(H_{0}\), where the mean age of voters and non-voters are equivalent. Our alternative hypothesis, \(H_{1}\) is that the mean age of the two populations are not equal.

\(H_{0}: \mu_{1} = \mu_{2}\)
\(H_{1}: \mu_{1} \neq \mu_{2}\)

We will use Welch’s t-test, under which we assume that the data from the two groups follow a normal distribution. However, we do not require that the data from the two populations have the same variance.

Welch’s t-test:

Test statistic: \(\frac{\mu_{1} - \mu_{2}}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}\)
Degrees of Freedom (DOF): \(\frac{(\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}})^2}{(\frac{s_{1}^2}{n_{1}})^2 + (\frac{s_{2}^2}{n_{2}})^2}\)

First, we will test if the mean age of people who voted is different from the mean age of people that did not vote.

tidy(t.test(age ~ vote02,
  data = mobilize_df,
  var.equal = FALSE,
  alternative = "less",
  conf.level = 0.95
)) %>%
  select(
    estimate1, estimate2, estimate, conf.low,
    conf.high, p.value, statistic, parameter
  ) %>%
  rename(
    `Didn't Vote` = estimate1,
    `Voted` = estimate2,
    `Difference` = estimate,
    `Low` = conf.low,
    `High` = conf.high,
    `T-Stat` = statistic,
    `DOF` = parameter,
  ) %>%
  knitr::kable(digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::add_header_above(c(
    "Mean Age" = 3,
    "95% Confidence Interval" = 2,
    "Test Statistics" = 3
  )) %>%
  kableExtra::add_header_above(c("Difference in Mean Age, 
                                 Welch Two Sample, 
                                 One-sided t-test" = 8),
    align = "left"
  )

Didn’t Vote	Voted	Difference	Low	High	p.value	T-Stat	DOF
47.36	56.33	-8.97	-Inf	-8.85	0	-122.05	236481.3

We reject the null hypothesis that the mean age of those that voted in the 2002 midterm election is equal to the mean age of non-voters. We find that votes are, on average, older than non-voters.

Now let’s test for equality of mean voting rates by gender. This time we will use a one-sided test, where our hypothesis is that women voters in the 2002 midterm election were older than women non-voters.

tidy(t.test(age ~ vote02,
  data = filter(mobilize_df, female == 1),
  var.equal = FALSE,
  alternative = "less",
  conf.level = 0.95
)) %>%
  select(
    estimate1, estimate2, estimate, conf.low,
    conf.high, p.value, statistic, parameter
  ) %>%
  rename(
    `Didn't Vote` = estimate1,
    `Voted` = estimate2,
    `Difference` = estimate,
    `Low` = conf.low,
    `High` = conf.high,
    `T-Stat` = statistic,
    `DOF` = parameter,
  ) %>%
  knitr::kable(digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::add_header_above(c(
    "Mean Age" = 3,
    "95% Confidence Interval" = 2,
    "Test Statistics" = 3
  )) %>%
  kableExtra::add_header_above(c("Difference in Mean Age (Women), 
                                 Welch Two Sample, 
                                 One-sided t-test" = 8),
    align = "left"
  )

Difference in Mean Age (Women), Welch Two Sample, One-sided t-test
Mean Age			95% Confidence Interval		Test Statistics
Didn’t Vote	Voted	Difference	Low	High	p.value	T-Stat	DOF
48.13	57.55	-9.43	-Inf	-9.26	0	-93.25	133797.4

Similarly, for men:

tidy(t.test(age ~ vote02,
  data = filter(mobilize_df, female == 0),
  var.equal = FALSE,
  alternative = "less",
  conf.level = 0.95
)) %>%
  select(estimate1, estimate2, estimate, conf.low, conf.high, p.value, statistic, parameter) %>%
  rename(
    `Didn't Vote` = estimate1,
    `Voted` = estimate2,
    `Difference` = estimate,
    `Low` = conf.low,
    `High` = conf.high,
    `T-Stat` = statistic,
    `DOF` = parameter,
  ) %>%
  knitr::kable(digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::add_header_above(c(
    "Mean Age" = 3,
    "95% Confidence Interval" = 2,
    "Test Statistics" = 3
  )) %>%
  kableExtra::add_header_above(c("Difference in Mean Age (Men), 
                                 Welch Two Sample, 
                                 One-sided t-test" = 8),
    align = "left"
  )

Difference in Mean Age (Men), Welch Two Sample, One-sided t-test
Mean Age			95% Confidence Interval		Test Statistics
Didn’t Vote	Voted	Difference	Low	High	p.value	T-Stat	DOF
46.34	54.71	-8.37	-Inf	-8.2	0	-79.37	102628.6

Comprehension Check

Using the code above as a template, test the hypothesis that women voted at a higher rate, on average, than men in the 2002 midterm election. (Hint: use a one-sided test).

# [YOUR CODE HERE]
#Now, I will test if the mean of voting rate of women in the 2002 midterm election was higher than the mean of voting rate of men.
tidy(t.test(vote02 ~ female,
  data = filter(mobilize_df),
  var.equal = FALSE,
  alternative = "less",
  conf.level = 0.95
)) %>%
  select(
    estimate1, estimate2, estimate, conf.low,
    conf.high, p.value, statistic, parameter
  ) %>%
  rename(
    `Male` = estimate1,
    `Female` = estimate2,
    `Difference` = estimate,
    `Low` = conf.low,
    `High` = conf.high,
    `T-Stat` = statistic,
    `DOF` = parameter,
  ) %>%
  knitr::kable(digits = 2) %>%
  kableExtra::kable_styling(bootstrap_options = "striped") %>%
  kableExtra::add_header_above(c(
    "Voting Proportion" = 3,
    "95% Confidence Interval" = 2,
    "Test Statistics" = 3
  )) %>%
  kableExtra::add_header_above(c("Difference in Mean between Males and Females, 
                                 Welch Two Sample, 
                                 One-sided t-test" = 8),
    align = "left"
  )

Difference in Mean between Males and Females, Welch Two Sample, One-sided t-test
Voting Proportion			95% Confidence Interval		Test Statistics
Male	Female	Difference	Low	High	p.value	T-Stat	DOF
0.53	0.53	0	-Inf	0.01	0.85	1.05	225814.8

We don’t have enough evidence to reject the null hypothesis that the mean of the proportion of women voters in the 2002 midterm election is higher to the mean age of the proportion of men.

Correlation

Next, we want to explore linear correlation across the variables.

We can visualize the correlation coefficients with a correlation plot using the ggcorrplot function in tandem with the cor function that produces a correlation matrix.

covariate_names <- names(mobilize_df)

corr <- cor(mobilize_df,
  use = "pairwise.complete.obs"
)

# Replace NA values with zero
corr[is.na(corr)] <- 0

# Organize variables by heirarchical clustering
ggcorrplot::ggcorrplot(corr,
  hc.order = TRUE,
  type = "full",
  lab = FALSE,
  legend.title = "Correlation Coefficient",
  colors = c("#053061", "white", "#67001f"),
  ggtheme = ggplot2::theme_void,
  outline.col = "white"
) +
  theme(
    legend.position = "top",
    axis.title.x = element_blank(),
    axis.text.x = element_text(size = 10),
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 10),
    axis.title.y = element_blank(),
    axis.ticks.y = element_blank()
  )

As we can see, there are a few variables that are highly correlated. When we come to the statistical modeling section of the tutorial, high correlation between variables is something that we want to be cognizant of.

What variables correlate with whether a person voted in the 2002 midterm election?

# Convert correlation matrix to data frame
corrr::as_cordf(corr) %>%
  # Focus on the 2002 vote variable
  corrr::focus(vote02) %>%
  # Get the absolute value of the correlation coefficient
  mutate(vote02 = abs(vote02)) %>%
  # Sort variables by absolute value of correlation coefficient
  arrange(-vote02) %>%
  # Clean up headers
  rename(`Correlation with vote02` = rowname) %>%
  rename(corr_coef = vote02) %>%
  knitr::kable(digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = "striped")

Correlation with vote02	corr_coef
vote00	0.4553
age	0.2410
vote98	0.2342
newreg	0.1197
persons	0.1193
st_hse	0.0093
st_sen	0.0064
female	0.0021

Comprehension Check

In 2-3 sentences, discuss the correlation coefficients observed in the table above. Do they match your intuition?

[DISCUSS HERE] Not really. In particular, I expected a higher correlation with vote00 and with newreg. Since if you voted previously maybe you are used to do it, so you may want to continue with such habit. If you are newly registered, you may want to start with the habit. I know that having a high correlation with these two variables might be hard, because one implies that the other is not occurring; so, the effect is not clear.

For age, it is hard for me to know if it was what I expected or not since I don’t think it is a linear relationship, but a parabolic relationship, in which at a young age you are likely to be less engaged with voting, when you start getting older you start getting more concern with voting and then you start losing interest as you are closer to your last years. However, if the election is on a weekday, and you work, then the relationship I previously assumed is the inverse.

I didn’t expect any correlation with the number of registered voters in the household, since if all those registered voters don’t vote might push you to not vote as well, or if they vote they will push you to vote.

Linear Regression

Let’s now move to linear regression to check for correlation between the voting outcome and covariates. First, we’ll use the lm function from the stats package. Here is a bivariate regression where we regress an indicator of a person’s gender on whether or not that person voted in the 2002 midterm election.

The underlying model that we are using is:

\(Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}\)

where \(Y_{i}\) is our outcome of interest, “Voted in 2002 midterm election” and \(\beta_{1}\), the coefficient of interest, gives us a measure of how much \(X\) and \(Y\) covary, adjusting for the variance of \(X\) in our population. We take a data-driven approach to estimating that relationship.

ols_lm1 <- lm(vote02 ~ age,
  data = mobilize_df
)

broom::tidy(ols_lm1) %>%
  # round values
  knitr::kable(digits = 4) %>%
  # Clean table formating
  kableExtra::kable_styling(bootstrap_options = "striped")

term	estimate	std.error	statistic	p.value
(Intercept)	0.1893	0.0029	64.7001	0
age	0.0065	0.0001	122.4382	0

Based on this bivariate regression, we estimate that for each additional year of age, an individual is 0.65 percentage points more likely to have voted in the 2002 election. This is consistent with what we observed in the t-tests.

We can move to a more general linear regression model, the multivariate regression.

\(Y_{i} = X\beta + u_{i}\)

Where \(X\) is a matrix of variables (or covariates) and \(\beta\) is a vector of coefficients.

Building on our previous regression, let’s add gender to the regression.

ols_lm2 <- lm(vote02 ~ age + female,
  data = mobilize_df
)

Usually, we will be concerned about heteroskedacity in our data. That is, the variance of our estimates may be dependent on covariates. To adjust for heteroskedacity, there are a number of way to calculate heteroskedastic-robust standard errors. Here, we’ll use the lmtest package in tandem with the sandwich package to report heteroskedastic-robust standard errors.

# report heteroskedasticity-robust standard errors
lmtest::coeftest(ols_lm2,
  vcov = sandwich::vcovHC(ols_lm2, type = "HC2")
) %>%
  broom::tidy() %>%
  knitr::kable(digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = "striped")

term	estimate	std.error	statistic
(Intercept)	0.1977	0.0031	63.8366
age	0.0065	0.0001	121.2115
female	-0.0173	0.0020	-8.6853

Logistic Regression

Because our outcome is a binary variable, we can approach the modeling as a classification problem. As such, let’s switch to logistic regression.

logit_age_fem <- glm(vote02 ~ age + female,
  data = mobilize_df,
  family = binomial(link = "logit")
)

lmtest::coeftest(logit_age_fem,
  vcov = sandwich::vcovHC(logit_age_fem, type = "HC2")
) %>%
  broom::tidy() %>%
  knitr::kable(digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = "striped")

term	estimate	std.error	statistic
(Intercept)	-1.2683	0.0136	-92.9639
age	0.0274	0.0002	110.9083
female	-0.0721	0.0084	-8.5323

Recall that the coefficients from the logistic regression model are not interpreted in the same way as coefficients from the linear model. Whereas coefficient from the lm model are interpreted as the expected change in Y for a change in X. The coefficients from the glm model are the change in the “log odds” as X changes.

We can get an analogous average marginal effect from the glm coefficents using the margins function. You can confirm that the values for the average marginal effect resemble those from the linear regression.

# Calculate average marginal effects
margins::margins(logit_age_fem)

##      age   female
##  0.00643 -0.01691

We’ve seen that age is correlated with whether or not a person voted in the 2002 midterm election. However, it’s likely that a person’s propensity to vote diminishes at some point with age. So, we’ll include a quadratic term for age.

logit_age2 <- glm(vote02 ~ age + I(age^2) + female,
  data = mobilize_df,
  family = binomial(link = "logit")
)

lmtest::coeftest(logit_age2,
  vcov = sandwich::vcovHC(logit_age2, type = "HC2")
) %>%
  broom::tidy() %>%
  knitr::kable(digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = "striped")

term	estimate	std.error	statistic	p.value
(Intercept)	-3.9097	0.0390	-100.1602	0e+00
age	0.1332	0.0015	90.8825	0e+00
I(age^2)	-0.0010	0.0000	-74.0238	0e+00
female	-0.0301	0.0086	-3.5056	5e-04

We see from the the table above and the following figure that the fitted values of the likelihood that a person voted in 2002 follow an “inverted U” shape as age increases.

ggplot(margins_age, aes(x = xvals)) +
  geom_line(aes(y = yvals)) +
  geom_line(aes(y = upper), linetype = 2) +
  geom_line(aes(y = lower), linetype = 2) +
  geom_hline(yintercept = 0) +
  labs(
    title = "Relationship between age and voting in 2002",
    x = "Age", y = "Predicted likelihood of voting"
  )

logit_interaction <- glm(vote02 ~ female * age + female * I(age^2),
  data = mobilize_df,
  family = binomial(link = "logit")
)

lmtest::coeftest(logit_interaction, vcov = sandwich::vcovHC(logit_interaction, type = "HC2")) %>%
  broom::tidy() %>%
  knitr::kable(digits = 4) %>%
  kableExtra::kable_styling(bootstrap_options = "striped")

term	estimate	std.error	statistic	p.value
(Intercept)	-3.8691	0.0604	-64.0806	0.0000
female	-0.0992	0.0786	-1.2625	0.2068
age	0.1317	0.0023	56.9492	0.0000
I(age^2)	-0.0009	0.0000	-45.6899	0.0000
female:age	0.0025	0.0030	0.8491	0.3958
female:I(age^2)	0.0000	0.0000	-0.7912	0.4288

cfemale <- ggplot(margins_female, aes(x = xvals)) +
  geom_line(aes(y = yvals), color = "red") +
  geom_line(aes(y = upper), linetype = 2) +
  geom_line(aes(y = lower), linetype = 2) +
  geom_hline(yintercept = 0) +
  labs(
    title = "Women",
    subtitle = "",
    x = "Age", y = "Likelihood of voting"
  )

cmale <- ggplot(margins_male, aes(x = xvals)) +
  geom_line(aes(y = yvals), color = "blue") +
  geom_line(aes(y = upper), linetype = 2) +
  geom_line(aes(y = lower), linetype = 2) +
  geom_hline(yintercept = 0) +
  labs(
    title = "Men",
    # subtitle = "Men",
    x = "Age", y = ""
  )

cfemale + cmale

We can see from the regression table and the figure that older people were observed to vote at a higher rate. However, we do not observe a difference in voting rates between women and men. If oe

Comprehension Check

We have described the relationship between covariates and the outcome as correlation. We want to be cautious about interpreting the relationships as causal. One primary reason is that there are likely variables omitted from our model that are correlated with the likelihood of voting and the covariates included in the regression above. In 3-5 sentences, interpret the table and describe some of the factors that may be omitted from these simple regressions that we want to be concerned about when interpreting the estimates from the regressions above.

[DISCUSS HERE]

In the table above (considering the logit model with the most regressors) we found that female is negatively correlated with voting but it is not significant, so we cannot say it plays an important role. Age is possitively correlated with voting, so one extra year will increase the voting rate by 0.13; however, age squared is negative. From the graphs above we can see that young and elder population, independently if they are males or females, are less likely to vote than middle age population. One of the factors that may be affecting the likelihood of voting is that the election was on a weekday and the person worked on that day. Education may also be a factor that is possitively correlated with voting. Whether people voted or not in previous elections may also affect; however, the effect in this may be less clear because maybe people are less likely to engage on midterm elections.

Machine Learning

Now, let’s bring in some steps from machine learning to set up a predictive model of whether or not a person voted in the 2002 midterm election.

Train-Test Split

First, we will split our data into training set and test set. We will holdout 80% of the data as a test dataset to evaluate our “best” model at the final step. The test set is often referred to as the holdout data.

# Set the seed and run the full code chunk
set.seed(200004)

# Holdout 80% of the data for model validation
mobilize_split <- rsample::initial_split(mobilize_df, prop = 0.8)
mobilize_split

## <194434/48608/243042>

Create a training set and a holdout set.

mobilize_train <- rsample::training(mobilize_split)
mobilize_test <- rsample::testing(mobilize_split)

Preprocessing

We will do some data preprocessing to prepare for our models

mobilize_rec <- recipes::recipe(vote02 ~ .,
  data = mobilize_train
) %>%
  # convert outcome to factor
  recipes::step_mutate(
    vote02 = as.factor(vote02),
    role = "outcome"
  ) %>%
  # Impute mean for all numeric variables with missing values
  recipes::step_meanimpute(recipes::all_numeric()) %>%
  # remove any variables with near-zero variance
  recipes::step_nzv(recipes::all_predictors()) %>%
  # remove variables with 0.8 correlation coefficient or higher
  recipes::step_corr(recipes::all_predictors(),
    threshold = 0.8,
    use = "pairwise.complete.obs",
    method = "pearson"
  ) %>%
  recipes::step_mutate(
    # convert senate district to a factor
    st_sen = as.factor(st_sen),
    # convert congressional district to a factor
    st_hse = as.factor(st_hse),
    # create age-squared variable
    age2 = age^2
  ) %>%
  # Convert all factor variables to binary/dummy variables
  recipes::step_dummy(recipes::all_nominal(), -recipes::all_outcomes(),
    one_hot = FALSE, preserve = FALSE
  )

mobilize_prep <- mobilize_rec %>%
  recipes::prep(retain = TRUE)

mobilize_train_prepped <- recipes::juice(mobilize_prep)
mobilize_test_prepped <- recipes::bake(mobilize_prep,
  new_data = mobilize_test
)

After all of the steps above, we can see that we have added several new variables to our data.

# vote02 is dropped because outcome variable is now a factor.
stargazer::stargazer(as.data.frame(mobilize_train_prepped),
  summary = TRUE, type = "text"
)

## 
## ===================================================================
## Statistic     N      Mean    St. Dev.  Min Pctl(25) Pctl(75)  Max  
## -------------------------------------------------------------------
## age        194,434  52.076    18.573   18     37       67     113  
## persons    194,434   1.424     0.494    1     1        2       2   
## newreg     194,434   0.122     0.328    0     0        0       1   
## vote00     194,434   0.538     0.499    0     0        1       1   
## vote98     194,434   0.165     0.371    0     0        0       1   
## female     194,434   0.568     0.495    0     0        1       1   
## age2       194,434 3,056.847 2,088.571 324  1,369    4,489   12,769
## st_sen_X2  194,434   0.097     0.297    0     0        0       1   
## st_sen_X3  194,434   0.106     0.308    0     0        0       1   
## st_sen_X4  194,434   0.081     0.272    0     0        0       1   
## st_sen_X5  194,434   0.104     0.306    0     0        0       1   
## st_sen_X6  194,434   0.181     0.385    0     0        0       1   
## st_sen_X7  194,434   0.173     0.379    0     0        0       1   
## st_sen_X8  194,434   0.176     0.380    0     0        0       1   
## st_hse_X2  194,434   0.023     0.150    0     0        0       1   
## st_hse_X3  194,434   0.025     0.156    0     0        0       1   
## st_hse_X4  194,434   0.033     0.179    0     0        0       1   
## st_hse_X5  194,434   0.028     0.165    0     0        0       1   
## st_hse_X6  194,434   0.032     0.175    0     0        0       1   
## st_hse_X7  194,434   0.031     0.173    0     0        0       1   
## st_hse_X8  194,434   0.034     0.182    0     0        0       1   
## st_hse_X9  194,434   0.031     0.173    0     0        0       1   
## st_hse_X10 194,434   0.030     0.171    0     0        0       1   
## st_hse_X11 194,434   0.029     0.167    0     0        0       1   
## st_hse_X12 194,434   0.020     0.139    0     0        0       1   
## st_hse_X13 194,434   0.067     0.250    0     0        0       1   
## st_hse_X14 194,434   0.060     0.238    0     0        0       1   
## st_hse_X15 194,434   0.049     0.216    0     0        0       1   
## st_hse_X16 194,434   0.055     0.227    0     0        0       1   
## st_hse_X17 194,434   0.057     0.232    0     0        0       1   
## st_hse_X18 194,434   0.057     0.232    0     0        0       1   
## st_hse_X19 194,434   0.067     0.249    0     0        0       1   
## st_hse_X20 194,434   0.064     0.245    0     0        0       1   
## st_hse_X21 194,434   0.055     0.227    0     0        0       1   
## st_hse_X22 194,434   0.052     0.222    0     0        0       1   
## st_hse_X23 194,434   0.050     0.218    0     0        0       1   
## -------------------------------------------------------------------

Models

We are going to use a few models to predict whether or not a person voted in the 2020 election. logistic regression, lasso, and classification trees to predict whether or not a person voted in the 2002 midterm election.

Lasso

One of the central tools of many machine learning tools is regularization. When we are working with a large number of variables to predict an outcome, we want to be especially careful about extrapolating results from the data sample that we have to a data sample that we do not have. This is often refered to as out-of-sample performance. Namely, we want to set up data-driven models in a manner that is sensitive to variance of our analysis when we trying to apply our analysis in a new setting. Regularization is a step included in machine learning to penalize our models for complexity (or including so many variables that we overfit to our training sample).

A direct like between linear/logistic regression can be made to regularization using the lasso algorithm. The lasso is modification to linear and logistic regression that includes a penalty term, “lambda”, that is use to enumerate a number of candidate models from which we can select the best performing model.

In R, we can run a lasso using the glmnet function.

# The glmnet function requires that we input a matrix of
# our covariates and an array of the outcome variable
x <- model.matrix(~., data = select(
  mobilize_train_prepped,
  -vote02
))
y <- as.matrix(select(mobilize_train_prepped, vote02))

# Run the lasso using glmnet
lasso_fit <- glmnet::glmnet(x, y,
  family = "binomial",
  alpha = 1
)

# Remove the x and y matrix
rm(x, y)

# Plot the regularization paths
plotmo::plot_glmnet(lasso_fit)

The figure above shows the path that the coefficients for each covariate follows as the value of the penalty term – lambda – decreases.

On the left-hand side we see that the coefficients are all zero. That is point at which the penalty term (lambda) fully drowns out the relationship between the covariates and the outcome variable.

On the right-hand side is our model will no penalty, e.g. lambda equals zero. This is analogous to the non-regularized logistic regression model that we saw in the previous section.

Decision Tree

Another common foundation in machine learning models is the decision tree, which is a heirarchical approach to modelling the relationship between inputs (our covariates) and outputs. The decision tree algorithm implements a series of steps to split the data into groups that are most similar. Stating from the top of the heirarchy, the algorithm makes splits, dividing the data based on a covariate. The algorithm continues until it reaches final nodes (often called leaf nodes or terminal nodes). We can run a classification tree with our training data using the rpart function. For demonstration, we’ll set the complexity parameter, cp at a fixed value.

# Set the seed and run the full code chunk
set.seed(200004)

# Fit the classsification tree
tree_fit <- rpart::rpart(vote02 ~ .,
  data = mobilize_train,
  cp = 0.001
)

# Plot the tree
rpart.plot::rpart.plot(tree_fit)

Cross Validation

Ok, so we’ve seen how to fit three different models to our training data, logistic regression, lasso, and decision trees. Let’s bring them all together into a framework for evaluating models.

Another tool that we implement though machine learning is cross validation. The motivation behind cross validation is that we want to train models that have good out-of-sample prediction. By “good”, we could mean that the model has low variance, high accuracy, or a number of other measures. We’ll comeback to deciding on the appropriate metrics but the way that we derive quantitative measures of the goodness of model fitness is to split our data into “folds.” The data are randomly split to increase the likelihood that we are generating representative sample of the full data within each fold.

Cross validation is the concept of holding aside one fold, training our model on the remaining folds, and then measuring goodness of fit on the fold that was held out. This held out fold is often referred to as the validation fold. It’s not to be confused with the test data, which was a portion of our data that was put aside and never touched during the model training phase.

Let’s apply cross-validation to our data. We will use 5-fold cross validation (you are welcome to change the number of folds; 3, 5, and 10-fold are common).

# Set the seed and run the full code chunk
set.seed(202004)

# Split the data into cross validation folds,
# stratify by congressional district
mobilize_cv_splits <- rsample::vfold_cv(mobilize_train, v = 5, strata = st_hse)

# print the object to confirm
mobilize_cv_splits

## #  5-fold cross-validation using stratification 
## # A tibble: 5 x 2
##   splits                 id   
##   <named list>           <chr>
## 1 <split [155.5K/38.9K]> Fold1
## 2 <split [155.5K/38.9K]> Fold2
## 3 <split [155.5K/38.9K]> Fold3
## 4 <split [155.5K/38.9K]> Fold4
## 5 <split [155.5K/38.9K]> Fold5

Now, we are going to use a unifying framwork to train our models. For this tutorial, we will work with the tidymodels framework for machine learning. tidymodels is a “meta-package” that incorporates a number of other packages that make the machine learning pipeline consistent for many commonly-used machine learning algorithms.

We will work with the same models that we have seen already in this tutorial: logistic regression, lasso, and decision trees.

First, we initialize the workflow by assigning the preprocessing recipe that we created earlier.

mobilize_workflow <- workflows::workflow() %>%
  workflows::add_recipe(mobilize_rec)

Next, we declare the three models that we want to train using the parsnip package. We will use the glm package for logistic regression, glmnet package for lasso, and rpart package for decision trees. Declare these packages as the “engine” for each specification. For all three models, we will be using classification.

# Logistic Regression
glm_spec <- parsnip::logistic_reg() %>%
  parsnip::set_engine("glm") %>%
  parsnip::set_mode("classification")

# Lasso
lasso_spec <- parsnip::logistic_reg(
  penalty = 0.01,
  mixture = 1
) %>%
  parsnip::set_engine("glmnet") %>%
  parsnip::set_mode("classification")

# Decision Tree
tree_spec <- parsnip::decision_tree(cost_complexity = 0.001) %>%
  parsnip::set_engine("rpart") %>%
  parsnip::set_mode("classification")

Next, run the cross-validation! In order to do so, we are using the fit_resamples function from the tune package. This allows us to extract the performance metrics for each fold of each model.

Performance metrics are the measures of goodness of fit that we previously mentioned. Deciding on the performance metric is a key decision in the machine learning pipeline. It is something that we will comeback to a number of times throughout the course. In our set up, we will evaluate model performance using the following performance metrics:

Accuracy - this is portion of the observation that are predicted correctly
Area Under the ROC Curve (AUC) - this is a common method of assessing the preformance of binary classification, as it balances the true positive rate (TPR, or specificity) with the false positive rate (FPR, or 1 - sensitivity). Values close to 0.50 are little better than picking at random while values close to 1 maximize the true positive rate without inducing many false positives. See Receiver operating characteristic

The true value of cross validation comes into play when we are evaluating performance metrics. In effect, what we are doing (and what happens under-the-hood in the following code) is to train each model on 4 folds of the data, calculate the performance metrics on those 4 folds, then calculate the error in the validation fold. We don’t want to over-interpret the performance metrics from the training folds, instead the validation fold gives us a more reliable estimate of the out-of-sample performance.

# Set the seed and run the full code chunk
set.seed(202004)

# Cross validate logistic regression model
glm_cv <- tune::fit_resamples(
  mobilize_rec,
  glm_spec,
  mobilize_cv_splits,
  metrics = yardstick::metric_set(accuracy, roc_auc),
  control = tune::control_resamples(save_pred = TRUE)
)

# Cross validate lasso model
lasso_cv <- tune::fit_resamples(
  mobilize_rec,
  lasso_spec,
  mobilize_cv_splits,
  metrics = yardstick::metric_set(accuracy, roc_auc),
  control = tune::control_resamples(save_pred = TRUE)
)

# Cross validate decision tree model
tree_cv <- tune::fit_resamples(
  mobilize_rec,
  tree_spec,
  mobilize_cv_splits,
  metrics = yardstick::metric_set(accuracy, roc_auc),
  control = tune::control_resamples(save_pred = TRUE)
)

Let’s check the average accuracy and AUC (area under the curve) metrics from the cross-validated models.

glm_cv %>%
  tune::collect_metrics()

## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.737     5 0.00166
## 2 roc_auc  binary     0.800     5 0.00144

lasso_cv %>%
  tune::collect_metrics()

## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.732     5 0.00210
## 2 roc_auc  binary     0.793     5 0.00149

tree_cv %>%
  tune::collect_metrics()

## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.736     5 0.00183
## 2 roc_auc  binary     0.754     5 0.00577

As we can see above, accuracy and AUC are fairly similar across the three models. Note that for the lasso and tree models, we did not do any model tuning. Let’s see if we can improve performance using model tuning.

Model Tuning

Let’s go back to the lasso model. Instead of setting the penalty parameter at a fixed value, we can use a data-driven approach to selecting the prefered value of our parameters. This is whene model tuning comes into play.

Tuning involves supplying a list of values, or a “grid”, and training the models with each of the values. Importance to notice is that for each value (or set of values if tuning multiple values), we implement cross validation.

With our same set up, let’s tune the penalty parameter. Note that we could also tune the mixture paramater if we wanted to use a full elastic net regularization that is possible using the glmnet package.

Note: the following code takes some time to run, we set up parallel processing to help cut down on the required time.

# Set the seed and run the full code chunk
set.seed(202004)

# Set up parallel processing
all_cores <- parallel::detectCores(logical = FALSE)
doFuture::registerDoFuture()
cl <- parallel::makeCluster(all_cores)
future::plan("cluster", workers = cl)

# tune penalty parameter
tune_lasso <- parsnip::logistic_reg(
  penalty = tune(),
  mixture = 1
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

# Hyperparameter grid
lambda_grid <- dials::grid_regular(penalty(), levels = 50)

lasso_tuned <- tune::tune_grid(
  mobilize_workflow %>% add_model(tune_lasso),
  resamples = mobilize_cv_splits,
  grid = lambda_grid,
  metrics = yardstick::metric_set(accuracy, roc_auc)
)

Take a look at the penalty values with the highest accuracy values.

# Best tuned models for AUC
tune::show_best(lasso_tuned, "accuracy") %>%
  select(penalty, .metric, mean, std_err)

## # A tibble: 5 x 4
##    penalty .metric   mean std_err
##      <dbl> <chr>    <dbl>   <dbl>
## 1 2.12e- 4 accuracy 0.737 0.00169
## 2 3.39e- 4 accuracy 0.737 0.00166
## 3 1.00e-10 accuracy 0.737 0.00168
## 4 1.60e-10 accuracy 0.737 0.00168
## 5 2.56e-10 accuracy 0.737 0.00168

And the highest AUC values.

# Best tuned models for AUC
tune::show_best(lasso_tuned, "roc_auc") %>%
  select(penalty, .metric, mean, std_err)

## # A tibble: 5 x 4
##    penalty .metric  mean std_err
##      <dbl> <chr>   <dbl>   <dbl>
## 1 1.00e-10 roc_auc 0.800 0.00145
## 2 1.60e-10 roc_auc 0.800 0.00145
## 3 2.56e-10 roc_auc 0.800 0.00145
## 4 4.09e-10 roc_auc 0.800 0.00145
## 5 6.55e-10 roc_auc 0.800 0.00145

Let’s look at the full path of the performance metrics as the penalty term varied.

# Plot metrics
tune::collect_metrics(lasso_tuned) %>%
  ggplot(aes(x = penalty, y = mean, color = .metric)) +
  geom_line() +
  scale_color_colorblind() +
  theme(legend.position = "top") +
  labs(color = "Performance Metric")

We find that tuning provides some improvement in the performance of the lasso model. Given that our models have only a few covariates. The models with very low penalty terms tend to perform better. From the figure above, we see that accuracy levels off at penalty values below 0.15. However, the AUC metric continues to incrase until the non-penalized model is reached.

We will leave it as an exercise to you to explore more complex models with interaction terms and polynomial terms in the models. To implement data transformations, explore the recipes::step_interact for interactions between covariates recipes::step_bs and recipes::step_poly function for adding basis functions or polynomial terms.

Now that we’ve tune our model, we can extract the best model and apply it to the full training data.

highest_accuracy <- lasso_tuned %>%
  tune::select_best("accuracy", maximize = TRUE)

output_lasso <- tune::finalize_workflow(
  mobilize_workflow %>% add_model(tune_lasso),
  highest_accuracy
)

lasso_fitted <- output_lasso %>%
  # fit the final model on the full training set
  fit(data = mobilize_train)

Take a look at the coefficients from the best model.

lasso_coefs <- lasso_fitted$fit$fit$fit %>%
  broom::tidy() %>%
  # filter out the intercept term
  filter(term != "(Intercept)") %>%
  select(-step, -dev.ratio)

delta <- abs(lasso_coefs$lambda - highest_accuracy$penalty)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]

labelled_coefs <- lasso_coefs %>%
  mutate(abs_estimate = abs(estimate)) %>%
  filter(abs_estimate >= 0.01) %>%
  distinct(term) %>%
  inner_join(lasso_coefs, by = "term") %>%
  filter(lambda == lambda_opt) %>%
  select(term, estimate)

labelled_coefs

## # A tibble: 33 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.79  
##  2 vote98       1.11  
##  3 age          0.0815
##  4 persons      0.237 
##  5 st_sen_X3    0.105 
##  6 st_sen_X8   -0.327 
##  7 st_hse_X19   0.184 
##  8 st_hse_X5   -0.717 
##  9 st_hse_X16  -0.628 
## 10 st_hse_X18  -0.419 
## # … with 23 more rows

Comprehension Check

Now it’s your turn, try out a few different values for the penalty term. How do the variables and coefficients from the lasso model change as you change the penalty.val? Try some values close to 0.1 and some values below 0.01. Provide 3-5 sentence as comment below.

##############################
# Change the penalty value
penalty.val <- 0.1
##############################

check_lasso <- tune::finalize_workflow(
  mobilize_workflow %>% add_model(tune_lasso),
  list(penalty.val)
)

lasso_fitted <- check_lasso %>%
  # fit the final model on the full training set
  fit(data = mobilize_train)

lasso_coefs <- lasso_fitted$fit$fit$fit %>%
  broom::tidy() %>%
  # filter out the intercept term
  filter(term != "(Intercept)") %>%
  select(-step, -dev.ratio)

delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]

labelled_coefs <- lasso_coefs %>%
  distinct(term) %>%
  inner_join(lasso_coefs, by = "term") %>%
  filter(lambda == lambda_opt) %>%
  select(term, estimate)

labelled_coefs

## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00 1.06    
## 2 age    0.000869

##############################
# Change the penalty value
penalty.val <- 0.09
##############################

check_lasso <- tune::finalize_workflow(
  mobilize_workflow %>% add_model(tune_lasso),
  list(penalty.val)
)

lasso_fitted <- check_lasso %>%
  # fit the final model on the full training set
  fit(data = mobilize_train)

lasso_coefs <- lasso_fitted$fit$fit$fit %>%
  broom::tidy() %>%
  # filter out the intercept term
  filter(term != "(Intercept)") %>%
  select(-step, -dev.ratio)

delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]

labelled_coefs_1 <- lasso_coefs %>%
  distinct(term) %>%
  inner_join(lasso_coefs, by = "term") %>%
  filter(lambda == lambda_opt) %>%
  select(term, estimate)

labelled_coefs_1

## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00  1.12   
## 2 age     0.00266

##############################
# Change the penalty value
penalty.val <- 0.05
##############################

check_lasso <- tune::finalize_workflow(
  mobilize_workflow %>% add_model(tune_lasso),
  list(penalty.val)
)

lasso_fitted <- check_lasso %>%
  # fit the final model on the full training set
  fit(data = mobilize_train)

lasso_coefs <- lasso_fitted$fit$fit$fit %>%
  broom::tidy() %>%
  # filter out the intercept term
  filter(term != "(Intercept)") %>%
  select(-step, -dev.ratio)

delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]

labelled_coefs_2 <- lasso_coefs %>%
  distinct(term) %>%
  inner_join(lasso_coefs, by = "term") %>%
  filter(lambda == lambda_opt) %>%
  select(term, estimate)

labelled_coefs_2

## # A tibble: 3 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00   1.39  
## 2 age      0.0102
## 3 vote98   0.285

##############################
# Change the penalty value
penalty.val <- 0.009
##############################

check_lasso <- tune::finalize_workflow(
  mobilize_workflow %>% add_model(tune_lasso),
  list(penalty.val)
)

lasso_fitted <- check_lasso %>%
  # fit the final model on the full training set
  fit(data = mobilize_train)

lasso_coefs <- lasso_fitted$fit$fit$fit %>%
  broom::tidy() %>%
  # filter out the intercept term
  filter(term != "(Intercept)") %>%
  select(-step, -dev.ratio)

delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]

labelled_coefs_3 <- lasso_coefs %>%
  distinct(term) %>%
  inner_join(lasso_coefs, by = "term") %>%
  filter(lambda == lambda_opt) %>%
  select(term, estimate)

labelled_coefs_3

## # A tibble: 15 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.72  
##  2 age          0.0201
##  3 vote98       0.829 
##  4 persons      0.155 
##  5 st_sen_X3    0.112 
##  6 st_hse_X19   0.140 
##  7 st_sen_X8   -0.115 
##  8 st_hse_X5   -0.219 
##  9 st_hse_X16  -0.131 
## 10 st_hse_X18  -0.0974
## 11 newreg       0.0544
## 12 st_hse_X8    0.0494
## 13 st_hse_X14  -0.0460
## 14 st_hse_X9    0.0229
## 15 st_sen_X7   -0.0138

#### just printing all my coefficients
labelled_coefs

## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00 1.06    
## 2 age    0.000869

labelled_coefs_1

## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00  1.12   
## 2 age     0.00266

labelled_coefs_2

## # A tibble: 3 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00   1.39  
## 2 age      0.0102
## 3 vote98   0.285

labelled_coefs_3

## # A tibble: 15 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.72  
##  2 age          0.0201
##  3 vote98       0.829 
##  4 persons      0.155 
##  5 st_sen_X3    0.112 
##  6 st_hse_X19   0.140 
##  7 st_sen_X8   -0.115 
##  8 st_hse_X5   -0.219 
##  9 st_hse_X16  -0.131 
## 10 st_hse_X18  -0.0974
## 11 newreg       0.0544
## 12 st_hse_X8    0.0494
## 13 st_hse_X14  -0.0460
## 14 st_hse_X9    0.0229
## 15 st_sen_X7   -0.0138

```

[DISCUSS HERE]

The bigger the penalty, the less terms that “are relevant for the analysis”, which is cause by the fact that one estimator is compressing all the causality that other variables are bringing to the model, and that is why we only use it for the purpose of prediction and not for interpreting each oe of the estimators.

Finally, let’s apply best model from our model tuning to our holdout set. We can use the last_fit function from the tune package to carry out this step.

test_fit <- tune::last_fit(
  output_lasso,
  mobilize_split
)

test_fit %>%
  collect_metrics()

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.738
## 2 roc_auc  binary         0.797

parallel::stopCluster(cl)

Comprehension Check

Let’s say that you are contacted by a non-partisan organization that is seeking to identify voters that are less likely to vote in the 2020 US federal general election. The organization wants to focus its resources on encouraging unlikely voters to go to the polls in November. Based on the analysis above, what recommendations would you provide? What additional data would would you recommend that the organization collect to accurately predict whether a person is likely to vote?

[DISCUSS HERE] Based on the analysis made above, I would recommend to focus on convincing population younger than 50 years old, whose likelihood of voting is smaller; and the younger the better. In addition, I would recommend to focus on population who has not voted in previous elections. I would recommend get information on education, work status, day of election to know, income level and controlling whether or not you live in a swing state.

References

R for Data Science by Garret Grolemund and Hadley Wickham (https://r4ds.had.co.nz/index.html)
Data Visualization by Kieran Healy (https://socviz.co/)
Julia Silge provided excellent blog posts on tidymodels (https://juliasilge.com/blog/) as did Bruno Rodrigues (https://www.brodrigues.co/blog/2020-03-08-tidymodels/)
Grant McDermott’s lecture slides from ‘Data Science for Economists’ (https://github.com/uo-ec607/lectures)
Ed Rubin’s lecture notes and slides are fantastic (https://edrub.in/teaching.html)
Matt Taddy provided replication code on GitHub for Business Data Science (https://github.com/TaddyLab/bds)

sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] vroom_1.2.0        doFuture_0.9.0     iterators_1.0.12   foreach_1.4.8     
##  [5] future_1.16.0      globals_0.12.5     patchwork_1.0.0    ggthemes_4.2.0    
##  [9] hrbrthemes_0.8.0   rpart_4.1-15       plotmo_3.5.6       TeachingDemos_2.10
## [13] plotrix_3.7-7      Formula_1.2-3      glmnet_3.0-2       Matrix_1.2-18     
## [17] yardstick_0.0.6    workflows_0.1.1    tune_0.0.1         rsample_0.0.5     
## [21] recipes_0.1.10     parsnip_0.0.5      infer_0.5.1        dials_0.0.4       
## [25] scales_1.1.0       broom_0.5.5        tidymodels_0.1.0   sandwich_2.5-1    
## [29] lmtest_0.9-37      zoo_1.8-7          margins_0.3.23     ggcorrplot_0.1.3  
## [33] corrr_0.4.2        stargazer_5.2.2    glue_1.3.2         here_0.1          
## [37] forcats_0.5.0      stringr_1.4.0      dplyr_0.8.5        purrr_0.3.3       
## [41] readr_1.3.1        tidyr_1.0.2        tibble_2.1.3       ggplot2_3.3.0     
## [45] tidyverse_1.3.0    pacman_0.5.1      
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4          tidyselect_1.0.0    lme4_1.1-21        
##   [4] htmlwidgets_1.5.1   grid_3.6.3          pROC_1.16.1        
##   [7] munsell_0.5.0       codetools_0.2-16    DT_0.12            
##  [10] miniUI_0.1.1.1      withr_2.1.2         colorspace_1.4-1   
##  [13] highr_0.8           knitr_1.28          rstudioapi_0.11    
##  [16] stats4_3.6.3        Rttf2pt1_1.3.8      bayesplot_1.7.1    
##  [19] listenv_0.8.0       labeling_0.3        rstan_2.19.3       
##  [22] farver_2.0.3        bit64_0.9-7         DiceDesign_1.8-1   
##  [25] rprojroot_1.3-2     vctrs_0.2.4         generics_0.0.2     
##  [28] ipred_0.9-9         xfun_0.12           R6_2.4.1           
##  [31] markdown_1.1        rstanarm_2.19.3     lhs_1.0.1          
##  [34] assertthat_0.2.1    promises_1.1.0      nnet_7.3-13        
##  [37] gtable_0.3.0        processx_3.4.2      timeDate_3043.102  
##  [40] rlang_0.4.5         systemfonts_0.1.1   splines_3.6.3      
##  [43] extrafontdb_1.0     inline_0.3.15       yaml_2.2.1         
##  [46] reshape2_1.4.3      prediction_0.3.14   modelr_0.1.6       
##  [49] tidytext_0.2.3      threejs_0.3.3       crosstalk_1.1.0.1  
##  [52] backports_1.1.5     httpuv_1.5.2        rsconnect_0.8.16   
##  [55] tokenizers_0.2.1    extrafont_0.17      tools_3.6.3        
##  [58] lava_1.6.7          kableExtra_1.1.0    ellipsis_0.3.0     
##  [61] ggridges_0.5.2      Rcpp_1.0.4          plyr_1.8.6         
##  [64] base64enc_0.1-3     ps_1.3.2            prettyunits_1.1.1  
##  [67] haven_2.2.0         fs_1.3.2            furrr_0.1.0        
##  [70] magrittr_1.5        data.table_1.12.8   colourpicker_1.0   
##  [73] reprex_0.3.0        GPfit_1.0-8         SnowballC_0.6.0    
##  [76] matrixStats_0.56.0  tidyposterior_0.0.2 hms_0.5.3          
##  [79] shinyjs_1.1         mime_0.9            evaluate_0.14      
##  [82] xtable_1.8-4        tidypredict_0.4.5   shinystan_2.5.0    
##  [85] readxl_1.3.1        gridExtra_2.3       shape_1.4.4        
##  [88] rstantools_2.0.0    compiler_3.6.3      rpart.plot_3.0.8   
##  [91] crayon_1.3.4        minqa_1.2.4         StanHeaders_2.19.2 
##  [94] htmltools_0.4.0     later_1.0.0         lubridate_1.7.4    
##  [97] DBI_1.1.0           dbplyr_1.4.2        MASS_7.3-51.5      
## [100] boot_1.3-24         cli_2.0.2           gower_0.2.1        
## [103] igraph_1.2.4.2      pkgconfig_2.0.3     xml2_1.2.5         
## [106] dygraphs_1.1.1.6    hardhat_0.1.2       webshot_0.5.2      
## [109] prodlim_2019.11.13  rvest_0.3.5         janeaustenr_0.1.5  
## [112] callr_3.4.2         digest_0.6.25       rmarkdown_2.1      
## [115] cellranger_1.1.0    gdtools_0.2.1       shiny_1.4.0.2      
## [118] gtools_3.8.1        nloptr_1.2.2.1      lifecycle_0.2.0    
## [121] nlme_3.1-144        jsonlite_1.6.1      viridisLite_0.3.0  
## [124] fansi_0.4.1         pillar_1.4.3        lattice_0.20-40    
## [127] loo_2.2.0           fastmap_1.0.1       httr_1.4.1         
## [130] pkgbuild_1.0.6      survival_3.1-8      xts_0.12-0         
## [133] shinythemes_1.1.2   bit_1.1-15.2        class_7.3-15       
## [136] stringi_1.4.6

This tutorial was originally developed by Susan Athey, Niall Keleher, and Eray Turkel for the Spring 2020 course, ALP301 Data-Driven Impact. We are grateful to Grant R. McDermott, Ed Rubin, Matthew Taddy, Kieran Healy, Garret Grolemund and Hadley Wickham, and Bruno Rodrigues and Julia Silge. This tuturial draws from the examples that these colleagues made publicly available.↩

Data Descriptives, Visualization, & Machine Learning

MariaFernanda Porras Jacobo

April 2020

Learning Objective:

Introduction

Comprehension Check

R Packages

Coding Style

Load additional required packages

Data

Descriptive Statistics

Data Visualization

Comprehension Check

T-tests

Comprehension Check

Correlation

Comprehension Check

Linear Regression

Logistic Regression

Comprehension Check

Machine Learning

Train-Test Split

Preprocessing

Models

Lasso

Decision Tree

Cross Validation

Model Tuning

Comprehension Check

Comprehension Check

References