You will be able to apply a common set of tools for data analysis in the course and interpret the analysis through clear written and visual output.1
In this tutorial, we will cover topics of reading data into R, producing descriptive statistics, visualizing data, and running statistical models. Several of the topics covered in this tutorial should be review. Some topics, materials, and methods may be new. Please read through the tutorial closely and examine the code closely to get a sense of the steps that are covered.
We have structured this RMarkdown document such that the code chunks are ready for your to run. You can run individual code chunks by clicking on the green right-facing arrow or by using the Ctrl+Shift+Enter key combination. For more information on RMarkdown documents, see the RMarkdown chapter from R for Data Science.
Interspersed through the tutorial, we have included sections where your input is required. These are in the form of comprehension checks and assignment questions.
Here is where we will ask you to interpret or reflect on the data, analysis, or models used.
[DISCUSS HERE]
To promote collaborative, reproducible work we will be following a set of guidelines and using a prescribed set of R packages throughout the course. We will make extensive use of the tidyverse, “a set of packages that work in harmony because they share common data representations and API design.” Check out the Tidyverse packages. And for an excellent introduction to R, the tidyverse, and data science, we recommend R for Data Science.
The following is an RMarkdown “code chunk” to load the tidyverse package (sometimes called a library).
From the start, we want to establish a set of common good practices in writing interpretable, well-commented code in R. If you have not done so already, please review the tidyverse style guide by Hadley Wickham (A shorter, similar style guide can be found at Google’s R Style Guide).
# for working directories
pacman::p_load(here)
# for working with strings
pacman::p_load(glue)
# for tables of summary statistics
pacman::p_load(stargazer)
# for calculating correlations
pacman::p_load(corrr)
# for correlation plots
pacman::p_load(ggcorrplot)
# for marginal effects from lineal regressions
pacman::p_load(margins)
# Test for linear regression models
pacman::p_load(lmtest)
# for robust standard errors
pacman::p_load(sandwich)
# meta-framework for model training and testing
pacman::p_load(tidymodels)
# for elastic net and lasso
pacman::p_load(glmnet)
# for plotting glmnet
pacman::p_load(plotmo)
# for classification and regression trees
pacman::p_load(rpart)
# for updated ggplot2 theme
pacman::p_load(hrbrthemes)
# for updated ggplot2 colorblind-friendly scheme
pacman::p_load(ggthemes)
# for combining ggplot2 figures
pacman::p_load(patchwork)
# for parallel processing
pacman::p_load(doFuture)
#vroom
pacman::p_load(vroom)The data we will be working are from Kevin Arceneaux, Alan S. Gerber and Donald P. Green (2006)’s paper “Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment” (see article). These data come from a randomized experiment. A research study was conducted in Iowa and Michigan before the 2002 US midterm elections. Households with one or two registered voters were randomly assigned to treatment and control groups. The treatment involved calling the selected individuals by phone for a get-out-the-vote (GOTV) campaign. The study involved over 1.9 million households (1,845,348 in the control group and 59,972 in the group that was assigned to receive the GOTV phone call).
The data that we will use includes information about voting records in elections prior to 2002 as well as several demographic and location variables.
This tutorial assumes that the data is in a folder named “data” with a .here file located. We will use the here package (See Practice safe paths from Jennifer Bryan and Jim Hester’s book) to assist with finding files. We encourage you to work within a RStudio Project (See Project-oriented workflows).
Let’s read in the data:
# # Download the data from GitHub
# temp <- tempfile()
# download.file("https://raw.githubusercontent.com/gsbDBI/ExperimentData/master/Mobilization/ProcessedData/mobilization_with_unlisted.zip", temp)
#
# # Unzip the data file and load into memory
# df <- readr::read_csv(unz(temp, "mobilization_with_unlisted.csv"))
# unlink(temp)
# # Save the data
# vroom_write(df, here::here("data", "mobilization.csv"), delim = ",")
# Read in the data
df <- readr::read_csv(here::here("data", "mobilization.csv"))Take a look at the data. The glimpse function from the tibble package is a handy way to check the data types of variables in the data and get a sense of the values that one is working with in the data.
## Observations: 2,474,927
## Variables: 29
## $ persons <dbl> 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,…
## $ competiv <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ vote00 <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ treatment <dbl> -1, 0, 0, -1, 0, -1, 0, 0, 0, -1, -1, 0, 0, -1, 0, 0,…
## $ county <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ st_sen <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 2…
## $ st_hse <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 5…
## $ vote98 <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ newreg <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pid_maj <dbl> 0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 0, 2, 2, 2, 2, 0,…
## $ age <dbl> 44, 39, 82, 73, 44, 75, 78, 74, 46, 87, 59, 78, 42, 4…
## $ female <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ vote02 <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,…
## $ contact <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ pseudo_c <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ treat_real <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ treat_pseudo <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ state <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ comp_mi <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ comp_ia <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fem_miss <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ female2 <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ lc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ hc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pseudo_portion <dbl> 0, NA, NA, 0, NA, 0, NA, NA, NA, 1, 0, NA, NA, 0, NA,…
## $ W <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ female_decoded <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ pid_maj_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ d_unlisted <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,…
As you can see, the data include 2,474,927 observations and 29 variables.
Can you start to get a sense of what each variable represents?
Our data is unlabelled, so to help understand some of the variables, here’s a list of some key variables that we will work with:
| Variable | Description |
|---|---|
| age | Age |
| persons | Number of registered voters in household |
| newreg | Newly registered voter |
| female | Indicator variable where \(= 1\) indicates female |
| st_sen | Index of state senate district |
| st_hse | Index of state congressional district |
| vote00 | Indicator variable where \(= 1\) indicates voted in the 2000 general election |
| vote98 | Indicator variable where \(= 1\) indicates voted in the 1998 midterm election |
| county | County index |
| state | US State indicator variable where \(= 0\) indicates Michigan and \(= 1\) indicates Iowa |
| comp_mi | Indicator of whether the congressional district was classified as competitive in Michigan |
| comp_ia | Indicator of whether the congressional district was classified as competitive in Iowa |
| vote02 | Indicator variable where \(= 1\) indicates voted in the 2002 midterm election |
| treatment | Indicator variable specifying the treatment in the experiment |
| contact | Indicator of whether the person was contacted (NA - indicates an observation with an unlisted phone number) |
For this tutorial, we will focus our analysis on the control group, e.g. observations from people that did not receive the GOTV campaign messaging. We will, thus, put aside the observations from people that were included in the campaign. Additionally, we are going to restrict our analysis to use the subset of variables listed above. For computational reasons, we will restrict our analysis to the largest county.
# We will retain a subset of the variables for our analysis
covariate_names <- c("age", "persons", "newreg", "vote00", "vote98", "female", "st_sen", "st_hse")
mobilize_df <- df %>%
# Keep the control group
filter(treatment == 0) %>%
# Keep the largest county
filter(county == 82) %>%
# select the variables of interest
select(vote02, all_of(covariate_names)) %>%
# drop observations missing female indicator
drop_na(female)
# Remove the full data from memory
rm(df)To get a quick understanding of the variables, it’s often useful to summarize the data. One way to do so is the summary function.
## vote02 age persons newreg
## Min. :0.0000 Min. : 18.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 37.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :1.0000 Median : 49.00 Median :1.000 Median :0.0000
## Mean :0.5268 Mean : 52.08 Mean :1.423 Mean :0.1221
## 3rd Qu.:1.0000 3rd Qu.: 67.00 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :113.00 Max. :2.000 Max. :1.0000
## vote00 vote98 female st_sen
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.0000 Median :0.0000 Median :1.0000 Median :6.000
## Mean :0.5382 Mean :0.1657 Mean :0.5685 Mean :5.144
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:7.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :8.000
## st_hse
## Min. : 1.00
## 1st Qu.: 8.00
## Median :15.00
## Mean :13.57
## 3rd Qu.:19.00
## Max. :23.00
That’s a bit overwhelming when we have lots of variables. Instead, we can get the mean and standard deviation of individual variables.
## Mean household size:1.4232
## Standard deviation:0.4941
We can also use the stargazer function to produce easy to read summary statistics tables.
##
## ===========================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------
## vote02 243,042 0.527 0.499 0 0 1 1
## age 243,042 52.082 18.573 18 37 67 113
## persons 243,042 1.423 0.494 1 1 2 2
## newreg 243,042 0.122 0.327 0 0 0 1
## vote00 243,042 0.538 0.499 0 0 1 1
## vote98 243,042 0.166 0.372 0 0 0 1
## female 243,042 0.568 0.495 0 0 1 1
## st_sen 243,042 5.144 2.265 1 3 7 8
## st_hse 243,042 13.566 6.545 1 8 19 23
## -----------------------------------------------------------
A good way to get to know the data is to produce some data visualizations. Let’s take a look at a few of the variables. First, we’ll look a univariate visualizations.
Relying on the tidyverse, let’s produce a histogram of age in 2002 (age) with ggplot2.
# Try out a few different ggplot themes to see which one you find most accessible
theme_set(hrbrthemes::theme_ipsum())
# theme_set(ggplot2::theme_minimal())
# theme_set(ggthemes::theme_base())
# theme_set(ggthemes::theme_tufte())
ggplot2::ggplot(data = mobilize_df, aes(x = age)) +
geom_histogram()We can compare the distribution of age among those that voted in the 2002 midterm election by gender.
# Labels for voting
vote_labs <- c("Didn't Vote", "Voted")
names(vote_labs) <- c("0", "1")
# Labels for gender
gender_labs <- c("Male", "Female")
names(gender_labs) <- c("0", "1")
mobilize_df %>%
# convert female variable to factor variable for plotting
mutate(female = as.factor(female)) %>%
ggplot(aes(x = age, color = female, fill = female)) +
# set number of bins
geom_histogram(bins = 20) +
# grid by vote02 and female
facet_grid(vote02 ~ female,
labeller = labeller(vote02 = vote_labs, female = gender_labs)
) +
# use a colorblind friendly color palette
scale_color_colorblind() +
scale_fill_colorblind() +
# drop the legend
guides(color = FALSE, fill = FALSE) +
labs(
title = "Age",
subtitle = "By gender and voting record in the 2002 midterm election"
)Using the code for the previous figure, create a 2-by-2 grid of figures showing the age distribution where the columns correspond to whether or not a person voted in 2000 (
vote00) and the rows correspond to whether or not the person voted in 2002 (vote02).
# Labels for voting
vote_labs <- c("Didn't Vote", "Voted")
names(vote_labs) <- c("0", "1")
mobilize_df %>%
# convert vote00 variable to factor variable for plotting
mutate(vote00 = as.factor(vote00)) %>%
ggplot(aes(x = age, color = vote00, fill = vote00)) +
# set number of bins
geom_histogram(bins = 20) +
# [YOUR CODE HERE]
facet_grid(vote02 ~ vote00,
labeller = labeller(vote02 = vote_labs, vote00 = vote_labs)
) +
# use a colorblind friendly color palette
scale_color_colorblind() +
scale_fill_colorblind() +
# drop the legend
guides(color = FALSE, fill = FALSE) +
labs(
title = "Age",
subtitle = "By voting in 2000 and 2002"
)Let’s formally test for differences in age. We want to conduct a hypothesis test of whether the age of those who voted in the 2002 midterm election are older than those that did not vote.
Formally, what we are doing is setting up a null hypothesis, \(H_{0}\), where the mean age of voters and non-voters are equivalent. Our alternative hypothesis, \(H_{1}\) is that the mean age of the two populations are not equal.
We will use Welch’s t-test, under which we assume that the data from the two groups follow a normal distribution. However, we do not require that the data from the two populations have the same variance.
Welch’s t-test:
First, we will test if the mean age of people who voted is different from the mean age of people that did not vote.
tidy(t.test(age ~ vote02,
data = mobilize_df,
var.equal = FALSE,
alternative = "less",
conf.level = 0.95
)) %>%
select(
estimate1, estimate2, estimate, conf.low,
conf.high, p.value, statistic, parameter
) %>%
rename(
`Didn't Vote` = estimate1,
`Voted` = estimate2,
`Difference` = estimate,
`Low` = conf.low,
`High` = conf.high,
`T-Stat` = statistic,
`DOF` = parameter,
) %>%
knitr::kable(digits = 2) %>%
kableExtra::kable_styling(bootstrap_options = "striped") %>%
kableExtra::add_header_above(c(
"Mean Age" = 3,
"95% Confidence Interval" = 2,
"Test Statistics" = 3
)) %>%
kableExtra::add_header_above(c("Difference in Mean Age,
Welch Two Sample,
One-sided t-test" = 8),
align = "left"
)| Didn’t Vote | Voted | Difference | Low | High | p.value | T-Stat | DOF |
|---|---|---|---|---|---|---|---|
| 47.36 | 56.33 | -8.97 | -Inf | -8.85 | 0 | -122.05 | 236481.3 |
We reject the null hypothesis that the mean age of those that voted in the 2002 midterm election is equal to the mean age of non-voters. We find that votes are, on average, older than non-voters.
Now let’s test for equality of mean voting rates by gender. This time we will use a one-sided test, where our hypothesis is that women voters in the 2002 midterm election were older than women non-voters.
tidy(t.test(age ~ vote02,
data = filter(mobilize_df, female == 1),
var.equal = FALSE,
alternative = "less",
conf.level = 0.95
)) %>%
select(
estimate1, estimate2, estimate, conf.low,
conf.high, p.value, statistic, parameter
) %>%
rename(
`Didn't Vote` = estimate1,
`Voted` = estimate2,
`Difference` = estimate,
`Low` = conf.low,
`High` = conf.high,
`T-Stat` = statistic,
`DOF` = parameter,
) %>%
knitr::kable(digits = 2) %>%
kableExtra::kable_styling(bootstrap_options = "striped") %>%
kableExtra::add_header_above(c(
"Mean Age" = 3,
"95% Confidence Interval" = 2,
"Test Statistics" = 3
)) %>%
kableExtra::add_header_above(c("Difference in Mean Age (Women),
Welch Two Sample,
One-sided t-test" = 8),
align = "left"
)| Didn’t Vote | Voted | Difference | Low | High | p.value | T-Stat | DOF |
|---|---|---|---|---|---|---|---|
| 48.13 | 57.55 | -9.43 | -Inf | -9.26 | 0 | -93.25 | 133797.4 |
Similarly, for men:
tidy(t.test(age ~ vote02,
data = filter(mobilize_df, female == 0),
var.equal = FALSE,
alternative = "less",
conf.level = 0.95
)) %>%
select(estimate1, estimate2, estimate, conf.low, conf.high, p.value, statistic, parameter) %>%
rename(
`Didn't Vote` = estimate1,
`Voted` = estimate2,
`Difference` = estimate,
`Low` = conf.low,
`High` = conf.high,
`T-Stat` = statistic,
`DOF` = parameter,
) %>%
knitr::kable(digits = 2) %>%
kableExtra::kable_styling(bootstrap_options = "striped") %>%
kableExtra::add_header_above(c(
"Mean Age" = 3,
"95% Confidence Interval" = 2,
"Test Statistics" = 3
)) %>%
kableExtra::add_header_above(c("Difference in Mean Age (Men),
Welch Two Sample,
One-sided t-test" = 8),
align = "left"
)| Didn’t Vote | Voted | Difference | Low | High | p.value | T-Stat | DOF |
|---|---|---|---|---|---|---|---|
| 46.34 | 54.71 | -8.37 | -Inf | -8.2 | 0 | -79.37 | 102628.6 |
Using the code above as a template, test the hypothesis that women voted at a higher rate, on average, than men in the 2002 midterm election. (Hint: use a one-sided test).
# [YOUR CODE HERE]
#Now, I will test if the mean of voting rate of women in the 2002 midterm election was higher than the mean of voting rate of men.
tidy(t.test(vote02 ~ female,
data = filter(mobilize_df),
var.equal = FALSE,
alternative = "less",
conf.level = 0.95
)) %>%
select(
estimate1, estimate2, estimate, conf.low,
conf.high, p.value, statistic, parameter
) %>%
rename(
`Male` = estimate1,
`Female` = estimate2,
`Difference` = estimate,
`Low` = conf.low,
`High` = conf.high,
`T-Stat` = statistic,
`DOF` = parameter,
) %>%
knitr::kable(digits = 2) %>%
kableExtra::kable_styling(bootstrap_options = "striped") %>%
kableExtra::add_header_above(c(
"Voting Proportion" = 3,
"95% Confidence Interval" = 2,
"Test Statistics" = 3
)) %>%
kableExtra::add_header_above(c("Difference in Mean between Males and Females,
Welch Two Sample,
One-sided t-test" = 8),
align = "left"
)| Male | Female | Difference | Low | High | p.value | T-Stat | DOF |
|---|---|---|---|---|---|---|---|
| 0.53 | 0.53 | 0 | -Inf | 0.01 | 0.85 | 1.05 | 225814.8 |
We don’t have enough evidence to reject the null hypothesis that the mean of the proportion of women voters in the 2002 midterm election is higher to the mean age of the proportion of men.
Next, we want to explore linear correlation across the variables.
We can visualize the correlation coefficients with a correlation plot using the ggcorrplot function in tandem with the cor function that produces a correlation matrix.
covariate_names <- names(mobilize_df)
corr <- cor(mobilize_df,
use = "pairwise.complete.obs"
)
# Replace NA values with zero
corr[is.na(corr)] <- 0
# Organize variables by heirarchical clustering
ggcorrplot::ggcorrplot(corr,
hc.order = TRUE,
type = "full",
lab = FALSE,
legend.title = "Correlation Coefficient",
colors = c("#053061", "white", "#67001f"),
ggtheme = ggplot2::theme_void,
outline.col = "white"
) +
theme(
legend.position = "top",
axis.title.x = element_blank(),
axis.text.x = element_text(size = 10),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 10),
axis.title.y = element_blank(),
axis.ticks.y = element_blank()
)As we can see, there are a few variables that are highly correlated. When we come to the statistical modeling section of the tutorial, high correlation between variables is something that we want to be cognizant of.
What variables correlate with whether a person voted in the 2002 midterm election?
# Convert correlation matrix to data frame
corrr::as_cordf(corr) %>%
# Focus on the 2002 vote variable
corrr::focus(vote02) %>%
# Get the absolute value of the correlation coefficient
mutate(vote02 = abs(vote02)) %>%
# Sort variables by absolute value of correlation coefficient
arrange(-vote02) %>%
# Clean up headers
rename(`Correlation with vote02` = rowname) %>%
rename(corr_coef = vote02) %>%
knitr::kable(digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = "striped")| Correlation with vote02 | corr_coef |
|---|---|
| vote00 | 0.4553 |
| age | 0.2410 |
| vote98 | 0.2342 |
| newreg | 0.1197 |
| persons | 0.1193 |
| st_hse | 0.0093 |
| st_sen | 0.0064 |
| female | 0.0021 |
In 2-3 sentences, discuss the correlation coefficients observed in the table above. Do they match your intuition?
[DISCUSS HERE] Not really. In particular, I expected a higher correlation with vote00 and with newreg. Since if you voted previously maybe you are used to do it, so you may want to continue with such habit. If you are newly registered, you may want to start with the habit. I know that having a high correlation with these two variables might be hard, because one implies that the other is not occurring; so, the effect is not clear.
For age, it is hard for me to know if it was what I expected or not since I don’t think it is a linear relationship, but a parabolic relationship, in which at a young age you are likely to be less engaged with voting, when you start getting older you start getting more concern with voting and then you start losing interest as you are closer to your last years. However, if the election is on a weekday, and you work, then the relationship I previously assumed is the inverse.
I didn’t expect any correlation with the number of registered voters in the household, since if all those registered voters don’t vote might push you to not vote as well, or if they vote they will push you to vote.
Let’s now move to linear regression to check for correlation between the voting outcome and covariates. First, we’ll use the lm function from the stats package. Here is a bivariate regression where we regress an indicator of a person’s gender on whether or not that person voted in the 2002 midterm election.
The underlying model that we are using is:
\(Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}\)
where \(Y_{i}\) is our outcome of interest, “Voted in 2002 midterm election” and \(\beta_{1}\), the coefficient of interest, gives us a measure of how much \(X\) and \(Y\) covary, adjusting for the variance of \(X\) in our population. We take a data-driven approach to estimating that relationship.
ols_lm1 <- lm(vote02 ~ age,
data = mobilize_df
)
broom::tidy(ols_lm1) %>%
# round values
knitr::kable(digits = 4) %>%
# Clean table formating
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.1893 | 0.0029 | 64.7001 | 0 |
| age | 0.0065 | 0.0001 | 122.4382 | 0 |
Based on this bivariate regression, we estimate that for each additional year of age, an individual is 0.65 percentage points more likely to have voted in the 2002 election. This is consistent with what we observed in the t-tests.
We can move to a more general linear regression model, the multivariate regression.
\(Y_{i} = X\beta + u_{i}\)
Where \(X\) is a matrix of variables (or covariates) and \(\beta\) is a vector of coefficients.
Building on our previous regression, let’s add gender to the regression.
Usually, we will be concerned about heteroskedacity in our data. That is, the variance of our estimates may be dependent on covariates. To adjust for heteroskedacity, there are a number of way to calculate heteroskedastic-robust standard errors. Here, we’ll use the lmtest package in tandem with the sandwich package to report heteroskedastic-robust standard errors.
# report heteroskedasticity-robust standard errors
lmtest::coeftest(ols_lm2,
vcov = sandwich::vcovHC(ols_lm2, type = "HC2")
) %>%
broom::tidy() %>%
knitr::kable(digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.1977 | 0.0031 | 63.8366 | 0 |
| age | 0.0065 | 0.0001 | 121.2115 | 0 |
| female | -0.0173 | 0.0020 | -8.6853 | 0 |
Because our outcome is a binary variable, we can approach the modeling as a classification problem. As such, let’s switch to logistic regression.
logit_age_fem <- glm(vote02 ~ age + female,
data = mobilize_df,
family = binomial(link = "logit")
)
lmtest::coeftest(logit_age_fem,
vcov = sandwich::vcovHC(logit_age_fem, type = "HC2")
) %>%
broom::tidy() %>%
knitr::kable(digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.2683 | 0.0136 | -92.9639 | 0 |
| age | 0.0274 | 0.0002 | 110.9083 | 0 |
| female | -0.0721 | 0.0084 | -8.5323 | 0 |
Recall that the coefficients from the logistic regression model are not interpreted in the same way as coefficients from the linear model. Whereas coefficient from the lm model are interpreted as the expected change in Y for a change in X. The coefficients from the glm model are the change in the “log odds” as X changes.
We can get an analogous average marginal effect from the glm coefficents using the margins function. You can confirm that the values for the average marginal effect resemble those from the linear regression.
## age female
## 0.00643 -0.01691
We’ve seen that age is correlated with whether or not a person voted in the 2002 midterm election. However, it’s likely that a person’s propensity to vote diminishes at some point with age. So, we’ll include a quadratic term for age.
logit_age2 <- glm(vote02 ~ age + I(age^2) + female,
data = mobilize_df,
family = binomial(link = "logit")
)
lmtest::coeftest(logit_age2,
vcov = sandwich::vcovHC(logit_age2, type = "HC2")
) %>%
broom::tidy() %>%
knitr::kable(digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.9097 | 0.0390 | -100.1602 | 0e+00 |
| age | 0.1332 | 0.0015 | 90.8825 | 0e+00 |
| I(age^2) | -0.0010 | 0.0000 | -74.0238 | 0e+00 |
| female | -0.0301 | 0.0086 | -3.5056 | 5e-04 |
We see from the the table above and the following figure that the fitted values of the likelihood that a person voted in 2002 follow an “inverted U” shape as age increases.
ggplot(margins_age, aes(x = xvals)) +
geom_line(aes(y = yvals)) +
geom_line(aes(y = upper), linetype = 2) +
geom_line(aes(y = lower), linetype = 2) +
geom_hline(yintercept = 0) +
labs(
title = "Relationship between age and voting in 2002",
x = "Age", y = "Predicted likelihood of voting"
)logit_interaction <- glm(vote02 ~ female * age + female * I(age^2),
data = mobilize_df,
family = binomial(link = "logit")
)
lmtest::coeftest(logit_interaction, vcov = sandwich::vcovHC(logit_interaction, type = "HC2")) %>%
broom::tidy() %>%
knitr::kable(digits = 4) %>%
kableExtra::kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.8691 | 0.0604 | -64.0806 | 0.0000 |
| female | -0.0992 | 0.0786 | -1.2625 | 0.2068 |
| age | 0.1317 | 0.0023 | 56.9492 | 0.0000 |
| I(age^2) | -0.0009 | 0.0000 | -45.6899 | 0.0000 |
| female:age | 0.0025 | 0.0030 | 0.8491 | 0.3958 |
| female:I(age^2) | 0.0000 | 0.0000 | -0.7912 | 0.4288 |
cfemale <- ggplot(margins_female, aes(x = xvals)) +
geom_line(aes(y = yvals), color = "red") +
geom_line(aes(y = upper), linetype = 2) +
geom_line(aes(y = lower), linetype = 2) +
geom_hline(yintercept = 0) +
labs(
title = "Women",
subtitle = "",
x = "Age", y = "Likelihood of voting"
)
cmale <- ggplot(margins_male, aes(x = xvals)) +
geom_line(aes(y = yvals), color = "blue") +
geom_line(aes(y = upper), linetype = 2) +
geom_line(aes(y = lower), linetype = 2) +
geom_hline(yintercept = 0) +
labs(
title = "Men",
# subtitle = "Men",
x = "Age", y = ""
)
cfemale + cmaleWe can see from the regression table and the figure that older people were observed to vote at a higher rate. However, we do not observe a difference in voting rates between women and men. If oe
We have described the relationship between covariates and the outcome as correlation. We want to be cautious about interpreting the relationships as causal. One primary reason is that there are likely variables omitted from our model that are correlated with the likelihood of voting and the covariates included in the regression above. In 3-5 sentences, interpret the table and describe some of the factors that may be omitted from these simple regressions that we want to be concerned about when interpreting the estimates from the regressions above.
[DISCUSS HERE]
In the table above (considering the logit model with the most regressors) we found that female is negatively correlated with voting but it is not significant, so we cannot say it plays an important role. Age is possitively correlated with voting, so one extra year will increase the voting rate by 0.13; however, age squared is negative. From the graphs above we can see that young and elder population, independently if they are males or females, are less likely to vote than middle age population. One of the factors that may be affecting the likelihood of voting is that the election was on a weekday and the person worked on that day. Education may also be a factor that is possitively correlated with voting. Whether people voted or not in previous elections may also affect; however, the effect in this may be less clear because maybe people are less likely to engage on midterm elections.
Now, let’s bring in some steps from machine learning to set up a predictive model of whether or not a person voted in the 2002 midterm election.
First, we will split our data into training set and test set. We will holdout 80% of the data as a test dataset to evaluate our “best” model at the final step. The test set is often referred to as the holdout data.
# Set the seed and run the full code chunk
set.seed(200004)
# Holdout 80% of the data for model validation
mobilize_split <- rsample::initial_split(mobilize_df, prop = 0.8)
mobilize_split## <194434/48608/243042>
Create a training set and a holdout set.
We will do some data preprocessing to prepare for our models
mobilize_rec <- recipes::recipe(vote02 ~ .,
data = mobilize_train
) %>%
# convert outcome to factor
recipes::step_mutate(
vote02 = as.factor(vote02),
role = "outcome"
) %>%
# Impute mean for all numeric variables with missing values
recipes::step_meanimpute(recipes::all_numeric()) %>%
# remove any variables with near-zero variance
recipes::step_nzv(recipes::all_predictors()) %>%
# remove variables with 0.8 correlation coefficient or higher
recipes::step_corr(recipes::all_predictors(),
threshold = 0.8,
use = "pairwise.complete.obs",
method = "pearson"
) %>%
recipes::step_mutate(
# convert senate district to a factor
st_sen = as.factor(st_sen),
# convert congressional district to a factor
st_hse = as.factor(st_hse),
# create age-squared variable
age2 = age^2
) %>%
# Convert all factor variables to binary/dummy variables
recipes::step_dummy(recipes::all_nominal(), -recipes::all_outcomes(),
one_hot = FALSE, preserve = FALSE
)
mobilize_prep <- mobilize_rec %>%
recipes::prep(retain = TRUE)
mobilize_train_prepped <- recipes::juice(mobilize_prep)
mobilize_test_prepped <- recipes::bake(mobilize_prep,
new_data = mobilize_test
)After all of the steps above, we can see that we have added several new variables to our data.
# vote02 is dropped because outcome variable is now a factor.
stargazer::stargazer(as.data.frame(mobilize_train_prepped),
summary = TRUE, type = "text"
)##
## ===================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------
## age 194,434 52.076 18.573 18 37 67 113
## persons 194,434 1.424 0.494 1 1 2 2
## newreg 194,434 0.122 0.328 0 0 0 1
## vote00 194,434 0.538 0.499 0 0 1 1
## vote98 194,434 0.165 0.371 0 0 0 1
## female 194,434 0.568 0.495 0 0 1 1
## age2 194,434 3,056.847 2,088.571 324 1,369 4,489 12,769
## st_sen_X2 194,434 0.097 0.297 0 0 0 1
## st_sen_X3 194,434 0.106 0.308 0 0 0 1
## st_sen_X4 194,434 0.081 0.272 0 0 0 1
## st_sen_X5 194,434 0.104 0.306 0 0 0 1
## st_sen_X6 194,434 0.181 0.385 0 0 0 1
## st_sen_X7 194,434 0.173 0.379 0 0 0 1
## st_sen_X8 194,434 0.176 0.380 0 0 0 1
## st_hse_X2 194,434 0.023 0.150 0 0 0 1
## st_hse_X3 194,434 0.025 0.156 0 0 0 1
## st_hse_X4 194,434 0.033 0.179 0 0 0 1
## st_hse_X5 194,434 0.028 0.165 0 0 0 1
## st_hse_X6 194,434 0.032 0.175 0 0 0 1
## st_hse_X7 194,434 0.031 0.173 0 0 0 1
## st_hse_X8 194,434 0.034 0.182 0 0 0 1
## st_hse_X9 194,434 0.031 0.173 0 0 0 1
## st_hse_X10 194,434 0.030 0.171 0 0 0 1
## st_hse_X11 194,434 0.029 0.167 0 0 0 1
## st_hse_X12 194,434 0.020 0.139 0 0 0 1
## st_hse_X13 194,434 0.067 0.250 0 0 0 1
## st_hse_X14 194,434 0.060 0.238 0 0 0 1
## st_hse_X15 194,434 0.049 0.216 0 0 0 1
## st_hse_X16 194,434 0.055 0.227 0 0 0 1
## st_hse_X17 194,434 0.057 0.232 0 0 0 1
## st_hse_X18 194,434 0.057 0.232 0 0 0 1
## st_hse_X19 194,434 0.067 0.249 0 0 0 1
## st_hse_X20 194,434 0.064 0.245 0 0 0 1
## st_hse_X21 194,434 0.055 0.227 0 0 0 1
## st_hse_X22 194,434 0.052 0.222 0 0 0 1
## st_hse_X23 194,434 0.050 0.218 0 0 0 1
## -------------------------------------------------------------------
We are going to use a few models to predict whether or not a person voted in the 2020 election. logistic regression, lasso, and classification trees to predict whether or not a person voted in the 2002 midterm election.
One of the central tools of many machine learning tools is regularization. When we are working with a large number of variables to predict an outcome, we want to be especially careful about extrapolating results from the data sample that we have to a data sample that we do not have. This is often refered to as out-of-sample performance. Namely, we want to set up data-driven models in a manner that is sensitive to variance of our analysis when we trying to apply our analysis in a new setting. Regularization is a step included in machine learning to penalize our models for complexity (or including so many variables that we overfit to our training sample).
A direct like between linear/logistic regression can be made to regularization using the lasso algorithm. The lasso is modification to linear and logistic regression that includes a penalty term, “lambda”, that is use to enumerate a number of candidate models from which we can select the best performing model.
In R, we can run a lasso using the glmnet function.
# The glmnet function requires that we input a matrix of
# our covariates and an array of the outcome variable
x <- model.matrix(~., data = select(
mobilize_train_prepped,
-vote02
))
y <- as.matrix(select(mobilize_train_prepped, vote02))
# Run the lasso using glmnet
lasso_fit <- glmnet::glmnet(x, y,
family = "binomial",
alpha = 1
)
# Remove the x and y matrix
rm(x, y)
# Plot the regularization paths
plotmo::plot_glmnet(lasso_fit)The figure above shows the path that the coefficients for each covariate follows as the value of the penalty term – lambda – decreases.
On the left-hand side we see that the coefficients are all zero. That is point at which the penalty term (lambda) fully drowns out the relationship between the covariates and the outcome variable.
On the right-hand side is our model will no penalty, e.g. lambda equals zero. This is analogous to the non-regularized logistic regression model that we saw in the previous section.
Another common foundation in machine learning models is the decision tree, which is a heirarchical approach to modelling the relationship between inputs (our covariates) and outputs. The decision tree algorithm implements a series of steps to split the data into groups that are most similar. Stating from the top of the heirarchy, the algorithm makes splits, dividing the data based on a covariate. The algorithm continues until it reaches final nodes (often called leaf nodes or terminal nodes). We can run a classification tree with our training data using the rpart function. For demonstration, we’ll set the complexity parameter, cp at a fixed value.
# Set the seed and run the full code chunk
set.seed(200004)
# Fit the classsification tree
tree_fit <- rpart::rpart(vote02 ~ .,
data = mobilize_train,
cp = 0.001
)
# Plot the tree
rpart.plot::rpart.plot(tree_fit)Ok, so we’ve seen how to fit three different models to our training data, logistic regression, lasso, and decision trees. Let’s bring them all together into a framework for evaluating models.
Another tool that we implement though machine learning is cross validation. The motivation behind cross validation is that we want to train models that have good out-of-sample prediction. By “good”, we could mean that the model has low variance, high accuracy, or a number of other measures. We’ll comeback to deciding on the appropriate metrics but the way that we derive quantitative measures of the goodness of model fitness is to split our data into “folds.” The data are randomly split to increase the likelihood that we are generating representative sample of the full data within each fold.
Cross validation is the concept of holding aside one fold, training our model on the remaining folds, and then measuring goodness of fit on the fold that was held out. This held out fold is often referred to as the validation fold. It’s not to be confused with the test data, which was a portion of our data that was put aside and never touched during the model training phase.
Let’s apply cross-validation to our data. We will use 5-fold cross validation (you are welcome to change the number of folds; 3, 5, and 10-fold are common).
# Set the seed and run the full code chunk
set.seed(202004)
# Split the data into cross validation folds,
# stratify by congressional district
mobilize_cv_splits <- rsample::vfold_cv(mobilize_train, v = 5, strata = st_hse)
# print the object to confirm
mobilize_cv_splits## # 5-fold cross-validation using stratification
## # A tibble: 5 x 2
## splits id
## <named list> <chr>
## 1 <split [155.5K/38.9K]> Fold1
## 2 <split [155.5K/38.9K]> Fold2
## 3 <split [155.5K/38.9K]> Fold3
## 4 <split [155.5K/38.9K]> Fold4
## 5 <split [155.5K/38.9K]> Fold5
Now, we are going to use a unifying framwork to train our models. For this tutorial, we will work with the tidymodels framework for machine learning. tidymodels is a “meta-package” that incorporates a number of other packages that make the machine learning pipeline consistent for many commonly-used machine learning algorithms.
We will work with the same models that we have seen already in this tutorial: logistic regression, lasso, and decision trees.
First, we initialize the workflow by assigning the preprocessing recipe that we created earlier.
Next, we declare the three models that we want to train using the parsnip package. We will use the glm package for logistic regression, glmnet package for lasso, and rpart package for decision trees. Declare these packages as the “engine” for each specification. For all three models, we will be using classification.
# Logistic Regression
glm_spec <- parsnip::logistic_reg() %>%
parsnip::set_engine("glm") %>%
parsnip::set_mode("classification")
# Lasso
lasso_spec <- parsnip::logistic_reg(
penalty = 0.01,
mixture = 1
) %>%
parsnip::set_engine("glmnet") %>%
parsnip::set_mode("classification")
# Decision Tree
tree_spec <- parsnip::decision_tree(cost_complexity = 0.001) %>%
parsnip::set_engine("rpart") %>%
parsnip::set_mode("classification")Next, run the cross-validation! In order to do so, we are using the fit_resamples function from the tune package. This allows us to extract the performance metrics for each fold of each model.
Performance metrics are the measures of goodness of fit that we previously mentioned. Deciding on the performance metric is a key decision in the machine learning pipeline. It is something that we will comeback to a number of times throughout the course. In our set up, we will evaluate model performance using the following performance metrics:
The true value of cross validation comes into play when we are evaluating performance metrics. In effect, what we are doing (and what happens under-the-hood in the following code) is to train each model on 4 folds of the data, calculate the performance metrics on those 4 folds, then calculate the error in the validation fold. We don’t want to over-interpret the performance metrics from the training folds, instead the validation fold gives us a more reliable estimate of the out-of-sample performance.
# Set the seed and run the full code chunk
set.seed(202004)
# Cross validate logistic regression model
glm_cv <- tune::fit_resamples(
mobilize_rec,
glm_spec,
mobilize_cv_splits,
metrics = yardstick::metric_set(accuracy, roc_auc),
control = tune::control_resamples(save_pred = TRUE)
)
# Cross validate lasso model
lasso_cv <- tune::fit_resamples(
mobilize_rec,
lasso_spec,
mobilize_cv_splits,
metrics = yardstick::metric_set(accuracy, roc_auc),
control = tune::control_resamples(save_pred = TRUE)
)
# Cross validate decision tree model
tree_cv <- tune::fit_resamples(
mobilize_rec,
tree_spec,
mobilize_cv_splits,
metrics = yardstick::metric_set(accuracy, roc_auc),
control = tune::control_resamples(save_pred = TRUE)
)Let’s check the average accuracy and AUC (area under the curve) metrics from the cross-validated models.
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 accuracy binary 0.737 5 0.00166
## 2 roc_auc binary 0.800 5 0.00144
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 accuracy binary 0.732 5 0.00210
## 2 roc_auc binary 0.793 5 0.00149
## # A tibble: 2 x 5
## .metric .estimator mean n std_err
## <chr> <chr> <dbl> <int> <dbl>
## 1 accuracy binary 0.736 5 0.00183
## 2 roc_auc binary 0.754 5 0.00577
As we can see above, accuracy and AUC are fairly similar across the three models. Note that for the lasso and tree models, we did not do any model tuning. Let’s see if we can improve performance using model tuning.
Let’s go back to the lasso model. Instead of setting the penalty parameter at a fixed value, we can use a data-driven approach to selecting the prefered value of our parameters. This is whene model tuning comes into play.
Tuning involves supplying a list of values, or a “grid”, and training the models with each of the values. Importance to notice is that for each value (or set of values if tuning multiple values), we implement cross validation.
With our same set up, let’s tune the penalty parameter. Note that we could also tune the mixture paramater if we wanted to use a full elastic net regularization that is possible using the glmnet package.
Note: the following code takes some time to run, we set up parallel processing to help cut down on the required time.
# Set the seed and run the full code chunk
set.seed(202004)
# Set up parallel processing
all_cores <- parallel::detectCores(logical = FALSE)
doFuture::registerDoFuture()
cl <- parallel::makeCluster(all_cores)
future::plan("cluster", workers = cl)
# tune penalty parameter
tune_lasso <- parsnip::logistic_reg(
penalty = tune(),
mixture = 1
) %>%
set_engine("glmnet") %>%
set_mode("classification")
# Hyperparameter grid
lambda_grid <- dials::grid_regular(penalty(), levels = 50)
lasso_tuned <- tune::tune_grid(
mobilize_workflow %>% add_model(tune_lasso),
resamples = mobilize_cv_splits,
grid = lambda_grid,
metrics = yardstick::metric_set(accuracy, roc_auc)
)Take a look at the penalty values with the highest accuracy values.
# Best tuned models for AUC
tune::show_best(lasso_tuned, "accuracy") %>%
select(penalty, .metric, mean, std_err)## # A tibble: 5 x 4
## penalty .metric mean std_err
## <dbl> <chr> <dbl> <dbl>
## 1 2.12e- 4 accuracy 0.737 0.00169
## 2 3.39e- 4 accuracy 0.737 0.00166
## 3 1.00e-10 accuracy 0.737 0.00168
## 4 1.60e-10 accuracy 0.737 0.00168
## 5 2.56e-10 accuracy 0.737 0.00168
And the highest AUC values.
# Best tuned models for AUC
tune::show_best(lasso_tuned, "roc_auc") %>%
select(penalty, .metric, mean, std_err)## # A tibble: 5 x 4
## penalty .metric mean std_err
## <dbl> <chr> <dbl> <dbl>
## 1 1.00e-10 roc_auc 0.800 0.00145
## 2 1.60e-10 roc_auc 0.800 0.00145
## 3 2.56e-10 roc_auc 0.800 0.00145
## 4 4.09e-10 roc_auc 0.800 0.00145
## 5 6.55e-10 roc_auc 0.800 0.00145
Let’s look at the full path of the performance metrics as the penalty term varied.
# Plot metrics
tune::collect_metrics(lasso_tuned) %>%
ggplot(aes(x = penalty, y = mean, color = .metric)) +
geom_line() +
scale_color_colorblind() +
theme(legend.position = "top") +
labs(color = "Performance Metric")We find that tuning provides some improvement in the performance of the lasso model. Given that our models have only a few covariates. The models with very low penalty terms tend to perform better. From the figure above, we see that accuracy levels off at penalty values below 0.15. However, the AUC metric continues to incrase until the non-penalized model is reached.
We will leave it as an exercise to you to explore more complex models with interaction terms and polynomial terms in the models. To implement data transformations, explore the recipes::step_interact for interactions between covariates recipes::step_bs and recipes::step_poly function for adding basis functions or polynomial terms.
Now that we’ve tune our model, we can extract the best model and apply it to the full training data.
output_lasso <- tune::finalize_workflow(
mobilize_workflow %>% add_model(tune_lasso),
highest_accuracy
)lasso_fitted <- output_lasso %>%
# fit the final model on the full training set
fit(data = mobilize_train)Take a look at the coefficients from the best model.
lasso_coefs <- lasso_fitted$fit$fit$fit %>%
broom::tidy() %>%
# filter out the intercept term
filter(term != "(Intercept)") %>%
select(-step, -dev.ratio)
delta <- abs(lasso_coefs$lambda - highest_accuracy$penalty)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]
labelled_coefs <- lasso_coefs %>%
mutate(abs_estimate = abs(estimate)) %>%
filter(abs_estimate >= 0.01) %>%
distinct(term) %>%
inner_join(lasso_coefs, by = "term") %>%
filter(lambda == lambda_opt) %>%
select(term, estimate)
labelled_coefs## # A tibble: 33 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.79
## 2 vote98 1.11
## 3 age 0.0815
## 4 persons 0.237
## 5 st_sen_X3 0.105
## 6 st_sen_X8 -0.327
## 7 st_hse_X19 0.184
## 8 st_hse_X5 -0.717
## 9 st_hse_X16 -0.628
## 10 st_hse_X18 -0.419
## # … with 23 more rows
Now it’s your turn, try out a few different values for the penalty term. How do the variables and coefficients from the lasso model change as you change the
penalty.val? Try some values close to 0.1 and some values below 0.01. Provide 3-5 sentence as comment below.
##############################
# Change the penalty value
penalty.val <- 0.1
##############################
check_lasso <- tune::finalize_workflow(
mobilize_workflow %>% add_model(tune_lasso),
list(penalty.val)
)
lasso_fitted <- check_lasso %>%
# fit the final model on the full training set
fit(data = mobilize_train)
lasso_coefs <- lasso_fitted$fit$fit$fit %>%
broom::tidy() %>%
# filter out the intercept term
filter(term != "(Intercept)") %>%
select(-step, -dev.ratio)
delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]
labelled_coefs <- lasso_coefs %>%
distinct(term) %>%
inner_join(lasso_coefs, by = "term") %>%
filter(lambda == lambda_opt) %>%
select(term, estimate)
labelled_coefs## # A tibble: 2 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.06
## 2 age 0.000869
##############################
# Change the penalty value
penalty.val <- 0.09
##############################
check_lasso <- tune::finalize_workflow(
mobilize_workflow %>% add_model(tune_lasso),
list(penalty.val)
)
lasso_fitted <- check_lasso %>%
# fit the final model on the full training set
fit(data = mobilize_train)
lasso_coefs <- lasso_fitted$fit$fit$fit %>%
broom::tidy() %>%
# filter out the intercept term
filter(term != "(Intercept)") %>%
select(-step, -dev.ratio)
delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]
labelled_coefs_1 <- lasso_coefs %>%
distinct(term) %>%
inner_join(lasso_coefs, by = "term") %>%
filter(lambda == lambda_opt) %>%
select(term, estimate)
labelled_coefs_1## # A tibble: 2 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.12
## 2 age 0.00266
##############################
# Change the penalty value
penalty.val <- 0.05
##############################
check_lasso <- tune::finalize_workflow(
mobilize_workflow %>% add_model(tune_lasso),
list(penalty.val)
)
lasso_fitted <- check_lasso %>%
# fit the final model on the full training set
fit(data = mobilize_train)
lasso_coefs <- lasso_fitted$fit$fit$fit %>%
broom::tidy() %>%
# filter out the intercept term
filter(term != "(Intercept)") %>%
select(-step, -dev.ratio)
delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]
labelled_coefs_2 <- lasso_coefs %>%
distinct(term) %>%
inner_join(lasso_coefs, by = "term") %>%
filter(lambda == lambda_opt) %>%
select(term, estimate)
labelled_coefs_2## # A tibble: 3 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.39
## 2 age 0.0102
## 3 vote98 0.285
##############################
# Change the penalty value
penalty.val <- 0.009
##############################
check_lasso <- tune::finalize_workflow(
mobilize_workflow %>% add_model(tune_lasso),
list(penalty.val)
)
lasso_fitted <- check_lasso %>%
# fit the final model on the full training set
fit(data = mobilize_train)
lasso_coefs <- lasso_fitted$fit$fit$fit %>%
broom::tidy() %>%
# filter out the intercept term
filter(term != "(Intercept)") %>%
select(-step, -dev.ratio)
delta <- abs(lasso_coefs$lambda - penalty.val)
lambda_opt <- lasso_coefs$lambda[which.min(delta)]
labelled_coefs_3 <- lasso_coefs %>%
distinct(term) %>%
inner_join(lasso_coefs, by = "term") %>%
filter(lambda == lambda_opt) %>%
select(term, estimate)
labelled_coefs_3## # A tibble: 15 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.72
## 2 age 0.0201
## 3 vote98 0.829
## 4 persons 0.155
## 5 st_sen_X3 0.112
## 6 st_hse_X19 0.140
## 7 st_sen_X8 -0.115
## 8 st_hse_X5 -0.219
## 9 st_hse_X16 -0.131
## 10 st_hse_X18 -0.0974
## 11 newreg 0.0544
## 12 st_hse_X8 0.0494
## 13 st_hse_X14 -0.0460
## 14 st_hse_X9 0.0229
## 15 st_sen_X7 -0.0138
## # A tibble: 2 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.06
## 2 age 0.000869
## # A tibble: 2 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.12
## 2 age 0.00266
## # A tibble: 3 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.39
## 2 age 0.0102
## 3 vote98 0.285
## # A tibble: 15 x 2
## term estimate
## <chr> <dbl>
## 1 vote00 1.72
## 2 age 0.0201
## 3 vote98 0.829
## 4 persons 0.155
## 5 st_sen_X3 0.112
## 6 st_hse_X19 0.140
## 7 st_sen_X8 -0.115
## 8 st_hse_X5 -0.219
## 9 st_hse_X16 -0.131
## 10 st_hse_X18 -0.0974
## 11 newreg 0.0544
## 12 st_hse_X8 0.0494
## 13 st_hse_X14 -0.0460
## 14 st_hse_X9 0.0229
## 15 st_sen_X7 -0.0138
```
[DISCUSS HERE]
Finally, let’s apply best model from our model tuning to our holdout set. We can use the last_fit function from the tune package to carry out this step.
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.738
## 2 roc_auc binary 0.797
Let’s say that you are contacted by a non-partisan organization that is seeking to identify voters that are less likely to vote in the 2020 US federal general election. The organization wants to focus its resources on encouraging unlikely voters to go to the polls in November. Based on the analysis above, what recommendations would you provide? What additional data would would you recommend that the organization collect to accurately predict whether a person is likely to vote?
[DISCUSS HERE] Based on the analysis made above, I would recommend to focus on convincing population younger than 50 years old, whose likelihood of voting is smaller; and the younger the better. In addition, I would recommend to focus on population who has not voted in previous elections. I would recommend get information on education, work status, day of election to know, income level and controlling whether or not you live in a swing state.
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] vroom_1.2.0 doFuture_0.9.0 iterators_1.0.12 foreach_1.4.8
## [5] future_1.16.0 globals_0.12.5 patchwork_1.0.0 ggthemes_4.2.0
## [9] hrbrthemes_0.8.0 rpart_4.1-15 plotmo_3.5.6 TeachingDemos_2.10
## [13] plotrix_3.7-7 Formula_1.2-3 glmnet_3.0-2 Matrix_1.2-18
## [17] yardstick_0.0.6 workflows_0.1.1 tune_0.0.1 rsample_0.0.5
## [21] recipes_0.1.10 parsnip_0.0.5 infer_0.5.1 dials_0.0.4
## [25] scales_1.1.0 broom_0.5.5 tidymodels_0.1.0 sandwich_2.5-1
## [29] lmtest_0.9-37 zoo_1.8-7 margins_0.3.23 ggcorrplot_0.1.3
## [33] corrr_0.4.2 stargazer_5.2.2 glue_1.3.2 here_0.1
## [37] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3
## [41] readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 ggplot2_3.3.0
## [45] tidyverse_1.3.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] utf8_1.1.4 tidyselect_1.0.0 lme4_1.1-21
## [4] htmlwidgets_1.5.1 grid_3.6.3 pROC_1.16.1
## [7] munsell_0.5.0 codetools_0.2-16 DT_0.12
## [10] miniUI_0.1.1.1 withr_2.1.2 colorspace_1.4-1
## [13] highr_0.8 knitr_1.28 rstudioapi_0.11
## [16] stats4_3.6.3 Rttf2pt1_1.3.8 bayesplot_1.7.1
## [19] listenv_0.8.0 labeling_0.3 rstan_2.19.3
## [22] farver_2.0.3 bit64_0.9-7 DiceDesign_1.8-1
## [25] rprojroot_1.3-2 vctrs_0.2.4 generics_0.0.2
## [28] ipred_0.9-9 xfun_0.12 R6_2.4.1
## [31] markdown_1.1 rstanarm_2.19.3 lhs_1.0.1
## [34] assertthat_0.2.1 promises_1.1.0 nnet_7.3-13
## [37] gtable_0.3.0 processx_3.4.2 timeDate_3043.102
## [40] rlang_0.4.5 systemfonts_0.1.1 splines_3.6.3
## [43] extrafontdb_1.0 inline_0.3.15 yaml_2.2.1
## [46] reshape2_1.4.3 prediction_0.3.14 modelr_0.1.6
## [49] tidytext_0.2.3 threejs_0.3.3 crosstalk_1.1.0.1
## [52] backports_1.1.5 httpuv_1.5.2 rsconnect_0.8.16
## [55] tokenizers_0.2.1 extrafont_0.17 tools_3.6.3
## [58] lava_1.6.7 kableExtra_1.1.0 ellipsis_0.3.0
## [61] ggridges_0.5.2 Rcpp_1.0.4 plyr_1.8.6
## [64] base64enc_0.1-3 ps_1.3.2 prettyunits_1.1.1
## [67] haven_2.2.0 fs_1.3.2 furrr_0.1.0
## [70] magrittr_1.5 data.table_1.12.8 colourpicker_1.0
## [73] reprex_0.3.0 GPfit_1.0-8 SnowballC_0.6.0
## [76] matrixStats_0.56.0 tidyposterior_0.0.2 hms_0.5.3
## [79] shinyjs_1.1 mime_0.9 evaluate_0.14
## [82] xtable_1.8-4 tidypredict_0.4.5 shinystan_2.5.0
## [85] readxl_1.3.1 gridExtra_2.3 shape_1.4.4
## [88] rstantools_2.0.0 compiler_3.6.3 rpart.plot_3.0.8
## [91] crayon_1.3.4 minqa_1.2.4 StanHeaders_2.19.2
## [94] htmltools_0.4.0 later_1.0.0 lubridate_1.7.4
## [97] DBI_1.1.0 dbplyr_1.4.2 MASS_7.3-51.5
## [100] boot_1.3-24 cli_2.0.2 gower_0.2.1
## [103] igraph_1.2.4.2 pkgconfig_2.0.3 xml2_1.2.5
## [106] dygraphs_1.1.1.6 hardhat_0.1.2 webshot_0.5.2
## [109] prodlim_2019.11.13 rvest_0.3.5 janeaustenr_0.1.5
## [112] callr_3.4.2 digest_0.6.25 rmarkdown_2.1
## [115] cellranger_1.1.0 gdtools_0.2.1 shiny_1.4.0.2
## [118] gtools_3.8.1 nloptr_1.2.2.1 lifecycle_0.2.0
## [121] nlme_3.1-144 jsonlite_1.6.1 viridisLite_0.3.0
## [124] fansi_0.4.1 pillar_1.4.3 lattice_0.20-40
## [127] loo_2.2.0 fastmap_1.0.1 httr_1.4.1
## [130] pkgbuild_1.0.6 survival_3.1-8 xts_0.12-0
## [133] shinythemes_1.1.2 bit_1.1-15.2 class_7.3-15
## [136] stringi_1.4.6
This tutorial was originally developed by Susan Athey, Niall Keleher, and Eray Turkel for the Spring 2020 course, ALP301 Data-Driven Impact. We are grateful to Grant R. McDermott, Ed Rubin, Matthew Taddy, Kieran Healy, Garret Grolemund and Hadley Wickham, and Bruno Rodrigues and Julia Silge. This tuturial draws from the examples that these colleagues made publicly available.↩