Learning Objective:

You will be able to apply a common set of tools for data analysis in the course and interpret the analysis through clear written and visual output.1

Introduction

In this tutorial, we will cover topics of reading data into R, producing descriptive statistics, visualizing data, and running statistical models. Several of the topics covered in this tutorial should be review. Some topics, materials, and methods may be new. Please read through the tutorial closely and examine the code closely to get a sense of the steps that are covered.

We have structured this RMarkdown document such that the code chunks are ready for your to run. You can run individual code chunks by clicking on the green right-facing arrow or by using the Ctrl+Shift+Enter key combination. For more information on RMarkdown documents, see the RMarkdown chapter from R for Data Science.

Interspersed through the tutorial, we have included sections where your input is required. These are in the form of comprehension checks and assignment questions.


Comprehension Check

Here is where we will ask you to interpret or reflect on the data, analysis, or models used.

[DISCUSS HERE]


R Packages

To promote collaborative, reproducible work we will be following a set of guidelines and using a prescribed set of R packages throughout the course. We will make extensive use of the tidyverse, “a set of packages that work in harmony because they share common data representations and API design.” Check out the Tidyverse packages. And for an excellent introduction to R, the tidyverse, and data science, we recommend R for Data Science.

The following is an RMarkdown “code chunk” to load the tidyverse package (sometimes called a library).

Coding Style

From the start, we want to establish a set of common good practices in writing interpretable, well-commented code in R. If you have not done so already, please review the tidyverse style guide by Hadley Wickham (A shorter, similar style guide can be found at Google’s R Style Guide).

Data

The data we will be working are from Kevin Arceneaux, Alan S. Gerber and Donald P. Green (2006)’s paper “Comparing Experimental and Matching Methods Using a Large-Scale Voter Mobilization Experiment” (see article). These data come from a randomized experiment. A research study was conducted in Iowa and Michigan before the 2002 US midterm elections. Households with one or two registered voters were randomly assigned to treatment and control groups. The treatment involved calling the selected individuals by phone for a get-out-the-vote (GOTV) campaign. The study involved over 1.9 million households (1,845,348 in the control group and 59,972 in the group that was assigned to receive the GOTV phone call).

The data that we will use includes information about voting records in elections prior to 2002 as well as several demographic and location variables.

This tutorial assumes that the data is in a folder named “data” with a .here file located. We will use the here package (See Practice safe paths from Jennifer Bryan and Jim Hester’s book) to assist with finding files. We encourage you to work within a RStudio Project (See Project-oriented workflows).

Let’s read in the data:

Take a look at the data. The glimpse function from the tibble package is a handy way to check the data types of variables in the data and get a sense of the values that one is working with in the data.

## Observations: 2,474,927
## Variables: 29
## $ persons         <dbl> 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,…
## $ competiv        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ vote00          <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ treatment       <dbl> -1, 0, 0, -1, 0, -1, 0, 0, 0, -1, -1, 0, 0, -1, 0, 0,…
## $ county          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ st_sen          <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 2…
## $ st_hse          <dbl> 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 58, 5…
## $ vote98          <dbl> 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ newreg          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pid_maj         <dbl> 0, 0, 1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 0, 2, 2, 2, 2, 0,…
## $ age             <dbl> 44, 39, 82, 73, 44, 75, 78, 74, 46, 87, 59, 78, 42, 4…
## $ female          <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ vote02          <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0,…
## $ contact         <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ pseudo_c        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ treat_real      <dbl> NA, 0, 0, NA, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0,…
## $ treat_pseudo    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ state           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ comp_mi         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ comp_ia         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fem_miss        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ female2         <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ lc              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ hc              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pseudo_portion  <dbl> 0, NA, NA, 0, NA, 0, NA, NA, NA, 1, 0, NA, NA, 0, NA,…
## $ W               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ female_decoded  <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,…
## $ pid_maj_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ d_unlisted      <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1,…

As you can see, the data include 2,474,927 observations and 29 variables.

Can you start to get a sense of what each variable represents?

Our data is unlabelled, so to help understand some of the variables, here’s a list of some key variables that we will work with:

Variable Description
age Age
persons Number of registered voters in household
newreg Newly registered voter
female Indicator variable where \(= 1\) indicates female
st_sen Index of state senate district
st_hse Index of state congressional district
vote00 Indicator variable where \(= 1\) indicates voted in the 2000 general election
vote98 Indicator variable where \(= 1\) indicates voted in the 1998 midterm election
county County index
state US State indicator variable where \(= 0\) indicates Michigan and \(= 1\) indicates Iowa
comp_mi Indicator of whether the congressional district was classified as competitive in Michigan
comp_ia Indicator of whether the congressional district was classified as competitive in Iowa
vote02 Indicator variable where \(= 1\) indicates voted in the 2002 midterm election
treatment Indicator variable specifying the treatment in the experiment
contact Indicator of whether the person was contacted (NA - indicates an observation with an unlisted phone number)

For this tutorial, we will focus our analysis on the control group, e.g. observations from people that did not receive the GOTV campaign messaging. We will, thus, put aside the observations from people that were included in the campaign. Additionally, we are going to restrict our analysis to use the subset of variables listed above. For computational reasons, we will restrict our analysis to the largest county.

Descriptive Statistics

To get a quick understanding of the variables, it’s often useful to summarize the data. One way to do so is the summary function.

##      vote02            age            persons          newreg      
##  Min.   :0.0000   Min.   : 18.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 37.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :1.0000   Median : 49.00   Median :1.000   Median :0.0000  
##  Mean   :0.5268   Mean   : 52.08   Mean   :1.423   Mean   :0.1221  
##  3rd Qu.:1.0000   3rd Qu.: 67.00   3rd Qu.:2.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :113.00   Max.   :2.000   Max.   :1.0000  
##      vote00           vote98           female           st_sen     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
##  Median :1.0000   Median :0.0000   Median :1.0000   Median :6.000  
##  Mean   :0.5382   Mean   :0.1657   Mean   :0.5685   Mean   :5.144  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:7.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :8.000  
##      st_hse     
##  Min.   : 1.00  
##  1st Qu.: 8.00  
##  Median :15.00  
##  Mean   :13.57  
##  3rd Qu.:19.00  
##  Max.   :23.00

That’s a bit overwhelming when we have lots of variables. Instead, we can get the mean and standard deviation of individual variables.

## Mean household size:1.4232
## Standard deviation:0.4941

We can also use the stargazer function to produce easy to read summary statistics tables.

## 
## ===========================================================
## Statistic    N     Mean  St. Dev. Min Pctl(25) Pctl(75) Max
## -----------------------------------------------------------
## vote02    243,042 0.527   0.499    0     0        1      1 
## age       243,042 52.082  18.573  18     37       67    113
## persons   243,042 1.423   0.494    1     1        2      2 
## newreg    243,042 0.122   0.327    0     0        0      1 
## vote00    243,042 0.538   0.499    0     0        1      1 
## vote98    243,042 0.166   0.372    0     0        0      1 
## female    243,042 0.568   0.495    0     0        1      1 
## st_sen    243,042 5.144   2.265    1     3        7      8 
## st_hse    243,042 13.566  6.545    1     8        19    23 
## -----------------------------------------------------------

Data Visualization

A good way to get to know the data is to produce some data visualizations. Let’s take a look at a few of the variables. First, we’ll look a univariate visualizations.

Relying on the tidyverse, let’s produce a histogram of age in 2002 (age) with ggplot2.

We can compare the distribution of age among those that voted in the 2002 midterm election by gender.


T-tests

Let’s formally test for differences in age. We want to conduct a hypothesis test of whether the age of those who voted in the 2002 midterm election are older than those that did not vote.

Formally, what we are doing is setting up a null hypothesis, \(H_{0}\), where the mean age of voters and non-voters are equivalent. Our alternative hypothesis, \(H_{1}\) is that the mean age of the two populations are not equal.

  • \(H_{0}: \mu_{1} = \mu_{2}\)
  • \(H_{1}: \mu_{1} \neq \mu_{2}\)

We will use Welch’s t-test, under which we assume that the data from the two groups follow a normal distribution. However, we do not require that the data from the two populations have the same variance.

Welch’s t-test:

  • Test statistic: \(\frac{\mu_{1} - \mu_{2}}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}\)
  • Degrees of Freedom (DOF): \(\frac{(\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}})^2}{(\frac{s_{1}^2}{n_{1}})^2 + (\frac{s_{2}^2}{n_{2}})^2}\)

First, we will test if the mean age of people who voted is different from the mean age of people that did not vote.

Didn’t Vote Voted Difference Low High p.value T-Stat DOF
47.36 56.33 -8.97 -Inf -8.85 0 -122.05 236481.3

We reject the null hypothesis that the mean age of those that voted in the 2002 midterm election is equal to the mean age of non-voters. We find that votes are, on average, older than non-voters.

Now let’s test for equality of mean voting rates by gender. This time we will use a one-sided test, where our hypothesis is that women voters in the 2002 midterm election were older than women non-voters.

Difference in Mean Age (Women),
Welch Two Sample,
One-sided t-test
Mean Age
95% Confidence Interval
Test Statistics
Didn’t Vote Voted Difference Low High p.value T-Stat DOF
48.13 57.55 -9.43 -Inf -9.26 0 -93.25 133797.4

Similarly, for men:

Difference in Mean Age (Men),
Welch Two Sample,
One-sided t-test
Mean Age
95% Confidence Interval
Test Statistics
Didn’t Vote Voted Difference Low High p.value T-Stat DOF
46.34 54.71 -8.37 -Inf -8.2 0 -79.37 102628.6

Comprehension Check

Using the code above as a template, test the hypothesis that women voted at a higher rate, on average, than men in the 2002 midterm election. (Hint: use a one-sided test).

Difference in Mean between Males and Females,
Welch Two Sample,
One-sided t-test
Voting Proportion
95% Confidence Interval
Test Statistics
Male Female Difference Low High p.value T-Stat DOF
0.53 0.53 0 -Inf 0.01 0.85 1.05 225814.8

We don’t have enough evidence to reject the null hypothesis that the mean of the proportion of women voters in the 2002 midterm election is higher to the mean age of the proportion of men.


Correlation

Next, we want to explore linear correlation across the variables.

We can visualize the correlation coefficients with a correlation plot using the ggcorrplot function in tandem with the cor function that produces a correlation matrix.

As we can see, there are a few variables that are highly correlated. When we come to the statistical modeling section of the tutorial, high correlation between variables is something that we want to be cognizant of.

What variables correlate with whether a person voted in the 2002 midterm election?

Correlation with vote02 corr_coef
vote00 0.4553
age 0.2410
vote98 0.2342
newreg 0.1197
persons 0.1193
st_hse 0.0093
st_sen 0.0064
female 0.0021

Comprehension Check

In 2-3 sentences, discuss the correlation coefficients observed in the table above. Do they match your intuition?

[DISCUSS HERE] Not really. In particular, I expected a higher correlation with vote00 and with newreg. Since if you voted previously maybe you are used to do it, so you may want to continue with such habit. If you are newly registered, you may want to start with the habit. I know that having a high correlation with these two variables might be hard, because one implies that the other is not occurring; so, the effect is not clear.

For age, it is hard for me to know if it was what I expected or not since I don’t think it is a linear relationship, but a parabolic relationship, in which at a young age you are likely to be less engaged with voting, when you start getting older you start getting more concern with voting and then you start losing interest as you are closer to your last years. However, if the election is on a weekday, and you work, then the relationship I previously assumed is the inverse.

I didn’t expect any correlation with the number of registered voters in the household, since if all those registered voters don’t vote might push you to not vote as well, or if they vote they will push you to vote.


Linear Regression

Let’s now move to linear regression to check for correlation between the voting outcome and covariates. First, we’ll use the lm function from the stats package. Here is a bivariate regression where we regress an indicator of a person’s gender on whether or not that person voted in the 2002 midterm election.

The underlying model that we are using is:

\(Y_{i} = \beta_{0} + \beta_{1} X_{i} + u_{i}\)

where \(Y_{i}\) is our outcome of interest, “Voted in 2002 midterm election” and \(\beta_{1}\), the coefficient of interest, gives us a measure of how much \(X\) and \(Y\) covary, adjusting for the variance of \(X\) in our population. We take a data-driven approach to estimating that relationship.

term estimate std.error statistic p.value
(Intercept) 0.1893 0.0029 64.7001 0
age 0.0065 0.0001 122.4382 0

Based on this bivariate regression, we estimate that for each additional year of age, an individual is 0.65 percentage points more likely to have voted in the 2002 election. This is consistent with what we observed in the t-tests.

We can move to a more general linear regression model, the multivariate regression.

\(Y_{i} = X\beta + u_{i}\)

Where \(X\) is a matrix of variables (or covariates) and \(\beta\) is a vector of coefficients.

Building on our previous regression, let’s add gender to the regression.

Usually, we will be concerned about heteroskedacity in our data. That is, the variance of our estimates may be dependent on covariates. To adjust for heteroskedacity, there are a number of way to calculate heteroskedastic-robust standard errors. Here, we’ll use the lmtest package in tandem with the sandwich package to report heteroskedastic-robust standard errors.

term estimate std.error statistic p.value
(Intercept) 0.1977 0.0031 63.8366 0
age 0.0065 0.0001 121.2115 0
female -0.0173 0.0020 -8.6853 0

Logistic Regression

Because our outcome is a binary variable, we can approach the modeling as a classification problem. As such, let’s switch to logistic regression.

term estimate std.error statistic p.value
(Intercept) -1.2683 0.0136 -92.9639 0
age 0.0274 0.0002 110.9083 0
female -0.0721 0.0084 -8.5323 0

Recall that the coefficients from the logistic regression model are not interpreted in the same way as coefficients from the linear model. Whereas coefficient from the lm model are interpreted as the expected change in Y for a change in X. The coefficients from the glm model are the change in the “log odds” as X changes.

We can get an analogous average marginal effect from the glm coefficents using the margins function. You can confirm that the values for the average marginal effect resemble those from the linear regression.

##      age   female
##  0.00643 -0.01691

We’ve seen that age is correlated with whether or not a person voted in the 2002 midterm election. However, it’s likely that a person’s propensity to vote diminishes at some point with age. So, we’ll include a quadratic term for age.

term estimate std.error statistic p.value
(Intercept) -3.9097 0.0390 -100.1602 0e+00
age 0.1332 0.0015 90.8825 0e+00
I(age^2) -0.0010 0.0000 -74.0238 0e+00
female -0.0301 0.0086 -3.5056 5e-04

We see from the the table above and the following figure that the fitted values of the likelihood that a person voted in 2002 follow an “inverted U” shape as age increases.

term estimate std.error statistic p.value
(Intercept) -3.8691 0.0604 -64.0806 0.0000
female -0.0992 0.0786 -1.2625 0.2068
age 0.1317 0.0023 56.9492 0.0000
I(age^2) -0.0009 0.0000 -45.6899 0.0000
female:age 0.0025 0.0030 0.8491 0.3958
female:I(age^2) 0.0000 0.0000 -0.7912 0.4288

We can see from the regression table and the figure that older people were observed to vote at a higher rate. However, we do not observe a difference in voting rates between women and men. If oe


Comprehension Check

We have described the relationship between covariates and the outcome as correlation. We want to be cautious about interpreting the relationships as causal. One primary reason is that there are likely variables omitted from our model that are correlated with the likelihood of voting and the covariates included in the regression above. In 3-5 sentences, interpret the table and describe some of the factors that may be omitted from these simple regressions that we want to be concerned about when interpreting the estimates from the regressions above.

[DISCUSS HERE]

In the table above (considering the logit model with the most regressors) we found that female is negatively correlated with voting but it is not significant, so we cannot say it plays an important role. Age is possitively correlated with voting, so one extra year will increase the voting rate by 0.13; however, age squared is negative. From the graphs above we can see that young and elder population, independently if they are males or females, are less likely to vote than middle age population. One of the factors that may be affecting the likelihood of voting is that the election was on a weekday and the person worked on that day. Education may also be a factor that is possitively correlated with voting. Whether people voted or not in previous elections may also affect; however, the effect in this may be less clear because maybe people are less likely to engage on midterm elections.


Machine Learning

Now, let’s bring in some steps from machine learning to set up a predictive model of whether or not a person voted in the 2002 midterm election.

Train-Test Split

First, we will split our data into training set and test set. We will holdout 80% of the data as a test dataset to evaluate our “best” model at the final step. The test set is often referred to as the holdout data.

## <194434/48608/243042>

Create a training set and a holdout set.

Preprocessing

We will do some data preprocessing to prepare for our models

After all of the steps above, we can see that we have added several new variables to our data.

## 
## ===================================================================
## Statistic     N      Mean    St. Dev.  Min Pctl(25) Pctl(75)  Max  
## -------------------------------------------------------------------
## age        194,434  52.076    18.573   18     37       67     113  
## persons    194,434   1.424     0.494    1     1        2       2   
## newreg     194,434   0.122     0.328    0     0        0       1   
## vote00     194,434   0.538     0.499    0     0        1       1   
## vote98     194,434   0.165     0.371    0     0        0       1   
## female     194,434   0.568     0.495    0     0        1       1   
## age2       194,434 3,056.847 2,088.571 324  1,369    4,489   12,769
## st_sen_X2  194,434   0.097     0.297    0     0        0       1   
## st_sen_X3  194,434   0.106     0.308    0     0        0       1   
## st_sen_X4  194,434   0.081     0.272    0     0        0       1   
## st_sen_X5  194,434   0.104     0.306    0     0        0       1   
## st_sen_X6  194,434   0.181     0.385    0     0        0       1   
## st_sen_X7  194,434   0.173     0.379    0     0        0       1   
## st_sen_X8  194,434   0.176     0.380    0     0        0       1   
## st_hse_X2  194,434   0.023     0.150    0     0        0       1   
## st_hse_X3  194,434   0.025     0.156    0     0        0       1   
## st_hse_X4  194,434   0.033     0.179    0     0        0       1   
## st_hse_X5  194,434   0.028     0.165    0     0        0       1   
## st_hse_X6  194,434   0.032     0.175    0     0        0       1   
## st_hse_X7  194,434   0.031     0.173    0     0        0       1   
## st_hse_X8  194,434   0.034     0.182    0     0        0       1   
## st_hse_X9  194,434   0.031     0.173    0     0        0       1   
## st_hse_X10 194,434   0.030     0.171    0     0        0       1   
## st_hse_X11 194,434   0.029     0.167    0     0        0       1   
## st_hse_X12 194,434   0.020     0.139    0     0        0       1   
## st_hse_X13 194,434   0.067     0.250    0     0        0       1   
## st_hse_X14 194,434   0.060     0.238    0     0        0       1   
## st_hse_X15 194,434   0.049     0.216    0     0        0       1   
## st_hse_X16 194,434   0.055     0.227    0     0        0       1   
## st_hse_X17 194,434   0.057     0.232    0     0        0       1   
## st_hse_X18 194,434   0.057     0.232    0     0        0       1   
## st_hse_X19 194,434   0.067     0.249    0     0        0       1   
## st_hse_X20 194,434   0.064     0.245    0     0        0       1   
## st_hse_X21 194,434   0.055     0.227    0     0        0       1   
## st_hse_X22 194,434   0.052     0.222    0     0        0       1   
## st_hse_X23 194,434   0.050     0.218    0     0        0       1   
## -------------------------------------------------------------------

Models

We are going to use a few models to predict whether or not a person voted in the 2020 election. logistic regression, lasso, and classification trees to predict whether or not a person voted in the 2002 midterm election.

Lasso

One of the central tools of many machine learning tools is regularization. When we are working with a large number of variables to predict an outcome, we want to be especially careful about extrapolating results from the data sample that we have to a data sample that we do not have. This is often refered to as out-of-sample performance. Namely, we want to set up data-driven models in a manner that is sensitive to variance of our analysis when we trying to apply our analysis in a new setting. Regularization is a step included in machine learning to penalize our models for complexity (or including so many variables that we overfit to our training sample).

A direct like between linear/logistic regression can be made to regularization using the lasso algorithm. The lasso is modification to linear and logistic regression that includes a penalty term, “lambda”, that is use to enumerate a number of candidate models from which we can select the best performing model.

In R, we can run a lasso using the glmnet function.

The figure above shows the path that the coefficients for each covariate follows as the value of the penalty term – lambda – decreases.

On the left-hand side we see that the coefficients are all zero. That is point at which the penalty term (lambda) fully drowns out the relationship between the covariates and the outcome variable.

On the right-hand side is our model will no penalty, e.g. lambda equals zero. This is analogous to the non-regularized logistic regression model that we saw in the previous section.

Decision Tree

Another common foundation in machine learning models is the decision tree, which is a heirarchical approach to modelling the relationship between inputs (our covariates) and outputs. The decision tree algorithm implements a series of steps to split the data into groups that are most similar. Stating from the top of the heirarchy, the algorithm makes splits, dividing the data based on a covariate. The algorithm continues until it reaches final nodes (often called leaf nodes or terminal nodes). We can run a classification tree with our training data using the rpart function. For demonstration, we’ll set the complexity parameter, cp at a fixed value.

Cross Validation

Ok, so we’ve seen how to fit three different models to our training data, logistic regression, lasso, and decision trees. Let’s bring them all together into a framework for evaluating models.

Another tool that we implement though machine learning is cross validation. The motivation behind cross validation is that we want to train models that have good out-of-sample prediction. By “good”, we could mean that the model has low variance, high accuracy, or a number of other measures. We’ll comeback to deciding on the appropriate metrics but the way that we derive quantitative measures of the goodness of model fitness is to split our data into “folds.” The data are randomly split to increase the likelihood that we are generating representative sample of the full data within each fold.

Cross validation is the concept of holding aside one fold, training our model on the remaining folds, and then measuring goodness of fit on the fold that was held out. This held out fold is often referred to as the validation fold. It’s not to be confused with the test data, which was a portion of our data that was put aside and never touched during the model training phase.

Let’s apply cross-validation to our data. We will use 5-fold cross validation (you are welcome to change the number of folds; 3, 5, and 10-fold are common).

## #  5-fold cross-validation using stratification 
## # A tibble: 5 x 2
##   splits                 id   
##   <named list>           <chr>
## 1 <split [155.5K/38.9K]> Fold1
## 2 <split [155.5K/38.9K]> Fold2
## 3 <split [155.5K/38.9K]> Fold3
## 4 <split [155.5K/38.9K]> Fold4
## 5 <split [155.5K/38.9K]> Fold5

Now, we are going to use a unifying framwork to train our models. For this tutorial, we will work with the tidymodels framework for machine learning. tidymodels is a “meta-package” that incorporates a number of other packages that make the machine learning pipeline consistent for many commonly-used machine learning algorithms.

We will work with the same models that we have seen already in this tutorial: logistic regression, lasso, and decision trees.

First, we initialize the workflow by assigning the preprocessing recipe that we created earlier.

Next, we declare the three models that we want to train using the parsnip package. We will use the glm package for logistic regression, glmnet package for lasso, and rpart package for decision trees. Declare these packages as the “engine” for each specification. For all three models, we will be using classification.

Next, run the cross-validation! In order to do so, we are using the fit_resamples function from the tune package. This allows us to extract the performance metrics for each fold of each model.

Performance metrics are the measures of goodness of fit that we previously mentioned. Deciding on the performance metric is a key decision in the machine learning pipeline. It is something that we will comeback to a number of times throughout the course. In our set up, we will evaluate model performance using the following performance metrics:

  • Accuracy - this is portion of the observation that are predicted correctly
  • Area Under the ROC Curve (AUC) - this is a common method of assessing the preformance of binary classification, as it balances the true positive rate (TPR, or specificity) with the false positive rate (FPR, or 1 - sensitivity). Values close to 0.50 are little better than picking at random while values close to 1 maximize the true positive rate without inducing many false positives. See Receiver operating characteristic

The true value of cross validation comes into play when we are evaluating performance metrics. In effect, what we are doing (and what happens under-the-hood in the following code) is to train each model on 4 folds of the data, calculate the performance metrics on those 4 folds, then calculate the error in the validation fold. We don’t want to over-interpret the performance metrics from the training folds, instead the validation fold gives us a more reliable estimate of the out-of-sample performance.

Let’s check the average accuracy and AUC (area under the curve) metrics from the cross-validated models.

## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.737     5 0.00166
## 2 roc_auc  binary     0.800     5 0.00144
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.732     5 0.00210
## 2 roc_auc  binary     0.793     5 0.00149
## # A tibble: 2 x 5
##   .metric  .estimator  mean     n std_err
##   <chr>    <chr>      <dbl> <int>   <dbl>
## 1 accuracy binary     0.736     5 0.00183
## 2 roc_auc  binary     0.754     5 0.00577

As we can see above, accuracy and AUC are fairly similar across the three models. Note that for the lasso and tree models, we did not do any model tuning. Let’s see if we can improve performance using model tuning.

Model Tuning

Let’s go back to the lasso model. Instead of setting the penalty parameter at a fixed value, we can use a data-driven approach to selecting the prefered value of our parameters. This is whene model tuning comes into play.

Tuning involves supplying a list of values, or a “grid”, and training the models with each of the values. Importance to notice is that for each value (or set of values if tuning multiple values), we implement cross validation.

With our same set up, let’s tune the penalty parameter. Note that we could also tune the mixture paramater if we wanted to use a full elastic net regularization that is possible using the glmnet package.

Note: the following code takes some time to run, we set up parallel processing to help cut down on the required time.

Take a look at the penalty values with the highest accuracy values.

## # A tibble: 5 x 4
##    penalty .metric   mean std_err
##      <dbl> <chr>    <dbl>   <dbl>
## 1 2.12e- 4 accuracy 0.737 0.00169
## 2 3.39e- 4 accuracy 0.737 0.00166
## 3 1.00e-10 accuracy 0.737 0.00168
## 4 1.60e-10 accuracy 0.737 0.00168
## 5 2.56e-10 accuracy 0.737 0.00168

And the highest AUC values.

## # A tibble: 5 x 4
##    penalty .metric  mean std_err
##      <dbl> <chr>   <dbl>   <dbl>
## 1 1.00e-10 roc_auc 0.800 0.00145
## 2 1.60e-10 roc_auc 0.800 0.00145
## 3 2.56e-10 roc_auc 0.800 0.00145
## 4 4.09e-10 roc_auc 0.800 0.00145
## 5 6.55e-10 roc_auc 0.800 0.00145

Let’s look at the full path of the performance metrics as the penalty term varied.

We find that tuning provides some improvement in the performance of the lasso model. Given that our models have only a few covariates. The models with very low penalty terms tend to perform better. From the figure above, we see that accuracy levels off at penalty values below 0.15. However, the AUC metric continues to incrase until the non-penalized model is reached.

We will leave it as an exercise to you to explore more complex models with interaction terms and polynomial terms in the models. To implement data transformations, explore the recipes::step_interact for interactions between covariates recipes::step_bs and recipes::step_poly function for adding basis functions or polynomial terms.

Now that we’ve tune our model, we can extract the best model and apply it to the full training data.

Take a look at the coefficients from the best model.

## # A tibble: 33 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.79  
##  2 vote98       1.11  
##  3 age          0.0815
##  4 persons      0.237 
##  5 st_sen_X3    0.105 
##  6 st_sen_X8   -0.327 
##  7 st_hse_X19   0.184 
##  8 st_hse_X5   -0.717 
##  9 st_hse_X16  -0.628 
## 10 st_hse_X18  -0.419 
## # … with 23 more rows

Comprehension Check

Now it’s your turn, try out a few different values for the penalty term. How do the variables and coefficients from the lasso model change as you change the penalty.val? Try some values close to 0.1 and some values below 0.01. Provide 3-5 sentence as comment below.

## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00 1.06    
## 2 age    0.000869
## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00  1.12   
## 2 age     0.00266
## # A tibble: 3 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00   1.39  
## 2 age      0.0102
## 3 vote98   0.285
## # A tibble: 15 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.72  
##  2 age          0.0201
##  3 vote98       0.829 
##  4 persons      0.155 
##  5 st_sen_X3    0.112 
##  6 st_hse_X19   0.140 
##  7 st_sen_X8   -0.115 
##  8 st_hse_X5   -0.219 
##  9 st_hse_X16  -0.131 
## 10 st_hse_X18  -0.0974
## 11 newreg       0.0544
## 12 st_hse_X8    0.0494
## 13 st_hse_X14  -0.0460
## 14 st_hse_X9    0.0229
## 15 st_sen_X7   -0.0138
## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00 1.06    
## 2 age    0.000869
## # A tibble: 2 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00  1.12   
## 2 age     0.00266
## # A tibble: 3 x 2
##   term   estimate
##   <chr>     <dbl>
## 1 vote00   1.39  
## 2 age      0.0102
## 3 vote98   0.285
## # A tibble: 15 x 2
##    term       estimate
##    <chr>         <dbl>
##  1 vote00       1.72  
##  2 age          0.0201
##  3 vote98       0.829 
##  4 persons      0.155 
##  5 st_sen_X3    0.112 
##  6 st_hse_X19   0.140 
##  7 st_sen_X8   -0.115 
##  8 st_hse_X5   -0.219 
##  9 st_hse_X16  -0.131 
## 10 st_hse_X18  -0.0974
## 11 newreg       0.0544
## 12 st_hse_X8    0.0494
## 13 st_hse_X14  -0.0460
## 14 st_hse_X9    0.0229
## 15 st_sen_X7   -0.0138

```

[DISCUSS HERE]

The bigger the penalty, the less terms that “are relevant for the analysis”, which is cause by the fact that one estimator is compressing all the causality that other variables are bringing to the model, and that is why we only use it for the purpose of prediction and not for interpreting each oe of the estimators.

Finally, let’s apply best model from our model tuning to our holdout set. We can use the last_fit function from the tune package to carry out this step.

## # A tibble: 2 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.738
## 2 roc_auc  binary         0.797

Comprehension Check

Let’s say that you are contacted by a non-partisan organization that is seeking to identify voters that are less likely to vote in the 2020 US federal general election. The organization wants to focus its resources on encouraging unlikely voters to go to the polls in November. Based on the analysis above, what recommendations would you provide? What additional data would would you recommend that the organization collect to accurately predict whether a person is likely to vote?

[DISCUSS HERE] Based on the analysis made above, I would recommend to focus on convincing population younger than 50 years old, whose likelihood of voting is smaller; and the younger the better. In addition, I would recommend to focus on population who has not voted in previous elections. I would recommend get information on education, work status, day of election to know, income level and controlling whether or not you live in a swing state.


References

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] vroom_1.2.0        doFuture_0.9.0     iterators_1.0.12   foreach_1.4.8     
##  [5] future_1.16.0      globals_0.12.5     patchwork_1.0.0    ggthemes_4.2.0    
##  [9] hrbrthemes_0.8.0   rpart_4.1-15       plotmo_3.5.6       TeachingDemos_2.10
## [13] plotrix_3.7-7      Formula_1.2-3      glmnet_3.0-2       Matrix_1.2-18     
## [17] yardstick_0.0.6    workflows_0.1.1    tune_0.0.1         rsample_0.0.5     
## [21] recipes_0.1.10     parsnip_0.0.5      infer_0.5.1        dials_0.0.4       
## [25] scales_1.1.0       broom_0.5.5        tidymodels_0.1.0   sandwich_2.5-1    
## [29] lmtest_0.9-37      zoo_1.8-7          margins_0.3.23     ggcorrplot_0.1.3  
## [33] corrr_0.4.2        stargazer_5.2.2    glue_1.3.2         here_0.1          
## [37] forcats_0.5.0      stringr_1.4.0      dplyr_0.8.5        purrr_0.3.3       
## [41] readr_1.3.1        tidyr_1.0.2        tibble_2.1.3       ggplot2_3.3.0     
## [45] tidyverse_1.3.0    pacman_0.5.1      
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4          tidyselect_1.0.0    lme4_1.1-21        
##   [4] htmlwidgets_1.5.1   grid_3.6.3          pROC_1.16.1        
##   [7] munsell_0.5.0       codetools_0.2-16    DT_0.12            
##  [10] miniUI_0.1.1.1      withr_2.1.2         colorspace_1.4-1   
##  [13] highr_0.8           knitr_1.28          rstudioapi_0.11    
##  [16] stats4_3.6.3        Rttf2pt1_1.3.8      bayesplot_1.7.1    
##  [19] listenv_0.8.0       labeling_0.3        rstan_2.19.3       
##  [22] farver_2.0.3        bit64_0.9-7         DiceDesign_1.8-1   
##  [25] rprojroot_1.3-2     vctrs_0.2.4         generics_0.0.2     
##  [28] ipred_0.9-9         xfun_0.12           R6_2.4.1           
##  [31] markdown_1.1        rstanarm_2.19.3     lhs_1.0.1          
##  [34] assertthat_0.2.1    promises_1.1.0      nnet_7.3-13        
##  [37] gtable_0.3.0        processx_3.4.2      timeDate_3043.102  
##  [40] rlang_0.4.5         systemfonts_0.1.1   splines_3.6.3      
##  [43] extrafontdb_1.0     inline_0.3.15       yaml_2.2.1         
##  [46] reshape2_1.4.3      prediction_0.3.14   modelr_0.1.6       
##  [49] tidytext_0.2.3      threejs_0.3.3       crosstalk_1.1.0.1  
##  [52] backports_1.1.5     httpuv_1.5.2        rsconnect_0.8.16   
##  [55] tokenizers_0.2.1    extrafont_0.17      tools_3.6.3        
##  [58] lava_1.6.7          kableExtra_1.1.0    ellipsis_0.3.0     
##  [61] ggridges_0.5.2      Rcpp_1.0.4          plyr_1.8.6         
##  [64] base64enc_0.1-3     ps_1.3.2            prettyunits_1.1.1  
##  [67] haven_2.2.0         fs_1.3.2            furrr_0.1.0        
##  [70] magrittr_1.5        data.table_1.12.8   colourpicker_1.0   
##  [73] reprex_0.3.0        GPfit_1.0-8         SnowballC_0.6.0    
##  [76] matrixStats_0.56.0  tidyposterior_0.0.2 hms_0.5.3          
##  [79] shinyjs_1.1         mime_0.9            evaluate_0.14      
##  [82] xtable_1.8-4        tidypredict_0.4.5   shinystan_2.5.0    
##  [85] readxl_1.3.1        gridExtra_2.3       shape_1.4.4        
##  [88] rstantools_2.0.0    compiler_3.6.3      rpart.plot_3.0.8   
##  [91] crayon_1.3.4        minqa_1.2.4         StanHeaders_2.19.2 
##  [94] htmltools_0.4.0     later_1.0.0         lubridate_1.7.4    
##  [97] DBI_1.1.0           dbplyr_1.4.2        MASS_7.3-51.5      
## [100] boot_1.3-24         cli_2.0.2           gower_0.2.1        
## [103] igraph_1.2.4.2      pkgconfig_2.0.3     xml2_1.2.5         
## [106] dygraphs_1.1.1.6    hardhat_0.1.2       webshot_0.5.2      
## [109] prodlim_2019.11.13  rvest_0.3.5         janeaustenr_0.1.5  
## [112] callr_3.4.2         digest_0.6.25       rmarkdown_2.1      
## [115] cellranger_1.1.0    gdtools_0.2.1       shiny_1.4.0.2      
## [118] gtools_3.8.1        nloptr_1.2.2.1      lifecycle_0.2.0    
## [121] nlme_3.1-144        jsonlite_1.6.1      viridisLite_0.3.0  
## [124] fansi_0.4.1         pillar_1.4.3        lattice_0.20-40    
## [127] loo_2.2.0           fastmap_1.0.1       httr_1.4.1         
## [130] pkgbuild_1.0.6      survival_3.1-8      xts_0.12-0         
## [133] shinythemes_1.1.2   bit_1.1-15.2        class_7.3-15       
## [136] stringi_1.4.6

  1. This tutorial was originally developed by Susan Athey, Niall Keleher, and Eray Turkel for the Spring 2020 course, ALP301 Data-Driven Impact. We are grateful to Grant R. McDermott, Ed Rubin, Matthew Taddy, Kieran Healy, Garret Grolemund and Hadley Wickham, and Bruno Rodrigues and Julia Silge. This tuturial draws from the examples that these colleagues made publicly available.