Introduction to Cross-Validation in R

In this recitation, we will: (1) load a dataset and (2) perform cross-validation of two linear regression models predicting vote choice in the 2016 Brexit referendum.

The dataset we will be working on today emanates from the 2016 EU Referendum in the United Kingdom. On June 23, 2016, 51.9% of the British electorate voted in favour of the United Kindgom leaving the European Union. Results are displayed below.

Brexit

Results of the Brexit Referendum (source: Statista)

1. Loading the dataset

First, we are loading a the BrexitData.csv file into our library using read.csv(). This file contains sociodemographic information and official voting data from the Brexit referendum.

# Loading the data using read.csv()

myData <- read.csv("/Users/evelynebrie/Dropbox/TA/RPubs/BrexitData.csv")

# Looking at the content of our myData dataset

colnames(myData)

##  [1] "X"                      "ID"                    
##  [3] "Region.Code"            "Region"                
##  [5] "Code"                   "Area.x"                
##  [7] "Electorate"             "Expected.Ballots"      
##  [9] "Verified.Ballot.Papers" "Percent.Turnout"       
## [11] "Votes.Cast"             "Valid.Votes"           
## [13] "Remain"                 "Leave"                 
## [15] "Rejected.Ballots"       "No.Official.Mark"      
## [17] "Multiple.Marks"         "Writing.or.Mark"       
## [19] "Unmarked.or.Void"       "Percent.Remain"        
## [21] "Percent.Leave"          "Percent.Rejected"      
## [23] "Type"                   "Area.y"                
## [25] "All.Residents"          "Age.0.to.4"            
## [27] "Age.5.to.9"             "Age.10.to.14"          
## [29] "Age.15.to.19"           "Age.20.to.24"          
## [31] "Age.25.to.29"           "Age.30.to.34"          
## [33] "Age.35.to.39"           "Age.40.to.44"          
## [35] "Age.45.to.49"           "Age.50.to.54"          
## [37] "Age.55.to.59"           "Age.60.to.64"          
## [39] "Age.65.to.69"           "Age.70.to.74"          
## [41] "Age.75.to.79"           "Age.80.to.84"          
## [43] "Age.85.to.89"           "Age.90.and.Over"

We will use cross-validation to identify the variables that allow us to predict the outcome of the Brexit referendum in each area code. To do so, we first need to create a variable for the percentage of votes in favor of Brexit. Each row represents a different area code.

# Creating a variable for the percentage of votes in favor of Brexit

myData$LeaveVotesPCT <- myData$Leave*100/myData$Valid.Votes

2. Cross-validation

In this recitation, we will be using cross-validation to predict support for the Leave option (in favor of Brexit) in the 2016 UK Referendum.

2.1 Creating a Training Set and a Test Set

The first step in performing cross-validation is to divide the dataset into a training set and a test set. Because we want to avoid over-fitting our model on specific data, we will start by running our linear regression model on the test set. We will then use the regression coefficients to predict the value of the dependent variable in the training set using the predict() function. We’ll never be able to perfectly predict the DV in this second dataset, but if our model is good, there should be only a small difference between our prediction and the actual data.

# Set the seed for replication purposes

set.seed(123)

# Then, randomly sampling from the full dataset to generate a training set and a test set

rs <- sample(1:dim(myData)[1],size=round(dim(myData)[1]/5),replace=F)

length(rs) # Number of integers

## [1] 76

table(rs) # To be clear, this is really just a bunch of row numbers!

## rs
##   1   9  16  18  31  38  41  42  47  49  52  68  75  79  80  88  89  91 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 103 110 120 121 123 126 138 141 142 143 147 156 157 166 169 171 184 194 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 196 199 207 211 212 215 221 231 235 236 242 245 250 251 253 256 260 263 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 264 278 286 290 301 316 317 323 327 331 335 339 347 353 356 357 365 367 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 370 378 379 382 
##   1   1   1   1

# Dividing data into two datasets

training.set <- myData[-rs,] # Bigger, to give a better chance to our model...
test.set <- myData[rs,] # Smaller, to test our model

# Sanity check: verifying the dimensions of both datasets

dim(training.set) # 4/5 of the total dataset

## [1] 306  45

dim(test.set) # 1/5 of the total dataset

## [1] 76 45

2.2 Running and Testing the Simple Linear Regression Model

The next step is simply (1) running the linear model in the training set and (2) predicting the value of our dependent variable in the test set using the predict() function. Performing cross-validation is as simple as that!

# Generating a simple linear regression model

model <- lm(LeaveVotesPCT ~ All.Residents, data=training.set) # Population size of area code

summary(model)

## 
## Call:
## lm(formula = LeaveVotesPCT ~ All.Residents, data = training.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.187  -6.293   1.166   6.775  21.220 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.528e+01  8.953e-01  61.749  < 2e-16 ***
## All.Residents -1.451e-05  4.052e-06  -3.581 0.000399 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.3 on 298 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.04127,    Adjusted R-squared:  0.03805 
## F-statistic: 12.83 on 1 and 298 DF,  p-value: 0.0003988

# Predicting the DV in the test set using the intercept and coefficient from the training set

pout <- predict(model,newdata=test.set)

We find a coefficient of -0.00001451 for the population size.

Now, what we are really looking for is the answer to this question: was cross-validation effective? Let’s take a look at (1) the mean difference between our prediction and the actual data and (2) how strongly correlated our prediction is with the value of actual data.

# Displaying the main difference between the expected DV values from our regression on the training set and the actual DV values within the training set

mean(abs(pout-test.set$LeaveVotesPCT),na.rm=T)

## [1] 8.041857

# How strongly correlated are our expected values with the actual values? 

cor(pout,test.set$LeaveVotesPCT,use="pairwise.complete.obs")

## [1] 0.1265215

2.3 Running and Testing the Multiple Linear Regression Model

Should we simply add more independent variables if we want to improve our model? Let’s see if this helps…

# Generating a multiple linear regression model

model <- lm(LeaveVotesPCT ~ All.Residents + Percent.Turnout, data=training.set)
summary(model)

## 
## Call:
## lm(formula = LeaveVotesPCT ~ All.Residents + Percent.Turnout, 
##     data = training.set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.9398  -6.2169   0.8913   6.7767  20.8410 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.222e+01  9.691e+00   4.357 1.82e-05 ***
## All.Residents   -1.203e-05  4.443e-06  -2.707  0.00717 ** 
## Percent.Turnout  1.718e-01  1.270e-01   1.353  0.17703    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.29 on 297 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.04714,    Adjusted R-squared:  0.04073 
## F-statistic: 7.347 on 2 and 297 DF,  p-value: 0.0007685

# Predicting the test set using the intercept and coefficient from the training set

pout <- predict(model,newdata=test.set)

# Displaying the main difference between the expected DV values from our regression on the training set and the actual DV values within the training set

mean(abs(pout-test.set$LeaveVotesPCT),na.rm=T)

## [1] 8.210852

# How strongly correlated are our expected values with the actual values? 

cor(pout,test.set$LeaveVotesPCT,use="pairwise.complete.obs")

## [1] 0.08004575

Turns out, adding one additional independent variable even makes our prediction slightly worse. The key is to select variables that are strong and reliable predictors of the relationship within the training set (p-value can help us make an educated guess) AND/OR the variables that are not dependent on the specificity of the training set (this can be trickier).

Exercises

Step 1

Please create a variable called RemainVotesPCT for the percentage of votes against Brexit in a given area code. Then, fit a linear regression in which RemainVotesPCT is the dependent variable and “All.Residents” is the lone independent variable. Report the basic regression output using the summary() function.

Step 2

In their forthcoming book, Pippa Norris and Ronald Ingelhart (two very prominent political scientists!) argue that millennials were far more likely to have voted Remain in the Brexit referendum. You want to test their argument. If millenials were between 15 and 29 at the time of the 2016 Brexit referendum, please create a variable called MILL_PCT and run a simple linear regression model using the % of millenials as the sole independent variable.

Step 3

How good of a predictor of support for the Remain option was the % of millenials in terms of explaining overall variance in this DV? Please extract the r squared of the model from Step 2. Using only code, please confirm or infirm that it is higher than than 0.1. What should we conclude from this?

Step 4

Please set your seed at 123 and divide the dataset into a training set (4/5 of the data) and a test set (1/5 of the data). What is the main difference between the expected DV value in the training set and in the test set when using the % of millenials as the sole independent variable?

Introduction to Cross-Validation in R

Evelyne Brie

Results of the Brexit Referendum (source: Statista)

1. Loading the dataset

2. Cross-validation

2.1 Creating a Training Set and a Test Set

2.2 Running and Testing the Simple Linear Regression Model

2.3 Running and Testing the Multiple Linear Regression Model

Exercises

Step 1

Step 2

Step 3

Step 4

Happy Thanksgiving! (yes, this turkey was made using `ggplot`!)

Introduction to Cross-Validation in R

Evelyne Brie

Results of the Brexit Referendum (source: Statista)

1. Loading the dataset

2. Cross-validation

2.1 Creating a Training Set and a Test Set

2.2 Running and Testing the Simple Linear Regression Model

2.3 Running and Testing the Multiple Linear Regression Model

Exercises

Step 1

Step 2

Step 3

Step 4

Happy Thanksgiving! (yes, this turkey was made using ggplot!)

Happy Thanksgiving! (yes, this turkey was made using `ggplot`!)