In this recitation, we will: (1) load a dataset and (2) perform cross-validation of two linear regression models predicting vote choice in the 2016 Brexit referendum.
The dataset we will be working on today emanates from the 2016 EU Referendum in the United Kingdom. On June 23, 2016, 51.9% of the British electorate voted in favour of the United Kindgom leaving the European Union. Results are displayed below.
First, we are loading a the BrexitData.csv file into our library using read.csv(). This file contains sociodemographic information and official voting data from the Brexit referendum.
# Loading the data using read.csv()
myData <- read.csv("/Users/evelynebrie/Dropbox/TA/RPubs/BrexitData.csv")
# Looking at the content of our myData dataset
colnames(myData)
## [1] "X" "ID"
## [3] "Region.Code" "Region"
## [5] "Code" "Area.x"
## [7] "Electorate" "Expected.Ballots"
## [9] "Verified.Ballot.Papers" "Percent.Turnout"
## [11] "Votes.Cast" "Valid.Votes"
## [13] "Remain" "Leave"
## [15] "Rejected.Ballots" "No.Official.Mark"
## [17] "Multiple.Marks" "Writing.or.Mark"
## [19] "Unmarked.or.Void" "Percent.Remain"
## [21] "Percent.Leave" "Percent.Rejected"
## [23] "Type" "Area.y"
## [25] "All.Residents" "Age.0.to.4"
## [27] "Age.5.to.9" "Age.10.to.14"
## [29] "Age.15.to.19" "Age.20.to.24"
## [31] "Age.25.to.29" "Age.30.to.34"
## [33] "Age.35.to.39" "Age.40.to.44"
## [35] "Age.45.to.49" "Age.50.to.54"
## [37] "Age.55.to.59" "Age.60.to.64"
## [39] "Age.65.to.69" "Age.70.to.74"
## [41] "Age.75.to.79" "Age.80.to.84"
## [43] "Age.85.to.89" "Age.90.and.Over"
We will use cross-validation to identify the variables that allow us to predict the outcome of the Brexit referendum in each area code. To do so, we first need to create a variable for the percentage of votes in favor of Brexit. Each row represents a different area code.
# Creating a variable for the percentage of votes in favor of Brexit
myData$LeaveVotesPCT <- myData$Leave*100/myData$Valid.Votes
In this recitation, we will be using cross-validation to predict support for the Leave option (in favor of Brexit) in the 2016 UK Referendum.
The first step in performing cross-validation is to divide the dataset into a training set and a test set. Because we want to avoid over-fitting our model on specific data, we will start by running our linear regression model on the test set. We will then use the regression coefficients to predict the value of the dependent variable in the training set using the predict() function. We’ll never be able to perfectly predict the DV in this second dataset, but if our model is good, there should be only a small difference between our prediction and the actual data.
# Set the seed for replication purposes
set.seed(123)
# Then, randomly sampling from the full dataset to generate a training set and a test set
rs <- sample(1:dim(myData)[1],size=round(dim(myData)[1]/5),replace=F)
length(rs) # Number of integers
## [1] 76
table(rs) # To be clear, this is really just a bunch of row numbers!
## rs
## 1 9 16 18 31 38 41 42 47 49 52 68 75 79 80 88 89 91
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 103 110 120 121 123 126 138 141 142 143 147 156 157 166 169 171 184 194
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 196 199 207 211 212 215 221 231 235 236 242 245 250 251 253 256 260 263
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 264 278 286 290 301 316 317 323 327 331 335 339 347 353 356 357 365 367
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 370 378 379 382
## 1 1 1 1
# Dividing data into two datasets
training.set <- myData[-rs,] # Bigger, to give a better chance to our model...
test.set <- myData[rs,] # Smaller, to test our model
# Sanity check: verifying the dimensions of both datasets
dim(training.set) # 4/5 of the total dataset
## [1] 306 45
dim(test.set) # 1/5 of the total dataset
## [1] 76 45
The next step is simply (1) running the linear model in the training set and (2) predicting the value of our dependent variable in the test set using the predict() function. Performing cross-validation is as simple as that!
# Generating a simple linear regression model
model <- lm(LeaveVotesPCT ~ All.Residents, data=training.set) # Population size of area code
summary(model)
##
## Call:
## lm(formula = LeaveVotesPCT ~ All.Residents, data = training.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.187 -6.293 1.166 6.775 21.220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.528e+01 8.953e-01 61.749 < 2e-16 ***
## All.Residents -1.451e-05 4.052e-06 -3.581 0.000399 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.3 on 298 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.04127, Adjusted R-squared: 0.03805
## F-statistic: 12.83 on 1 and 298 DF, p-value: 0.0003988
# Predicting the DV in the test set using the intercept and coefficient from the training set
pout <- predict(model,newdata=test.set)
We find a coefficient of -0.00001451 for the population size.
Now, what we are really looking for is the answer to this question: was cross-validation effective? Let’s take a look at (1) the mean difference between our prediction and the actual data and (2) how strongly correlated our prediction is with the value of actual data.
# Displaying the main difference between the expected DV values from our regression on the training set and the actual DV values within the training set
mean(abs(pout-test.set$LeaveVotesPCT),na.rm=T)
## [1] 8.041857
# How strongly correlated are our expected values with the actual values?
cor(pout,test.set$LeaveVotesPCT,use="pairwise.complete.obs")
## [1] 0.1265215
Should we simply add more independent variables if we want to improve our model? Let’s see if this helps…
# Generating a multiple linear regression model
model <- lm(LeaveVotesPCT ~ All.Residents + Percent.Turnout, data=training.set)
summary(model)
##
## Call:
## lm(formula = LeaveVotesPCT ~ All.Residents + Percent.Turnout,
## data = training.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9398 -6.2169 0.8913 6.7767 20.8410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.222e+01 9.691e+00 4.357 1.82e-05 ***
## All.Residents -1.203e-05 4.443e-06 -2.707 0.00717 **
## Percent.Turnout 1.718e-01 1.270e-01 1.353 0.17703
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.29 on 297 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.04714, Adjusted R-squared: 0.04073
## F-statistic: 7.347 on 2 and 297 DF, p-value: 0.0007685
# Predicting the test set using the intercept and coefficient from the training set
pout <- predict(model,newdata=test.set)
# Displaying the main difference between the expected DV values from our regression on the training set and the actual DV values within the training set
mean(abs(pout-test.set$LeaveVotesPCT),na.rm=T)
## [1] 8.210852
# How strongly correlated are our expected values with the actual values?
cor(pout,test.set$LeaveVotesPCT,use="pairwise.complete.obs")
## [1] 0.08004575
Turns out, adding one additional independent variable even makes our prediction slightly worse. The key is to select variables that are strong and reliable predictors of the relationship within the training set (p-value can help us make an educated guess) AND/OR the variables that are not dependent on the specificity of the training set (this can be trickier).
Please create a variable called RemainVotesPCT for the percentage of votes against Brexit in a given area code. Then, fit a linear regression in which RemainVotesPCT is the dependent variable and “All.Residents” is the lone independent variable. Report the basic regression output using the summary() function.
In their forthcoming book, Pippa Norris and Ronald Ingelhart (two very prominent political scientists!) argue that millennials were far more likely to have voted Remain in the Brexit referendum. You want to test their argument. If millenials were between 15 and 29 at the time of the 2016 Brexit referendum, please create a variable called MILL_PCT and run a simple linear regression model using the % of millenials as the sole independent variable.
How good of a predictor of support for the Remain option was the % of millenials in terms of explaining overall variance in this DV? Please extract the r squared of the model from Step 2. Using only code, please confirm or infirm that it is higher than than 0.1. What should we conclude from this?
Please set your seed at 123 and divide the dataset into a training set (4/5 of the data) and a test set (1/5 of the data). What is the main difference between the expected DV value in the training set and in the test set when using the % of millenials as the sole independent variable?
ggplot!)