First, we’ll look at some basics about how R code works.
Then, we’ll dissect an “Election Data Fetch” script that crunches Rutherford County precinct results for the 2022 Tennessee governor’s race.
This section borrows examples from “R programming for ABSOLUTE beginners,” an excellent introduction to R code from Dr. Greg Martin’s R Programming 101 YouTube channel.
5 + 6
## [1] 11
a <- 5
b <- 6
a + b
## [1] 11
sum(a,b)
## [1] 11
name <- c("Greg", "Gill")
name
## [1] "Greg" "Gill"
name <- c("Greg", "Paul", "Kim")
age <- c(47,52,34)
gender <- c("M","M","F")
name
## [1] "Greg" "Paul" "Kim"
age
## [1] 47 52 34
gender
## [1] "M" "M" "F"
I used the head() function to show the data frame in the output here. Martin simply clicks on the data frame in RStudio and shows it to you there.
friends <- data.frame(name, age, gender)
head(friends)
## name age gender
## 1 Greg 47 M
## 2 Paul 52 M
## 3 Kim 34 F
An aside: One concept Martin doesn’t explain: A data frame is a grid. Horizontal sections of the grid are called “rows.” Vertical sections of the grid are called “columns.” Each data point gets stored in the intersection of a row and a column. In the friends data frame, for example, Martin’s first name, “Greg,” is stored in the intersection of the first row and the first column.
Another aside: When you are using a spreadsheet, like Excel, these intersections are called “cells.” Each has a “cell address,” like A1 for the cell at the intersection of Column A and Row 1. Most of what you do in a spreadsheet involves working with data in cells. R is designed to work with whole columns at once. It’s a little different approach. But, as you’ll see, it’s often much, much faster.
One last aside: In data journalism, it’s rare to build a data frame manually like this. Most of the time, you import data that someone else has already assembled.
Selecting various pieces of the data frame using base R code. Note: This is the hard way to do things. It comes in handy sometimes. But you’ll see an easier way in a moment.
# Show all rows in the data frame's "name" column:
friends$name
## [1] "Greg" "Paul" "Kim"
# Show all rows and columns in the data frame:
friends[ , ]
## name age gender
## 1 Greg 47 M
## 2 Paul 52 M
## 3 Kim 34 F
# Show all columns in the data frame's first row:
friends[1, ]
## name age gender
## 1 Greg 47 M
# Show the first row of the first column:
friends[1,1]
## [1] "Greg"
# Show rows 1 through 3 of the first column (Same result as friends$name):
friends[1:3,1]
## [1] "Greg" "Paul" "Kim"
# Show column 1 of the first three rows (Same result as friends[1, ]):
friends[1,1:3]
## name age gender
## 1 Greg 47 M
# Show all rows in the first two columns for which age is less than 50:
friends[friends$age<50,1:2]
## name age
## 1 Greg 47
## 3 Kim 34
A much easier way to select pieces of the data frame: Use the tidyverse package. The tidyverse package makes many other things in R easier to do, too.
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse)
friends %>%
select(name,age) %>%
filter(age < 50) %>%
arrange(age)
## name age
## 1 Kim 34
## 2 Greg 47
Now, let’s look at something truly useful. This script will:
Retrieve statewide, Tennessee-precinct-level November 2022 election results from the Tennessee Secretary of State’s web site’s “Historical Election Results” page, (https://sos.tn.gov/elections/results).
Filter the data for Rutherford County, Tennessee, and the precinct-level vote counts for Republican incumbent Bill Lee and Democratic challenger Jason Martin.
Calculate the total votes cast for either Lee or Martin and the percentage of those votes case for each candidate.
Compare the vote totals and indicate which candidate, Lee or Martin, won the precinct.
Sort the data in descending order by the percentage of votes for Lee.
Show the precinct-level data as an HTML table
Aggregate the precinct-level data to produce county-level data
Show the county-level data as an HTML table
if (!require("tidyverse"))
install.packages("tidyverse")
if (!require("gtExtras"))
install.packages("gtExtras")
library(tidyverse)
library(readxl)
library(gtExtras)
# Download and import election data
# from TN Secretary of State web site:
# https://sos.tn.gov/elections/results
download.file(
"https://sos-prod.tnsosgovfiles.com/s3fs-public/document/20221108AllbyPrecinct.xlsx",
"RawElectionData.xlsx",
quiet = TRUE,
mode = "wb")
RawElectionData <- read_xlsx("RawElectionData.xlsx", sheet = "SOFFICELso")
# Filter, calculate, and select
# to get data of interest
# then import results in a dataframe called MyData
MyData <- RawElectionData %>%
filter(COUNTY == "Rutherford",
CANDGROUP == "1") %>%
mutate(
Lee = PVTALLY1,
Martin = PVTALLY2,
Total = PVTALLY1 + PVTALLY2,
Lee_Pct = round(PVTALLY1 / (PVTALLY1 + PVTALLY2), 2),
Martin_Pct = round(PVTALLY2 / (PVTALLY1 + PVTALLY2), 2),
Winner = case_when(Lee > Martin ~ "Lee (R)",
Martin > Lee ~ "Martin (D)",
.default = "Tie")) %>%
arrange(desc(Lee)) %>%
select(COUNTY,
PRECINCT,
Total,
Lee,
Martin,
Lee_Pct,
Martin_Pct,
Winner)
# Show MyData as an HTML table
gt(MyData) %>%
tab_header("Precinct results") %>%
cols_align(align = "left") %>%
gt_theme_538
# Show county totals as an HTML table
MySummary <- MyData %>%
summarize(sum(Lee), sum(Martin), sum(Total)) %>%
rename("Lee" = "sum(Lee)",
"Martin" = "sum(Martin)",
"Total" = "sum(Total)") %>%
mutate(Pct_Lee = round(Lee / Total, digits = 2)) %>%
mutate(Pct_Martin = round(Martin / Total, digits = 2))
gt(MySummary) %>%
tab_header("County results") %>%
cols_align(align = "left") %>%
gt_theme_538
Precinct results | |||||||
COUNTY | PRECINCT | Total | Lee | Martin | Lee_Pct | Martin_Pct | Winner |
---|---|---|---|---|---|---|---|
Rutherford | 8-1 | 5013 | 3778 | 1235 | 0.75 | 0.25 | Lee (R) |
Rutherford | 9-1 | 5024 | 3425 | 1599 | 0.68 | 0.32 | Lee (R) |
Rutherford | 19-1 | 4644 | 3046 | 1598 | 0.66 | 0.34 | Lee (R) |
Rutherford | 14-2 | 4570 | 3002 | 1568 | 0.66 | 0.34 | Lee (R) |
Rutherford | 2-1 | 3397 | 2647 | 750 | 0.78 | 0.22 | Lee (R) |
Rutherford | 4-1 | 3727 | 2465 | 1262 | 0.66 | 0.34 | Lee (R) |
Rutherford | 7-1 | 3959 | 2416 | 1543 | 0.61 | 0.39 | Lee (R) |
Rutherford | 6-2 | 3357 | 2409 | 948 | 0.72 | 0.28 | Lee (R) |
Rutherford | 3-2 | 3341 | 2312 | 1029 | 0.69 | 0.31 | Lee (R) |
Rutherford | 13-1 | 3119 | 1836 | 1283 | 0.59 | 0.41 | Lee (R) |
Rutherford | 20-1 | 3084 | 1782 | 1302 | 0.58 | 0.42 | Lee (R) |
Rutherford | 15-1 | 2804 | 1716 | 1088 | 0.61 | 0.39 | Lee (R) |
Rutherford | 18-2 | 2142 | 1384 | 758 | 0.65 | 0.35 | Lee (R) |
Rutherford | 15-2 | 2111 | 1380 | 731 | 0.65 | 0.35 | Lee (R) |
Rutherford | 10-2 | 1952 | 1122 | 830 | 0.57 | 0.43 | Lee (R) |
Rutherford | 5-1 | 2042 | 1080 | 962 | 0.53 | 0.47 | Lee (R) |
Rutherford | 21-2 | 1951 | 1020 | 931 | 0.52 | 0.48 | Lee (R) |
Rutherford | 10-1 | 1772 | 997 | 775 | 0.56 | 0.44 | Lee (R) |
Rutherford | 12-1 | 1613 | 946 | 667 | 0.59 | 0.41 | Lee (R) |
Rutherford | 1-1 | 1911 | 909 | 1002 | 0.48 | 0.52 | Martin (D) |
Rutherford | 6-1 | 1082 | 840 | 242 | 0.78 | 0.22 | Lee (R) |
Rutherford | 11-2 | 1463 | 835 | 628 | 0.57 | 0.43 | Lee (R) |
Rutherford | 16-2 | 1397 | 813 | 584 | 0.58 | 0.42 | Lee (R) |
Rutherford | 2-2 | 1104 | 793 | 311 | 0.72 | 0.28 | Lee (R) |
Rutherford | 17-1 | 1504 | 705 | 799 | 0.47 | 0.53 | Martin (D) |
Rutherford | 11-1 | 1197 | 672 | 525 | 0.56 | 0.44 | Lee (R) |
Rutherford | 12-2 | 914 | 654 | 260 | 0.72 | 0.28 | Lee (R) |
Rutherford | 21-1 | 903 | 501 | 402 | 0.55 | 0.45 | Lee (R) |
Rutherford | 16-1 | 1017 | 406 | 611 | 0.40 | 0.60 | Martin (D) |
Rutherford | 18-1 | 680 | 207 | 473 | 0.30 | 0.70 | Martin (D) |
Rutherford | 14-1 | 338 | 187 | 151 | 0.55 | 0.45 | Lee (R) |
Rutherford | 3-1 | 192 | 157 | 35 | 0.82 | 0.18 | Lee (R) |
Rutherford | 4-2 | 199 | 133 | 66 | 0.67 | 0.33 | Lee (R) |
County results | ||||
Lee | Martin | Total | Pct_Lee | Pct_Martin |
---|---|---|---|---|
46575 | 26948 | 73523 | 0.63 | 0.37 |
You might wonder how to filter for two or more counties at the same time, rather than just for one county at a time. It’s pretty easy.
Suppose, for example, you want data for Rutherford, Davidson, and Williamson counties. Find the script’s code block that filters for Rutherford County:
MyData <- RawElectionData %>%
filter(COUNTY == "Rutherford",
CANDGROUP == "1") %>%
… and change it to this:
MyData <- RawElectionData %>%
filter(COUNTY == "Rutherford" |
COUNTY == "Davidson" |
COUNTY == "Williamson",
CANDGROUP == "1") %>%
… then run the script. You’ll find that the MyData data frame contains data for Rutherford, Davidson, and Williamson counties. You could add as many Tennessee counties as you wanted. All you have to do is be sure you follow the code pattern shown.
To make things a little neater in the table, you might also want to sort the data hierarchically - that is, first by county, then by total votes for Lee. Changing:
arrange(desc(Lee)) %>%
… to …
arrange(COUNTY, desc(Lee)) %>%
… would do the trick.