Overview

First, we’ll look at some basics about how R code works.

Then, we’ll dissect an “Election Data Fetch” script that crunches Rutherford County precinct results for the 2022 Tennessee governor’s race.

How R code works

This section borrows examples from “R programming for ABSOLUTE beginners,” an excellent introduction to R code from Dr. Greg Martin’s R Programming 101 YouTube channel.

Assigning values to variables.

5 + 6 
## [1] 11
a <- 5 
b <- 6 
a + b 
## [1] 11
sum(a,b)  
## [1] 11
name <- c("Greg", "Gill") 
name  
## [1] "Greg" "Gill"
name <- c("Greg", "Paul", "Kim") 
age <- c(47,52,34) 
gender <- c("M","M","F")

name
## [1] "Greg" "Paul" "Kim"
age
## [1] 47 52 34
gender
## [1] "M" "M" "F"

Creating a data frame using the data.frame() function.

I used the head() function to show the data frame in the output here. Martin simply clicks on the data frame in RStudio and shows it to you there.

friends <- data.frame(name, age, gender)
head(friends)
##   name age gender
## 1 Greg  47      M
## 2 Paul  52      M
## 3  Kim  34      F
  • An aside: One concept Martin doesn’t explain: A data frame is a grid. Horizontal sections of the grid are called “rows.” Vertical sections of the grid are called “columns.” Each data point gets stored in the intersection of a row and a column. In the friends data frame, for example, Martin’s first name, “Greg,” is stored in the intersection of the first row and the first column.

  • Another aside: When you are using a spreadsheet, like Excel, these intersections are called “cells.” Each has a “cell address,” like A1 for the cell at the intersection of Column A and Row 1. Most of what you do in a spreadsheet involves working with data in cells. R is designed to work with whole columns at once. It’s a little different approach. But, as you’ll see, it’s often much, much faster.

  • One last aside: In data journalism, it’s rare to build a data frame manually like this. Most of the time, you import data that someone else has already assembled.

Subsetting (the hard way)

Selecting various pieces of the data frame using base R code. Note: This is the hard way to do things. It comes in handy sometimes. But you’ll see an easier way in a moment.

# Show all rows in the data frame's "name" column:
friends$name  
## [1] "Greg" "Paul" "Kim"
# Show all rows and columns in the data frame: 
friends[ , ]  
##   name age gender
## 1 Greg  47      M
## 2 Paul  52      M
## 3  Kim  34      F
# Show all columns in the data frame's first row: 
friends[1, ] 
##   name age gender
## 1 Greg  47      M
# Show the first row of the first column: 
friends[1,1]
## [1] "Greg"
# Show rows 1 through 3 of the first column (Same result as friends$name):
friends[1:3,1]
## [1] "Greg" "Paul" "Kim"
# Show column 1 of the first three rows (Same result as friends[1, ]): 
friends[1,1:3]
##   name age gender
## 1 Greg  47      M
# Show all rows in the first two columns for which age is less than 50: 
friends[friends$age<50,1:2]
##   name age
## 1 Greg  47
## 3  Kim  34

Subsetting (the easy way)

A much easier way to select pieces of the data frame: Use the tidyverse package. The tidyverse package makes many other things in R easier to do, too.

if (!require("tidyverse"))
  install.packages("tidyverse")

library(tidyverse)

friends %>%
  select(name,age) %>%
  filter(age < 50) %>%
  arrange(age)
##   name age
## 1  Kim  34
## 2 Greg  47

Election data fetch

Now, let’s look at something truly useful. This script will:

if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("gtExtras"))
  install.packages("gtExtras")

library(tidyverse)
library(readxl)
library(gtExtras)

# Download and import election data
# from TN Secretary of State web site:
# https://sos.tn.gov/elections/results

download.file(
  "https://sos-prod.tnsosgovfiles.com/s3fs-public/document/20221108AllbyPrecinct.xlsx",
  "RawElectionData.xlsx",
  quiet = TRUE,
  mode = "wb")

RawElectionData <- read_xlsx("RawElectionData.xlsx", sheet = "SOFFICELso")

# Filter, calculate, and select
# to get data of interest
# then import results in a dataframe called MyData

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford",
         CANDGROUP == "1") %>%
  mutate(
    Lee = PVTALLY1,
    Martin = PVTALLY2,
    Total = PVTALLY1 + PVTALLY2,
    Lee_Pct = round(PVTALLY1 / (PVTALLY1 + PVTALLY2), 2),
    Martin_Pct = round(PVTALLY2 / (PVTALLY1 + PVTALLY2), 2),
    Winner = case_when(Lee > Martin ~ "Lee (R)",
                       Martin > Lee ~ "Martin (D)",
                       .default = "Tie")) %>%
  arrange(desc(Lee)) %>%
  select(COUNTY,
         PRECINCT,
         Total,
         Lee,
         Martin,
         Lee_Pct,
         Martin_Pct,
         Winner)

# Show MyData as an HTML table

gt(MyData) %>%
  tab_header("Precinct results") %>%
  cols_align(align = "left") %>%
  gt_theme_538

# Show county totals as an HTML table

MySummary <- MyData %>%
  summarize(sum(Lee), sum(Martin), sum(Total)) %>%
  rename("Lee" = "sum(Lee)",
         "Martin" = "sum(Martin)",
         "Total" = "sum(Total)") %>%
  mutate(Pct_Lee = round(Lee / Total, digits = 2)) %>%
  mutate(Pct_Martin = round(Martin / Total, digits = 2))

gt(MySummary) %>%
  tab_header("County results") %>%
  cols_align(align = "left") %>%
  gt_theme_538

Here’s the output the code produces:

Precinct results
COUNTY PRECINCT Total Lee Martin Lee_Pct Martin_Pct Winner
Rutherford 8-1 5013 3778 1235 0.75 0.25 Lee (R)
Rutherford 9-1 5024 3425 1599 0.68 0.32 Lee (R)
Rutherford 19-1 4644 3046 1598 0.66 0.34 Lee (R)
Rutherford 14-2 4570 3002 1568 0.66 0.34 Lee (R)
Rutherford 2-1 3397 2647 750 0.78 0.22 Lee (R)
Rutherford 4-1 3727 2465 1262 0.66 0.34 Lee (R)
Rutherford 7-1 3959 2416 1543 0.61 0.39 Lee (R)
Rutherford 6-2 3357 2409 948 0.72 0.28 Lee (R)
Rutherford 3-2 3341 2312 1029 0.69 0.31 Lee (R)
Rutherford 13-1 3119 1836 1283 0.59 0.41 Lee (R)
Rutherford 20-1 3084 1782 1302 0.58 0.42 Lee (R)
Rutherford 15-1 2804 1716 1088 0.61 0.39 Lee (R)
Rutherford 18-2 2142 1384 758 0.65 0.35 Lee (R)
Rutherford 15-2 2111 1380 731 0.65 0.35 Lee (R)
Rutherford 10-2 1952 1122 830 0.57 0.43 Lee (R)
Rutherford 5-1 2042 1080 962 0.53 0.47 Lee (R)
Rutherford 21-2 1951 1020 931 0.52 0.48 Lee (R)
Rutherford 10-1 1772 997 775 0.56 0.44 Lee (R)
Rutherford 12-1 1613 946 667 0.59 0.41 Lee (R)
Rutherford 1-1 1911 909 1002 0.48 0.52 Martin (D)
Rutherford 6-1 1082 840 242 0.78 0.22 Lee (R)
Rutherford 11-2 1463 835 628 0.57 0.43 Lee (R)
Rutherford 16-2 1397 813 584 0.58 0.42 Lee (R)
Rutherford 2-2 1104 793 311 0.72 0.28 Lee (R)
Rutherford 17-1 1504 705 799 0.47 0.53 Martin (D)
Rutherford 11-1 1197 672 525 0.56 0.44 Lee (R)
Rutherford 12-2 914 654 260 0.72 0.28 Lee (R)
Rutherford 21-1 903 501 402 0.55 0.45 Lee (R)
Rutherford 16-1 1017 406 611 0.40 0.60 Martin (D)
Rutherford 18-1 680 207 473 0.30 0.70 Martin (D)
Rutherford 14-1 338 187 151 0.55 0.45 Lee (R)
Rutherford 3-1 192 157 35 0.82 0.18 Lee (R)
Rutherford 4-2 199 133 66 0.67 0.33 Lee (R)
County results
Lee Martin Total Pct_Lee Pct_Martin
46575 26948 73523 0.63 0.37

Filtering for multiple counties

You might wonder how to filter for two or more counties at the same time, rather than just for one county at a time. It’s pretty easy.

Suppose, for example, you want data for Rutherford, Davidson, and Williamson counties. Find the script’s code block that filters for Rutherford County:

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford",
         CANDGROUP == "1") %>%

… and change it to this:

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford" |
           COUNTY == "Davidson" |
           COUNTY == "Williamson",
         CANDGROUP == "1") %>%

… then run the script. You’ll find that the MyData data frame contains data for Rutherford, Davidson, and Williamson counties. You could add as many Tennessee counties as you wanted. All you have to do is be sure you follow the code pattern shown.

To make things a little neater in the table, you might also want to sort the data hierarchically - that is, first by county, then by total votes for Lee. Changing:

  arrange(desc(Lee)) %>%

… to …

  arrange(COUNTY, desc(Lee)) %>%

… would do the trick.