R Basics & Election Data Fetch

Overview

First, we’ll look at some basics about how R code works.

Then, we’ll dissect an “Election Data Fetch” script that crunches Rutherford County precinct results for the 2022 Tennessee governor’s race.

How R code works

This section borrows examples from “R programming for ABSOLUTE beginners,” an excellent introduction to R code from Dr. Greg Martin’s R Programming 101 YouTube channel.

Assigning values to variables.

5 + 6

## [1] 11

a <- 5 
b <- 6 
a + b

## [1] 11

sum(a,b)

## [1] 11

name <- c("Greg", "Gill") 
name

## [1] "Greg" "Gill"

name <- c("Greg", "Paul", "Kim") 
age <- c(47,52,34) 
gender <- c("M","M","F")

name

## [1] "Greg" "Paul" "Kim"

age

## [1] 47 52 34

gender

## [1] "M" "M" "F"

Creating a data frame using the data.frame() function.

I used the head() function to show the data frame in the output here. Martin simply clicks on the data frame in RStudio and shows it to you there.

friends <- data.frame(name, age, gender)
head(friends)

##   name age gender
## 1 Greg  47      M
## 2 Paul  52      M
## 3  Kim  34      F

An aside: One concept Martin doesn’t explain: A data frame is a grid. Horizontal sections of the grid are called “rows.” Vertical sections of the grid are called “columns.” Each data point gets stored in the intersection of a row and a column. In the friends data frame, for example, Martin’s first name, “Greg,” is stored in the intersection of the first row and the first column.
Another aside: When you are using a spreadsheet, like Excel, these intersections are called “cells.” Each has a “cell address,” like A1 for the cell at the intersection of Column A and Row 1. Most of what you do in a spreadsheet involves working with data in cells. R is designed to work with whole columns at once. It’s a little different approach. But, as you’ll see, it’s often much, much faster.
One last aside: In data journalism, it’s rare to build a data frame manually like this. Most of the time, you import data that someone else has already assembled.

Subsetting (the hard way)

Selecting various pieces of the data frame using base R code. Note: This is the hard way to do things. It comes in handy sometimes. But you’ll see an easier way in a moment.

# Show all rows in the data frame's "name" column:
friends$name

## [1] "Greg" "Paul" "Kim"

# Show all rows and columns in the data frame: 
friends[ , ]

##   name age gender
## 1 Greg  47      M
## 2 Paul  52      M
## 3  Kim  34      F

# Show all columns in the data frame's first row: 
friends[1, ]

##   name age gender
## 1 Greg  47      M

# Show the first row of the first column: 
friends[1,1]

## [1] "Greg"

# Show rows 1 through 3 of the first column (Same result as friends$name):
friends[1:3,1]

## [1] "Greg" "Paul" "Kim"

# Show column 1 of the first three rows (Same result as friends[1, ]): 
friends[1,1:3]

##   name age gender
## 1 Greg  47      M

# Show all rows in the first two columns for which age is less than 50: 
friends[friends$age<50,1:2]

##   name age
## 1 Greg  47
## 3  Kim  34

Subsetting (the easy way)

A much easier way to select pieces of the data frame: Use the tidyverse package. The tidyverse package makes many other things in R easier to do, too.

if (!require("tidyverse"))
  install.packages("tidyverse")

library(tidyverse)

friends %>%
  select(name,age) %>%
  filter(age < 50) %>%
  arrange(age)

##   name age
## 1  Kim  34
## 2 Greg  47

Election data fetch

Now, let’s look at something truly useful. This script will:

Retrieve statewide, Tennessee-precinct-level November 2022 election results from the Tennessee Secretary of State’s web site’s “Historical Election Results” page, (https://sos.tn.gov/elections/results).
Filter the data for Rutherford County, Tennessee, and the precinct-level vote counts for Republican incumbent Bill Lee and Democratic challenger Jason Martin.
Calculate the total votes cast for either Lee or Martin and the percentage of those votes case for each candidate.
Compare the vote totals and indicate which candidate, Lee or Martin, won the precinct.
Sort the data in descending order by the percentage of votes for Lee.
Show the precinct-level data as an HTML table
Aggregate the precinct-level data to produce county-level data
Show the county-level data as an HTML table

if (!require("tidyverse"))
  install.packages("tidyverse")
if (!require("gtExtras"))
  install.packages("gtExtras")

library(tidyverse)
library(readxl)
library(gtExtras)

# Download and import election data
# from TN Secretary of State web site:
# https://sos.tn.gov/elections/results

download.file(
  "https://sos-prod.tnsosgovfiles.com/s3fs-public/document/20221108AllbyPrecinct.xlsx",
  "RawElectionData.xlsx",
  quiet = TRUE,
  mode = "wb")

RawElectionData <- read_xlsx("RawElectionData.xlsx", sheet = "SOFFICELso")

# Filter, calculate, and select
# to get data of interest
# then import results in a dataframe called MyData

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford",
         CANDGROUP == "1") %>%
  mutate(
    Lee = PVTALLY1,
    Martin = PVTALLY2,
    Total = PVTALLY1 + PVTALLY2,
    Lee_Pct = round(PVTALLY1 / (PVTALLY1 + PVTALLY2), 2),
    Martin_Pct = round(PVTALLY2 / (PVTALLY1 + PVTALLY2), 2),
    Winner = case_when(Lee > Martin ~ "Lee (R)",
                       Martin > Lee ~ "Martin (D)",
                       .default = "Tie")) %>%
  arrange(desc(Lee)) %>%
  select(COUNTY,
         PRECINCT,
         Total,
         Lee,
         Martin,
         Lee_Pct,
         Martin_Pct,
         Winner)

# Show MyData as an HTML table

gt(MyData) %>%
  tab_header("Precinct results") %>%
  cols_align(align = "left") %>%
  gt_theme_538

# Show county totals as an HTML table

MySummary <- MyData %>%
  summarize(sum(Lee), sum(Martin), sum(Total)) %>%
  rename("Lee" = "sum(Lee)",
         "Martin" = "sum(Martin)",
         "Total" = "sum(Total)") %>%
  mutate(Pct_Lee = round(Lee / Total, digits = 2)) %>%
  mutate(Pct_Martin = round(Martin / Total, digits = 2))

gt(MySummary) %>%
  tab_header("County results") %>%
  cols_align(align = "left") %>%
  gt_theme_538

Here’s the output the code produces:

COUNTY	PRECINCT	Total	Lee	Martin	Lee_Pct	Martin_Pct	Winner
Precinct results
Rutherford	8-1	5013	3778	1235	0.75	0.25	Lee (R)
Rutherford	9-1	5024	3425	1599	0.68	0.32	Lee (R)
Rutherford	19-1	4644	3046	1598	0.66	0.34	Lee (R)
Rutherford	14-2	4570	3002	1568	0.66	0.34	Lee (R)
Rutherford	2-1	3397	2647	750	0.78	0.22	Lee (R)
Rutherford	4-1	3727	2465	1262	0.66	0.34	Lee (R)
Rutherford	7-1	3959	2416	1543	0.61	0.39	Lee (R)
Rutherford	6-2	3357	2409	948	0.72	0.28	Lee (R)
Rutherford	3-2	3341	2312	1029	0.69	0.31	Lee (R)
Rutherford	13-1	3119	1836	1283	0.59	0.41	Lee (R)
Rutherford	20-1	3084	1782	1302	0.58	0.42	Lee (R)
Rutherford	15-1	2804	1716	1088	0.61	0.39	Lee (R)
Rutherford	18-2	2142	1384	758	0.65	0.35	Lee (R)
Rutherford	15-2	2111	1380	731	0.65	0.35	Lee (R)
Rutherford	10-2	1952	1122	830	0.57	0.43	Lee (R)
Rutherford	5-1	2042	1080	962	0.53	0.47	Lee (R)
Rutherford	21-2	1951	1020	931	0.52	0.48	Lee (R)
Rutherford	10-1	1772	997	775	0.56	0.44	Lee (R)
Rutherford	12-1	1613	946	667	0.59	0.41	Lee (R)
Rutherford	1-1	1911	909	1002	0.48	0.52	Martin (D)
Rutherford	6-1	1082	840	242	0.78	0.22	Lee (R)
Rutherford	11-2	1463	835	628	0.57	0.43	Lee (R)
Rutherford	16-2	1397	813	584	0.58	0.42	Lee (R)
Rutherford	2-2	1104	793	311	0.72	0.28	Lee (R)
Rutherford	17-1	1504	705	799	0.47	0.53	Martin (D)
Rutherford	11-1	1197	672	525	0.56	0.44	Lee (R)
Rutherford	12-2	914	654	260	0.72	0.28	Lee (R)
Rutherford	21-1	903	501	402	0.55	0.45	Lee (R)
Rutherford	16-1	1017	406	611	0.40	0.60	Martin (D)
Rutherford	18-1	680	207	473	0.30	0.70	Martin (D)
Rutherford	14-1	338	187	151	0.55	0.45	Lee (R)
Rutherford	3-1	192	157	35	0.82	0.18	Lee (R)
Rutherford	4-2	199	133	66	0.67	0.33	Lee (R)

Lee	Martin	Total	Pct_Lee	Pct_Martin
County results
46575	26948	73523	0.63	0.37

Filtering for multiple counties

You might wonder how to filter for two or more counties at the same time, rather than just for one county at a time. It’s pretty easy.

Suppose, for example, you want data for Rutherford, Davidson, and Williamson counties. Find the script’s code block that filters for Rutherford County:

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford",
         CANDGROUP == "1") %>%

… and change it to this:

MyData <- RawElectionData %>%
  filter(COUNTY == "Rutherford" |
           COUNTY == "Davidson" |
           COUNTY == "Williamson",
         CANDGROUP == "1") %>%

… then run the script. You’ll find that the MyData data frame contains data for Rutherford, Davidson, and Williamson counties. You could add as many Tennessee counties as you wanted. All you have to do is be sure you follow the code pattern shown.

To make things a little neater in the table, you might also want to sort the data hierarchically - that is, first by county, then by total votes for Lee. Changing:

  arrange(desc(Lee)) %>%

… to …

  arrange(COUNTY, desc(Lee)) %>%

… would do the trick.