Data Transformations Using the North Carolina Voter File

As usual, we will do setup. This time I’ll go as far as to scrape the Alamance County, North Carolina voter file from the web and load it into R, as we did in a previous class.

library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
working.dir <- "D:/DropBox/Dropbox/Election Science/Class Data Transformation"
url <- "https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter1.zip"
dest.file <- paste(working.dir, "/ncvoter1.zip", sep="")

download.file(url, dest.file, method="wininet")

unzip(dest.file, exdir = working.dir, overwrite = T)

voter.file <- paste(working.dir, "/ncvoter1.txt", sep="")

voters.NC <- read_tsv(voter.file, col_names = T)
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   county_id = col_double(),
##   absent_ind = col_logical(),
##   name_prefx_cd = col_logical(),
##   zip_code = col_double(),
##   mail_addr3 = col_logical(),
##   mail_addr4 = col_logical(),
##   birth_age = col_double(),
##   ward_abbrv = col_logical(),
##   ward_desc = col_logical(),
##   nc_senate_abbrv = col_double(),
##   county_commiss_abbrv = col_logical(),
##   county_commiss_desc = col_logical(),
##   township_abbrv = col_logical(),
##   township_desc = col_logical(),
##   school_dist_abbrv = col_logical(),
##   school_dist_desc = col_logical(),
##   fire_dist_abbrv = col_logical(),
##   fire_dist_desc = col_logical(),
##   water_dist_abbrv = col_logical(),
##   water_dist_desc = col_logical()
##   # ... with 10 more columns
## )
## See spec(...) for full column specifications.
## Warning: 15 parsing failures.
##   row        col           expected                actual                                                                         file
##  4010 mail_addr3 1/0/T/F/TRUE/FALSE CH-2900 PORRENTRUY    'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
##  4010 mail_addr4 1/0/T/F/TRUE/FALSE SWITZERLAND           'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
##  5652 mail_addr4 1/0/T/F/TRUE/FALSE 28213 BREMEN, GERMANY 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 22547 mail_addr3 1/0/T/F/TRUE/FALSE UKYO-KU, KYOTO, JAPAN 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 22547 mail_addr4 1/0/T/F/TRUE/FALSE 616-8184              'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## ..... .......... .................. ..................... ............................................................................
## See problems(...) for more details.

Data Transformation Basics

There are five functions of a R package called dplyr that we will cover today. dplyr is included in the tidyverse package.

These five functions are:

Selecting Observations with filter()

Filtering is selecting observations – or rows – from a data frame based on a logical expression, which compares two values and returns a TRUE or FALSE.

Since logical expressions are common to programming languages, let’s first understand how they work.

Frequently, you might want to check if values are the same. Most programming languages use double equal signs to ask the logical question if value A equals value B, rather than a single equals sign.

To illustrate, let’s consider active and inactive registered voters. Active voters those registered voters who either recently registered or re-registered, or had some other contact with election officials – including voting. Conversely, “inactive” voters are those who have not had a recent contact with election officials or voted in a recent election (at least two general federal elections according to the National Voter Registration Act of 1993, although states and localities may allow voters to skip more elections).

For the discussion of logical expressions, I create an R object to identify “active”" voters.

Active <- "ACTIVE"

Let’s see what happens when I invoke the logical expression Active == "ACTIVE"

Active == "ACTIVE"
## [1] TRUE

R returns TRUE from this logical expression. This is sometimes known as a boolean operator in that R will only return a TRUE or FALSE.

Let’s see what happens when I invoke the logical expression Active == "active"

Active == "active"
## [1] FALSE

Now, R returns FALSE because the contents of the R object Active does not exactly match the string stored in Active, which is all capitals.

Here are some common logical expressions

Operator Function
== Equal
!= Not Equal
> Greater Than
< Less Than
>= Greater Than or Equal
<= Less Than or Equal

Filter() returns an R object that filters based on the logical expressions passed to it.

The variable that contains the status of the voter is called voter_status_desc. This creates a new R object voters_active.NC that contains records of all active voters in the voters.NC R object.

voters_active.NC <- filter(voters.NC, voter_status_desc == "ACTIVE")

One of the interesting features of North Carolina’s voter registration file is that a record exists of every person who has ever been registered to vote in the state, including all people who have ever been removed. The reason for the removal is recorded in the variable voter_status_reason_desc. I can thus identify all the registered voters whose voter registrations have been removed because they had felony convictions, thusly:

voters_removed_felons.NC <- filter(voters.NC, voter_status_desc == "REMOVED", voter_status_reason_desc == "FELONY CONVICTION")

This statement is implictly performing an AND logical operation. Both voter_status_desc == "REMOVED" and voter_status_reason_desc == "FELONY CONVICTION" must be TRUE for filter to return a TRUE and pass on the record to voters_removed_felons.NC.

You can also combine logical expressions with the AND operator & and the OR operator |. But you have to be careful since the syntax sometimes doesn’t follow what you might expect from normal English.

Suppose I wanted to identify everyone who is age 18 or 19. This (might) surprisingly return all the records, not just those age 18 or 19.

voters_age.NC <- filter(voters.NC, birth_age == 18|19)

Why? R first evaluates the statement 18|19, which is a logical expression that returns TRUE, and then R looks to see if birth_age equals TRUE, or if data exists.

This gives us what we want:

voters_age.NC <- filter(voters.NC, birth_age == 18 | birth_age == 19)

I can also modify this statement to find all voters who are not age 18 or 19, using !, which is essentially a not operator, which is applied to the logical expression found in within the parentheses (birth_age == 18 | birth_age == 19).

voters_age.NC <- filter(voters.NC, !(birth_age == 18 | birth_age == 19))

Again, you need to be careful! It is all too easy to do something unexpected, especially when you write complicated logical expressions.

Missing Values NA

Sometimes a dataset will have explicit missing values for data that should be ignored. This is particularly true with surveys, where a variable might be have a missing value like -1 or -9.

In R missing data is represented by NA or “not available”.

An logical operation on a missing value returns a missing value, even such that:

NA == NA
## [1] NA

Returns NA, not TRUE, as one might expect.

Arrange()

If you want to sort data, instead of selecting data, use arrange(). If I want to see the age of the removed felons in ascending order of their age:

voters_removed_felons.NC <- arrange(voters_removed_felons.NC, birth_age)

If I wanted to see the removed felons in descending order of their age, use desc().

voters_removed_felons.NC <- arrange(voters_removed_felons.NC, desc(birth_age))

Missing values NA always appear sorted at the end, regardless if the data are sorted by ascending or descending.

In most cases, you won’t need to sort large data frames. This comes in handy sometimes if you want to order data for display in a table or graphic plot.

Dropping Columns with Select()

Voter files are often very large, with many variables. One technique to pare down a large data frame is to select columns to work with and remove the rest. A better way is to read in only the columns of data that you want to work with in the first place - and we will cover this later. However, there are times when you wish to read in a full file for exploratory analysis and then drop variables you don’t need later.

Select() works as you might imagine. Identify a data frame and the variables you want, and select() will pass back a pared-down data frame.

voters_temp.NC <- select(voters_removed_felons.NC, birth_age)

The columns of a data frame are in an order, and you can use this feature to select multiple variables. For example, this will select all the name fields.

voters_temp.NC <- select(voters_removed_felons.NC, name_prefx_cd:name_suffix_lbl)

Given the discussion about logical expressions, you might expect to use ! to select all but the names, but R doesn’t work that way. The way to do that trick is to use a minus sign, thusly:

voters_temp.NC <- select(voters_removed_felons.NC, -(name_prefx_cd:name_suffix_lbl))

There are more options. Look them up if you want, but for most use cases, you won’t need them.

Renaming a Variable with Rename()

Perhaps you have a very obtuse variable name in the header field of the data file you read in and you want to use an easier name to remember. You can rename variable names with rename(). The syntax is new name = old name.

voters_temp.NC <- rename(voters_removed_felons.NC, name_prefex = name_prefx_cd)

Adding New Variables with Mutate()

Perhaps one of the most frequent tasks that you will need to do is creating a new variable from a calculation or transformation of other variables. The way to do this in the tidyverse is with mutate(). As with all programming, there is often more than one way to skin a cat. With new variable creation this is particularly true, and many R programmers use a different approach that existed before the tidyverse module was created. We’ll cover both approaches.

Suppose I want to figure out the year someone was born (assuming we are in the year 2019). The following with create a new variable birth_year with the calculation for every record.

voters_removed_felons.NC <- mutate(voters_removed_felons.NC, birth_year = 2019 - birth_age)

This will do the same thing, but without the tidyverse

voters_removed_felons.NC$birth_year <- 2019 - voters_removed_felons.NC$birth_age

For this approach, note that the $ identifies a named column in the data frame. It is also possible to directly access specific columns and rows in a data frame, and perform calculations on them, but we will set that aside for now.

Which is the correct way? Both approaches are valid. Mutate has some nice features, as we will see, but many people use the legacy method since it is native in R and does not require loading the tidyverse package. Plus, many existing programs were written before the tidyverse existed.

If you want to keep only the columns you are mutating, you can do that, too, with transmute()

voters_temp.NC$birth_year <- transmute(voters_removed_felons.NC, birth_year = 2019 - birth_age)

R has the typical arithmetic operators like +, -, and /. If you aren’t familar with programming, multiplication is * and raising to a power is ^.

Two other useful operators are %/% which returns the integer from a division, and %% with returns the fraction, thusly:

104%/%100
## [1] 1
104%%100
## [1] 4

Other mathematical functions supported in mutate() (and its dependency module dplyr) include logs, lags, summary statistics like sum() and mean(), and so on.

Summarizing data with Summarize()

Suppose I want to find the mean age. I can do with with summarize().

summarize(voters_removed_felons.NC, mean(birth_age, na.rm = T))
## # A tibble: 1 x 1
##   `mean(birth_age, na.rm = T)`
##                          <dbl>
## 1                         41.5
summarize(voters.NC, mean(birth_age, na.rm = T))
## # A tibble: 1 x 1
##   `mean(birth_age, na.rm = T)`
##                          <dbl>
## 1                         51.6

This returns the mean age of the registered voters who were removed for being felons, and all voters. We can see that the felons are on average younger by about 10 years.

Why do I use na.rm = T? Recall that any calculation that encounters a NA will result in an NA. If I have just one person with a missing birth_age then summarize will return a mean of NA. Sometimes, though, it is important to check your data for missing values since there could have been an error reading in the data or at some other point.

Summarize is more powerful with combined with group_by which can get the same info for everyone

voters_by_reason.NC <- group_by(voters.NC, voter_status_desc, voter_status_reason_desc)
summarize(voters_by_reason.NC, voters = n(), mean_age = mean(birth_age, na.rm = T))
## # A tibble: 24 x 4
## # Groups:   voter_status_desc [5]
##    voter_status_desc voter_status_reason_desc            voters mean_age
##    <chr>             <chr>                                <int>    <dbl>
##  1 ACTIVE            CONFIRMATION NOT RETURNED               10     39.1
##  2 ACTIVE            CONFIRMATION PENDING                   852     42.8
##  3 ACTIVE            CONFIRMATION RETURNED UNDELIVERABLE      6     37.3
##  4 ACTIVE            UNVERIFIED                             111     37.3
##  5 ACTIVE            UNVERIFIED NEW                         360     26.2
##  6 ACTIVE            VERIFICATION PENDING                   841     34.5
##  7 ACTIVE            VERIFIED                             85427     51.3
##  8 DENIED            MOVED FROM STATE                         1     36  
##  9 DENIED            UNAVAILABLE ESSENTIAL INFORMATION      342     51.9
## 10 DENIED            VERIFICATION RETURNED UNDELIVERABLE   1637     38.1
## # ... with 14 more rows

Notice that unlike the prior command, I needed to provide a variable name for the mean. This is because the output from a grouped data frame returns the variables that the data frame has been grouped by, and the new data frame created by summarize needs variable names to associate with the calculations you wish to perform.

If at some point you want to ungroup your data, use ungroup().

Frequency tables are so useful, dplyr has a routine to produce them. so, if you want a quick frequency table you can use count.

count(voters.NC, voter_status_reason_desc)
## # A tibble: 21 x 2
##    voter_status_reason_desc                n
##    <chr>                               <int>
##  1 ADMINISTRATIVE                         23
##  2 CONFIRMATION NOT RETURNED            8287
##  3 CONFIRMATION PENDING                  852
##  4 CONFIRMATION RETURNED UNDELIVERABLE  3634
##  5 DECEASED                             6938
##  6 DUPLICATE                               2
##  7 FELONY CONVICTION                     341
##  8 MILITARY                                9
##  9 MOVED FROM COUNTY                    1103
## 10 MOVED FROM STATE                     1204
## # ... with 11 more rows

Now I can plot these data out, too!

voters_by_reason_stats.NC <- summarize(voters_by_reason.NC, voters = n(), mean_age = mean(birth_age, na.rm = T))
ggplot(data = voters_by_reason_stats.NC) +
 geom_bar(aes(x = voter_status_reason_desc, y = voters), stat="identity") +
 theme(axis.text.x = element_text(angle = 90, hjust = 1))

Which, of course, is nearly impossible to read because of the terrible labels.

Multiple Operations With Pipe

Okay, suppose you want to show how much of a programming pro you are by making your code sleeker and more compact (and harder to read, but that’s beside the point of showing off). You can do this by piping data using %>%.

voters_by_reason.NC <- voters.NC %>%
  group_by(voter_status_desc, voter_status_reason_desc) %>%
  summarize(voters = n(), mean_age = mean(birth_age, na.rm = T))
voters_by_reason.NC
## # A tibble: 24 x 4
## # Groups:   voter_status_desc [5]
##    voter_status_desc voter_status_reason_desc            voters mean_age
##    <chr>             <chr>                                <int>    <dbl>
##  1 ACTIVE            CONFIRMATION NOT RETURNED               10     39.1
##  2 ACTIVE            CONFIRMATION PENDING                   852     42.8
##  3 ACTIVE            CONFIRMATION RETURNED UNDELIVERABLE      6     37.3
##  4 ACTIVE            UNVERIFIED                             111     37.3
##  5 ACTIVE            UNVERIFIED NEW                         360     26.2
##  6 ACTIVE            VERIFICATION PENDING                   841     34.5
##  7 ACTIVE            VERIFIED                             85427     51.3
##  8 DENIED            MOVED FROM STATE                         1     36  
##  9 DENIED            UNAVAILABLE ESSENTIAL INFORMATION      342     51.9
## 10 DENIED            VERIFICATION RETURNED UNDELIVERABLE   1637     38.1
## # ... with 14 more rows

Basically, I’m telling R that I want to do a sequential number of actions on voters.NC and deposit the results in the R object voters_by_reason.NC. By piping the data, I don’t need to restate the data frame that I am working with since the data frame is being passed to each of the intermediate operations.

I generally don’t like piping of this sort since I like we be able to check my code for errors at each step. Piping makes error checking of intermediate steps more difficult. But, to each their own. Many programmers like making their code more compact. Another benefit in this case is that I don’t need to create intermediate R objects that can take up memory.

Other Useful Summary Functions

Here are some other useful summary functions

Function What It Does
n() Count of the observations
sum() Sum
sum!(is.na(x)) Number of missing values in x
mean() Mean
median() Median
sd() Standard deviation
min(), max() Minimum and maximum
first(), last() Value for first and last rows