As usual, we will do setup. This time I’ll go as far as to scrape the Alamance County, North Carolina voter file from the web and load it into R, as we did in a previous class.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
working.dir <- "D:/DropBox/Dropbox/Election Science/Class Data Transformation"
url <- "https://s3.amazonaws.com/dl.ncsbe.gov/data/ncvoter1.zip"
dest.file <- paste(working.dir, "/ncvoter1.zip", sep="")
download.file(url, dest.file, method="wininet")
unzip(dest.file, exdir = working.dir, overwrite = T)
voter.file <- paste(working.dir, "/ncvoter1.txt", sep="")
voters.NC <- read_tsv(voter.file, col_names = T)
## Parsed with column specification:
## cols(
## .default = col_character(),
## county_id = col_double(),
## absent_ind = col_logical(),
## name_prefx_cd = col_logical(),
## zip_code = col_double(),
## mail_addr3 = col_logical(),
## mail_addr4 = col_logical(),
## birth_age = col_double(),
## ward_abbrv = col_logical(),
## ward_desc = col_logical(),
## nc_senate_abbrv = col_double(),
## county_commiss_abbrv = col_logical(),
## county_commiss_desc = col_logical(),
## township_abbrv = col_logical(),
## township_desc = col_logical(),
## school_dist_abbrv = col_logical(),
## school_dist_desc = col_logical(),
## fire_dist_abbrv = col_logical(),
## fire_dist_desc = col_logical(),
## water_dist_abbrv = col_logical(),
## water_dist_desc = col_logical()
## # ... with 10 more columns
## )
## See spec(...) for full column specifications.
## Warning: 15 parsing failures.
## row col expected actual file
## 4010 mail_addr3 1/0/T/F/TRUE/FALSE CH-2900 PORRENTRUY 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 4010 mail_addr4 1/0/T/F/TRUE/FALSE SWITZERLAND 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 5652 mail_addr4 1/0/T/F/TRUE/FALSE 28213 BREMEN, GERMANY 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 22547 mail_addr3 1/0/T/F/TRUE/FALSE UKYO-KU, KYOTO, JAPAN 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## 22547 mail_addr4 1/0/T/F/TRUE/FALSE 616-8184 'D:/DropBox/Dropbox/Election Science/Class Data Transformation/ncvoter1.txt'
## ..... .......... .................. ..................... ............................................................................
## See problems(...) for more details.
There are five functions of a R package called dplyr
that we will cover today. dplyr
is included in the tidyverse
package.
These five functions are:
filter()
.arrange()
.select()
.mutate()
.summarise()
.filter()
Filtering is selecting observations – or rows – from a data frame based on a logical expression, which compares two values and returns a TRUE
or FALSE
.
Since logical expressions are common to programming languages, let’s first understand how they work.
Frequently, you might want to check if values are the same. Most programming languages use double equal signs to ask the logical question if value A equals value B, rather than a single equals sign.
To illustrate, let’s consider active and inactive registered voters. Active voters those registered voters who either recently registered or re-registered, or had some other contact with election officials – including voting. Conversely, “inactive” voters are those who have not had a recent contact with election officials or voted in a recent election (at least two general federal elections according to the National Voter Registration Act of 1993, although states and localities may allow voters to skip more elections).
For the discussion of logical expressions, I create an R object to identify “active”" voters.
Active <- "ACTIVE"
Let’s see what happens when I invoke the logical expression Active == "ACTIVE"
Active == "ACTIVE"
## [1] TRUE
R returns TRUE
from this logical expression. This is sometimes known as a boolean operator in that R will only return a TRUE
or FALSE
.
Let’s see what happens when I invoke the logical expression Active == "active"
Active == "active"
## [1] FALSE
Now, R returns FALSE
because the contents of the R object Active
does not exactly match the string stored in Active
, which is all capitals.
Here are some common logical expressions
Operator | Function |
---|---|
== | Equal |
!= | Not Equal |
> | Greater Than |
< | Less Than |
>= | Greater Than or Equal |
<= | Less Than or Equal |
Filter()
returns an R object that filters based on the logical expressions passed to it.
The variable that contains the status of the voter is called voter_status_desc
. This creates a new R object voters_active.NC
that contains records of all active voters in the voters.NC
R object.
voters_active.NC <- filter(voters.NC, voter_status_desc == "ACTIVE")
One of the interesting features of North Carolina’s voter registration file is that a record exists of every person who has ever been registered to vote in the state, including all people who have ever been removed. The reason for the removal is recorded in the variable voter_status_reason_desc
. I can thus identify all the registered voters whose voter registrations have been removed because they had felony convictions, thusly:
voters_removed_felons.NC <- filter(voters.NC, voter_status_desc == "REMOVED", voter_status_reason_desc == "FELONY CONVICTION")
This statement is implictly performing an AND logical operation. Both voter_status_desc == "REMOVED"
and voter_status_reason_desc == "FELONY CONVICTION"
must be TRUE
for filter
to return a TRUE
and pass on the record to voters_removed_felons.NC
.
You can also combine logical expressions with the AND operator &
and the OR operator |
. But you have to be careful since the syntax sometimes doesn’t follow what you might expect from normal English.
Suppose I wanted to identify everyone who is age 18 or 19. This (might) surprisingly return all the records, not just those age 18 or 19.
voters_age.NC <- filter(voters.NC, birth_age == 18|19)
Why? R first evaluates the statement 18|19, which is a logical expression that returns TRUE
, and then R looks to see if birth_age
equals TRUE
, or if data exists.
This gives us what we want:
voters_age.NC <- filter(voters.NC, birth_age == 18 | birth_age == 19)
I can also modify this statement to find all voters who are not age 18 or 19, using !
, which is essentially a not
operator, which is applied to the logical expression found in within the parentheses (birth_age == 18 | birth_age == 19)
.
voters_age.NC <- filter(voters.NC, !(birth_age == 18 | birth_age == 19))
Again, you need to be careful! It is all too easy to do something unexpected, especially when you write complicated logical expressions.
NA
Sometimes a dataset will have explicit missing values for data that should be ignored. This is particularly true with surveys, where a variable might be have a missing value like -1
or -9
.
In R missing data is represented by NA
or “not available”.
An logical operation on a missing value returns a missing value, even such that:
NA == NA
## [1] NA
Returns NA
, not TRUE
, as one might expect.
Arrange()
If you want to sort data, instead of selecting data, use arrange()
. If I want to see the age of the removed felons in ascending order of their age:
voters_removed_felons.NC <- arrange(voters_removed_felons.NC, birth_age)
If I wanted to see the removed felons in descending order of their age, use desc()
.
voters_removed_felons.NC <- arrange(voters_removed_felons.NC, desc(birth_age))
Missing values NA
always appear sorted at the end, regardless if the data are sorted by ascending or descending.
In most cases, you won’t need to sort large data frames. This comes in handy sometimes if you want to order data for display in a table or graphic plot.
Select()
Voter files are often very large, with many variables. One technique to pare down a large data frame is to select columns to work with and remove the rest. A better way is to read in only the columns of data that you want to work with in the first place - and we will cover this later. However, there are times when you wish to read in a full file for exploratory analysis and then drop variables you don’t need later.
Select()
works as you might imagine. Identify a data frame and the variables you want, and select()
will pass back a pared-down data frame.
voters_temp.NC <- select(voters_removed_felons.NC, birth_age)
The columns of a data frame are in an order, and you can use this feature to select multiple variables. For example, this will select all the name fields.
voters_temp.NC <- select(voters_removed_felons.NC, name_prefx_cd:name_suffix_lbl)
Given the discussion about logical expressions, you might expect to use !
to select all but the names, but R doesn’t work that way. The way to do that trick is to use a minus sign, thusly:
voters_temp.NC <- select(voters_removed_felons.NC, -(name_prefx_cd:name_suffix_lbl))
There are more options. Look them up if you want, but for most use cases, you won’t need them.
Rename()
Perhaps you have a very obtuse variable name in the header field of the data file you read in and you want to use an easier name to remember. You can rename variable names with rename()
. The syntax is new name = old name
.
voters_temp.NC <- rename(voters_removed_felons.NC, name_prefex = name_prefx_cd)
Mutate()
Perhaps one of the most frequent tasks that you will need to do is creating a new variable from a calculation or transformation of other variables. The way to do this in the tidyverse
is with mutate()
. As with all programming, there is often more than one way to skin a cat. With new variable creation this is particularly true, and many R programmers use a different approach that existed before the tidyverse
module was created. We’ll cover both approaches.
Suppose I want to figure out the year someone was born (assuming we are in the year 2019). The following with create a new variable birth_year
with the calculation for every record.
voters_removed_felons.NC <- mutate(voters_removed_felons.NC, birth_year = 2019 - birth_age)
This will do the same thing, but without the tidyverse
voters_removed_felons.NC$birth_year <- 2019 - voters_removed_felons.NC$birth_age
For this approach, note that the $
identifies a named column in the data frame. It is also possible to directly access specific columns and rows in a data frame, and perform calculations on them, but we will set that aside for now.
Which is the correct way? Both approaches are valid. Mutate has some nice features, as we will see, but many people use the legacy method since it is native in R and does not require loading the tidyverse
package. Plus, many existing programs were written before the tidyverse
existed.
If you want to keep only the columns you are mutating, you can do that, too, with transmute()
voters_temp.NC$birth_year <- transmute(voters_removed_felons.NC, birth_year = 2019 - birth_age)
R has the typical arithmetic operators like +
, -
, and /
. If you aren’t familar with programming, multiplication is *
and raising to a power is ^
.
Two other useful operators are %/%
which returns the integer from a division, and %%
with returns the fraction, thusly:
104%/%100
## [1] 1
104%%100
## [1] 4
Other mathematical functions supported in mutate()
(and its dependency module dplyr
) include logs, lags, summary statistics like sum()
and mean()
, and so on.
Summarize()
Suppose I want to find the mean age. I can do with with summarize()
.
summarize(voters_removed_felons.NC, mean(birth_age, na.rm = T))
## # A tibble: 1 x 1
## `mean(birth_age, na.rm = T)`
## <dbl>
## 1 41.5
summarize(voters.NC, mean(birth_age, na.rm = T))
## # A tibble: 1 x 1
## `mean(birth_age, na.rm = T)`
## <dbl>
## 1 51.6
This returns the mean age of the registered voters who were removed for being felons, and all voters. We can see that the felons are on average younger by about 10 years.
Why do I use na.rm = T
? Recall that any calculation that encounters a NA
will result in an NA
. If I have just one person with a missing birth_age
then summarize will return a mean of NA
. Sometimes, though, it is important to check your data for missing values since there could have been an error reading in the data or at some other point.
Summarize is more powerful with combined with group_by
which can get the same info for everyone
voters_by_reason.NC <- group_by(voters.NC, voter_status_desc, voter_status_reason_desc)
summarize(voters_by_reason.NC, voters = n(), mean_age = mean(birth_age, na.rm = T))
## # A tibble: 24 x 4
## # Groups: voter_status_desc [5]
## voter_status_desc voter_status_reason_desc voters mean_age
## <chr> <chr> <int> <dbl>
## 1 ACTIVE CONFIRMATION NOT RETURNED 10 39.1
## 2 ACTIVE CONFIRMATION PENDING 852 42.8
## 3 ACTIVE CONFIRMATION RETURNED UNDELIVERABLE 6 37.3
## 4 ACTIVE UNVERIFIED 111 37.3
## 5 ACTIVE UNVERIFIED NEW 360 26.2
## 6 ACTIVE VERIFICATION PENDING 841 34.5
## 7 ACTIVE VERIFIED 85427 51.3
## 8 DENIED MOVED FROM STATE 1 36
## 9 DENIED UNAVAILABLE ESSENTIAL INFORMATION 342 51.9
## 10 DENIED VERIFICATION RETURNED UNDELIVERABLE 1637 38.1
## # ... with 14 more rows
Notice that unlike the prior command, I needed to provide a variable name for the mean. This is because the output from a grouped data frame returns the variables that the data frame has been grouped by, and the new data frame created by summarize
needs variable names to associate with the calculations you wish to perform.
If at some point you want to ungroup your data, use ungroup()
.
Frequency tables are so useful, dplyr
has a routine to produce them. so, if you want a quick frequency table you can use count
.
count(voters.NC, voter_status_reason_desc)
## # A tibble: 21 x 2
## voter_status_reason_desc n
## <chr> <int>
## 1 ADMINISTRATIVE 23
## 2 CONFIRMATION NOT RETURNED 8287
## 3 CONFIRMATION PENDING 852
## 4 CONFIRMATION RETURNED UNDELIVERABLE 3634
## 5 DECEASED 6938
## 6 DUPLICATE 2
## 7 FELONY CONVICTION 341
## 8 MILITARY 9
## 9 MOVED FROM COUNTY 1103
## 10 MOVED FROM STATE 1204
## # ... with 11 more rows
Now I can plot these data out, too!
voters_by_reason_stats.NC <- summarize(voters_by_reason.NC, voters = n(), mean_age = mean(birth_age, na.rm = T))
ggplot(data = voters_by_reason_stats.NC) +
geom_bar(aes(x = voter_status_reason_desc, y = voters), stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Which, of course, is nearly impossible to read because of the terrible labels.
Okay, suppose you want to show how much of a programming pro you are by making your code sleeker and more compact (and harder to read, but that’s beside the point of showing off). You can do this by piping data using %>%
.
voters_by_reason.NC <- voters.NC %>%
group_by(voter_status_desc, voter_status_reason_desc) %>%
summarize(voters = n(), mean_age = mean(birth_age, na.rm = T))
voters_by_reason.NC
## # A tibble: 24 x 4
## # Groups: voter_status_desc [5]
## voter_status_desc voter_status_reason_desc voters mean_age
## <chr> <chr> <int> <dbl>
## 1 ACTIVE CONFIRMATION NOT RETURNED 10 39.1
## 2 ACTIVE CONFIRMATION PENDING 852 42.8
## 3 ACTIVE CONFIRMATION RETURNED UNDELIVERABLE 6 37.3
## 4 ACTIVE UNVERIFIED 111 37.3
## 5 ACTIVE UNVERIFIED NEW 360 26.2
## 6 ACTIVE VERIFICATION PENDING 841 34.5
## 7 ACTIVE VERIFIED 85427 51.3
## 8 DENIED MOVED FROM STATE 1 36
## 9 DENIED UNAVAILABLE ESSENTIAL INFORMATION 342 51.9
## 10 DENIED VERIFICATION RETURNED UNDELIVERABLE 1637 38.1
## # ... with 14 more rows
Basically, I’m telling R that I want to do a sequential number of actions on voters.NC
and deposit the results in the R object voters_by_reason.NC
. By piping the data, I don’t need to restate the data frame that I am working with since the data frame is being passed to each of the intermediate operations.
I generally don’t like piping of this sort since I like we be able to check my code for errors at each step. Piping makes error checking of intermediate steps more difficult. But, to each their own. Many programmers like making their code more compact. Another benefit in this case is that I don’t need to create intermediate R objects that can take up memory.
Here are some other useful summary functions
Function | What It Does |
---|---|
n() | Count of the observations |
sum() | Sum |
sum!(is.na(x)) | Number of missing values in x |
mean() | Mean |
median() | Median |
sd() | Standard deviation |
min(), max() | Minimum and maximum |
first(), last() | Value for first and last rows |