Module 4: Enter the Tidyverse
Workshop 4: Importing and Tidying Data
Open this Project on RStudio.Cloud!
Up until now, we’ve only used data from packages, so we could always a) load it from our packages, and b) it came in a very specific format great for data analysis - “tidy” format. More on that later.
This workshop is going to:
introduce several ways of importing data.
introduce techniques for tidying data.
This way, you can visualize it, collect summary statistics, etc. No joke - these data tidying techniques save you dozens of hours compared to doing it by hand.
Why are we using a website? So that you can practice typing in the code yourself now. I recommend opening this in one website and typing it in by yourself, if possible. Most of the code output below is hidden, so you will need to type the code below into your RStudio.Cloud session to find out what it does! It helps build muscle memory.
0. Load Packages
Please load the tidyverse packages. This will load dplyr, ggplot2, tidyr, and readr. Congratulations! Now you can just load tidyverse instead of all the others. Please also load the viridis package, a new package you will need today.
library(tidyverse)
library(viridis)
1. Importing Data
There are three main ways to import data.
Load it from a package.
Load it from a spreadsheet online.
Load it from a spreadsheet file.
1.1. Loading Data from a Package
For example, you can load it from a package like we did last week. Try to access the diamonds dataset in ggplot2 below.
library(ggplot2)
diamondsBut wouldn’t it be nice if we could see what datasets are in a package?
Try typing ggplot2:: on it’s own line, put your cursor at the end of this code, and then press TAB.
You’ll see all the functions and datasets stored in a package. You can scroll down the list and select the one you want, by clicking on it or pressing tab or enter.
ggplot2::
Learning Check 1.1
glimpse() it to see what variables are there. select() just three variables to look at. filter() by at least one variable.
1.2. Loading Data from a Spreadsheet Online
Using the readr package from the tidyverse package, load the data!
japan <- read_csv("http://bit.ly/japanese_prefectural_elections")This is a dataset of candidate election results made by Yusaku Horuichi and Ryota Natori at Dartmouth College. It looks at candidates who ran for Prefectural Assembly (like State Legislature) in every town in Japan, from 2003 - 2017.
Take a look.
japan %>% glimpse()## Rows: 31,022
## Columns: 19
## $ candidate_name <chr> "千葉英守", "藤川雅司", "小林郁子", "越智美智子", "…
## $ candidate_age <dbl> 65, 58, 67, 45, 38, 50, 39, 49, 48, 39, 59, 53, 53, …
## $ candidate_party <chr> "Liberal Democratic Party", "Democratic Party of Jap…
## $ candidate_status <chr> "incumbent", "new", "incumbent", "new", "incumbent",…
## $ candidate_rank <dbl> 1, 2, 3, 4, 1, 2, 5, 3, 4, 6, 7, 8, 3, 1, 2, 4, 5, 3…
## $ election_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
## $ pref_code <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ muni_code <chr> "01100", "01100", "01100", "01100", "01100", "01100"…
## $ district_code <chr> "01101", "01101", "01101", "01101", "01102", "01102"…
## $ muni_name <chr> "札幌市", "札幌市", "札幌市", "札幌市", "札幌市", "…
## $ election_district <chr> "中央区", "中央区", "中央区", "中央区", "北区", "北…
## $ district_name <chr> "中央区", "中央区", "中央区", "中央区", "北区", "北…
## $ eligible_voters <dbl> 188968, 188968, 188968, 188968, 229593, 229593, 2295…
## $ turnout <dbl> 105389, 105389, 105389, 105389, 134431, 134431, 1344…
## $ valid_votes_cast <dbl> 99375, 99375, 99375, 99375, 129638, 129638, 129638, …
## $ quota <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3…
## $ num_candidates <dbl> 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 4…
## $ votes_received <dbl> 43610.00, 28000.00, 22729.00, 5036.00, 24158.00, 111…
## $ votes_total <dbl> 118161.00, 111890.00, 94029.00, 34568.00, 146889.00,…
Notice especially the following vectors, which have easy-to-interpret names:
"candidate_name""candidate_age""candidate_party""election_year""votes_received""valid_votes_cast"
IT’S IN JAPANESE! That’s right! You can work with data in foreign languages, and our process doesn’t really change! R IS SO NEAT!
Learning Check 1.2
filter() this dataset so that you only see candidates from elections in 2015, and arrange() the rows in terms of how many votes each candidate received.
But, that’s a lot of data, so let’s remove it from our memory for the time being.
remove(japan)
1.3. Loading data from a Spreadsheet file
Alternatively, you can provide the data.
Navigate to the Files menu (lower right hand corner), and find
"us_county_voteshare_2016.csv".Now, upload it using
read_csv()from thereadrpackage, like so:
counties <- read_csv("https://bit.ly/us_county_voteshare_2016")This contains the percentage of democrats who voted for Clinton and the percentage of Republicans who voted for Trump in 2016, for each US county. fips is a unique code per county.
countiesNow you type out the code to import it and try it out!
Learning Check 1.3
filter()!
2. Tidying Data
2.1 Identifying Long vs. Wide Data
We’ve got at least two good data.frames to work with now, but they’re not exactly in “tidy” format. Tidy format looks like this:
Let’s make a tidy data.frame of how many pets three people want (hypothetically).
dogs_longer <- data.frame(
person = c("Tim", "Tim", "Melanie", "Jon"),
type = c("corgi", "dalmatian", "corgi", "greyhound"),
puppies = c(9, 1, 2, 3)
)Now view it!
dogs_longer## person type puppies
## 1 Tim corgi 9
## 2 Tim dalmatian 1
## 3 Melanie corgi 2
## 4 Jon greyhound 3
There’s an “index” - a.k.a. a unique identifier - "person". There’s a “class” - in this case, dog "type". There’s a “value” - in this case, "puppies"!
You’ll notice that each person-type pair only occurs once, and it has a value (puppies) associated.
Think about this.
Read this again.
Read this one more time, please.
Is this the way you expect to see data?
Or do you expect data to look like this?
dogs_wider <- data.frame(
person = c("Tim", "Melanie", "Jon"),
corgi = c(9, 2, 0),
dalmatian = c(1, 0, 0),
greyhound = c(0, 0, 3))Now view it!
dogs_wider## person corgi dalmatian greyhound
## 1 Tim 9 1 0
## 2 Melanie 2 0 0
## 3 Jon 0 0 3
These are equivalent representations of the same data.
The first, dogs_longer puts it in a “tidy”, “long” list. The second, dogs_wider puts it in a “wide” matrix.
Why would any of this matter? Well, it’s really easy to do calculations or visualization on tidy data, but not matrix data. For example, count the total puppies each person has:
Using the dataframe dogs_longer:
dogs_longer %>%
# For each person
group_by(person) %>%
# Calculate the total number of puppies
summarize(total_puppies = sum(puppies))Using the dataframe dogs_wider:
# We can get the number of dogs that Tim wants (10)
dogs_wider %>%
filter(person == "Tim") %>%
select(corgi) +
# the number of dalmatians Tim wants
dogs_wider %>%
filter(person == "Tim") %>%
select(dalmatian) +
# and the number of greyhounds Tim wants
dogs_wider %>%
filter(person == "Tim") %>%
select(greyhound) But we haven’t even gotten the totals for Melanie or Jon yet! Using wide data takes so much time and extra code. Using “tidy” long-format data is way more efficient.
Learning Check 2.1:
Does “tidy data” refer to long or wide data? Why? Check in with your group-mates and make up a hypothetical dataset. Sketch out what a long version would look like. Sketch out what a wide version would look like.
2.2 Pivoting Data
The tidyr package contains two mindblowing functions that convert between wide and long data.
pivot_wider()pivot_longer()
While not earth-shattering, they save me hours every day and increase your productivity like nobody’s business.
Examine these two chunks of code that use pivot_wider().
dogs_longer %>%
# Using pivot_wider
pivot_wider(
id_cols = c(person), # make values in vector "person" the rows
names_from = type, # make values in vector "type" the columns
values_from = puppies) # make values from vector "puppies" the cell valuesNow try this instead.
dogs_longer %>%
pivot_wider(
id_cols = c(person),
names_from = type,
values_from = puppies,
values_fill = list(puppies = 0))Learning Check 2.2:
Next, please examine the following three code chunks that use pivot_longer().
Code Chunk A:
dogs_longerCode Chunk B:
dogs_wider %>%
pivot_longer(
cols = c(corgi, dalmatian, greyhound), # pivot all columns inside c()
names_to = "type", # stick column names in a column called type
values_to = "puppies") # stick values in a column called puppiesCode Chunk C:
dogs_wider %>%
pivot_longer(
cols = c(corgi, dalmatian, greyhound),
names_to = "type",
values_to = "puppies") %>%
filter(puppies != 0)Learning Check 2.3:
dogs_longer (Code Chunk A). What is it? What did we do to fix it? Why did it work?
3. County Elections Data
The counties dataset we used before can tell us what percentage of county residents supported a party’s candidate for president in 2016, for every county in the US.
Using the counties dataset before, let’s calculate the average rate of support statewide.
counties <- read_csv("https://bit.ly/us_county_elections_2016")Check out the first six rows with head().
counties %>% head()## # A tibble: 6 × 6
## fips name state democrats republicans region
## <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 1001 Autauga County AL 23.8 72.8 East South Central
## 2 1003 Baldwin County AL 19.4 76.6 East South Central
## 3 1005 Barbour County AL 46.5 52.1 East South Central
## 4 1007 Bibb County AL 21.2 76.4 East South Central
## 5 1009 Blount County AL 8.43 89.3 East South Central
## 6 1011 Bullock County AL 75.0 24.2 East South Central
3.1. Pivot!
counties_long <- counties %>%
# pivot to tidy format
pivot_longer(cols = c(democrats, republicans), # pivot just these columns
names_to = "party", # sticking column names in this variable
values_to = "percent") # sticking values here3.2: Summarize!
county_means <- counties_long %>%
# let's add to that by grouping by state AND party and summarizing
# (ooh, yes, you can group by two variables at once!)
group_by(state, party) %>%
# calculate the mean and label the result "mean_percent"
summarize(mean_percent = mean(percent, na.rm = TRUE)) %>%
# and sorting the data from highest to lowest!
arrange(desc(mean_percent))And let’s view it.
county_means3.3 Pivot and Visualize!
Alternatively, we can pivot and visualize our data. In this case, we don’t have to summarize our dataset into statistics like the mean; we can just use the raw data. That’s because ggplot() loves tidy data. Whenever we use pivot_longer(), we create a categorical variable that we can apply to our mapping = aes() function, creating new components to our visualizations!
Here’s one example.
counties_long %>%
# Then we visualize it!
ggplot(mapping = aes(x = percent,
fill = party)) +
geom_histogram(alpha = 0.5, color = "white") +
theme_bw() +
labs(x = "% of Vote per County",
y = "Frequency (% of County)",
fill = "Party",
title = "The Shape of Electoral Outcomes in 2016")Although, I’m noticing that in this visual, Democrats are red and Republicans are blue. I’d like to switch that order. How would I do that?
A few tricks:
Let’s try out a new function, factor(). This turns our character vector into a factor, which has a fixed order. It’s kind of like R’s version of an ordinal variable. It’s helpful with ggplot, because ggplot recognizes what to do with it.
counties_long %>%
# Let's add this new function, where we say, hey, please put our categories in this order
mutate(party = factor(party, levels = c("republicans", "democrats"))) %>%
ggplot(mapping = aes(x = percent,
fill = party)) +
geom_histogram(alpha = 0.5, color = "white") +
theme_bw() +
labs(x = "% of Vote per County",
y = "Number of Counties",
fill = "Party",
title = "The Shape of Electoral Outcomes in 2016")Or, we can go a step further and adjust the category names themselves. Let’s add this new function, recode_factor() from the dplyr package, where we say, hey, please rename our categories, and put them in this order.
counties_long %>%
# Let's add this new function, where we say, hey, please rename our categories
# and put them in this order.
mutate(party = party %>% recode_factor(
"republicans" = "Republicans",
"democrats" = "Democrats")) %>%
ggplot(mapping = aes(x = percent,
fill = party)) +
geom_histogram(alpha = 0.5, color = "white") +
theme_bw() +
labs(x = "% of Vote per County",
y = "Number of Counties",
fill = "Party",
title = "The Shape of Electoral Outcomes in 2016")Learning Check 3.1
recode_factor() function? Try updating the new labels too.
4. Color
Another way to handle our color issue is to learn a little more about color!
Color is a key way to innovate in your visualizations, and it fits well with tools like pivot_longer(), which naturally created categories for us in our data.
4.1 Pick you own colors with scale_color_manual()
R has a number of built in colors, which you can access using the colors() function. Try it! They have great names, like "steelblue", "firebrick", "cornsilk", "mistyrose", etc.
colors()Let’s try assigning some color manually from this list, using "steelblue" and "firebrick".
counties_long %>%
ggplot(mapping = aes(x = region, y = percent, color = party)) +
geom_jitter(alpha = 0.5) +
coord_flip() + # Flip on its side
facet_wrap(~party) + # split into panels
theme_bw() +
# Add a manual color scale, which is always categorical!
scale_color_manual(values = c("steelblue", "firebrick"))Alternatively, you could use color-blind friendly hexadecimal codes, like the IBM Design Library’s scale. These are written below in ‘hexadecimal’ code. (All colors can be represented by six numbers or letters, called a ‘hexadecimal code.’)
c("#648FFF", # Blue
"#785EF0", # Purple
"#DC267F", # Red
"#FE6100", # Orange
"#FFB000") # YellowLet’s try two color-blind friendly colors to our visual below. They really pop!
counties_long %>%
ggplot(mapping = aes(x = region,
y = percent,
color = party)) +
geom_jitter(alpha = 0.5) +
coord_flip() +
facet_wrap(~party) +
theme_bw() +
# Add our color-blind friendly blue/red colors!
scale_color_manual(values = c("#648FFF","#DC267F"))Learning Check 4.1
colors() function or our color-friendly hexadecimal codes. Which did you choose? Try switching the order in which they occur in the values = c() argument.
4.2 Divergent Color Scales with scale_color_gradient2()
Alternatively, you could make a divergent color scale, perhaps by what percentage of the population voted Democrat. We can split the colors at 50, where if less than 50% voted Democrat, we’ll call that an increasingly Red county, while if more than 50% voted Democrat, we’ll call that an increasingly blue county.
counties_long %>%
# Let's zoom into just data on Democrats
filter(party == "democrats") %>%
ggplot(mapping = aes(x = region,
y = percent,
color = percent)) +
geom_jitter(alpha = 0.5) +
coord_flip() +
facet_wrap(~party) +
theme_dark() + # aha! finally, a use for dark themes!
# Add a DIVERGENT color scale, splitting at 50.
scale_color_gradient2(low = "#DC267F", mid = "white", high = "#648FFF", midpoint = 50)
4.3 Viridis Color Scales
However, to be honest, the easiest and most effective solution I have found is the viridis package. This package was designed by scholars at Berkeley, and includes several beautiful palettes that are just super useful. They are color-blind friendly, they are grey-scale friendly (meaning they print well in black and white), and they just look cool.
To check out the names of available palettes, use viridis:: to look inside the package at the names of options.
viridis::Then, apply them in visuals like these!
You can use continuous color scales with viridis by default. Try scale_color_viridis() below, using the "plasma" palette. It’s my favorite so far.
counties_long %>%
filter(party == "democrats") %>%
ggplot(mapping = aes(x = region, y = percent,
# changed color variable to percent
color = percent)) +
geom_jitter() +
coord_flip() +
theme_bw() +
scale_color_viridis(option = "plasma")
Other cool options include viridis (green-to-purple), inferno (orange-red-black), cividis (yellow-to-blue), mako (green-blue-black), rocket (tan-red-black, and turbo (rainbow). Try them out!
You might also want to use categorical color scales with viridis. We call this a ‘discrete’ color scale, and we flag that for ggplot by saying, discrete = TRUE.
counties_long %>%
filter(party == "democrats") %>%
ggplot(mapping = aes(x = region, y = percent,
# changed color variable to region
color = region)) +
geom_jitter() +
coord_flip() +
theme_bw() +
scale_color_viridis(option = "plasma", discrete = TRUE)When applying colors to a geometry’s fill, all the same rules apply. We just use slightly different functions. For example:
| purpose | color | fill |
|---|---|---|
| Categories | scale_color_manual() |
scale_fill_manual() |
| Divergent colors | scale_color_gradient2() |
scale_fill_gradient2() |
| Viridis | scale_color_viridis() |
scale_fill_viridis() |
| Viridis (categories) | scale_color_viridis(discrete = TRUE) |
scale_fill_viridis(discrete = TRUE) |
Learning Check 4.2
"plasma". Now try another. Which one do you like best? Remember to include discrete = TRUE, or it will not render.
5. Visualization
You’ve learned several helpful skills today:
pivot_longer()factor()/recode_factor()scale_color_viridis()/scale_fill_viridis()
Using these skills, we can start making some really nice barplots.
5.1. Data prep
Let’s take our tidy counties dataframe, 1) calculate the average share of the vote won for Democrats by region, 2) reorder our regions from lowest to highest in terms of average voteshare won, and 3) visualize it with a viridis color scale.
countybars <- counties_long %>%
# Filter to democrats
filter(party == "democrats") %>%
# calculate average share of the vote won by region
group_by(region) %>%
summarize(mean_percent = mean(percent, na.rm = TRUE)) %>%
# Tell R to remember a specific order for the regions as a factor
mutate(region = factor(region, levels = c(
"West North Central",
"West South Central",
"Mountain",
"East South Central",
"East North Central",
"South Atlantic",
"Middle Atlantic",
"Pacific", "New England")))Let’s look at it.
countybars
5.2 geom_col()
Unlike past bar charts, this isn’t just one variable (geom_bar() only takes an x variable.); it’s two! We need a geom_ function that can handle 2 variables. We’re going to learn geom_col() below! It works the same way as geom_point(), with an x and a y coordinate for each shape.
countybars %>%
ggplot(mapping = aes(x = region, y = mean_percent,
# make the fill reflect different regions
fill = region)) +
# let's use geom_col(), since we have both an x and y variable
geom_col(color = "black") +
coord_flip() + # flip it!
scale_fill_viridis(option = "plasma", discrete = TRUE) +
theme_bw() +
labs(y = "Average Voteshare", x = "Census Region")5.3 Radial Bar Plot
Although, now that we look at it, there are a few interesting options here!
countybars %>%
ggplot(mapping = aes(x = region, y = mean_percent,
fill = region)) +
geom_col(color = "black") +
scale_fill_viridis(option = "plasma", discrete = TRUE) +
theme_bw() +
labs(y = "Average Voteshare", x = "Census Region") +
# Add a circular, 'polar' layout.
coord_polar()5.4 Lollipop Plot
countybars %>%
ggplot(mapping = aes(x = region, y = mean_percent,
ymin = 0, ymax = mean_percent,
color = region)) +
geom_linerange(size = 2) +
geom_point(size = 5) +
coord_flip() + # flip it!
scale_color_viridis(option = "plasma", discrete = TRUE) +
theme_bw() +
labs(y = "Average Voteshare", x = "Census Region")Learning Check 5.1
viridis color scale (anything other than plasma.)
Great work! You’re done!