Module 4: Enter the Tidyverse

Workshop 4: Importing and Tidying Data

Open this Project on RStudio.Cloud!

Up until now, we’ve only used data from packages, so we could always a) load it from our packages, and b) it came in a very specific format great for data analysis - “tidy” format. More on that later.

This workshop is going to:

  1. introduce several ways of importing data.

  2. introduce techniques for tidying data.

This way, you can visualize it, collect summary statistics, etc. No joke - these data tidying techniques save you dozens of hours compared to doing it by hand.

Why are we using a website? So that you can practice typing in the code yourself now. I recommend opening this in one website and typing it in by yourself, if possible. Most of the code output below is hidden, so you will need to type the code below into your RStudio.Cloud session to find out what it does! It helps build muscle memory.

0. Load Packages

Please load the tidyverse packages. This will load dplyr, ggplot2, tidyr, and readr. Congratulations! Now you can just load tidyverse instead of all the others. Please also load the viridis package, a new package you will need today.

library(tidyverse)
library(viridis)




1. Importing Data

There are three main ways to import data.

  1. Load it from a package.

  2. Load it from a spreadsheet online.

  3. Load it from a spreadsheet file.




1.1. Loading Data from a Package

For example, you can load it from a package like we did last week. Try to access the diamonds dataset in ggplot2 below.

library(ggplot2)

diamonds

But wouldn’t it be nice if we could see what datasets are in a package?

Try typing ggplot2:: on it’s own line, put your cursor at the end of this code, and then press TAB.

You’ll see all the functions and datasets stored in a package. You can scroll down the list and select the one you want, by clicking on it or pressing tab or enter.

ggplot2::

Learning Check 1.1

Load three ggplot2 datasets other than the diamonds dataset.

For the first dataset, glimpse() it to see what variables are there. For the second dataset, select() just three variables to look at. For the third dataset, filter() by at least one variable.

Write down in comments (#) what you did. What kinds of data did you find?




1.2. Loading Data from a Spreadsheet Online

Using the readr package from the tidyverse package, load the data!

japan <- read_csv("http://bit.ly/japanese_prefectural_elections")

This is a dataset of candidate election results made by Yusaku Horuichi and Ryota Natori at Dartmouth College. It looks at candidates who ran for Prefectural Assembly (like State Legislature) in every town in Japan, from 2003 - 2017.

Take a look.

japan %>% glimpse()
## Rows: 31,022
## Columns: 19
## $ candidate_name    <chr> "千葉英守", "藤川雅司", "小林郁子", "越智美智子", "…
## $ candidate_age     <dbl> 65, 58, 67, 45, 38, 50, 39, 49, 48, 39, 59, 53, 53, …
## $ candidate_party   <chr> "Liberal Democratic Party", "Democratic Party of Jap…
## $ candidate_status  <chr> "incumbent", "new", "incumbent", "new", "incumbent",…
## $ candidate_rank    <dbl> 1, 2, 3, 4, 1, 2, 5, 3, 4, 6, 7, 8, 3, 1, 2, 4, 5, 3…
## $ election_year     <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
## $ pref_code         <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ muni_code         <chr> "01100", "01100", "01100", "01100", "01100", "01100"…
## $ district_code     <chr> "01101", "01101", "01101", "01101", "01102", "01102"…
## $ muni_name         <chr> "札幌市", "札幌市", "札幌市", "札幌市", "札幌市", "…
## $ election_district <chr> "中央区", "中央区", "中央区", "中央区", "北区", "北…
## $ district_name     <chr> "中央区", "中央区", "中央区", "中央区", "北区", "北…
## $ eligible_voters   <dbl> 188968, 188968, 188968, 188968, 229593, 229593, 2295…
## $ turnout           <dbl> 105389, 105389, 105389, 105389, 134431, 134431, 1344…
## $ valid_votes_cast  <dbl> 99375, 99375, 99375, 99375, 129638, 129638, 129638, …
## $ quota             <dbl> 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3…
## $ num_candidates    <dbl> 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 5, 5, 5, 5, 5, 4…
## $ votes_received    <dbl> 43610.00, 28000.00, 22729.00, 5036.00, 24158.00, 111…
## $ votes_total       <dbl> 118161.00, 111890.00, 94029.00, 34568.00, 146889.00,…

Notice especially the following vectors, which have easy-to-interpret names:

  • "candidate_name"
  • "candidate_age"
  • "candidate_party"
  • "election_year"
  • "votes_received"
  • "valid_votes_cast"

IT’S IN JAPANESE! That’s right! You can work with data in foreign languages, and our process doesn’t really change! R IS SO NEAT!

Learning Check 1.2

Please filter() this dataset so that you only see candidates from elections in 2015, and arrange() the rows in terms of how many votes each candidate received.

But, that’s a lot of data, so let’s remove it from our memory for the time being.

remove(japan)




1.3. Loading data from a Spreadsheet file

Alternatively, you can provide the data.

  1. Navigate to the Files menu (lower right hand corner), and find "us_county_voteshare_2016.csv".

  2. Now, upload it using read_csv() from the readr package, like so:

counties <- read_csv("https://bit.ly/us_county_voteshare_2016")

This contains the percentage of democrats who voted for Clinton and the percentage of Republicans who voted for Trump in 2016, for each US county. fips is a unique code per county.

counties

Now you type out the code to import it and try it out!



Learning Check 1.3

What percentage of residents voted Democrat in Suffolk County, MA? (That’s where Boston is.) What percentage of residents voted Republican?

Hint: Use filter()!




2. Tidying Data

2.1 Identifying Long vs. Wide Data

We’ve got at least two good data.frames to work with now, but they’re not exactly in “tidy” format. Tidy format looks like this:

Let’s make a tidy data.frame of how many pets three people want (hypothetically).

dogs_longer <- data.frame(
  person = c("Tim", "Tim", "Melanie", "Jon"),
  type = c("corgi", "dalmatian", "corgi", "greyhound"),
  puppies = c(9, 1, 2, 3)
)

Now view it!

dogs_longer
##    person      type puppies
## 1     Tim     corgi       9
## 2     Tim dalmatian       1
## 3 Melanie     corgi       2
## 4     Jon greyhound       3

There’s an “index” - a.k.a. a unique identifier - "person". There’s a “class” - in this case, dog "type". There’s a “value” - in this case, "puppies"!

You’ll notice that each person-type pair only occurs once, and it has a value (puppies) associated.

Think about this.

Read this again.

Read this one more time, please.

Is this the way you expect to see data?

Or do you expect data to look like this?

dogs_wider <- data.frame(
  person = c("Tim", "Melanie", "Jon"),
  corgi = c(9, 2, 0),
  dalmatian = c(1, 0, 0),
  greyhound = c(0, 0, 3))

Now view it!

dogs_wider
##    person corgi dalmatian greyhound
## 1     Tim     9         1         0
## 2 Melanie     2         0         0
## 3     Jon     0         0         3

These are equivalent representations of the same data.

The first, dogs_longer puts it in a “tidy”, “long” list. The second, dogs_wider puts it in a “wide” matrix.

Why would any of this matter? Well, it’s really easy to do calculations or visualization on tidy data, but not matrix data. For example, count the total puppies each person has:

Using the dataframe dogs_longer:

dogs_longer %>%
  # For each person
  group_by(person) %>%
  # Calculate the total number of puppies
  summarize(total_puppies = sum(puppies))

Using the dataframe dogs_wider:

# We can get the number of dogs that Tim wants (10)
dogs_wider %>% 
  filter(person == "Tim") %>%
  select(corgi) +
  # the number of dalmatians Tim wants
  dogs_wider %>% 
  filter(person == "Tim") %>% 
  select(dalmatian) +
  # and the number of greyhounds Tim wants
  dogs_wider %>% 
  filter(person == "Tim") %>% 
  select(greyhound) 

But we haven’t even gotten the totals for Melanie or Jon yet! Using wide data takes so much time and extra code. Using “tidy” long-format data is way more efficient.

Learning Check 2.1:

  • Does “tidy data” refer to long or wide data? Why?

  • Check in with your group-mates and make up a hypothetical dataset.

  • Sketch out what a long version would look like.

  • Sketch out what a wide version would look like.




2.2 Pivoting Data

The tidyr package contains two mindblowing functions that convert between wide and long data.

  1. pivot_wider()

  2. pivot_longer()

While not earth-shattering, they save me hours every day and increase your productivity like nobody’s business.

Examine these two chunks of code that use pivot_wider().

dogs_longer %>%
  # Using pivot_wider
  pivot_wider(
    id_cols = c(person), # make values in vector "person" the rows
    names_from = type, # make values in vector "type" the columns
    values_from = puppies) # make values from vector "puppies" the cell values

Now try this instead.

dogs_longer %>%
  pivot_wider(
    id_cols = c(person), 
    names_from = type, 
    values_from = puppies,
    values_fill = list(puppies = 0))

Learning Check 2.2:

Something’s different from our wide matrix, “dogs_wider”. What is it? What did we do to fix it? Try replacing 0 with other numbers. What happens?




Next, please examine the following three code chunks that use pivot_longer().

Code Chunk A:

dogs_longer

Code Chunk B:

dogs_wider %>%
  pivot_longer(
    cols = c(corgi, dalmatian, greyhound), # pivot all columns inside c()
    names_to = "type", # stick column names in a column called type
    values_to = "puppies") # stick values in a column called puppies

Code Chunk C:

dogs_wider %>%
  pivot_longer(
    cols = c(corgi, dalmatian, greyhound), 
    names_to = "type", 
    values_to = "puppies") %>%
  filter(puppies != 0)

Learning Check 2.3:

Something seems different about code chunks A, B, and C too, compared to dogs_longer (Code Chunk A). What is it? What did we do to fix it? Why did it work?

3. County Elections Data

The counties dataset we used before can tell us what percentage of county residents supported a party’s candidate for president in 2016, for every county in the US.

Using the counties dataset before, let’s calculate the average rate of support statewide.

counties <- read_csv("https://bit.ly/us_county_elections_2016")

Check out the first six rows with head().

counties %>% head()
## # A tibble: 6 × 6
##    fips name           state democrats republicans region            
##   <dbl> <chr>          <chr>     <dbl>       <dbl> <chr>             
## 1  1001 Autauga County AL        23.8         72.8 East South Central
## 2  1003 Baldwin County AL        19.4         76.6 East South Central
## 3  1005 Barbour County AL        46.5         52.1 East South Central
## 4  1007 Bibb County    AL        21.2         76.4 East South Central
## 5  1009 Blount County  AL         8.43        89.3 East South Central
## 6  1011 Bullock County AL        75.0         24.2 East South Central

3.1. Pivot!

counties_long <- counties %>%
  # pivot to tidy format
  pivot_longer(cols = c(democrats, republicans), # pivot just these columns
               names_to = "party", # sticking column names in this variable
               values_to = "percent") # sticking values here

3.2: Summarize!

county_means <- counties_long %>%
  # let's add to that by grouping by state AND party and summarizing
  # (ooh, yes, you can group by two variables at once!)
  group_by(state, party) %>% 
  # calculate the mean and label the result "mean_percent"
  summarize(mean_percent = mean(percent, na.rm = TRUE)) %>%
  # and sorting the data from highest to lowest! 
  arrange(desc(mean_percent))

And let’s view it.

county_means

3.3 Pivot and Visualize!

Alternatively, we can pivot and visualize our data. In this case, we don’t have to summarize our dataset into statistics like the mean; we can just use the raw data. That’s because ggplot() loves tidy data. Whenever we use pivot_longer(), we create a categorical variable that we can apply to our mapping = aes() function, creating new components to our visualizations!

Here’s one example.

counties_long %>%
  # Then we visualize it!
  ggplot(mapping = aes(x = percent, 
                       fill = party)) + 
  geom_histogram(alpha = 0.5, color = "white") +
  theme_bw() +
  labs(x = "% of Vote per County", 
       y = "Frequency (% of County)",
       fill = "Party", 
       title = "The Shape of Electoral Outcomes in 2016")

Although, I’m noticing that in this visual, Democrats are red and Republicans are blue. I’d like to switch that order. How would I do that?

A few tricks:

Let’s try out a new function, factor(). This turns our character vector into a factor, which has a fixed order. It’s kind of like R’s version of an ordinal variable. It’s helpful with ggplot, because ggplot recognizes what to do with it.

counties_long %>%
  # Let's add this new function, where we say, hey, please put our categories in this order
  mutate(party = factor(party, levels = c("republicans", "democrats"))) %>%
  ggplot(mapping = aes(x = percent, 
                       fill = party)) + 
  geom_histogram(alpha = 0.5, color = "white") +
  theme_bw() +
  labs(x = "% of Vote per County", 
       y = "Number of Counties",
       fill = "Party",
       title = "The Shape of Electoral Outcomes in 2016")

Or, we can go a step further and adjust the category names themselves. Let’s add this new function, recode_factor() from the dplyr package, where we say, hey, please rename our categories, and put them in this order.

counties_long %>%
  # Let's add this new function, where we say, hey, please rename our categories
  # and put them in this order.
  mutate(party = party %>% recode_factor(
    "republicans" = "Republicans",
    "democrats" = "Democrats")) %>%
  ggplot(mapping = aes(x = percent, 
                       fill = party)) + 
  geom_histogram(alpha = 0.5, color = "white") +
  theme_bw() +
  labs(x = "% of Vote per County", 
       y = "Number of Counties",
       fill = "Party", 
       title = "The Shape of Electoral Outcomes in 2016")

Learning Check 3.1

What happens when you switch around the Republicans and Democrats in the recode_factor() function? Try updating the new labels too.




4. Color

Another way to handle our color issue is to learn a little more about color!

Color is a key way to innovate in your visualizations, and it fits well with tools like pivot_longer(), which naturally created categories for us in our data.

4.1 Pick you own colors with scale_color_manual()

R has a number of built in colors, which you can access using the colors() function. Try it! They have great names, like "steelblue", "firebrick", "cornsilk", "mistyrose", etc.

colors()

Let’s try assigning some color manually from this list, using "steelblue" and "firebrick".

counties_long %>%
  ggplot(mapping = aes(x = region, y = percent, color = party)) +
  geom_jitter(alpha = 0.5) +
  coord_flip() +   # Flip on its side
  facet_wrap(~party) + # split into panels
  theme_bw() + 
  # Add a manual color scale, which is always categorical!
  scale_color_manual(values = c("steelblue", "firebrick"))

Alternatively, you could use color-blind friendly hexadecimal codes, like the IBM Design Library’s scale. These are written below in ‘hexadecimal’ code. (All colors can be represented by six numbers or letters, called a ‘hexadecimal code.’)

c("#648FFF", # Blue
  "#785EF0", # Purple
  "#DC267F", # Red
  "#FE6100", # Orange
  "#FFB000") # Yellow

Let’s try two color-blind friendly colors to our visual below. They really pop!

counties_long %>%
  ggplot(mapping = aes(x = region, 
                       y = percent, 
                       color = party)) +
  geom_jitter(alpha = 0.5) +
  coord_flip() +   
  facet_wrap(~party) + 
  theme_bw() + 
  # Add our color-blind friendly blue/red colors!
  scale_color_manual(values = c("#648FFF","#DC267F"))

Learning Check 4.1

Remake the visual above using two new colors you selected from either the colors() function or our color-friendly hexadecimal codes. Which did you choose? Try switching the order in which they occur in the values = c() argument.




4.2 Divergent Color Scales with scale_color_gradient2()

Alternatively, you could make a divergent color scale, perhaps by what percentage of the population voted Democrat. We can split the colors at 50, where if less than 50% voted Democrat, we’ll call that an increasingly Red county, while if more than 50% voted Democrat, we’ll call that an increasingly blue county.

counties_long %>%
  # Let's zoom into just data on Democrats
  filter(party == "democrats") %>%
  ggplot(mapping = aes(x = region, 
                       y = percent, 
                       color = percent)) +
  geom_jitter(alpha = 0.5) +
  coord_flip() +   
  facet_wrap(~party) + 
  theme_dark() +  # aha! finally, a use for dark themes!
  # Add a DIVERGENT color scale, splitting at 50.
  scale_color_gradient2(low = "#DC267F", mid = "white", high = "#648FFF", midpoint = 50)



4.3 Viridis Color Scales

However, to be honest, the easiest and most effective solution I have found is the viridis package. This package was designed by scholars at Berkeley, and includes several beautiful palettes that are just super useful. They are color-blind friendly, they are grey-scale friendly (meaning they print well in black and white), and they just look cool.

To check out the names of available palettes, use viridis:: to look inside the package at the names of options.

viridis::

Then, apply them in visuals like these!

You can use continuous color scales with viridis by default. Try scale_color_viridis() below, using the "plasma" palette. It’s my favorite so far.

counties_long %>%
  filter(party == "democrats") %>%
  ggplot(mapping = aes(x = region, y = percent, 
                       # changed color variable to percent
                       color = percent)) +
  geom_jitter() +
  coord_flip() +
  theme_bw() +
  scale_color_viridis(option = "plasma")



Other cool options include viridis (green-to-purple), inferno (orange-red-black), cividis (yellow-to-blue), mako (green-blue-black), rocket (tan-red-black, and turbo (rainbow). Try them out!

You might also want to use categorical color scales with viridis. We call this a ‘discrete’ color scale, and we flag that for ggplot by saying, discrete = TRUE.

counties_long %>%
  filter(party == "democrats") %>%
  ggplot(mapping = aes(x = region, y = percent, 
                       # changed color variable to region
                       color = region)) +
  geom_jitter() +
  coord_flip() +
  theme_bw() +
  scale_color_viridis(option = "plasma", discrete = TRUE)

When applying colors to a geometry’s fill, all the same rules apply. We just use slightly different functions. For example:

purpose color fill
Categories scale_color_manual() scale_fill_manual()
Divergent colors scale_color_gradient2() scale_fill_gradient2()
Viridis scale_color_viridis() scale_fill_viridis()
Viridis (categories) scale_color_viridis(discrete = TRUE) scale_fill_viridis(discrete = TRUE)


Learning Check 4.2

Recode the categorical visual above, using a viridis color scale other than "plasma". Now try another. Which one do you like best? Remember to include discrete = TRUE, or it will not render.




5. Visualization

You’ve learned several helpful skills today:

  • pivot_longer()

  • factor() / recode_factor()

  • scale_color_viridis() / scale_fill_viridis()

Using these skills, we can start making some really nice barplots.

5.1. Data prep

Let’s take our tidy counties dataframe, 1) calculate the average share of the vote won for Democrats by region, 2) reorder our regions from lowest to highest in terms of average voteshare won, and 3) visualize it with a viridis color scale.

countybars <- counties_long %>%
  # Filter to democrats
  filter(party == "democrats") %>%
  # calculate average share of the vote won by region
  group_by(region) %>%
  summarize(mean_percent = mean(percent, na.rm = TRUE)) %>%
  # Tell R to remember a specific order for the regions as a factor
  mutate(region = factor(region, levels = c(
    "West North Central", 
    "West South Central",
    "Mountain", 
    "East South Central",
    "East North Central",
    "South Atlantic", 
    "Middle Atlantic",
    "Pacific", "New England")))

Let’s look at it.

countybars



5.2 geom_col()

Unlike past bar charts, this isn’t just one variable (geom_bar() only takes an x variable.); it’s two! We need a geom_ function that can handle 2 variables. We’re going to learn geom_col() below! It works the same way as geom_point(), with an x and a y coordinate for each shape.

countybars %>%
  ggplot(mapping = aes(x = region, y = mean_percent, 
                       # make the fill reflect different regions
                       fill = region)) +
  # let's use geom_col(), since we have both an x and y variable
  geom_col(color = "black") +
  coord_flip() + # flip it!
  scale_fill_viridis(option = "plasma", discrete = TRUE) +
  theme_bw() +
  labs(y = "Average Voteshare", x = "Census Region")


5.3 Radial Bar Plot

Although, now that we look at it, there are a few interesting options here!

countybars %>%
  ggplot(mapping = aes(x = region, y = mean_percent, 
                       fill = region)) +
  geom_col(color = "black") +
  scale_fill_viridis(option = "plasma", discrete = TRUE) +
  theme_bw() +
  labs(y = "Average Voteshare", x = "Census Region") +
  # Add a circular, 'polar' layout.
  coord_polar()


5.4 Lollipop Plot

countybars %>%
  ggplot(mapping = aes(x = region, y = mean_percent, 
                       ymin = 0, ymax = mean_percent,
                       color = region)) +
  geom_linerange(size = 2) +
  geom_point(size = 5) +
  coord_flip() + # flip it!
  scale_color_viridis(option = "plasma", discrete = TRUE) +
  theme_bw() +
  labs(y = "Average Voteshare", x = "Census Region")

Learning Check 5.1

Make a bar plot, lollipop chart, or radial bar plot, but this time, evaluate it for Republicans, not Democrats. Be sure to choose a nice viridis color scale (anything other than plasma.)



Great work! You’re done!