Tidyverse Create

Universe? Tidyverse!

Today, we’ll be demonstrating some of the uses of the tidyverse. We’ll be using dplyr, tidyr, stringr, and ggplot2 to take a dataset and perform some exploratory analysis. The tidyverse is a wonderful tool to be able to take a dataset of any type, transform it, and present findings. Being flexible and powerful allows users to focus on answering layers of questions and deepening analysis, without being bogged down with technical tasks.

Personally, the packages we’re using today are my favorites, notably the pipe operator in dplyr as well as ggplot2.

The Raw Data

How Baby Boomers Get High

Let’s look at a dataset that comes from FiveThirtyEight. The article referenced can be found here and discussed drug use for baby boomers and other age groups. There is one visualiztion in the article, but we’ll create more to explore the data found on GitHub.

Visualizations are one of the easiest ways for people to digest data.

Packages

We can load each of the packages separately, or we can just load the tidyverse:

library(tidyverse)

Loading

Lets pull the data directly from GitHub into a new dataframe object, then take a quick peek at what we’re working with:

df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv',
               stringsAsFactors = F)
str(df)

## 'data.frame':    17 obs. of  28 variables:
##  $ age                    : chr  "12" "13" "14" "15" ...
##  $ n                      : int  2798 2757 2792 2956 3058 3038 2469 2223 2271 2354 ...
##  $ alcohol.use            : num  3.9 8.5 18.1 29.2 40.1 49.3 58.7 64.6 69.7 83.2 ...
##  $ alcohol.frequency      : num  3 6 5 6 10 13 24 36 48 52 ...
##  $ marijuana.use          : num  1.1 3.4 8.7 14.5 22.5 28 33.7 33.4 34 33 ...
##  $ marijuana.frequency    : num  4 15 24 25 30 36 52 60 60 52 ...
##  $ cocaine.use            : num  0.1 0.1 0.1 0.5 1 2 3.2 4.1 4.9 4.8 ...
##  $ cocaine.frequency      : chr  "5.0" "1.0" "5.5" "4.0" ...
##  $ crack.use              : num  0 0 0 0.1 0 0.1 0.4 0.5 0.6 0.5 ...
##  $ crack.frequency        : chr  "-" "3.0" "-" "9.5" ...
##  $ heroin.use             : num  0.1 0 0.1 0.2 0.1 0.1 0.4 0.5 0.9 0.6 ...
##  $ heroin.frequency       : chr  "35.5" "-" "2.0" "1.0" ...
##  $ hallucinogen.use       : num  0.2 0.6 1.6 2.1 3.4 4.8 7 8.6 7.4 6.3 ...
##  $ hallucinogen.frequency : num  52 6 3 4 3 3 4 3 2 4 ...
##  $ inhalant.use           : num  1.6 2.5 2.6 2.5 3 2 1.8 1.4 1.5 1.4 ...
##  $ inhalant.frequency     : chr  "19.0" "12.0" "5.0" "5.5" ...
##  $ pain.releiver.use      : num  2 2.4 3.9 5.5 6.2 8.5 9.2 9.4 10 9 ...
##  $ pain.releiver.frequency: num  36 14 12 10 7 9 12 12 10 15 ...
##  $ oxycontin.use          : num  0.1 0.1 0.4 0.8 1.1 1.4 1.7 1.5 1.7 1.3 ...
##  $ oxycontin.frequency    : chr  "24.5" "41.0" "4.5" "3.0" ...
##  $ tranquilizer.use       : num  0.2 0.3 0.9 2 2.4 3.5 4.9 4.2 5.4 3.9 ...
##  $ tranquilizer.frequency : num  52 25.5 5 4.5 11 7 12 4.5 10 7 ...
##  $ stimulant.use          : num  0.2 0.3 0.8 1.5 1.8 2.8 3 3.3 4 4.1 ...
##  $ stimulant.frequency    : num  2 4 12 6 9.5 9 8 6 12 10 ...
##  $ meth.use               : num  0 0.1 0.1 0.3 0.3 0.6 0.5 0.4 0.9 0.6 ...
##  $ meth.frequency         : chr  "-" "5.0" "24.0" "10.5" ...
##  $ sedative.use           : num  0.2 0.1 0.2 0.4 0.2 0.5 0.4 0.3 0.5 0.3 ...
##  $ sedative.frequency     : num  13 19 16.5 30 3 6.5 10 6 4 9 ...

Evaluating the Quality of Data

It looks like the data set can be generalized by a few points:

Each row is an age group.
Each column is a variable.
- There are multiple drug classes, showing summary statistics for the past 12 months:
  - Use: Percent of users
  - Frequency: Median frequency of use
The dataset is a summarized view

Ideally, we would like to have the raw unit level data, which shows the results for every person surveyed. Fortunately, with the tidyverse, we can still make good use of this summary data.

Preparing the Data

Method Explained

Since we’ll be trying to visualize the data in ggplot, we’ll want this dataset in a different structure. We’ll ignore the number of surveys for now, and we’ll try to build a new dataframe that includes 4 fields:

Age [group]
Drug
Use
Frequency

We’ll make sure of the following packages to do this:

tidyr to unpivot the data
stringr to extract information from the field names
dplyr to manage the data
ggplot2 to visualize the data

tidyr::gather

Lets start by taking this wide dataset and making it tall with the gather function from tidyr:

tall <- gather(df, field_name, value, -age)
head(tall,3)

##   age field_name value
## 1  12          n  2798
## 2  13          n  2757
## 3  14          n  2792

dplyr::filter

Here’s the powerhouse of the tidyverse IMO. Lets split the tall table into two- one for survey counts and another for drug usage. We’ll be using the filter function in dplyr.

counts <- filter(tall, field_name == 'n')

usage <- filter(tall, field_name != 'n')
head(usage,3)

##   age  field_name value
## 1  12 alcohol.use   3.9
## 2  13 alcohol.use   8.5
## 3  14 alcohol.use  18.1

The counts dataframe looks ready to use. Lets continue building out the usage dataframe.

stringr::str_extract

Now that we’re looking only at drug use and frequency, lets determine which row is which. We can use str_extract to find a specified pattern in the string and we’ll be adding that as a new field:

usage$type <- str_extract(usage$field_name, 'use|frequency')
head(usage,3)

##   age  field_name value type
## 1  12 alcohol.use   3.9  use
## 2  13 alcohol.use   8.5  use
## 3  14 alcohol.use  18.1  use

Now that we have use and frequency extracted into a separate field, we can use base R substr to clean up the field_name.

usage$field_name <- substr(usage$field_name, 1, nchar(usage$field_name) - nchar(usage$type))
head(usage[239:241,])

##     age     field_name value type
## 239  12 pain.releiver.     2  use
## 240  13 pain.releiver.   2.4  use
## 241  14 pain.releiver.   3.9  use

The third parameter of this function is the length of the field name minus the length of the type.

stringr::str_replace_all

We can skip this step, but lets clean up the extra periods since cleaning up the final product takes away from any distractions while trying to understand the visualization.

usage$field_name <- str_replace_all(usage$field_name, '\\.', ' ')
head(usage[239:241,])

##     age     field_name value type
## 239  12 pain releiver      2  use
## 240  13 pain releiver    2.4  use
## 241  14 pain releiver    3.9  use

# Lets also fix the spelling error in the dataset
usage$field_name <- str_replace_all(usage$field_name, 'releiver', 'reliever')
head(usage[239:241,])

##     age     field_name value type
## 239  12 pain reliever      2  use
## 240  13 pain reliever    2.4  use
## 241  14 pain reliever    3.9  use

# And replace the dash characters with zeros
usage$value  <- str_replace_all(usage$value, '\\-', '0')

stringr::str_trim

One last string clean up to do here- lets trim the white space at the end of the field name. Also, since we’ve completely transformed the field, lets rename it.

usage$field_name <- str_trim(usage$field_name, side = 'right')
names(usage)[2] <- 'drug'

tidyr::spread

Now that we have all the data points separated, we can spread the data back out so that the variables, usage and frequency, are their own columns.

drugs <- spread(usage, type, value)
str(drugs)

## 'data.frame':    221 obs. of  4 variables:
##  $ age      : chr  "12" "12" "12" "12" ...
##  $ drug     : chr  "alcohol" "cocaine" "crack" "hallucinogen" ...
##  $ frequency: chr  "3" "5.0" "0" "52" ...
##  $ use      : chr  "3.9" "0.1" "0" "0.2" ...

Final Touches

Lets convert frequency and use to numeric values:

drugs$frequency <- as.numeric(drugs$frequency)
drugs$use <- as.numeric(drugs$use)
str(drugs)

## 'data.frame':    221 obs. of  4 variables:
##  $ age      : chr  "12" "12" "12" "12" ...
##  $ drug     : chr  "alcohol" "cocaine" "crack" "hallucinogen" ...
##  $ frequency: num  3 5 0 52 35.5 19 4 0 24.5 36 ...
##  $ use      : num  3.9 0.1 0 0.2 0.1 1.6 1.1 0 0.1 2 ...

Since Age is actually age groups, we can leave it as a string. Luckily, it sorts in alphabetical order in a favorable way. If this didn’t happen by luck, we would convert this field into a factor and define the order of the levels.

Visualizations

dplyr %>%

First of all, I’m curious about my own age group. I’m 34 right now, so I fit into the 30-34 age group. Lets filter for my age group then use that in a visualization. We can do this in separate steps, but dplyr’s pipe allows us to take the results of one function and use it for another.

# percent of this age group that uses listed drug
filter(drugs, age == '30-34') %>%
  ggplot(aes(x = reorder(drug, use), y = use)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(x = element_blank(),
       y = element_blank(),
       title = "Percent of People Who Have Used Drugs",
       subtitle = 'Age Group: 30-34')

Facet Wrap

One of the most powerful tools of ggplot is the ability to add another dimension to your plot, so that you can see the same structure of your plot, but on an additional dimension.

Instead of filtering and plotting each age group separately, we’ll facet on age group and see them all at once.

ggplot(drugs, aes(x = reorder(drug, use), y = use)) +
  geom_bar(stat = 'identity') +
  coord_flip() +
  labs(x = element_blank(),
       y = element_blank(),
       title = "Percent of People Who Have Used Drugs") +
  facet_wrap(~age) +
  theme(text = element_text(size = 7))

We can also look at the distribution of usage by age for each drug:

ggplot(drugs, aes(x = age, y = use)) +
  geom_bar(stat = 'identity') +
  labs(x = element_blank(),
       y = element_blank(),
       title = "Distribution of Age by Drug") +
  facet_wrap(~drug) +
  theme(text = element_text(size = 7),
        axis.text.x = element_text(angle = 90, hjust = 1))

Looking at this at a high level, we can easily determine the most popular drugs, but seeing the trends in all drugs is difficult. For that, we’ll need to free the y scale for each drug:

ggplot(drugs, aes(x = age, y = use)) +
  geom_bar(stat = 'identity') +
  labs(x = element_blank(),
       y = element_blank(),
       title = "Distribution of Age by Drug") +
  facet_wrap(~drug, scales = 'free_y') +
  theme(text = element_text(size = 7),
        axis.text.x = element_text(angle = 90, hjust = 1))

Now this is interesting! Here’s a few things we can easily determine from this visualization:

Alcohol usage peaks right around the legal age
Cocaine usage peaks under 20 and doesn’t seem to be very popular across all ages
Not many people use crack, but the ones who do, continue to do so for the rest of their life
Hallucinogen usage peaks just under 20 years old and doesn’t seem very popular with all ages
Youths seem to prefer inhalants
Generally, across the board, it seems like various drug use peaks around 20.
- This might be the age where people are willing to take the most risk while simultaneously not being able to understand the full risk involved.

Conclusion

Tidyverse really makes R easy to use once you get the hang of it. Thanks Hadley!