Today, we’ll be demonstrating some of the uses of the tidyverse. We’ll be using dplyr, tidyr, stringr, and ggplot2 to take a dataset and perform some exploratory analysis. The tidyverse is a wonderful tool to be able to take a dataset of any type, transform it, and present findings. Being flexible and powerful allows users to focus on answering layers of questions and deepening analysis, without being bogged down with technical tasks.
Personally, the packages we’re using today are my favorites, notably the pipe operator in dplyr as well as ggplot2.
Let’s look at a dataset that comes from FiveThirtyEight. The article referenced can be found here and discussed drug use for baby boomers and other age groups. There is one visualiztion in the article, but we’ll create more to explore the data found on GitHub.
Visualizations are one of the easiest ways for people to digest data.
We can load each of the packages separately, or we can just load the tidyverse:
Lets pull the data directly from GitHub into a new dataframe object, then take a quick peek at what we’re working with:
df <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/drug-use-by-age/drug-use-by-age.csv',
stringsAsFactors = F)
str(df)## 'data.frame': 17 obs. of 28 variables:
## $ age : chr "12" "13" "14" "15" ...
## $ n : int 2798 2757 2792 2956 3058 3038 2469 2223 2271 2354 ...
## $ alcohol.use : num 3.9 8.5 18.1 29.2 40.1 49.3 58.7 64.6 69.7 83.2 ...
## $ alcohol.frequency : num 3 6 5 6 10 13 24 36 48 52 ...
## $ marijuana.use : num 1.1 3.4 8.7 14.5 22.5 28 33.7 33.4 34 33 ...
## $ marijuana.frequency : num 4 15 24 25 30 36 52 60 60 52 ...
## $ cocaine.use : num 0.1 0.1 0.1 0.5 1 2 3.2 4.1 4.9 4.8 ...
## $ cocaine.frequency : chr "5.0" "1.0" "5.5" "4.0" ...
## $ crack.use : num 0 0 0 0.1 0 0.1 0.4 0.5 0.6 0.5 ...
## $ crack.frequency : chr "-" "3.0" "-" "9.5" ...
## $ heroin.use : num 0.1 0 0.1 0.2 0.1 0.1 0.4 0.5 0.9 0.6 ...
## $ heroin.frequency : chr "35.5" "-" "2.0" "1.0" ...
## $ hallucinogen.use : num 0.2 0.6 1.6 2.1 3.4 4.8 7 8.6 7.4 6.3 ...
## $ hallucinogen.frequency : num 52 6 3 4 3 3 4 3 2 4 ...
## $ inhalant.use : num 1.6 2.5 2.6 2.5 3 2 1.8 1.4 1.5 1.4 ...
## $ inhalant.frequency : chr "19.0" "12.0" "5.0" "5.5" ...
## $ pain.releiver.use : num 2 2.4 3.9 5.5 6.2 8.5 9.2 9.4 10 9 ...
## $ pain.releiver.frequency: num 36 14 12 10 7 9 12 12 10 15 ...
## $ oxycontin.use : num 0.1 0.1 0.4 0.8 1.1 1.4 1.7 1.5 1.7 1.3 ...
## $ oxycontin.frequency : chr "24.5" "41.0" "4.5" "3.0" ...
## $ tranquilizer.use : num 0.2 0.3 0.9 2 2.4 3.5 4.9 4.2 5.4 3.9 ...
## $ tranquilizer.frequency : num 52 25.5 5 4.5 11 7 12 4.5 10 7 ...
## $ stimulant.use : num 0.2 0.3 0.8 1.5 1.8 2.8 3 3.3 4 4.1 ...
## $ stimulant.frequency : num 2 4 12 6 9.5 9 8 6 12 10 ...
## $ meth.use : num 0 0.1 0.1 0.3 0.3 0.6 0.5 0.4 0.9 0.6 ...
## $ meth.frequency : chr "-" "5.0" "24.0" "10.5" ...
## $ sedative.use : num 0.2 0.1 0.2 0.4 0.2 0.5 0.4 0.3 0.5 0.3 ...
## $ sedative.frequency : num 13 19 16.5 30 3 6.5 10 6 4 9 ...
It looks like the data set can be generalized by a few points:
Ideally, we would like to have the raw unit level data, which shows the results for every person surveyed. Fortunately, with the tidyverse, we can still make good use of this summary data.
Since we’ll be trying to visualize the data in ggplot, we’ll want this dataset in a different structure. We’ll ignore the number of surveys for now, and we’ll try to build a new dataframe that includes 4 fields:
We’ll make sure of the following packages to do this:
Lets start by taking this wide dataset and making it tall with the gather function from tidyr:
## age field_name value
## 1 12 n 2798
## 2 13 n 2757
## 3 14 n 2792
Here’s the powerhouse of the tidyverse IMO. Lets split the tall table into two- one for survey counts and another for drug usage. We’ll be using the filter function in dplyr.
## age field_name value
## 1 12 alcohol.use 3.9
## 2 13 alcohol.use 8.5
## 3 14 alcohol.use 18.1
The counts dataframe looks ready to use. Lets continue building out the usage dataframe.
Now that we’re looking only at drug use and frequency, lets determine which row is which. We can use str_extract to find a specified pattern in the string and we’ll be adding that as a new field:
## age field_name value type
## 1 12 alcohol.use 3.9 use
## 2 13 alcohol.use 8.5 use
## 3 14 alcohol.use 18.1 use
Now that we have use and frequency extracted into a separate field, we can use base R substr to clean up the field_name.
usage$field_name <- substr(usage$field_name, 1, nchar(usage$field_name) - nchar(usage$type))
head(usage[239:241,])## age field_name value type
## 239 12 pain.releiver. 2 use
## 240 13 pain.releiver. 2.4 use
## 241 14 pain.releiver. 3.9 use
The third parameter of this function is the length of the field name minus the length of the type.
We can skip this step, but lets clean up the extra periods since cleaning up the final product takes away from any distractions while trying to understand the visualization.
## age field_name value type
## 239 12 pain releiver 2 use
## 240 13 pain releiver 2.4 use
## 241 14 pain releiver 3.9 use
# Lets also fix the spelling error in the dataset
usage$field_name <- str_replace_all(usage$field_name, 'releiver', 'reliever')
head(usage[239:241,])## age field_name value type
## 239 12 pain reliever 2 use
## 240 13 pain reliever 2.4 use
## 241 14 pain reliever 3.9 use
One last string clean up to do here- lets trim the white space at the end of the field name. Also, since we’ve completely transformed the field, lets rename it.
Now that we have all the data points separated, we can spread the data back out so that the variables, usage and frequency, are their own columns.
## 'data.frame': 221 obs. of 4 variables:
## $ age : chr "12" "12" "12" "12" ...
## $ drug : chr "alcohol" "cocaine" "crack" "hallucinogen" ...
## $ frequency: chr "3" "5.0" "0" "52" ...
## $ use : chr "3.9" "0.1" "0" "0.2" ...
Lets convert frequency and use to numeric values:
## 'data.frame': 221 obs. of 4 variables:
## $ age : chr "12" "12" "12" "12" ...
## $ drug : chr "alcohol" "cocaine" "crack" "hallucinogen" ...
## $ frequency: num 3 5 0 52 35.5 19 4 0 24.5 36 ...
## $ use : num 3.9 0.1 0 0.2 0.1 1.6 1.1 0 0.1 2 ...
Since Age is actually age groups, we can leave it as a string. Luckily, it sorts in alphabetical order in a favorable way. If this didn’t happen by luck, we would convert this field into a factor and define the order of the levels.
First of all, I’m curious about my own age group. I’m 34 right now, so I fit into the 30-34 age group. Lets filter for my age group then use that in a visualization. We can do this in separate steps, but dplyr’s pipe allows us to take the results of one function and use it for another.
# percent of this age group that uses listed drug
filter(drugs, age == '30-34') %>%
ggplot(aes(x = reorder(drug, use), y = use)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(x = element_blank(),
y = element_blank(),
title = "Percent of People Who Have Used Drugs",
subtitle = 'Age Group: 30-34')One of the most powerful tools of ggplot is the ability to add another dimension to your plot, so that you can see the same structure of your plot, but on an additional dimension.
Instead of filtering and plotting each age group separately, we’ll facet on age group and see them all at once.
ggplot(drugs, aes(x = reorder(drug, use), y = use)) +
geom_bar(stat = 'identity') +
coord_flip() +
labs(x = element_blank(),
y = element_blank(),
title = "Percent of People Who Have Used Drugs") +
facet_wrap(~age) +
theme(text = element_text(size = 7))We can also look at the distribution of usage by age for each drug:
ggplot(drugs, aes(x = age, y = use)) +
geom_bar(stat = 'identity') +
labs(x = element_blank(),
y = element_blank(),
title = "Distribution of Age by Drug") +
facet_wrap(~drug) +
theme(text = element_text(size = 7),
axis.text.x = element_text(angle = 90, hjust = 1))Looking at this at a high level, we can easily determine the most popular drugs, but seeing the trends in all drugs is difficult. For that, we’ll need to free the y scale for each drug:
ggplot(drugs, aes(x = age, y = use)) +
geom_bar(stat = 'identity') +
labs(x = element_blank(),
y = element_blank(),
title = "Distribution of Age by Drug") +
facet_wrap(~drug, scales = 'free_y') +
theme(text = element_text(size = 7),
axis.text.x = element_text(angle = 90, hjust = 1))Now this is interesting! Here’s a few things we can easily determine from this visualization:
Tidyverse really makes R easy to use once you get the hang of it. Thanks Hadley!