Whether you are new to R or a grizzly veteran, you might had heard the term “tidydata” or “tidyverse”. What is Tidyverse you ask? Is it fun? Will it make your life easier? Is it a cult?
Yes, yes and probably!
In short “Tidy Data” is a methodology championed by Hadley Wickham [aka R Jesus] in his seminal journalarticel that lays out a framework to organize, clean and transform data sets so one can spend less time munging and more time analyzing!
Tidyverse has three simple rules:
Tidy
Tidying up your data might seem strange at first but there are advantages to using said framework.
Tidy data is consistent.
Tidy data is vectorized and is easy to use in R
The Tidy Universe is actually a collection R packages that are helpful for data munging/manipulation. From the main reference page–the “core” Tidyverse packages are:
I also like to use tidyr
We will be showcasing a few select packages and how they can useful for data analysis!
We will be manipulating the Master Candy Power Rankings from 538–the complete data set can be found via this GitHub–tooth brush’s optional!
Tidyverse and tidyr can be loaded easily as follows
library(tidyverse)
library(tidyr)
Loaded!
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.2.1 v purrr 0.3.3
v tibble 2.1.3 v stringr 1.4.0
v readr 1.3.1 v forcats 0.4.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
> library(tidyr)
First let’s load the data into a data frame and explore.
df=read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv')
And let’s just so some basic exploration.
str(df)
## 'data.frame': 85 obs. of 13 variables:
## $ competitorname : Factor w/ 85 levels "100 Grand","3 Musketeers",..: 1 2 45 46 3 4 5 6 7 8 ...
## $ chocolate : int 1 1 0 0 0 1 1 0 0 0 ...
## $ fruity : int 0 0 0 0 1 0 0 0 0 1 ...
## $ caramel : int 1 0 0 0 0 0 1 0 0 1 ...
## $ peanutyalmondy : int 0 0 0 0 0 1 1 1 0 0 ...
## $ nougat : int 0 1 0 0 0 0 1 0 0 0 ...
## $ crispedricewafer: int 1 0 0 0 0 0 0 0 0 0 ...
## $ hard : int 0 0 0 0 0 0 0 0 0 0 ...
## $ bar : int 1 1 0 0 0 1 1 0 0 0 ...
## $ pluribus : int 0 0 0 0 0 0 0 1 1 0 ...
## $ sugarpercent : num 0.732 0.604 0.011 0.011 0.906 ...
## $ pricepercent : num 0.86 0.511 0.116 0.511 0.511 ...
## $ winpercent : num 67 67.6 32.3 46.1 52.3 ...
Alright–13 columns[variables] that composed of factors, numbers and integers. No need to change any variable type–thank you 538!
Now let’s look at how it’s structured.
head(df)
From the data we have a 85 different candies, each described by either a single or mulitple attribute, with a unque sugarpercent, pricepercent and winpercent.
When working in the tidyverse it’s often useful to convert your dataframe to a tibble which is a type of dataframe with a tbl_df class. Tibbles are essentially standardized dataframes, with various properties, one of which is a visually appealing rectangular format.
as_tibble(.data)
as_tibble() will convert a passed dataframe to a tibble
as_tibble(df)
# A tibble: 85 x 13
competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent
<fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 100 Grand 1 0 1 0 0 1 0 1 0 0.732
2 3 Musketeers 1 0 0 0 1 0 0 1 0 0.604
3 One dime 0 0 0 0 0 0 0 0 0 0.011
4 One quarter 0 0 0 0 0 0 0 0 0 0.011
5 Air Heads 0 1 0 0 0 0 0 0 0 0.906
6 Almond Joy 1 0 0 1 0 0 0 1 0 0.465
7 Baby Ruth 1 0 1 1 1 0 0 1 0 0.604
8 Boston Baked ~ 0 0 0 1 0 0 0 0 1 0.313
9 Candy Corn 0 0 0 0 0 0 0 0 1 0.906
10 Caramel Apple~ 0 1 1 0 0 0 0 0 0 0.604
Compare this to the default head(df) print and we can see the tibble is cleaner, providers the datatypes below each column and has a rectanguar structure.
Let’s say we have a certain sweet-tooth and just want to look at candys and their sugar-content column, and not deal with the others.
We can use select() from the dplyr package!
select(.data,…)
select() allows you to choose or rename variables/columns in a passed table/dataframe. One can isolate, drop, rename or slice a dataframe. Select is very powerful and has a host of different functions
Let’s say we have a sweet-tooth and want to only isolate candy’s and their sugarpercent
select(as_tibble(df),competitorname,sugarpercent)
From our last example we isolated the candy columns and sugar content but they aren’t sorted! What if we had a diabetes and want to make sure we had a list of candies with the lowest sugar?
We can use dplyr: arrange
arrange() accepts a table and reorganizes(sorts) it by a passed column–by default it uses ascending order. Wrapping desc()
arrange(.data,…)
arrange(select(as_tibble(df),competitorname,sugarpercent),sugarpercent)
desc(.data,…)
arrange(select(as_tibble(df),competitorname,sugarpercent),desc(sugarpercent))
By now you noticed that to do these steps of tidy, the tibble, selecting and then arranging, I either have to pass the function and dataframe through each step or would need to make intermediate frames which would take up needless memory. One powerful way in the tidyverse to chain functions/commands is piping via %>%.
Note: %>% is actually part of the magrittr package but it’s loade dwith dplyr.
%>% takes whatever is left to it and forward (or pipes) it to an expression on the right.–this can be an expression, function, action
x %>% f(x
df %>% as_tibble() %>% select(competitorname,sugarpercent) %>% arrange(sugarpercent)
Compare that with the previous used code
arrange(select(as_tibble(df),competitorname,sugarpercent),desc(sugarpercent))
Assume we have a love for choclate candy. We might want to filter for candy that has chocolate in it
We can use dplyr::filter
Use filter() find rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.
filter(.data, …)
filter(as_tibble(df),chocolate ==1)
If we are worried about high chocolate amount in chocoalte treats we could want to have a look at average sugar content in choclate versus non-choclate treats:
as_tibble(df) %>%
dplyr::group_by(chocolate) %>%
dplyr::summarize(avgsugarpercent = mean(sugarpercent, na.rm = TRUE))
As we can see chocolate candy only has 6% more sugar on average so nothing too worrying.
The piping %>% operator is very useful in it’s ability to pass a chain of commands to a function. Let’s say we want to see if there is a relationship between winpercentage and sugarpercentage, we can follow the previous steps (selecting the winpercentage column in addition) and sending it all to ggplot [another core tidyverse package but beyond the scope of this notebook]
df %>% as_tibble() %>% select(competitorname,winpercent,sugarpercent) %>% ggplot(aes(x=winpercent,y=sugarpercent))+geom_point()
The data looks evenly distributed so there doesn’t seem to be a linear relation but let’s add a best fit regression line to show off %>%
df %>% as_tibble() %>% select(competitorname,winpercent,sugarpercent) %>% ggplot(aes(x=winpercent,y=sugarpercent))+geom_point()+geom_smooth(method='lm')
Finally–let’s label the candy’s for our diabetic friend that have a win percentage above 60% but have a sugar content less than 50%
df %>% as_tibble() %>% select(competitorname,winpercent,sugarpercent) %>% ggplot(aes(x=winpercent,y=sugarpercent),label=competitorname)+geom_point()+geom_smooth(method='lm')+geom_text(aes(label=ifelse((sugarpercent<0.50 & winpercent>60),as.character(competitorname),'')))
As you can see the tidyverse has a myriad of great tools to quickly analyze data and find insight–and most of this was possible due to the data already being in tidy form! We were able to quickly do a linear regression on a sorted, selected and now labeled group of data with one line of code; had it been a messy data set, most of our time would be devoted to cleaning!