NO CODING ZONE
tidyr and ggplot2 packages for easy data
pivoting and layer-based visualizationslike_thisTHE 6 DPLYR FUNCTIONS COVERED:
group_by(): as the name implies - use this to determine
what you want to group your data bysummarise(): used in tandem with group_by - use this to
determine what to summarise within selected groupsmutate(): use this to create a new variableinner_join(): use this to join tablesfilter(): use this to filter rowsselect(): use this to select columnsEND OF BACKGROUND INFO. NOW LET’S BEGIN OUR SCRIPT
NOW YOU ARE READY TO START CODING
Before we start coding - you should save the following files from the git repository to your working directory:
Most important is the csv file. This will be used in a later section.
#install.packages('dplyr')
library(ggplot2)
library(dplyr)
library(tidyr)
library(data.table)
library(plotly)
DECONSTRUCT THE CODE ABOVE:
install.packages() as “save a local copy of
(this external CRAN package)”library() as “for this working session of R,
activate (this locally saved package)”function(argument(s))You only need to install packages once. You need to activate selected
packages every session. Best practice is to begin any script with the
packages you will need for the task at hand. For this exercise we need
the dplyr package (no surprise) and the
data.table package (explained below).
install.packages('package_name'): makes a local copy
of a package that resides on CRAN
library(package_name): makes all functions contained
within selected package available during the current R working
session
Base R functions are accessible without the need to load any packages
Fun fact: Base R was used to develop many of the
dplyr functions
There are several datasets that are part of base R or part of the various packages you might install. In order to make this demonstration self contained, we will be using one of these datasets.
IMPORT INTERNAL DATASET
# READ IN AN INTERNAL DATASET
df.diamonds <- diamonds
IMPORT EXTERNAL CSV
Generally speaking, you will not be using internal datasets for your
projects. It is more common to import external files, such as csv files.
This section demonstrates how to import an external csv file:
# READ IN CSV
df.mapping <- read.csv('clarity_map.csv')
df.mapping <- fread('clarity_map.csv')
DECONSTRUCTING CODE ABOVE:
<- should be interpreted as “Create a
new object”fread() and read.csv()read.csv('file.csv') AS “import into this
active R session (‘file.csv’)fread.csv('file.csv') AS “import into this
active R session (‘file.csv’) - and do it much faster than read.csv
would.fread() every time? It is
not internal - requires us to load data.table - read.csv()
is internal - no external packages required.TRANSLATE CODE TO ENGLISH
df.mapping by not so quickly
importing the file ‘clarity_map.csv’.df.mapping by much more
quickly importing the file ‘clarity_map.csv’Once you import the data - you will see your 2 new objects in the upper right hand console:
df.diamonds1df.diamonds2NOTE: We did not have to specify a directory to export to or import from. This is because we set our working directory - and R knows to look here.
Calculate total price by table:
df.agg <- df.diamonds %>%
group_by(clarity) %>%
summarise(Total.Price = sum(price), Observations = n())
dim(df.agg)
## [1] 8 3
DECONSTRUCT THE CODE ABOVE
%>% is called a pipe.group_by(column) as: ‘combine your dataset
(into levels of selected column)’.summarise() as: ‘Sum totals within each
group_by level (these are the values to sum)’.group_by() and summarise() are
almost always used in tandem.TRANSLATE THE CODE TO ENGLISH
df.group as follows:df.diamonds THENNotice there is a new object in our upper left hand panel called
df.agg. This object has 127 rows and 2 columns. Click on
object for further inspection reveals the columns are ‘clarity’ and
‘Total.Price’ as expected.
df.agg <- df.agg %>%
mutate(average.price = Total.Price/Observations)
dim(df.agg)
## [1] 8 4
DECONSTRUCT THE CODE ABOVE
mutate(variable.name) as “create new variable
and that variable name is”variable.name”.Map the 8 clarity levels to 3 categories as follows:
| clarity | clarity.map |
|---|---|
| I1 | Bad |
| SI2 | Bad |
| SI1 | Medium |
| VS2 | Medium |
| VS1 | Medium |
| VVS2 | Medium |
| VVS1 | Good |
| IF | Good |
Use the following snippet:
df.agg <- df.agg %>%
inner_join(df.mapping, by = 'clarity')
dim(df.agg)
## [1] 8 5
DECONSTRUCT THE CODE ABOVE:
inner_join() as “perform an inner
join on previous table (with new table named here).df.agg table.TRANSLATE CODE TO ENGLISH:
df.aggdf.agg THENdf.mapping table by the column
‘clarity’As with everything, there is more than one way to accomplish this
task. The snippet below demonstrates an alternative method using
dplyr::if_else(). The end result is identical to the
inner_join() method - so both methods are correct. A table
makes sense when you have a lot of levels. if_else() is
easier if you have just a few levels. Ultimately, this is a personal
choice and depends on multiple factors that are different for each
individual and each project.
df.agg <- df.agg %>%
mutate(clarity.map2 = if_else(clarity %in% c('I1','SI2'),'Bad',
if_else(clarity %in% c('SI1','VS2','VS1','VVS2'),'Medium',
if_else(clarity %in% c('VVS1','IF'),'Good','Undefined'))))
Nested if_else() functions (or “if_else()
functions within if_else() functions”), as shown above, can
be difficult to digest. A third alternative is using
dplyr::case_when() - If you’re comfortable with SQL, this
function will feel right at home. Below is an example of
case_when() within a mutate() function:
df.agg <- df.agg %>%
mutate(clarity.map3 = case_when(clarity %in% c('I1','SI2') ~ 'Bad',
clarity %in% c('SI1','VS2','VS1','VVS2') ~ 'Medium',
clarity %in% c('VVS1','IF') ~ 'Good',
TRUE ~ 'Undefined'))
DECONSTRUCT THE CODE ABOVE:
%in%: this is a boolean based on
being an element of a list.c(): this is base R function to
concatenate strings and other R objectsTRANSLATE CODE TO ENGLISH:
Only look at diamonds mapped to “Good”.
df.agg <- df.agg %>%
filter(clarity.map == 'Good')
DECONSTRUCT THE CODE ABOVE:
filter() as: “only look at rows
where (this is true)”.TRANSLATE THE CODE TO ENGLISH:
df.aggdf.agg THENFilter will result in less than or equal to the original number of rows. Filter will not impact the number of columns.
Remove the columns “clarity.map2” and “clarity.map3” for housekeeping
df.agg1 <- df.agg %>%
select('clarity', 'Total.Price', 'Observations', 'average.price', 'clarity.map')
DECONSTRUCT THE CODE ABOVE:
select() as “select the specific columns
being stated”TRANSLATE CODE TO ENGLISH:
df.agg1df.agg THENselect() functionALTERNATE METHOD FOR TASK 7:
df.agg2 <- df.agg %>%
select(-c('clarity.map2', 'clarity.map3'))
df.agg <- df.agg1
rm(df.agg1, df.agg2)
DECONSTRUCT THE CODE ABOVE:
c()TRANSLATE THE CODE TO ENGLISH:
df.agg2df.agg THENWe have now taken the diamonds dataset THEN
df.mapping table THENThis was all done sequentially on the same dataset. This is a good representation of how code is created. Very piecemeal and iterative. It is very rare that you go into a project knowing all the tasks you need to complete sequentially. Ultimately, however, you do get your final result - like in our simplified example above. At this point, you can use the pipe operator to clean up and chain your code together as follows:
df.chained <- df.diamonds %>%
group_by(clarity) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations) %>%
inner_join(df.mapping, by = 'clarity') %>%
mutate(clarity.map2 = if_else(clarity %in% c('I1','SI2'),'Bad',
if_else(clarity %in% c('SI1','VS2','VS1','VVS2'),'Medium',
if_else(clarity %in% c('VVS1','1F'),'Good','Undefined'))),
clarity.map3 = case_when(clarity %in% c('I1','SI2') ~ 'Bad',
clarity %in% c('SI1','VS2','VS1','VVS2') ~ 'Medium',
clarity %in% c('VVS1','IF') ~ 'Good',
TRUE ~ 'Undefined')) %>%
filter(clarity.map == 'Good') %>%
select(c('clarity', 'average.price'))
The two main functions that we’ll learn about from the
tidyr package are pivot_longer() and
pivot_wider(). Pivoting your data longer will increase the
number of observations (rows) but decrease the number of variables
(columns). Pivoting your data longer will decrease the number of
observations but increase the number of variables.
For this tidyr overview, we’re going to step away from
the df.diamonds object and load / assign a new object,
tidyr_data, and let’s take a glimpse of the data:
# The relig_income dataset is part of the tidyr package
tidyr_data <- relig_income
glimpse(tidyr_data)
## Rows: 18
## Columns: 11
## $ religion <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "D…
## $ `<$10k` <dbl> 27, 12, 27, 418, 15, 575, 1, 228, 20, 19, 289, 29…
## $ `$10-20k` <dbl> 34, 27, 21, 617, 14, 869, 9, 244, 27, 19, 495, 40…
## $ `$20-30k` <dbl> 60, 37, 30, 732, 15, 1064, 7, 236, 24, 25, 619, 4…
## $ `$30-40k` <dbl> 81, 52, 34, 670, 11, 982, 9, 238, 24, 25, 655, 51…
## $ `$40-50k` <dbl> 76, 35, 33, 638, 10, 881, 11, 197, 21, 30, 651, 5…
## $ `$50-75k` <dbl> 137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95, 110…
## $ `$75-100k` <dbl> 122, 73, 62, 949, 21, 949, 47, 131, 15, 69, 939, …
## $ `$100-150k` <dbl> 109, 59, 39, 792, 17, 723, 48, 81, 11, 87, 753, 4…
## $ `>150k` <dbl> 84, 74, 53, 633, 18, 414, 54, 78, 6, 151, 634, 42…
## $ `Don't know/refused` <dbl> 96, 76, 54, 1489, 116, 1529, 37, 339, 37, 162, 13…
After review, we can group the above data points into three buckets:
While the “out of the box” format of the tidyr_data is
nice on the eyes, it makes it more difficult to model or visualize. For
this, we’ll explore the tidyr::pivot_longer() function.
The pivot_longer() function is structured as follows
with four main arguments:
df %>%
pivot_longer(
cols = c(col_name_a, col_name_b),
names_to = "new_col to capture all variables being pivoted",
values_to = "new col for values"
)
Lets put the above to practice! The goal of this exercise will be to
pivot long the tidyr_data object and end up with a new
tibble that only has 3 columns (instead of 11): religion,
income, and count
tidyr_long <- tidyr_data %>%
pivot_longer(
cols = !religion,
names_to = "income",
values_to = "count"
)
glimpse(tidyr_long)
## Rows: 180
## Columns: 3
## $ religion <chr> "Agnostic", "Agnostic", "Agnostic", "Agnostic", "Agnostic", "…
## $ income <chr> "<$10k", "$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k…
## $ count <dbl> 27, 34, 60, 81, 76, 137, 122, 109, 84, 96, 12, 27, 37, 52, 35…
DECONSTRUCT THE CODE ABOVE:
pivot_longer() describes which
columns need to be reshaped
names_to argument gives the name of the variable
that will be created from the data stored in the column names,
i.e. “income”values_to argument gives the name of the variable
that will be created from the data stored in the cells,
i.e. “count”Now, lets say your boss is “OK” with what you’ve created with the
tidyr_long object, but they want to see a different cut of
the data… This time, your boss wants to see “income” as the core
identifier and wants to see “religion” as the columns to capture the
“count” values… What to do?!?
We’ve learned about pivoting data long but now lets pivot
the data wide as the boss requested. The
pivot_wider() function is structured as follows with three
main arguments:
df %>%
pivot_wider(
names_from = "column name containing the values you want to spread to new columns",
names_sort = TRUE, # the default is FALSE to sort by first appearance
values_from = "values that will populate the cell values"
)
Since the tidyr_long object has 18 rows, we should
expect to see 19 columns after we pivot wide, i.e., 1 column for the
identifier (income) + 18 new columns created from pivoting wide
(religion). Putting this code to practice, let’s create a new object
called tidyr_wide with the pivot_wider()
function:
tidyr_wide <- tidyr_long %>%
pivot_wider(
names_from = religion,
names_sort = TRUE,
values_from = count
)
glimpse(tidyr_wide)
## Rows: 10
## Columns: 19
## $ income <chr> "<$10k", "$10-20k", "$20-30k", "$30-40k", "$…
## $ Agnostic <dbl> 27, 34, 60, 81, 76, 137, 122, 109, 84, 96
## $ Atheist <dbl> 12, 27, 37, 52, 35, 70, 73, 59, 74, 76
## $ Buddhist <dbl> 27, 21, 30, 34, 33, 58, 62, 39, 53, 54
## $ Catholic <dbl> 418, 617, 732, 670, 638, 1116, 949, 792, 633…
## $ `Don’t know/refused` <dbl> 15, 14, 15, 11, 10, 35, 21, 17, 18, 116
## $ `Evangelical Prot` <dbl> 575, 869, 1064, 982, 881, 1486, 949, 723, 41…
## $ Hindu <dbl> 1, 9, 7, 9, 11, 34, 47, 48, 54, 37
## $ `Historically Black Prot` <dbl> 228, 244, 236, 238, 197, 223, 131, 81, 78, 3…
## $ `Jehovah's Witness` <dbl> 20, 27, 24, 24, 21, 30, 15, 11, 6, 37
## $ Jewish <dbl> 19, 19, 25, 25, 30, 95, 69, 87, 151, 162
## $ `Mainline Prot` <dbl> 289, 495, 619, 655, 651, 1107, 939, 753, 634…
## $ Mormon <dbl> 29, 40, 48, 51, 56, 112, 85, 49, 42, 69
## $ Muslim <dbl> 6, 7, 9, 10, 9, 23, 16, 8, 6, 22
## $ Orthodox <dbl> 13, 17, 23, 32, 32, 47, 38, 42, 46, 73
## $ `Other Christian` <dbl> 9, 7, 11, 13, 13, 14, 18, 14, 12, 18
## $ `Other Faiths` <dbl> 20, 33, 40, 46, 49, 63, 46, 40, 41, 71
## $ `Other World Religions` <dbl> 5, 2, 3, 4, 2, 7, 3, 4, 4, 8
## $ Unaffiliated <dbl> 217, 299, 374, 365, 341, 528, 407, 321, 258,…
DECONSTRUCT THE CODE ABOVE:
names_from, identifies the column
from which we’ll be spreading the values contained in it to new columns
names_sort argument is optional (defaults to
FALSE) - by setting it to TRUE, we ensure the
new columns spread wide are in alphabetical order vs the order of first
appearancevalues_from argument supplies the column name with
the values that we want to use to populate the values in our reshaped
wide datasetSometimes there is a need to pivot your data long then wide to get your desired results. Sometimes there is a need to summarize the data using your own custom functions. Below, we’ll aim to do the following:
tidyr_data object and pivot it
long using pivot_longer()pivot_wider() function but
using different criteria to spread the data widepivot_wider() example in section 13, this
time, we’re going to use a fourth argument in the
pivot_wider() function, values_fn
values_fn, we’ll be able to replace the values in
our resulting table with categorical buckets representing the sample
sizevalues_fn = mean, but we can also pass through custom
functionscustom_sample_size_fn(), that will replace the numeric cell
values with categorical representations of sample sizecustom_sample_size_fn <- function(value) {
case_when(value < 50 ~ "S",
value < 250 ~ "M",
value < 750 ~ "L",
TRUE ~ "XL")
}
tidyr_wide2 <- tidyr_data %>%
pivot_longer(
cols = !religion,
names_to = "income",
values_to = "count"
) %>%
pivot_wider(
names_from = religion,
names_sort = TRUE,
values_from = count,
values_fn = ~ custom_sample_size_fn(.x)
)
glimpse(tidyr_wide2)
## Rows: 10
## Columns: 19
## $ income <chr> "<$10k", "$10-20k", "$20-30k", "$30-40k", "$…
## $ Agnostic <chr> "S", "S", "M", "M", "M", "M", "M", "M", "M",…
## $ Atheist <chr> "S", "S", "S", "M", "S", "M", "M", "M", "M",…
## $ Buddhist <chr> "S", "S", "S", "S", "S", "M", "M", "S", "M",…
## $ Catholic <chr> "L", "L", "L", "L", "L", "XL", "XL", "XL", "…
## $ `Don’t know/refused` <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ `Evangelical Prot` <chr> "L", "XL", "XL", "XL", "XL", "XL", "XL", "L"…
## $ Hindu <chr> "S", "S", "S", "S", "S", "S", "S", "S", "M",…
## $ `Historically Black Prot` <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M",…
## $ `Jehovah's Witness` <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ Jewish <chr> "S", "S", "S", "S", "S", "M", "M", "M", "M",…
## $ `Mainline Prot` <chr> "L", "L", "L", "L", "L", "XL", "XL", "XL", "…
## $ Mormon <chr> "S", "S", "S", "M", "M", "M", "M", "S", "S",…
## $ Muslim <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ Orthodox <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ `Other Christian` <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ `Other Faiths` <chr> "S", "S", "S", "S", "S", "M", "S", "S", "S",…
## $ `Other World Religions` <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S",…
## $ Unaffiliated <chr> "M", "L", "L", "L", "L", "L", "L", "L", "L",…
DECONSTRUCT THE CODE ABOVE:
case_when() function to bucket sample
size counts into categorical buckets,tidyr_data long,Initial Data Prep
Before we begin our data visualization - let’s do some data prep using
the tools above. First, create 2 new variables: “carat.group”, and
“table.group”. Then we will create 4 aggregated datasets as follows:
df.diamonds <- df.diamonds %>%
# Create new variable carat.group
mutate(carat.group = case_when(carat < 1 ~ 0.5,
carat < 2 ~ 1.5,
carat < 3 ~ 2.5,
carat < 4 ~ 3.5,
TRUE ~ 4.5),
# create new variable table.group
table.group = case_when(table < 60 ~ 60,
table < 70 ~ 70,
TRUE ~ 95))
# Create dataset for 1 variable visualizations
df.agg1 <- df.diamonds %>%
group_by(carat.group) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create dataset for 2 variable visualizations
df.agg2 <- df.diamonds %>%
group_by(carat.group, cut) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create dataset for 3 variable visualizations
df.agg3 <- df.diamonds %>%
group_by(carat.group, cut, color) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
# Create datset for 4 variable visualizations
df.agg4 <- df.diamonds %>%
group_by(carat.group, cut, color, table.group) %>%
summarise(Total.Price = sum(price), Observations = n()) %>%
mutate(average.price = Total.Price/Observations)
Basic GGPLOT2 GRAPH:
Graphs created with ggplot2 use the following form:
ggplot(data,(aes(x,y)) + geom_XXXX()
DECONSTRUCT THE LINE ABOVE:
geom_point(): scatter plotBegin by creating your ggplot object as described above:
graph.object <- ggplot(df.agg1, aes(x = carat.group, y = average.price, group = 1))
graph.object
scatter1 <- graph.object + geom_point()
line1 <- graph.object + geom_line()
bar1 <- graph.object + geom_bar(stat = "identity")
scatterline1 <- scatter1 + geom_line()
scatter1
bar1
line1
scatterline1
As always with ggplot2 - our first step is to create our ggplot object.
graph.task2 <- ggplot(df.agg2,aes(x = carat.group, y = average.price, color = cut, fill = cut))
Once we build our ggplot object, adding geometries is the same as in Task 1:
scatter2 <- graph.task2 + geom_point()
line2 <- graph.task2 + geom_line()
stackbar2 <- graph.task2 + geom_bar(stat = "identity",position = "stack")
fillbar2 <- graph.task2 + geom_bar(stat = "identity",position = "fill")
dodgebar2 <- graph.task2 + geom_bar(stat = "identity",position = "dodge")
scatterline2 <- scatter2 + geom_line()
scatter2
line2
stackbar2
fillbar2
dodgebar2
scatterline2
There is another powerful layer you can add to your graphs which allows you to analyze higher order dimensions (more variables). This is called faceting. Faceting means making copies of your graphs. We are going to make a bunch of copies of the graphs we built up to now. Each copy will represent a single level of our designated faceting variable(s).
Once again - start with your ggplot object:
graph.task3 <- ggplot(df.agg3,aes(x = carat.group, y = average.price, color = cut, fill = cut))
facet_wrap()
We will start with facet_wrap(). Facet_wrap() allows faceting (making copies of your graph) split by 1 variable.
Next - add our geometries like in previous tasks. We also add an additional facet_wrap() layer to our ggplot object. Adding the facet_wrap() layer to our graphs above simply results in several copies of your prior graphs, each one filtered on a single level of your faceting variable.
scatterline3 <- graph.task3 + geom_line() + geom_point() + facet_wrap(~color)
scatterline3
facet_grid()
Facet_grid() is very similar to facet_wrap(). Where facet_wrap() allows you to facet on one variable, facet_grid() allows you to facet on 2 variables. One variable for the columns, and one for the rows.
graph.task4 <- ggplot(df.agg4,aes(x = carat.group, y = average.price, color = cut, fill =cut))
scatterline4 <- graph.task4 + geom_line() + geom_point() + facet_grid(color ~ table.group) + xlab("Banded Ages") + ylab("Actual to Expected") + ggtitle("Dollar Weighted Actual to Expected Analysis Using 7580E")
scatterline4
Plotly is a wrapper you can put around your ggplot graphs. This wrapper greatly enhances ggplot graphs and is extremely simple to implement. For these reasons - I use it as a default with all my graphs.
In order to use plotly, simply add the ggplotly() function around any of the graph objects we have already created. This will create cleaner graphs as well as additional interactivity:
ggplotly(scatter1)
ggplotly(scatter2)
ggplotly(bar1)
ggplotly(stackbar2)
ggplotly(fillbar2)
ggplotly(dodgebar2)
ggplotly(line1)
ggplotly(line2)
ggplotly(scatterline1)
ggplotly(scatterline2)
ggplotly(scatterline3)
ggplotly(scatterline4)