Welcome to the…

## * __  _    __   .    o           *  . 
##  / /_(_)__/ /_ ___  _____ _______ ___ 
## / __/ / _  / // / |/ / -_) __(_-</ -_)
## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/ 
##      *  . /___/      o      .       *

Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. ~Hadley Wickham

The tidyverse is a group of packages with a common design philosophy that uses a concise syntax to help you clean, organize, analyze, and visualize large data sets with ease. The syntax was popularized by “R for Data Science” by Hadley Wickham and Garrett Grolemund, but its rooted in the idea that workflows should be both readable and reproducible. Tidyverse packages help your code read left to right, more like a sentence: in base code, you’d write h(g(f(x))) but in tidyverse syntax, you’d write x %>% f %>% g %>% h.

Here is the opinion part:

“Programs must be written for people to read and only incidentally for machines to execute”. ~Hal Abelson

If you think about it, it really does make more sense to read your code like you’d read a book rather than reading from the inside out. As more of a writer than a mathemetician myself, this structure inherently made more sense to me than dollar sign or function syntax. Learning ggplot and other tidy commands transformed me from a reluctant and deficient coder into an enthusiastic (and hopefully proficient) one!

The tidyverse is widely used because it is logical, but also because it has packages for every step of your data’s journey from import to output. Each package uses consistent a grammar and data structure.

1) Import:

2) Tidy:

3) Transform:

4) Visualize:

5) Model:

6) Program:

There are many more great packages that are tidy-friendly, but we will focus on this core group, and more specifically on tidy, dplyr, and ggplot2. Fear not, you don’t need to install all of these packages individually, just load the tidyverse!

install.packages("tidyverse") library(tidyverse)

Grammar

Before we start coding, there are a few peices of tidyverse jargon we need to define:

tidy data - In the framework of tidy data every row is an observation, every column represents variables and every entry into the cells of the data frame are values. As you might expect, the tidyverse aims to create, visualize, and analyze data in a tidy format.

tibble - Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). More on tibbles later.

%>% also known as a pipe - The infix operator is a function that passes the left hand side of the operator to the first argument on the right hand side of the operator. Thus, iris %>% head() is equivalent to head(iris). This operator is convinient because you can call the pipe multiple times to “chain” functions together (nesting in base R). The pipe operator is not required to use tidyverse functions, but it does make them more convinient.

Readr

To read in a dataset, use the readr package. readr::read_csv replaces read.csv which allows for faster data reading. read_csv will also preserve column names and it will not coerce characters to factors (i.e., no more header = TRUE, stringsAsFactors = FALSE) yay!)

cetaceans<-read_csv("https://raw.githubusercontent.com/LGCarlson/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv")

cetaceans %>% class()
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Base R equalivalent

# Base R equalivalent:

# cetatceans<-read.csv("https://raw.githubusercontent.com/LGCarlson/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv",header = TRUE, stringsAsFactors = FALSE)

# class(cetaceaens)
# [1] "data.frame"

Tibble

As shown by calling “class” above, readr functions automatically read your dataset as a tibble. Let’s see what that looks like by calling head() and asking for the first 10 observations.

cetaceans %>%
  head(10)
## # A tibble: 10 x 22
##       X1 species id    name  sex   accuracy birthYear acquisition
##    <dbl> <chr>   <chr> <chr> <chr> <chr>    <chr>     <chr>      
##  1     1 Bottle~ NOA0~ Dazz~ F     a        1989      Born       
##  2     2 Bottle~ NOA0~ Tursi F     a        1973      Born       
##  3     3 Bottle~ NOA0~ Star~ M     a        1978      Born       
##  4     4 Bottle~ NOA0~ Sandy F     a        1979      Born       
##  5     5 Bottle~ NOA0~ Sandy M     a        1979      Born       
##  6     6 Bottle~ NOA0~ Nacha F     a        1980      Born       
##  7     7 Bottle~ NOA0~ Kama  M     a        1981      Born       
##  8     8 Bottle~ NOA0~ Jene~ F     a        1981      Born       
##  9     9 Bottle~ NOA0~ Duffy M     a        1982      Born       
## 10    10 Bottle~ NOA0~ Astra F     a        1983      Born       
## # ... with 14 more variables: originDate <date>, originLocation <chr>,
## #   mother <chr>, father <chr>, transfers <chr>, currently <chr>,
## #   region <chr>, status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## #   transferDate <date>, transfer <chr>, entryDate <date>

Base R equalivalent

# Base R equalivalent: 

# head(cetaceans)

When you preview a tibble, it always prints the class of each object, but you can get more information about the tibble by calling glimpse. This is a good function to know. As a wise colleage once advised me… "always check the %$#*ing structure!"

cetaceans %>% 
  glimpse()
## Observations: 2,194
## Variables: 22
## $ X1             <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ species        <chr> "Bottlenose", "Bottlenose", "Bottlenose", "Bott...
## $ id             <chr> "NOA0004614, AZA 428, MLF-428", "NOA0004386, AZ...
## $ name           <chr> "Dazzle", "Tursi", "Starbuck", "Sandy", "Sandy"...
## $ sex            <chr> "F", "F", "M", "F", "M", "F", "M", "F", "M", "F...
## $ accuracy       <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a...
## $ birthYear      <chr> "1989", "1973", "1978", "1979", "1979", "1980",...
## $ acquisition    <chr> "Born", "Born", "Born", "Born", "Born", "Born",...
## $ originDate     <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03...
## $ originLocation <chr> "Marineland Florida", "Dolphin Research Center"...
## $ mother         <chr> "Betty III", "Little Bit", "Cindy (T.t. gilli)"...
## $ father         <chr> "Davy II", "Mr. Gipper", "Sambo", NA, NA, "Jeth...
## $ transfers      <chr> NA, NA, "SeaWorld San Diego to SeaWorld Aurora ...
## $ currently      <chr> "Marineland Florida", "Dolphin Research Center"...
## $ region         <chr> "US", "US", "US", "US", "US", "US", "US", "US",...
## $ status         <chr> "Alive", "Alive", "Alive", "Alive", "Alive", "A...
## $ statusDate     <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ COD            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ notes          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Sunny ...
## $ transferDate   <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ transfer       <chr> "US", "US", "US", "US", "US", "US", "US", "US",...
## $ entryDate      <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03...

Base R equalivalent

# Base R equalivalent: 

# str(cetaceans)

Tidyr

Is this a tidy dataset as it is? It is! But could it be…. dare I say, tidyr?

Here are a few tidyr functions that may be useful.

separate - separate one column into several

The separate function is telling R to seperate the “originDate” column into “originYear”,“originMonth”, and “originDay”.

The sep= command tells the function what each element is separated by. Unfortunately, this command does not work for to separate lowercase from capsital letters without a symbol in between (i.e., can handle m.ABDU but not mABDU… bonus points if you can tell me what ABDU is). Sep = can take [^[:alnum:]]+. For seperating capital letters, you’ll have to use “extract.”

Finally, remove = TRUE deletes the original column, but remove = FALSE retains it.

The select column is just saying we want to ignore all the other data columns except originDate, originYear, originMonth, and originDay.

cetaceans %>% 
  separate(originDate, into = c("originYear","originMonth", "originDay"), sep = "-", remove = FALSE) %>%
  select(originDate, originYear, originMonth, originDay) %>%
  head(10)
## # A tibble: 10 x 4
##    originDate originYear originMonth originDay
##    <date>     <chr>      <chr>       <chr>    
##  1 1989-04-07 1989       04          07       
##  2 1973-11-26 1973       11          26       
##  3 1978-05-13 1978       05          13       
##  4 1979-02-03 1979       02          03       
##  5 1979-08-15 1979       08          15       
##  6 1980-10-10 1980       10          10       
##  7 1981-03-27 1981       03          27       
##  8 1981-10-20 1981       10          20       
##  9 1982-10-16 1982       10          16       
## 10 1983-03-07 1983       03          07

Base R equalivalent

# Base R equalivalent:

# originDate<-as.character(cetaceans$originDate)

# YMD<-c()
# for(i in 1:length(originDate)){
# if(is.na(originDate[i])){
#    YMD<-rbind(YMD,rep(NA,3))
#    next
# }
# YMD<-rbind(YMD,unlist(strsplit(originDate[i],"-")))
# }

# Dates<-data.frame(originDate=cetaceans$originDate,
#                   originYear=YMD[,1],
#                   originMonth=YMD[,2],
#                   originDay=YMD[,3])

# head(Dates)

gather - gather columns into rows (make a long dataset)

There isn’t a variable I would actually want to gather by in this dataset, but we’ll pretend.

Explanation by line:

  1. For the first time, we are going to actually save the edits we make the the dataframe as a new object (parentlong) rather than just printing them.

  2. Next, we will gather columns 11 and 12 so that we have a long (less tidy) dataset. Each individual could now have two rows: one row for the mother, one for the father. The “key” column called parentgender will tell us if the partent in the “value” column is the mother or father. The “value” column will provide the parent name.

  3. Then, we will select the columns id, name, and the new columns we just created.

  4. We will filter out the rows where “parentname” is NA for easier example-viewing purposes.

  5. Then, we will order the rows in descending order by ID

  6. We will select the first 40 cases

parentlong<-cetaceans %>% 
  gather(key = "parentgender", value = "parentname", 11:12) %>%
  select(id, name, parentgender, parentname) %>%
  filter(!is.na(parentname)) %>% 
  arrange(desc(id)) %>%
  head(40)

parentlong %>%
  head(10)
## # A tibble: 10 x 4
##    id                               name          parentgender parentname
##    <chr>                            <chr>         <chr>        <chr>     
##  1 SWT-00-1776                      Takara's Calf mother       Takara    
##  2 SWT-00-1776                      Takara's Calf father       Kyuquot   
##  3 SWF-DL-9901                      Spooky’s Calf mother       Spooky    
##  4 SWF-DL-9901                      Spooky’s Calf father       Luke      
##  5 NOA006628, AZA 1396, SWF-TT-1001 Hurlee        mother       Thelma    
##  6 NOA006628, AZA 1396, SWF-TT-1001 Hurlee        father       Akai      
##  7 NOA006536, AZA 1281, SWF-TT-0903 Brigg         mother       Clipper   
##  8 NOA006536, AZA 1281, SWF-TT-0903 Brigg         father       Capricorn 
##  9 NOA0010381, 1116F1               Pele's Calf   mother       Pele      
## 10 NOA0010379, 916M1                Kekoa         mother       Kona
  • In these examples, I’ve included the argument names “key=”, “value=”, etc. Note that just like in base R, you don’t have to include argument names so long as you put them in the correct order.

Base R equalivalent

# Base R equalivalent:

# parentlong<-cetaceans[,c(3,4,11,12)]

# parentlong<-parentlong[complete.cases(parentlong),]

# new<-c()
# for(i in 1:nrow(parentlong)){
#   new<-rbind(new,rbind(c(parentlong$id[i],parentlong$name[i],colnames(parentlong)[3],parentlong$mother[i]),
#                   c(parentlong$id[i],parentlong$name[i],colnames(parentlong)[4],parentlong$father[i])))
# }
# new<-as.data.frame(new)
# colnames(new)<-c("id","name","parentgender","parentname")

# parentlong<-new

# parentlong<-parentlong[order(parentlong$id,decreasing=TRUE),]

# head(parentlong[complete.cases(parentlong),],n=10)

spread - the inverse of gather: create a wide dataset by spreading columns

Now, we will spread the tibble back to wide form (one row per unique individual). “Parentgender” will become the column names and “parentname” will provide values to those columns. If a value is not present, it will be filled with NA.

parentlong %>% 
  tidyr::spread(key = parentgender,value = parentname, fill = NA) %>% 
  arrange(desc(id))
## # A tibble: 25 x 4
##    id                                    name          father    mother 
##    <chr>                                 <chr>         <chr>     <chr>  
##  1 SWT-00-1776                           Takara's Calf Kyuquot   Takara 
##  2 SWF-DL-9901                           Spooky’s Calf Luke      Spooky 
##  3 NOA006628, AZA 1396, SWF-TT-1001      Hurlee        Akai      Thelma 
##  4 NOA006536, AZA 1281, SWF-TT-0903      Brigg         Capricorn Clipper
##  5 NOA0010381, 1116F1                    Pele's Calf   <NA>      Pele   
##  6 NOA0010379, 916M1                     Kekoa         <NA>      Kona   
##  7 NOA0010378, AZA ????, SWF-TT-1608     Storm         <NA>      Haley  
##  8 NOA0010377, M16009 (Georgia Aquarium) Roxy's Calf   <NA>      Roxy   
##  9 NOA0010375, M150001                   Maris' Calf   Beethoven Maris  
## 10 NOA0010373, AZA ????, SWF-TT-1607     Star          <NA>      Stella 
## # ... with 15 more rows

Base R equalivalent

# Base R equalivalent:
  
# parentlong<-parentlong[order(as.numeric(row.names(parentlong))),]

# parentlong<-data.frame(id=subset(parentlong,parentgender=="father")[,1],
#               name=subset(parentlong,parentgender=="father")[,2],
#               father=subset(parentlong,parentgender=="father")[,4],
#               mother=subset(parentlong,parentgender=="mother")[,4])

# parentlong<-parentlong[order(parentlong$id,decreasing=T),]

# head(parentlong[complete.cases(parentlong),],n=10)

Dplyr

Dplyr is maybe the most useful packages in all of R. It provides a few functions that are absolutely essential for data wrangling/transformation.

There is a handy cheat sheet available here: Data Wrangling Cheat Sheet

I’ve already had to use some of the dplyr commands above to accomplish what I wanted to, but let’s look at them individually.

select keep (or drop) columns by name

When working with a large dataframe, sometimes you need to reduce the nubmer of variables and remove a specific one. Select is an easy way to do that.

Here, I removed the pesky (and unnecessary) ID column that read_csv created (one bad feature of readr). The minus sign before the column name denotes that you wish to remove that column. In dplyr, you can refer to columns by their names much more easily. The base equivalent requires you to know the positions of each variable you wish to select or remove, which is easy in this case, but that isn’t always true.

dplyr: select all columns except X1

cetaceans %>%
  select(-X1) %>%
  head(6)
## # A tibble: 6 x 21
##   species id    name  sex   accuracy birthYear acquisition originDate
##   <chr>   <chr> <chr> <chr> <chr>    <chr>     <chr>       <date>    
## 1 Bottle~ NOA0~ Dazz~ F     a        1989      Born        1989-04-07
## 2 Bottle~ NOA0~ Tursi F     a        1973      Born        1973-11-26
## 3 Bottle~ NOA0~ Star~ M     a        1978      Born        1978-05-13
## 4 Bottle~ NOA0~ Sandy F     a        1979      Born        1979-02-03
## 5 Bottle~ NOA0~ Sandy M     a        1979      Born        1979-08-15
## 6 Bottle~ NOA0~ Nacha F     a        1980      Born        1980-10-10
## # ... with 13 more variables: originLocation <chr>, mother <chr>,
## #   father <chr>, transfers <chr>, currently <chr>, region <chr>,
## #   status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## #   transferDate <date>, transfer <chr>, entryDate <date>

Base R equalivalent base: select all columns except that in position 1

# Base R equalivalent: 

# head(cetaceans[,2:22])

Here, I just selected the species, id, and name of each dolphin. Again, it is more apparent how you’re actually transforming the data when you use dplyr.

dplyr: select all columns between the columns “species” and “name”

cetaceans %>% 
  select(species:name) %>%
  head(6)
## # A tibble: 6 x 3
##   species    id                                                     name   
##   <chr>      <chr>                                                  <chr>  
## 1 Bottlenose NOA0004614, AZA 428, MLF-428                           Dazzle 
## 2 Bottlenose NOA0004386, AZA 138, IDR-73-1                          Tursi  
## 3 Bottlenose NOA0002137, SWC-TTG-7816                               Starbu~
## 4 Bottlenose NOA0002690, SWF-TT-7903                                Sandy  
## 5 Bottlenose NOA0004418, AZA 242, SWF-TT-7904, MH-82-36-TT (New En~ Sandy  
## 6 Bottlenose NOA0002725, SWC-TT-8014                                Nacha

Base R equalivalent base: select all columns between 2:4

# Base R equalivalent: 

# head(cetaceans[,2:4])

select keep (or drop) columns by a conditional statement

There are also a variety of useful helper functions for select that you can use to make conditional statement.

dplyr: select the column “name,” and any column that ends with the word “Date.”

cetaceans %>%
  select(name, ends_with("Date")) %>%
  head(6)
## # A tibble: 6 x 5
##   name     originDate statusDate transferDate entryDate 
##   <chr>    <date>     <date>     <date>       <date>    
## 1 Dazzle   1989-04-07 NA         NA           1989-04-07
## 2 Tursi    1973-11-26 NA         NA           1973-11-26
## 3 Starbuck 1978-05-13 NA         NA           1978-05-13
## 4 Sandy    1979-02-03 NA         NA           1979-02-03
## 5 Sandy    1979-08-15 NA         NA           1979-08-15
## 6 Nacha    1980-10-10 NA         NA           1980-10-10

You can also use select to rearrange columns. Let’s say you make another ID column called “nameID” (created with unite_ of dplyr rather than unite tidyr, which I don’t like as well).

Perhaps you want to rearrange your columns so that your new ID is in the first colum, followed by sex, followed by acquisition, followed by “everything()” to add all other columns in the original order. So you’re not deleting any columns, you’re just moving them around.

rename(): you can also rename columns in dplyr. The new name is in the first "" and the original name is second.

cetaceans %>%
  unite_("nameID",c("name","birthYear"),sep = "_") %>% 
  select(nameID, sex, acquisition, everything()) %>%
  rename("originType" = "acquisition")
## # A tibble: 2,194 x 21
##    nameID sex   originType    X1 species id    accuracy originDate
##    <chr>  <chr> <chr>      <dbl> <chr>   <chr> <chr>    <date>    
##  1 Dazzl~ F     Born           1 Bottle~ NOA0~ a        1989-04-07
##  2 Tursi~ F     Born           2 Bottle~ NOA0~ a        1973-11-26
##  3 Starb~ M     Born           3 Bottle~ NOA0~ a        1978-05-13
##  4 Sandy~ F     Born           4 Bottle~ NOA0~ a        1979-02-03
##  5 Sandy~ M     Born           5 Bottle~ NOA0~ a        1979-08-15
##  6 Nacha~ F     Born           6 Bottle~ NOA0~ a        1980-10-10
##  7 Kama_~ M     Born           7 Bottle~ NOA0~ a        1981-03-27
##  8 Jenev~ F     Born           8 Bottle~ NOA0~ a        1981-10-20
##  9 Duffy~ M     Born           9 Bottle~ NOA0~ a        1982-10-16
## 10 Astra~ F     Born          10 Bottle~ NOA0~ a        1983-03-07
## # ... with 2,184 more rows, and 13 more variables: originLocation <chr>,
## #   mother <chr>, father <chr>, transfers <chr>, currently <chr>,
## #   region <chr>, status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## #   transferDate <date>, transfer <chr>, entryDate <date>

Base R equalivalent

# Base R equalivalent:

# cetaceans<-data.frame(nameID=paste(cetaceans$name,cetaceans$birthYear,sep="_"),
#     sex=cetaceans$sex,
#     acquisition=cetaceans$acquisition,
#     cetaceans[,-which(colnames(cetaceans) %in%  c("name","birthYear","sex","acquisition"))])

# colnames(cetaceans)[which(colnames(cetaceans)=="acquisition")]<-"originType"

filter - filters rows by their value

Important to remember: select is for columns, filter is for rows Important to remember: you can’t use logical rules in select

The objective here is to reduce the rows/observations by a value critera or other condition. You can apply any of the logical rules in filter. For example:

Possible operators
< Less than != Not equal to
> Greater than %in% Group membership
== Equal to is.na is NA
<= Less than or equal to !is.na is not NA
>= Greater than or equal to &,l,! Boolean operators

Explanation by line:

  1. First, we are going to repeat the command we created in the select() example to select only the dolphin’s name and all four possible date values.

  2. Next, we will filter out all individuals who don’t have a status date. !is.na(statusDate)

  3. Next we will filter out all inividuals whose transfer date is earlier than 1990 (keep only transfers after Jan 1, 1990). The “filter” command actually works with date values!

cetaceans %>%
  select(name, ends_with("Date")) %>%
  filter(!is.na(statusDate)) %>%
  filter(transferDate >= "1990-01-01")
## # A tibble: 8 x 5
##   name      originDate statusDate transferDate entryDate 
##   <chr>     <date>     <date>     <date>       <date>    
## 1 Nea       2007-06-03 2011-09-05 2010-09-04   2010-09-04
## 2 Somers    1998-05-22 2010-04-23 2010-03-03   2010-03-03
## 3 Gasper    1997-01-01 2007-01-02 2005-10-17   2005-10-17
## 4 Nootka Iv 1982-10-01 1994-09-13 1993-01-07   1993-01-07
## 5 Nanuq     1990-08-13 2015-02-09 1997-07-27   1997-07-27
## 6 Haida Ii  1982-10-01 2001-08-01 1993-01-08   1993-01-08
## 7 Nico      1996-01-01 2009-10-31 2005-10-17   2005-10-17
## 8 Yogi      1978-12-18 2004-11-20 2004-09-12   2004-09-12

Base R equalivalent

# Base R equalivalent:

# cetaceans<-cetaceans[,c(which(colnames(cetaceans)=="name"),
#                         which(endsWith(colnames(cetaceans),"Date")))]

# cetaceans<-cetaceans[which(!is.na(cetaceans$statusDate)),]

# cetaceans[which(cetaceans$transferDate>="1990-01-01"),]

group_by - groups data by categorical levels

summarize or summarise - summarise data by functions of choice

We will talk about these together because there isn’t much use to grouping data by a categorical variable if you’re not going to transform or summarize it in some way.

group_by allows us to create/nest categorical groupings of data by factor levels and preform analysis at the group as well as the individual level

summarize allows us to easily calculate summary statistics. You can use functions such as min, median, var, sd, n and many more

Explanation by line:

  1. We’ll talk more about the mutate function later, but for now, all you need to know is that we want to convert birthYear to a numeric variable (double) because it was read in as a character for some reason

  2. Next, use filter to consider only those dolphins which were “born” or "captured

  3. We group by acquisition and sex, pretty self-explanatory

  4. We can use the variety of functions in summarize to create a summary dataframe from our original dataset. Note, this dataframe will “overwrite” your original dataset if you save it as the same object name. For example, you’d want to name this acq_summary_table or something. We are telling summarize to count (n) the number in each group and take the mean of the birth years for each group. Note that we passed “na.rm” to the mean function (just like you normally would) so that it doesn’t return NA values.

  5. Finally, we used mutate_at to round to the nearest whole number (because partial years aren’t very informative).

cetaceans %>%
  mutate(birthYear = as.double(birthYear)) %>% 
  filter(acquisition == "Born" | acquisition == "Capture") %>% 
  group_by(acquisition, sex) %>%
  summarize(n = n(), avgBirthYear = mean(birthYear,na.rm = TRUE)) %>%
  mutate_at("avgBirthYear", round, 0)
## # A tibble: 6 x 4
## # Groups:   acquisition [2]
##   acquisition sex       n avgBirthYear
##   <chr>       <chr> <int>        <dbl>
## 1 Born        F       369         1999
## 2 Born        M       382         1998
## 3 Born        U        25         2001
## 4 Capture     F       703         1973
## 5 Capture     M       440         1972
## 6 Capture     U        71         1971

arrange - orders observations by value of interest

Sometimes it is helpful to rank observations or summaries by the value of a variable. The arrange function allows us to order data by variables in accending or descending order.

count is does the same thing as summarize n=n(). However, count takes the grouping variable as the arguement, but n=n() doesn’t take any arguments and relies on group_by to know how to count.

cetaceans %>%
  filter(!is.na(birthYear)) %>% 
  count(birthYear) %>%
  arrange(desc(n)) %>%
  head(10)
## # A tibble: 10 x 2
##    birthYear     n
##    <chr>     <int>
##  1 1976         42
##  2 1985         40
##  3 1970         39
##  4 1980         39
##  5 1981         39
##  6 1968         38
##  7 1975         35
##  8 1972         34
##  9 1984         34
## 10 1973         32

Ignore this step. It makes a dataframe that contains each transfer location by ID.

## # A tibble: 10 x 15
## # Groups:   id [10]
##    id     t1    t2    t3    t4    t5    t6    t7    t8    t9    t10   t11  
##    <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 UNK00~ flor~ u.s.~ new ~ seaw~ seaw~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 UNK00~ unkn~ mont~ new ~ seaw~ disc~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 UNK00~ hawa~ sea ~ u.s.~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 TT-670 miss~ hold~ u.s.~ gulf~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  5 TT-669 miss~ hold~ u.s.~ gulf~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  6 SWF-T~ east~ seaw~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  7 SWF-P~ key ~ sea ~ seaw~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  8 SWC-D~ huds~ seaw~ seaw~ seaw~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  9 NOAA0~ miss~ u.s.~ dolp~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 NOA00~ disc~ dolp~ <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # ... with 3 more variables: t12 <chr>, t13 <chr>, t14 <chr>

join - join two datasets together

The join function is very helpful to joint two dataframes together that may have a different structure or different variables, but observations for the same individuals, etc. You can use the “join” functions to combine them by a common value or group of values.

There are four types of join:

  • inner_join(): Include only rows in both x and y that have a matching value

  • left_join(): Include all of x, and matching rows of y

  • semi_join(): Include rows of x that match y but only keep the columns from x

  • anti_join(): Opposite of semi_join

cetaceans %>%
  left_join(transfersdf,by = "id") %>%
  select(id,species,name,sex,starts_with("t")) %>% 
  head(6)
## # A tibble: 6 x 21
##   id    species name  sex   transfers transferDate transfer t1    t2   
##   <chr> <chr>   <chr> <chr> <chr>     <date>       <chr>    <chr> <chr>
## 1 NOA0~ Bottle~ Dazz~ F     <NA>      NA           US       <NA>  <NA> 
## 2 NOA0~ Bottle~ Tursi F     <NA>      NA           US       <NA>  <NA> 
## 3 NOA0~ Bottle~ Star~ M     SeaWorld~ NA           US       seaw~ seaw~
## 4 NOA0~ Bottle~ Sandy F     SeaWorld~ NA           US       seaw~ seaw~
## 5 NOA0~ Bottle~ Sandy M     SeaWorld~ NA           US       seaw~ new ~
## 6 NOA0~ Bottle~ Nacha F     SeaWorld~ NA           US       seaw~ seaw~
## # ... with 12 more variables: t3 <chr>, t4 <chr>, t5 <chr>, t6 <chr>,
## #   t7 <chr>, t8 <chr>, t9 <chr>, t10 <chr>, t11 <chr>, t12 <chr>,
## #   t13 <chr>, t14 <chr>

mutate - create new variables

Mutate is an extremely useful function. You can use it to create a new variable that is a function of the current variables, add a new variable, etc.

In this example, we will calculate a variable containing the dolphin’s age.

Explanation by line:

  1. Only include individuals whose status == “Died”

  2. Only include individuals with a birthYear and statusDate

  3. Select “id”, “status_date”, and “birthYear” columns

  4. Convert the column “birthYear” to a double

  5. Create a new column called “deathYear” that uses the lubridate package to extract “year” from “statusDate”

  6. Create a new column called “age” that = difference between death year and birth year

cetaceans %>%
  filter(status == "Died") %>%
  filter(!is.na(birthYear), !is.na(statusDate)) %>%
  select(id,statusDate,birthYear) %>% 
  mutate(birthYear = as.double(birthYear)) %>% 
  mutate(deathYear = year(statusDate)) %>%
  mutate(age = deathYear - birthYear) %>%
  head(10)
## # A tibble: 10 x 5
##    id                            statusDate birthYear deathYear   age
##    <chr>                         <date>         <dbl>     <dbl> <dbl>
##  1 NOA0003077, SWC-CC-9327       2014-04-15      1993      2014    21
##  2 NOA0005793, SWC-CC-9827       2014-01-08      1998      2014    16
##  3 NOA0000663, 22196             1978-04-20      1947      1978    31
##  4 NOA0000661, 22198             1974-09-23      1952      1974    22
##  5 NOA0000669, AZA 1019          1990-01-26      1956      1990    34
##  6 NOA0000662, 22708             1975-09-10      1968      1975     7
##  7 NOA0000683, ISIS 900317, TT09 1995-01-05      1969      1995    26
##  8 NOA0000664, 22709             1978-07-29      1973      1978     5
##  9 NOA0000666, 23037             1987-11-10      1973      1987    14
## 10 NOA0000682, AZA 1041, 900222  1990-09-07      1990      1990     0

top_n - select the most common cases

Explanation by line:

  1. Remove the “-” and NA values from cause of death column

  2. Convert the COD column to all lowercase text

  3. Count the number in each COD group

  4. Select the top 10 columns

  5. Arrange in descending order by n

cetaceans %>%
  filter(!is.na(COD), 
         COD != "-") %>%
  mutate(COD = tolower(COD)) %>%
  count(COD) %>%
  top_n(10) %>%
  arrange(desc(n))
## # A tibble: 10 x 2
##    COD                                                                    n
##    <chr>                                                              <int>
##  1 pneumonia                                                             84
##  2 septicemia                                                            26
##  3 euthanasia                                                            18
##  4 "euthanasia: life threatening condition involving\rpain/suffering"    18
##  5 undetermined                                                          18
##  6 bronchopneumonia                                                      16
##  7 drowning                                                              15
##  8 premature/still birth                                                 15
##  9 hepatitis                                                             14
## 10 lung abscess                                                          14

ggplot2

To learn ggplot visualizations, we will use the gapminder dataset.

gapminder %>%
  glimpse()
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

The tidyverse relies upon the package ggplot2 for data visualization. The package, based on “The Grammar of Graphics”, embodies a deep philosophy of visualization to declaratively create graphics. After providing the data, you tell ggplot2 how to map variables to aesthetics, then add layers, scales, faceting specifications, or coordinate systems. Not only is ggplot more concise than base graphics, it also allows you more creative freedom and greater control over your visualizations.

Here is an example of the superior qualities of ggplot.

This plot took approximately 2 minutes

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_jitter(shape = 1, aes(color = continent)) +
  stat_smooth(method = "lm", size = 1, color = "black") +
  scale_x_log10() + 
  xlab("Per Capita GDP") + 
  ylab("Life Expectancy (yrs)") +
  facet_wrap(~continent) +
  theme_few() + 
  guides(color = FALSE)

This (slightly inferior) plot took approximately 30 minutes

gapminder <- as.data.frame(gapminder)
conts <- sort(unique(gapminder[,"continent"]),decreasing = F)
cols <- scales::hue_pal()(length(conts))
par(mfrow = c(2,3))
counter <- 1
for (i in conts) {
  plot(gapminder[which(gapminder$continent == i), "gdpPercap"],
       gapminder[which(gapminder$continent == i), "lifeExp"], col = cols[counter],
       xlab = "Per Capita GDP", ylab = "Life Expectancy (yrs)",
       main = i, las = 1, log = "x")
  fit <- lm(gapminder[which(gapminder$continent == i), "lifeExp"] ~ log(gapminder[which(gapminder$continent == i), "gdpPercap"]))
  pred <- predict(fit, interval = "confidence")
  lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,1]))
  lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,2]), lty = 2)
  lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,3]), lty = 2)
  counter <- counter + 1
}

Grammar

  • data - your data must be a dataframe or a tibble

  • aesthetics - the mapping that defines how your data is represented visually (x, y, color, size, shape, transparency)

  • geometries - the objects added to the plot in layers (points, bars, lines)

  • stats - statistical transformations/data summaries

  • facets - subsetting and automatic plotting by a factor

  • scales - control color mapping and other aesthetic alterations

  • themes - themes allow you to customize every aspect of the plot

  • coordinates - there are a few different coordinate systems you can use

grammar prefix example
data ggplot() ggplot()
aesthetics aes() ggplot(data,aes(x,y))
geometries geom geom_point()
stats stat stat_boxplot()
facets facet facet_wrap()
scales scale scale_color_brewer()
themes theme theme_bw()
coordinates coord coord_polar()

Step 1: Call ggplot and define the “global” settings

  • Specify the data and variables inside the ggplot function

  • If you only call the ggplot function without adding any geometries, it will create a blank plot (much like calling type = “n” in base plotting).

  • Everything in the aesthetics inside ggplot() are “global aesthetics,” which means they will be applied to the entire plot (including all geometries/stats/facets). However, they will not be visible until you add those geoms, etc.

Base equivalent: plot(gapminder$year, gapminder$pop, type = "n")

ggplot(data = gapminder, aes(x = year, y = pop))

Step 2: Add geometries

You can add a variety geometries to create different types of plots. Check out the ggplot() Cheat Sheet for helpful functions.

If you define the aesthetics in the ggplot() command, the geoms don’t require any arguments, but you can always add layer-specific aesthetics (see size = 2).

p1<-ggplot(data = gapminder, aes(x = year, y = pop, color = continent)) + geom_point(size = 2) + 
  theme(legend.position = "bottom")

p2<-ggplot(data = gapminder, aes(x = year, y = pop, color = continent)) + 
  geom_smooth(method = "lm",se = FALSE) + theme(legend.position = "bottom")

gridExtra::grid.arrange(p1,p2, ncol = 2)

If you define the aesthetics in the ggplot command, they will be applied to any geometries you add (like in the above plots). You can also define variables and aesthetics inside the individual geoms, but these settings will only be applied to that layer.

In this example, we have added a “smooth” line, but because there are no global aethetics and no local arguements, there is nothing for this layer to do.

Here is an atrocious plot to demonstrate:

ggplot() + geom_point(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, shape = continent)) + scale_x_log10() + 
  geom_smooth() 

Popular geometries

  • geom_histogram(aes(x))

  • geom_bar(aes(x,y),stat = “identity”)

  • geom_point(aes(x,y)) or geom_jitter(aes(x,y))

  • geom)line(aes(x,y))

  • geom_smooth(model = lm)

  • geom_boxplot(aes(x,y)) and geom_errorbar()

.

Step 3: Add stats

Some plots visualize a transformation of the original data set. Use a stat to choose a common transformation to visualize.

Because ggplot boxplots don’t automatically come with whiskers, I’ve added “stat_boxplot(geom = ‘errorbar’)” to the plot first to create those.

Then, I layered on a regular stat_boxplot. Note that I used “fill” rather than “color.” The “color” command controls lines and points and the “fill” command controls areas. Note that I can also control the width of the errorbar and the boxplot seperately because I didn’t put width in the global aesthetics.

Base equivalent: boxplot(gapminder$lifeExp ~ gapminder$year)

ggplot(data = gapminder, aes(x = as.factor(year), y = lifeExp)) + 
  stat_boxplot(geom = 'errorbar', width = 0.4) + stat_boxplot(fill = "lightgray", width = 0.6)

Step 4: Add facets to visualize differences between categorical variables

We will use some of our previous dplyr skills to wrangle this data before we plot it.

I am only interested in looking at North America right now, so we will filter out all countries except Can, USA, and Mex.

Because we are using the dplyr pipe to call on the data, we don’t have to have the “data” argument, but we will pass x = year, y = population to the global aesthetic and layer on our geometries. Note that if we want to “group by” without changing the colors, we can call “group = factorlevel” in the global aesthetics.

Finally, we want to add a facet so each country has its own plot area.

  • facet_wrap() - wraps facets by one factor level into a rectangular layout (can still specify the number of rows/columns desired)

  • facet_grid() - can facet into both rows and columns by two different factor levels (perhaps continent rows, country columns?)

gapminder %>%
  filter(country %in% c("Canada","United States","Mexico")) %>% 
  group_by(country) %>% 
  ggplot(aes(year,pop, group = country)) + 
  geom_smooth(method = "lm",se = FALSE, color = "lightgray") + geom_point() + 
  facet_wrap(~country)

Step 5: Use themes and scales to adjust settings and make plots beautiful!

Themes:

Functions I use most for formatting:

  • theme_bw(), theme_classic(), theme_few(), theme_light() are all good ways to get rid of the majority of “annoying” ggplot formatting

  • theme(panel.grid = element_blank()) this is how you get rid of the gray gridlines. Anytime you assign something to element_blank(), it is “deleted/removed/blank”

  • labs(x = "“, y =”“, title =”“, color/fill/shape/etc =”") change the axis labels all in one command

  • theme(axis.text = element_text(size = XX)) change the size of the axis labels for pub-ready plots

topemitters<-c("China", "United States","India","Japan","Germany", "Korea, Dem. Rep.")

topemittersdf<- gapminder %>%
  filter(country %in% topemitters) %>% 
  group_by(country)

ggplot(topemittersdf, aes(year, gdpPercap, color = country)) + 
  geom_smooth(se = FALSE, color = "lightgray") + 
  geom_point(size = 1.4) +  facet_wrap(~forcats::fct_reorder2(country, year, gdpPercap)) +  
  theme_light() + scale_x_continuous(breaks = pretty_breaks(n = 3)) +
  theme(panel.grid = element_blank()) + scale_colour_brewer(palette = "RdBu")  + 
  theme(legend.position = "none") + theme(strip.text = element_text(size = 12, color = "black")) +
  theme(strip.background = element_blank()) +
  theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) + 
  labs(x = "\n Year", y = "Per capita GDP \n ")

Scales:

Use with any aesthetic: alpha, color, fill, linetype, shape, size:

  • scale_*_continuous() - map continuous values to visual values

  • scale_*_discrete() - map discrete values to visual values

  • scale_*_identity() - use data values as visual values

  • scale_*_manual(values = c()) - map discrete values to manually-chosen visual values

Color and fill scales:

  • scale_fill/color_brewer(palette = “Greys”) - use Rcolorbrewer

  • scale_fill/color_gradient(low = “blue”, high = “yellow”) - use a gradient between specied values (*usually for continuous vars only)

Location scales:

  • scale_x_date - x values as dates

  • scale_x_log10 or scale_x_sqrt() - transform axis

  • scale_x/y_continuous(limits = c()) - define limits with clipping

Find a complete compilation of text_spec(“R”, color = “blue”) text_spec(“color”,color = “red”) text_spec(“palettes”,color = “purple”) here

Most importantly, you can preview and subsequently use Wes Anderson palettes.

#install.packages("wesanderson")
library(wesanderson) 
wes_palette("Moonrise3")

Here is an example of a few different scales. You can put variables on a log scale without modifying them in your dataframe. You can set the limits of your plot. You can even color continuous variables by defining a gradient.

gapminder %>% 
  filter(continent == "Africa") %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = lifeExp)) + 
  geom_point() + scale_x_log10() + scale_y_continuous(limits = c(30,70)) + 
  scale_color_continuous(low = wes_palette("Zissou1")[1], high = wes_palette("Zissou1")[4])

Step: don’t try this at home/use only if you absolutely must….

Disclaimer: The author of this document does not condone the use of pie charts.

you can use different coordinate systems. But… maybe just stick to coord_cartesian() and coord_flip() and forget about other coordinate systems?

However, here is an example of how to manually color items in ggplot. I wanted to color each country by the primary color of their flag, so I created a vector of colors that I named “Nordicflags.” I then called scale_fill_manual and used “Nordicflags” as the value. Note that when assigning colors manually, your vector needs to be either length = 1 or the same length as the number of factor levels you’re grouping by.

Nordicflags<-c("#C60C30","#002F6C","#006AA7","#EF2B2D","#FECC00")

gapminder %>%
  filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>% 
  filter(year == 2007) %>% 
  mutate(proportion = pop/sum(pop)) %>% 
  ggplot(aes(x = "", y = proportion, fill = country)) + 
  geom_bar(stat = "identity") + 
  coord_polar("y", start=0) + scale_fill_manual(values = Nordicflags) + 
  theme_minimal() + theme(axis.text = element_blank()) +
  labs(title = "Nordic Countries", x = "", y = "Proportion of population by country", fill = "") 

Other things to know: Barplots require attention.

Stacking option Use stat = “identity” to allow stacking.

gapminder %>%
  filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>% 
  ggplot(aes(x = as.factor(year), y = pop, fill = country)) + geom_bar(stat = "identity") + 
  scale_fill_manual(values = Nordicflags) 

Dodging option Use stat = “identity” , position = “dodge” to give each factor level its own bar

gapminder %>%
  filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>% 
  filter(year < 1955 | year > 2005) %>% 
  ggplot(aes(x = as.factor(year), y = pop, fill = country)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  scale_fill_manual(values = Nordicflags) 

Lack of summary problem

Let’s talk about what is happening here: because we have an unplotted factor level/repeated measure, the barplots associated with these values are being layered below and you’re only observing the maximum value. We can see this here because I’ve made the bar color almost totally transparent (alpha).

When making barplots, it is always best to summarize your data first.

gapminder %>%
  filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>% 
  ggplot(aes(x = country, y = pop)) + 
  geom_bar(stat = "identity", position = "dodge",color = "black", alpha = 0.01) 

This isn’t exactly the right sitaution for this type of plot, but we will pretend for example’s sake.

Once you’ve summarized the values, you can use geom_col() rather than geom_bar(). R documentation says:

There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

Please note that I also left the transparency intact so you could see that, with the summarized data, the bars are no longer layered.

In this example, I also used the forcats::fct_reorder() function to order the bars by the value of another variable (mean population) in this case.

gapminder %>%
  filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>% 
  group_by(country) %>% 
  summarize(popmean = mean(pop), sd = sd(pop)) %>% 
  ggplot(aes(x = fct_reorder(country,popmean), y = popmean)) + 
  geom_col(position = "dodge",color = "black", alpha = 0.01) + 
  geom_errorbar(aes(ymin = popmean - sd, ymax = popmean + sd), width = 0.3) + 
  theme(panel.grid = element_blank()) + 
  labs(x = "", y = "Population by country (1952 - 2007)")

Other things to know: Ribbons for TS.

Ribbon is a great geom to know for time series analyses.

ribbon<-read_csv("https://raw.githubusercontent.com/LGCarlson/Intro-to-Tidyverse/master/ribbon_example.csv") %>%  glimpse()
## Observations: 298
## Variables: 3
## $ time         <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ value        <dbl> -0.4442503, -1.0992872, -1.8173539, -1.9129397, -...
## $ variablility <dbl> 0.07238277, 0.17910953, 0.29610589, 0.31167993, 0...

Much like the errorbar geoms, geom_ribbon requires a ymin and ymax argument (you must supply).

ggplot(ribbon,aes(time,value)) + 
  geom_ribbon(aes(ymin = value - variablility , ymax = value + variablility ), 
              fill = "#2171b5", alpha = 0.2) + geom_line(color = "#08519c")

You can also do the same thing with lines, but the fill ribbon provides looks nicer.

ggplot(ribbon, aes(time, value)) +
  geom_line(aes(y = value - variablility, x = time), color="grey", linetype=2) +
  geom_line(aes(y = value + variablility, x = time), color="grey", linetype=2) +
  geom_line(color = "black") + theme(panel.grid = element_blank())

Well, that’s all folks! You can find the ultimate tidyverse cheat sheet here and a variety of great documentation all around the web.