## * __ _ __ . o * .
## / /_(_)__/ /_ ___ _____ _______ ___
## / __/ / _ / // / |/ / -_) __(_-</ -_)
## \__/_/\_,_/\_, /|___/\__/_/ /___/\__/
## * . /___/ o . *
The tidyverse is an opinionated collection of R packages designed for data science. ~Hadley Wickham
The tidyverse is a group of packages with a common design philosophy that uses a concise syntax to help you clean, organize, analyze, and visualize large data sets with ease. The syntax was popularized by “R for Data Science” by Hadley Wickham and Garrett Grolemund, but its rooted in the idea that workflows should be both readable and reproducible. Tidyverse packages help your code read left to right, more like a sentence: in base code, you’d write h(g(f(x))) but in tidyverse syntax, you’d write x %>% f %>% g %>% h.
Here is the opinion part:
“Programs must be written for people to read and only incidentally for machines to execute”. ~Hal Abelson
If you think about it, it really does make more sense to read your code like you’d read a book rather than reading from the inside out. As more of a writer than a mathemetician myself, this structure inherently made more sense to me than dollar sign or function syntax. Learning ggplot and other tidy commands transformed me from a reluctant and deficient coder into an enthusiastic (and hopefully proficient) one!
The tidyverse is widely used because it is logical, but also because it has packages for every step of your data’s journey from import to output. Each package uses consistent a grammar and data structure.
1) Import:
2) Tidy:
tibble
tidyr
3) Transform:
dplyr
forcats
lubridate
stringr
4) Visualize:
5) Model:
broom
modelr
6) Program:
purrrr
magrittr.. ceci n’est pas une pipe!
There are many more great packages that are tidy-friendly, but we will focus on this core group, and more specifically on tidy, dplyr, and ggplot2. Fear not, you don’t need to install all of these packages individually, just load the tidyverse!
install.packages("tidyverse") library(tidyverse)
Before we start coding, there are a few peices of tidyverse jargon we need to define:
tidy data - In the framework of tidy data every row is an observation, every column represents variables and every entry into the cells of the data frame are values. As you might expect, the tidyverse aims to create, visualize, and analyze data in a tidy format.
tibble - Tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). More on tibbles later.
%>% also known as a pipe - The infix operator is a function that passes the left hand side of the operator to the first argument on the right hand side of the operator. Thus, iris %>% head() is equivalent to head(iris). This operator is convinient because you can call the pipe multiple times to “chain” functions together (nesting in base R). The pipe operator is not required to use tidyverse functions, but it does make them more convinient.
To read in a dataset, use the readr package. readr::read_csv replaces read.csv which allows for faster data reading. read_csv will also preserve column names and it will not coerce characters to factors (i.e., no more header = TRUE, stringsAsFactors = FALSE) yay!)
cetaceans<-read_csv("https://raw.githubusercontent.com/LGCarlson/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv")
cetaceans %>% class()
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Base R equalivalent
# Base R equalivalent:
# cetatceans<-read.csv("https://raw.githubusercontent.com/LGCarlson/tidytuesday/master/data/2018/2018-12-18/allCetaceanData.csv",header = TRUE, stringsAsFactors = FALSE)
# class(cetaceaens)
# [1] "data.frame"
As shown by calling “class” above, readr functions automatically read your dataset as a tibble. Let’s see what that looks like by calling head() and asking for the first 10 observations.
cetaceans %>%
head(10)
## # A tibble: 10 x 22
## X1 species id name sex accuracy birthYear acquisition
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Bottle~ NOA0~ Dazz~ F a 1989 Born
## 2 2 Bottle~ NOA0~ Tursi F a 1973 Born
## 3 3 Bottle~ NOA0~ Star~ M a 1978 Born
## 4 4 Bottle~ NOA0~ Sandy F a 1979 Born
## 5 5 Bottle~ NOA0~ Sandy M a 1979 Born
## 6 6 Bottle~ NOA0~ Nacha F a 1980 Born
## 7 7 Bottle~ NOA0~ Kama M a 1981 Born
## 8 8 Bottle~ NOA0~ Jene~ F a 1981 Born
## 9 9 Bottle~ NOA0~ Duffy M a 1982 Born
## 10 10 Bottle~ NOA0~ Astra F a 1983 Born
## # ... with 14 more variables: originDate <date>, originLocation <chr>,
## # mother <chr>, father <chr>, transfers <chr>, currently <chr>,
## # region <chr>, status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## # transferDate <date>, transfer <chr>, entryDate <date>
Base R equalivalent
# Base R equalivalent:
# head(cetaceans)
When you preview a tibble, it always prints the class of each object, but you can get more information about the tibble by calling glimpse. This is a good function to know. As a wise colleage once advised me… "always check the %$#*ing structure!"
cetaceans %>%
glimpse()
## Observations: 2,194
## Variables: 22
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ species <chr> "Bottlenose", "Bottlenose", "Bottlenose", "Bott...
## $ id <chr> "NOA0004614, AZA 428, MLF-428", "NOA0004386, AZ...
## $ name <chr> "Dazzle", "Tursi", "Starbuck", "Sandy", "Sandy"...
## $ sex <chr> "F", "F", "M", "F", "M", "F", "M", "F", "M", "F...
## $ accuracy <chr> "a", "a", "a", "a", "a", "a", "a", "a", "a", "a...
## $ birthYear <chr> "1989", "1973", "1978", "1979", "1979", "1980",...
## $ acquisition <chr> "Born", "Born", "Born", "Born", "Born", "Born",...
## $ originDate <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03...
## $ originLocation <chr> "Marineland Florida", "Dolphin Research Center"...
## $ mother <chr> "Betty III", "Little Bit", "Cindy (T.t. gilli)"...
## $ father <chr> "Davy II", "Mr. Gipper", "Sambo", NA, NA, "Jeth...
## $ transfers <chr> NA, NA, "SeaWorld San Diego to SeaWorld Aurora ...
## $ currently <chr> "Marineland Florida", "Dolphin Research Center"...
## $ region <chr> "US", "US", "US", "US", "US", "US", "US", "US",...
## $ status <chr> "Alive", "Alive", "Alive", "Alive", "Alive", "A...
## $ statusDate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ COD <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Sunny ...
## $ transferDate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ transfer <chr> "US", "US", "US", "US", "US", "US", "US", "US",...
## $ entryDate <date> 1989-04-07, 1973-11-26, 1978-05-13, 1979-02-03...
Base R equalivalent
# Base R equalivalent:
# str(cetaceans)
Is this a tidy dataset as it is? It is! But could it be…. dare I say, tidyr?
Here are a few tidyr functions that may be useful.
The separate function is telling R to seperate the “originDate” column into “originYear”,“originMonth”, and “originDay”.
The sep= command tells the function what each element is separated by. Unfortunately, this command does not work for to separate lowercase from capsital letters without a symbol in between (i.e., can handle m.ABDU but not mABDU… bonus points if you can tell me what ABDU is). Sep = can take [^[:alnum:]]+. For seperating capital letters, you’ll have to use “extract.”
Finally, remove = TRUE deletes the original column, but remove = FALSE retains it.
The select column is just saying we want to ignore all the other data columns except originDate, originYear, originMonth, and originDay.
cetaceans %>%
separate(originDate, into = c("originYear","originMonth", "originDay"), sep = "-", remove = FALSE) %>%
select(originDate, originYear, originMonth, originDay) %>%
head(10)
## # A tibble: 10 x 4
## originDate originYear originMonth originDay
## <date> <chr> <chr> <chr>
## 1 1989-04-07 1989 04 07
## 2 1973-11-26 1973 11 26
## 3 1978-05-13 1978 05 13
## 4 1979-02-03 1979 02 03
## 5 1979-08-15 1979 08 15
## 6 1980-10-10 1980 10 10
## 7 1981-03-27 1981 03 27
## 8 1981-10-20 1981 10 20
## 9 1982-10-16 1982 10 16
## 10 1983-03-07 1983 03 07
Base R equalivalent
# Base R equalivalent:
# originDate<-as.character(cetaceans$originDate)
# YMD<-c()
# for(i in 1:length(originDate)){
# if(is.na(originDate[i])){
# YMD<-rbind(YMD,rep(NA,3))
# next
# }
# YMD<-rbind(YMD,unlist(strsplit(originDate[i],"-")))
# }
# Dates<-data.frame(originDate=cetaceans$originDate,
# originYear=YMD[,1],
# originMonth=YMD[,2],
# originDay=YMD[,3])
# head(Dates)
There isn’t a variable I would actually want to gather by in this dataset, but we’ll pretend.
Explanation by line:
For the first time, we are going to actually save the edits we make the the dataframe as a new object (parentlong) rather than just printing them.
Next, we will gather columns 11 and 12 so that we have a long (less tidy) dataset. Each individual could now have two rows: one row for the mother, one for the father. The “key” column called parentgender will tell us if the partent in the “value” column is the mother or father. The “value” column will provide the parent name.
Then, we will select the columns id, name, and the new columns we just created.
We will filter out the rows where “parentname” is NA for easier example-viewing purposes.
Then, we will order the rows in descending order by ID
We will select the first 40 cases
parentlong<-cetaceans %>%
gather(key = "parentgender", value = "parentname", 11:12) %>%
select(id, name, parentgender, parentname) %>%
filter(!is.na(parentname)) %>%
arrange(desc(id)) %>%
head(40)
parentlong %>%
head(10)
## # A tibble: 10 x 4
## id name parentgender parentname
## <chr> <chr> <chr> <chr>
## 1 SWT-00-1776 Takara's Calf mother Takara
## 2 SWT-00-1776 Takara's Calf father Kyuquot
## 3 SWF-DL-9901 Spooky’s Calf mother Spooky
## 4 SWF-DL-9901 Spooky’s Calf father Luke
## 5 NOA006628, AZA 1396, SWF-TT-1001 Hurlee mother Thelma
## 6 NOA006628, AZA 1396, SWF-TT-1001 Hurlee father Akai
## 7 NOA006536, AZA 1281, SWF-TT-0903 Brigg mother Clipper
## 8 NOA006536, AZA 1281, SWF-TT-0903 Brigg father Capricorn
## 9 NOA0010381, 1116F1 Pele's Calf mother Pele
## 10 NOA0010379, 916M1 Kekoa mother Kona
Base R equalivalent
# Base R equalivalent:
# parentlong<-cetaceans[,c(3,4,11,12)]
# parentlong<-parentlong[complete.cases(parentlong),]
# new<-c()
# for(i in 1:nrow(parentlong)){
# new<-rbind(new,rbind(c(parentlong$id[i],parentlong$name[i],colnames(parentlong)[3],parentlong$mother[i]),
# c(parentlong$id[i],parentlong$name[i],colnames(parentlong)[4],parentlong$father[i])))
# }
# new<-as.data.frame(new)
# colnames(new)<-c("id","name","parentgender","parentname")
# parentlong<-new
# parentlong<-parentlong[order(parentlong$id,decreasing=TRUE),]
# head(parentlong[complete.cases(parentlong),],n=10)
Now, we will spread the tibble back to wide form (one row per unique individual). “Parentgender” will become the column names and “parentname” will provide values to those columns. If a value is not present, it will be filled with NA.
parentlong %>%
tidyr::spread(key = parentgender,value = parentname, fill = NA) %>%
arrange(desc(id))
## # A tibble: 25 x 4
## id name father mother
## <chr> <chr> <chr> <chr>
## 1 SWT-00-1776 Takara's Calf Kyuquot Takara
## 2 SWF-DL-9901 Spooky’s Calf Luke Spooky
## 3 NOA006628, AZA 1396, SWF-TT-1001 Hurlee Akai Thelma
## 4 NOA006536, AZA 1281, SWF-TT-0903 Brigg Capricorn Clipper
## 5 NOA0010381, 1116F1 Pele's Calf <NA> Pele
## 6 NOA0010379, 916M1 Kekoa <NA> Kona
## 7 NOA0010378, AZA ????, SWF-TT-1608 Storm <NA> Haley
## 8 NOA0010377, M16009 (Georgia Aquarium) Roxy's Calf <NA> Roxy
## 9 NOA0010375, M150001 Maris' Calf Beethoven Maris
## 10 NOA0010373, AZA ????, SWF-TT-1607 Star <NA> Stella
## # ... with 15 more rows
Base R equalivalent
# Base R equalivalent:
# parentlong<-parentlong[order(as.numeric(row.names(parentlong))),]
# parentlong<-data.frame(id=subset(parentlong,parentgender=="father")[,1],
# name=subset(parentlong,parentgender=="father")[,2],
# father=subset(parentlong,parentgender=="father")[,4],
# mother=subset(parentlong,parentgender=="mother")[,4])
# parentlong<-parentlong[order(parentlong$id,decreasing=T),]
# head(parentlong[complete.cases(parentlong),],n=10)
Dplyr is maybe the most useful packages in all of R. It provides a few functions that are absolutely essential for data wrangling/transformation.
select() selecting variables
filter() provides basic filtering capabilities
group_by() groups data by categorical levels
summarise() summarise data by functions of choice
arrange() ordering data
join() joining separate dataframes
mutate() create new variables
There is a handy cheat sheet available here: Data Wrangling Cheat Sheet
I’ve already had to use some of the dplyr commands above to accomplish what I wanted to, but let’s look at them individually.
When working with a large dataframe, sometimes you need to reduce the nubmer of variables and remove a specific one. Select is an easy way to do that.
Here, I removed the pesky (and unnecessary) ID column that read_csv created (one bad feature of readr). The minus sign before the column name denotes that you wish to remove that column. In dplyr, you can refer to columns by their names much more easily. The base equivalent requires you to know the positions of each variable you wish to select or remove, which is easy in this case, but that isn’t always true.
dplyr: select all columns except X1
cetaceans %>%
select(-X1) %>%
head(6)
## # A tibble: 6 x 21
## species id name sex accuracy birthYear acquisition originDate
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <date>
## 1 Bottle~ NOA0~ Dazz~ F a 1989 Born 1989-04-07
## 2 Bottle~ NOA0~ Tursi F a 1973 Born 1973-11-26
## 3 Bottle~ NOA0~ Star~ M a 1978 Born 1978-05-13
## 4 Bottle~ NOA0~ Sandy F a 1979 Born 1979-02-03
## 5 Bottle~ NOA0~ Sandy M a 1979 Born 1979-08-15
## 6 Bottle~ NOA0~ Nacha F a 1980 Born 1980-10-10
## # ... with 13 more variables: originLocation <chr>, mother <chr>,
## # father <chr>, transfers <chr>, currently <chr>, region <chr>,
## # status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## # transferDate <date>, transfer <chr>, entryDate <date>
Base R equalivalent base: select all columns except that in position 1
# Base R equalivalent:
# head(cetaceans[,2:22])
Here, I just selected the species, id, and name of each dolphin. Again, it is more apparent how you’re actually transforming the data when you use dplyr.
dplyr: select all columns between the columns “species” and “name”
cetaceans %>%
select(species:name) %>%
head(6)
## # A tibble: 6 x 3
## species id name
## <chr> <chr> <chr>
## 1 Bottlenose NOA0004614, AZA 428, MLF-428 Dazzle
## 2 Bottlenose NOA0004386, AZA 138, IDR-73-1 Tursi
## 3 Bottlenose NOA0002137, SWC-TTG-7816 Starbu~
## 4 Bottlenose NOA0002690, SWF-TT-7903 Sandy
## 5 Bottlenose NOA0004418, AZA 242, SWF-TT-7904, MH-82-36-TT (New En~ Sandy
## 6 Bottlenose NOA0002725, SWC-TT-8014 Nacha
Base R equalivalent base: select all columns between 2:4
# Base R equalivalent:
# head(cetaceans[,2:4])
There are also a variety of useful helper functions for select that you can use to make conditional statement.
dplyr: select the column “name,” and any column that ends with the word “Date.”
cetaceans %>%
select(name, ends_with("Date")) %>%
head(6)
## # A tibble: 6 x 5
## name originDate statusDate transferDate entryDate
## <chr> <date> <date> <date> <date>
## 1 Dazzle 1989-04-07 NA NA 1989-04-07
## 2 Tursi 1973-11-26 NA NA 1973-11-26
## 3 Starbuck 1978-05-13 NA NA 1978-05-13
## 4 Sandy 1979-02-03 NA NA 1979-02-03
## 5 Sandy 1979-08-15 NA NA 1979-08-15
## 6 Nacha 1980-10-10 NA NA 1980-10-10
You can also use select to rearrange columns. Let’s say you make another ID column called “nameID” (created with unite_ of dplyr rather than unite tidyr, which I don’t like as well).
Perhaps you want to rearrange your columns so that your new ID is in the first colum, followed by sex, followed by acquisition, followed by “everything()” to add all other columns in the original order. So you’re not deleting any columns, you’re just moving them around.
rename(): you can also rename columns in dplyr. The new name is in the first "" and the original name is second.
cetaceans %>%
unite_("nameID",c("name","birthYear"),sep = "_") %>%
select(nameID, sex, acquisition, everything()) %>%
rename("originType" = "acquisition")
## # A tibble: 2,194 x 21
## nameID sex originType X1 species id accuracy originDate
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <date>
## 1 Dazzl~ F Born 1 Bottle~ NOA0~ a 1989-04-07
## 2 Tursi~ F Born 2 Bottle~ NOA0~ a 1973-11-26
## 3 Starb~ M Born 3 Bottle~ NOA0~ a 1978-05-13
## 4 Sandy~ F Born 4 Bottle~ NOA0~ a 1979-02-03
## 5 Sandy~ M Born 5 Bottle~ NOA0~ a 1979-08-15
## 6 Nacha~ F Born 6 Bottle~ NOA0~ a 1980-10-10
## 7 Kama_~ M Born 7 Bottle~ NOA0~ a 1981-03-27
## 8 Jenev~ F Born 8 Bottle~ NOA0~ a 1981-10-20
## 9 Duffy~ M Born 9 Bottle~ NOA0~ a 1982-10-16
## 10 Astra~ F Born 10 Bottle~ NOA0~ a 1983-03-07
## # ... with 2,184 more rows, and 13 more variables: originLocation <chr>,
## # mother <chr>, father <chr>, transfers <chr>, currently <chr>,
## # region <chr>, status <chr>, statusDate <date>, COD <chr>, notes <chr>,
## # transferDate <date>, transfer <chr>, entryDate <date>
Base R equalivalent
# Base R equalivalent:
# cetaceans<-data.frame(nameID=paste(cetaceans$name,cetaceans$birthYear,sep="_"),
# sex=cetaceans$sex,
# acquisition=cetaceans$acquisition,
# cetaceans[,-which(colnames(cetaceans) %in% c("name","birthYear","sex","acquisition"))])
# colnames(cetaceans)[which(colnames(cetaceans)=="acquisition")]<-"originType"
Important to remember: select is for columns, filter is for rows Important to remember: you can’t use logical rules in select
The objective here is to reduce the rows/observations by a value critera or other condition. You can apply any of the logical rules in filter. For example:
| Possible operators | |||
|---|---|---|---|
| < | Less than | != | Not equal to |
| > | Greater than | %in% | Group membership |
| == | Equal to | is.na | is NA |
| <= | Less than or equal to | !is.na | is not NA |
| >= | Greater than or equal to | &,l,! | Boolean operators |
Explanation by line:
First, we are going to repeat the command we created in the select() example to select only the dolphin’s name and all four possible date values.
Next, we will filter out all individuals who don’t have a status date. !is.na(statusDate)
Next we will filter out all inividuals whose transfer date is earlier than 1990 (keep only transfers after Jan 1, 1990). The “filter” command actually works with date values!
cetaceans %>%
select(name, ends_with("Date")) %>%
filter(!is.na(statusDate)) %>%
filter(transferDate >= "1990-01-01")
## # A tibble: 8 x 5
## name originDate statusDate transferDate entryDate
## <chr> <date> <date> <date> <date>
## 1 Nea 2007-06-03 2011-09-05 2010-09-04 2010-09-04
## 2 Somers 1998-05-22 2010-04-23 2010-03-03 2010-03-03
## 3 Gasper 1997-01-01 2007-01-02 2005-10-17 2005-10-17
## 4 Nootka Iv 1982-10-01 1994-09-13 1993-01-07 1993-01-07
## 5 Nanuq 1990-08-13 2015-02-09 1997-07-27 1997-07-27
## 6 Haida Ii 1982-10-01 2001-08-01 1993-01-08 1993-01-08
## 7 Nico 1996-01-01 2009-10-31 2005-10-17 2005-10-17
## 8 Yogi 1978-12-18 2004-11-20 2004-09-12 2004-09-12
Base R equalivalent
# Base R equalivalent:
# cetaceans<-cetaceans[,c(which(colnames(cetaceans)=="name"),
# which(endsWith(colnames(cetaceans),"Date")))]
# cetaceans<-cetaceans[which(!is.na(cetaceans$statusDate)),]
# cetaceans[which(cetaceans$transferDate>="1990-01-01"),]
We will talk about these together because there isn’t much use to grouping data by a categorical variable if you’re not going to transform or summarize it in some way.
group_by allows us to create/nest categorical groupings of data by factor levels and preform analysis at the group as well as the individual level
summarize allows us to easily calculate summary statistics. You can use functions such as min, median, var, sd, n and many more
Explanation by line:
We’ll talk more about the mutate function later, but for now, all you need to know is that we want to convert birthYear to a numeric variable (double) because it was read in as a character for some reason
Next, use filter to consider only those dolphins which were “born” or "captured
We group by acquisition and sex, pretty self-explanatory
We can use the variety of functions in summarize to create a summary dataframe from our original dataset. Note, this dataframe will “overwrite” your original dataset if you save it as the same object name. For example, you’d want to name this acq_summary_table or something. We are telling summarize to count (n) the number in each group and take the mean of the birth years for each group. Note that we passed “na.rm” to the mean function (just like you normally would) so that it doesn’t return NA values.
Finally, we used mutate_at to round to the nearest whole number (because partial years aren’t very informative).
cetaceans %>%
mutate(birthYear = as.double(birthYear)) %>%
filter(acquisition == "Born" | acquisition == "Capture") %>%
group_by(acquisition, sex) %>%
summarize(n = n(), avgBirthYear = mean(birthYear,na.rm = TRUE)) %>%
mutate_at("avgBirthYear", round, 0)
## # A tibble: 6 x 4
## # Groups: acquisition [2]
## acquisition sex n avgBirthYear
## <chr> <chr> <int> <dbl>
## 1 Born F 369 1999
## 2 Born M 382 1998
## 3 Born U 25 2001
## 4 Capture F 703 1973
## 5 Capture M 440 1972
## 6 Capture U 71 1971
Sometimes it is helpful to rank observations or summaries by the value of a variable. The arrange function allows us to order data by variables in accending or descending order.
count is does the same thing as summarize n=n(). However, count takes the grouping variable as the arguement, but n=n() doesn’t take any arguments and relies on group_by to know how to count.
cetaceans %>%
filter(!is.na(birthYear)) %>%
count(birthYear) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## birthYear n
## <chr> <int>
## 1 1976 42
## 2 1985 40
## 3 1970 39
## 4 1980 39
## 5 1981 39
## 6 1968 38
## 7 1975 35
## 8 1972 34
## 9 1984 34
## 10 1973 32
Ignore this step. It makes a dataframe that contains each transfer location by ID.
## # A tibble: 10 x 15
## # Groups: id [10]
## id t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 UNK00~ flor~ u.s.~ new ~ seaw~ seaw~ <NA> <NA> <NA> <NA> <NA> <NA>
## 2 UNK00~ unkn~ mont~ new ~ seaw~ disc~ <NA> <NA> <NA> <NA> <NA> <NA>
## 3 UNK00~ hawa~ sea ~ u.s.~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 TT-670 miss~ hold~ u.s.~ gulf~ <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 TT-669 miss~ hold~ u.s.~ gulf~ <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 SWF-T~ east~ seaw~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 SWF-P~ key ~ sea ~ seaw~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 8 SWC-D~ huds~ seaw~ seaw~ seaw~ <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 9 NOAA0~ miss~ u.s.~ dolp~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 NOA00~ disc~ dolp~ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # ... with 3 more variables: t12 <chr>, t13 <chr>, t14 <chr>
The join function is very helpful to joint two dataframes together that may have a different structure or different variables, but observations for the same individuals, etc. You can use the “join” functions to combine them by a common value or group of values.
There are four types of join:
inner_join(): Include only rows in both x and y that have a matching value
left_join(): Include all of x, and matching rows of y
semi_join(): Include rows of x that match y but only keep the columns from x
anti_join(): Opposite of semi_join
cetaceans %>%
left_join(transfersdf,by = "id") %>%
select(id,species,name,sex,starts_with("t")) %>%
head(6)
## # A tibble: 6 x 21
## id species name sex transfers transferDate transfer t1 t2
## <chr> <chr> <chr> <chr> <chr> <date> <chr> <chr> <chr>
## 1 NOA0~ Bottle~ Dazz~ F <NA> NA US <NA> <NA>
## 2 NOA0~ Bottle~ Tursi F <NA> NA US <NA> <NA>
## 3 NOA0~ Bottle~ Star~ M SeaWorld~ NA US seaw~ seaw~
## 4 NOA0~ Bottle~ Sandy F SeaWorld~ NA US seaw~ seaw~
## 5 NOA0~ Bottle~ Sandy M SeaWorld~ NA US seaw~ new ~
## 6 NOA0~ Bottle~ Nacha F SeaWorld~ NA US seaw~ seaw~
## # ... with 12 more variables: t3 <chr>, t4 <chr>, t5 <chr>, t6 <chr>,
## # t7 <chr>, t8 <chr>, t9 <chr>, t10 <chr>, t11 <chr>, t12 <chr>,
## # t13 <chr>, t14 <chr>
Mutate is an extremely useful function. You can use it to create a new variable that is a function of the current variables, add a new variable, etc.
In this example, we will calculate a variable containing the dolphin’s age.
Explanation by line:
Only include individuals whose status == “Died”
Only include individuals with a birthYear and statusDate
Select “id”, “status_date”, and “birthYear” columns
Convert the column “birthYear” to a double
Create a new column called “deathYear” that uses the lubridate package to extract “year” from “statusDate”
Create a new column called “age” that = difference between death year and birth year
cetaceans %>%
filter(status == "Died") %>%
filter(!is.na(birthYear), !is.na(statusDate)) %>%
select(id,statusDate,birthYear) %>%
mutate(birthYear = as.double(birthYear)) %>%
mutate(deathYear = year(statusDate)) %>%
mutate(age = deathYear - birthYear) %>%
head(10)
## # A tibble: 10 x 5
## id statusDate birthYear deathYear age
## <chr> <date> <dbl> <dbl> <dbl>
## 1 NOA0003077, SWC-CC-9327 2014-04-15 1993 2014 21
## 2 NOA0005793, SWC-CC-9827 2014-01-08 1998 2014 16
## 3 NOA0000663, 22196 1978-04-20 1947 1978 31
## 4 NOA0000661, 22198 1974-09-23 1952 1974 22
## 5 NOA0000669, AZA 1019 1990-01-26 1956 1990 34
## 6 NOA0000662, 22708 1975-09-10 1968 1975 7
## 7 NOA0000683, ISIS 900317, TT09 1995-01-05 1969 1995 26
## 8 NOA0000664, 22709 1978-07-29 1973 1978 5
## 9 NOA0000666, 23037 1987-11-10 1973 1987 14
## 10 NOA0000682, AZA 1041, 900222 1990-09-07 1990 1990 0
Explanation by line:
Remove the “-” and NA values from cause of death column
Convert the COD column to all lowercase text
Count the number in each COD group
Select the top 10 columns
Arrange in descending order by n
cetaceans %>%
filter(!is.na(COD),
COD != "-") %>%
mutate(COD = tolower(COD)) %>%
count(COD) %>%
top_n(10) %>%
arrange(desc(n))
## # A tibble: 10 x 2
## COD n
## <chr> <int>
## 1 pneumonia 84
## 2 septicemia 26
## 3 euthanasia 18
## 4 "euthanasia: life threatening condition involving\rpain/suffering" 18
## 5 undetermined 18
## 6 bronchopneumonia 16
## 7 drowning 15
## 8 premature/still birth 15
## 9 hepatitis 14
## 10 lung abscess 14
To learn ggplot visualizations, we will use the gapminder dataset.
gapminder %>%
glimpse()
## Observations: 1,704
## Variables: 6
## $ country <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
The tidyverse relies upon the package ggplot2 for data visualization. The package, based on “The Grammar of Graphics”, embodies a deep philosophy of visualization to declaratively create graphics. After providing the data, you tell ggplot2 how to map variables to aesthetics, then add layers, scales, faceting specifications, or coordinate systems. Not only is ggplot more concise than base graphics, it also allows you more creative freedom and greater control over your visualizations.
Here is an example of the superior qualities of ggplot.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_jitter(shape = 1, aes(color = continent)) +
stat_smooth(method = "lm", size = 1, color = "black") +
scale_x_log10() +
xlab("Per Capita GDP") +
ylab("Life Expectancy (yrs)") +
facet_wrap(~continent) +
theme_few() +
guides(color = FALSE)
gapminder <- as.data.frame(gapminder)
conts <- sort(unique(gapminder[,"continent"]),decreasing = F)
cols <- scales::hue_pal()(length(conts))
par(mfrow = c(2,3))
counter <- 1
for (i in conts) {
plot(gapminder[which(gapminder$continent == i), "gdpPercap"],
gapminder[which(gapminder$continent == i), "lifeExp"], col = cols[counter],
xlab = "Per Capita GDP", ylab = "Life Expectancy (yrs)",
main = i, las = 1, log = "x")
fit <- lm(gapminder[which(gapminder$continent == i), "lifeExp"] ~ log(gapminder[which(gapminder$continent == i), "gdpPercap"]))
pred <- predict(fit, interval = "confidence")
lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,1]))
lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,2]), lty = 2)
lines(sort(gapminder[which(gapminder$continent == i), "gdpPercap"]), sort(pred[,3]), lty = 2)
counter <- counter + 1
}
data - your data must be a dataframe or a tibble
aesthetics - the mapping that defines how your data is represented visually (x, y, color, size, shape, transparency)
geometries - the objects added to the plot in layers (points, bars, lines)
stats - statistical transformations/data summaries
facets - subsetting and automatic plotting by a factor
scales - control color mapping and other aesthetic alterations
themes - themes allow you to customize every aspect of the plot
coordinates - there are a few different coordinate systems you can use
| grammar | prefix | example |
|---|---|---|
| data | ggplot() | ggplot() |
| aesthetics | aes() | ggplot(data,aes(x,y)) |
| geometries | geom | geom_point() |
| stats | stat | stat_boxplot() |
| facets | facet | facet_wrap() |
| scales | scale | scale_color_brewer() |
| themes | theme | theme_bw() |
| coordinates | coord | coord_polar() |
Specify the data and variables inside the ggplot function
If you only call the ggplot function without adding any geometries, it will create a blank plot (much like calling type = “n” in base plotting).
Everything in the aesthetics inside ggplot() are “global aesthetics,” which means they will be applied to the entire plot (including all geometries/stats/facets). However, they will not be visible until you add those geoms, etc.
Base equivalent: plot(gapminder$year, gapminder$pop, type = "n")
ggplot(data = gapminder, aes(x = year, y = pop))
You can add a variety geometries to create different types of plots. Check out the ggplot() Cheat Sheet for helpful functions.
If you define the aesthetics in the ggplot() command, the geoms don’t require any arguments, but you can always add layer-specific aesthetics (see size = 2).
p1<-ggplot(data = gapminder, aes(x = year, y = pop, color = continent)) + geom_point(size = 2) +
theme(legend.position = "bottom")
p2<-ggplot(data = gapminder, aes(x = year, y = pop, color = continent)) +
geom_smooth(method = "lm",se = FALSE) + theme(legend.position = "bottom")
gridExtra::grid.arrange(p1,p2, ncol = 2)
If you define the aesthetics in the ggplot command, they will be applied to any geometries you add (like in the above plots). You can also define variables and aesthetics inside the individual geoms, but these settings will only be applied to that layer.
In this example, we have added a “smooth” line, but because there are no global aethetics and no local arguements, there is nothing for this layer to do.
Here is an atrocious plot to demonstrate:
ggplot() + geom_point(data = gapminder, aes(x = gdpPercap, y = lifeExp, color = continent, shape = continent)) + scale_x_log10() +
geom_smooth()
Popular geometries
geom_histogram(aes(x))
geom_bar(aes(x,y),stat = “identity”)
geom_point(aes(x,y)) or geom_jitter(aes(x,y))
geom)line(aes(x,y))
geom_smooth(model = lm)
geom_boxplot(aes(x,y)) and geom_errorbar()
.
Some plots visualize a transformation of the original data set. Use a stat to choose a common transformation to visualize.
Because ggplot boxplots don’t automatically come with whiskers, I’ve added “stat_boxplot(geom = ‘errorbar’)” to the plot first to create those.
Then, I layered on a regular stat_boxplot. Note that I used “fill” rather than “color.” The “color” command controls lines and points and the “fill” command controls areas. Note that I can also control the width of the errorbar and the boxplot seperately because I didn’t put width in the global aesthetics.
Base equivalent: boxplot(gapminder$lifeExp ~ gapminder$year)
ggplot(data = gapminder, aes(x = as.factor(year), y = lifeExp)) +
stat_boxplot(geom = 'errorbar', width = 0.4) + stat_boxplot(fill = "lightgray", width = 0.6)
We will use some of our previous dplyr skills to wrangle this data before we plot it.
I am only interested in looking at North America right now, so we will filter out all countries except Can, USA, and Mex.
Because we are using the dplyr pipe to call on the data, we don’t have to have the “data” argument, but we will pass x = year, y = population to the global aesthetic and layer on our geometries. Note that if we want to “group by” without changing the colors, we can call “group = factorlevel” in the global aesthetics.
Finally, we want to add a facet so each country has its own plot area.
facet_wrap() - wraps facets by one factor level into a rectangular layout (can still specify the number of rows/columns desired)
facet_grid() - can facet into both rows and columns by two different factor levels (perhaps continent rows, country columns?)
gapminder %>%
filter(country %in% c("Canada","United States","Mexico")) %>%
group_by(country) %>%
ggplot(aes(year,pop, group = country)) +
geom_smooth(method = "lm",se = FALSE, color = "lightgray") + geom_point() +
facet_wrap(~country)
Functions I use most for formatting:
theme_bw(), theme_classic(), theme_few(), theme_light() are all good ways to get rid of the majority of “annoying” ggplot formatting
theme(panel.grid = element_blank()) this is how you get rid of the gray gridlines. Anytime you assign something to element_blank(), it is “deleted/removed/blank”
labs(x = "“, y =”“, title =”“, color/fill/shape/etc =”") change the axis labels all in one command
theme(axis.text = element_text(size = XX)) change the size of the axis labels for pub-ready plots
topemitters<-c("China", "United States","India","Japan","Germany", "Korea, Dem. Rep.")
topemittersdf<- gapminder %>%
filter(country %in% topemitters) %>%
group_by(country)
ggplot(topemittersdf, aes(year, gdpPercap, color = country)) +
geom_smooth(se = FALSE, color = "lightgray") +
geom_point(size = 1.4) + facet_wrap(~forcats::fct_reorder2(country, year, gdpPercap)) +
theme_light() + scale_x_continuous(breaks = pretty_breaks(n = 3)) +
theme(panel.grid = element_blank()) + scale_colour_brewer(palette = "RdBu") +
theme(legend.position = "none") + theme(strip.text = element_text(size = 12, color = "black")) +
theme(strip.background = element_blank()) +
theme(axis.text = element_text(size = 12), axis.title = element_text(size = 14)) +
labs(x = "\n Year", y = "Per capita GDP \n ")
Use with any aesthetic: alpha, color, fill, linetype, shape, size:
scale_*_continuous() - map continuous values to visual values
scale_*_discrete() - map discrete values to visual values
scale_*_identity() - use data values as visual values
scale_*_manual(values = c()) - map discrete values to manually-chosen visual values
Color and fill scales:
scale_fill/color_brewer(palette = “Greys”) - use Rcolorbrewer
scale_fill/color_gradient(low = “blue”, high = “yellow”) - use a gradient between specied values (*usually for continuous vars only)
Location scales:
scale_x_date - x values as dates
scale_x_log10 or scale_x_sqrt() - transform axis
scale_x/y_continuous(limits = c()) - define limits with clipping
Find a complete compilation of text_spec(“R”, color = “blue”) text_spec(“color”,color = “red”) text_spec(“palettes”,color = “purple”) here
Most importantly, you can preview and subsequently use Wes Anderson palettes.
#install.packages("wesanderson")
library(wesanderson)
wes_palette("Moonrise3")
Here is an example of a few different scales. You can put variables on a log scale without modifying them in your dataframe. You can set the limits of your plot. You can even color continuous variables by defining a gradient.
gapminder %>%
filter(continent == "Africa") %>%
ggplot(aes(x = gdpPercap, y = lifeExp, color = lifeExp)) +
geom_point() + scale_x_log10() + scale_y_continuous(limits = c(30,70)) +
scale_color_continuous(low = wes_palette("Zissou1")[1], high = wes_palette("Zissou1")[4])
Disclaimer: The author of this document does not condone the use of pie charts.
you can use different coordinate systems. But… maybe just stick to coord_cartesian() and coord_flip() and forget about other coordinate systems?
However, here is an example of how to manually color items in ggplot. I wanted to color each country by the primary color of their flag, so I created a vector of colors that I named “Nordicflags.” I then called scale_fill_manual and used “Nordicflags” as the value. Note that when assigning colors manually, your vector needs to be either length = 1 or the same length as the number of factor levels you’re grouping by.
Nordicflags<-c("#C60C30","#002F6C","#006AA7","#EF2B2D","#FECC00")
gapminder %>%
filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>%
filter(year == 2007) %>%
mutate(proportion = pop/sum(pop)) %>%
ggplot(aes(x = "", y = proportion, fill = country)) +
geom_bar(stat = "identity") +
coord_polar("y", start=0) + scale_fill_manual(values = Nordicflags) +
theme_minimal() + theme(axis.text = element_blank()) +
labs(title = "Nordic Countries", x = "", y = "Proportion of population by country", fill = "")
Stacking option Use stat = “identity” to allow stacking.
gapminder %>%
filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>%
ggplot(aes(x = as.factor(year), y = pop, fill = country)) + geom_bar(stat = "identity") +
scale_fill_manual(values = Nordicflags)
Dodging option Use stat = “identity” , position = “dodge” to give each factor level its own bar
gapminder %>%
filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>%
filter(year < 1955 | year > 2005) %>%
ggplot(aes(x = as.factor(year), y = pop, fill = country)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = Nordicflags)
Lack of summary problem
Let’s talk about what is happening here: because we have an unplotted factor level/repeated measure, the barplots associated with these values are being layered below and you’re only observing the maximum value. We can see this here because I’ve made the bar color almost totally transparent (alpha).
When making barplots, it is always best to summarize your data first.
gapminder %>%
filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>%
ggplot(aes(x = country, y = pop)) +
geom_bar(stat = "identity", position = "dodge",color = "black", alpha = 0.01)
This isn’t exactly the right sitaution for this type of plot, but we will pretend for example’s sake.
Once you’ve summarized the values, you can use geom_col() rather than geom_bar(). R documentation says:
There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.
Please note that I also left the transparency intact so you could see that, with the summarized data, the bars are no longer layered.
In this example, I also used the forcats::fct_reorder() function to order the bars by the value of another variable (mean population) in this case.
gapminder %>%
filter(country %in% c("Sweden","Norway","Finland", "Denmark","Iceland")) %>%
group_by(country) %>%
summarize(popmean = mean(pop), sd = sd(pop)) %>%
ggplot(aes(x = fct_reorder(country,popmean), y = popmean)) +
geom_col(position = "dodge",color = "black", alpha = 0.01) +
geom_errorbar(aes(ymin = popmean - sd, ymax = popmean + sd), width = 0.3) +
theme(panel.grid = element_blank()) +
labs(x = "", y = "Population by country (1952 - 2007)")
Ribbon is a great geom to know for time series analyses.
ribbon<-read_csv("https://raw.githubusercontent.com/LGCarlson/Intro-to-Tidyverse/master/ribbon_example.csv") %>% glimpse()
## Observations: 298
## Variables: 3
## $ time <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ value <dbl> -0.4442503, -1.0992872, -1.8173539, -1.9129397, -...
## $ variablility <dbl> 0.07238277, 0.17910953, 0.29610589, 0.31167993, 0...
Much like the errorbar geoms, geom_ribbon requires a ymin and ymax argument (you must supply).
ggplot(ribbon,aes(time,value)) +
geom_ribbon(aes(ymin = value - variablility , ymax = value + variablility ),
fill = "#2171b5", alpha = 0.2) + geom_line(color = "#08519c")
You can also do the same thing with lines, but the fill ribbon provides looks nicer.
ggplot(ribbon, aes(time, value)) +
geom_line(aes(y = value - variablility, x = time), color="grey", linetype=2) +
geom_line(aes(y = value + variablility, x = time), color="grey", linetype=2) +
geom_line(color = "black") + theme(panel.grid = element_blank())
Well, that’s all folks! You can find the ultimate tidyverse cheat sheet here and a variety of great documentation all around the web.