This is a quick introduction to dplyr and ggplot2, two highly used packages within the tidyverse group of tools. Esquisse is also introduced as a friendly user interface for ggplot2. Make sure you have the packages ‘tidyverse’ and ‘esquisse’ installed.
If you are viewing this tutorial online, it is recommended to copy each code chunk to an rscript file in RStudio and run it from there.
Let’s load the packages. Note that ‘library(tidyverse)’ will load all packages related to tidyverse. You can also just load dplyr and ggplot2 separately using the library command. The two datasets we will use in this study are loaded with the tidyverse package. To see them in our R global environment, we have to load them explicitly using the ‘data’ function.
library(tidyverse)
## -- Attaching packages --------------------
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(esquisse)
data(mpg)
data(starwars)
ggplot2 is a very popular plotting package that can create highly intricate plots. You can visit the R Graph Gallery for examples,tutorials, and inspiration. You can also refer to the data visualization chapter of R for Data Science for more detailed syntax and usage.
A basic ggplot includes the data frame to be plotted, aesthetic settings, and geometry. Aesthetics or ‘aes’ determines what is plotted on each axis. It can also assign data to the other characteristics of a plot such as color and size. the ‘+’ is used to tie together functions determine various plot characteristics together into one plot.
ggplot(mpg, aes(x=displ, y=hwy,color = class)) + geom_point()
ggplot has many layers of options to be explored. For example, we can add labels above the points while removing overlapping labels, and change to a more color-blindness friendly color palette.
ggplot(mpg, aes(x=displ, y=hwy,color = class)) + geom_point() + geom_text(label=mpg$manufacturer,check_overlap = TRUE,size=3,nudge_y = 1) + scale_color_viridis_d(end = .7)
You can use esquisse to load a more user friendly GUI that lets you manipulate and alter ggplots without code. You can then export that code into R and learn more about how these plots work.
esquisser(viewer = "browser")
Don’t be afraid to experiment! ggplot has a truly dizzying array of options and it is often best to look for a graph you’d like to emulate, then learn how that is constructed.
Just as one last example, you can even plot a second set of data over the first, and label a second axis in ggplot. The ‘~ .* 1’ statement specifies that the second axis is scaled the same as the first. Be careful when using a second axis of a different scale. You will need to transform the second set of plotted data with the inverse of what is used to transform the axis. They do not transform automatically.
ggplot(mpg) + geom_point(aes(x=displ, y=hwy,color = class)) + geom_point(shape = 4, aes(x=displ, y=cty,color = class)) + scale_color_viridis_d(end = .7) + scale_y_continuous(sec.axis = sec_axis( ~ .* 1, name = "cty"))
The dplyr package includes a set of functions and syntax that makes it easier to manipulate data in R. This section is adapted from the tidyverse tutorial for dplyr and the data transformation chapter of R for Data Science.
We will use the starwars dataset, which is automatically loaded with dplyr. You can preview it by typing it into the console.
starwars
The filter() function provides a simple way to filter a data frame based on its columns.
filter(starwars,height >= 160)
filter(starwars,hair_color == "none")
Notice that each filter works on the starwars dataset without overwriting it. To save filtered data, you can use ‘<-’ with the same variable name to overwite it, or a different name to create a new variable.
alopecia<-filter(starwars,hair_color =="none")
alopecia
## # A tibble: 37 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Dart~ 202 136 none white yellow 41.9 male mascu~
## 2 IG-88 200 140 none metal red 15 none mascu~
## 3 Bossk 190 113 none green red 53 male mascu~
## 4 Lobot 175 79 none light blue 37 male mascu~
## 5 Ackb~ 180 83 none brown mot~ orange 41 male mascu~
## 6 Nien~ 160 68 none grey black NA male mascu~
## 7 Nute~ 191 90 none mottled g~ red NA male mascu~
## 8 Jar ~ 196 66 none orange orange 52 male mascu~
## 9 Roos~ 224 82 none grey orange NA male mascu~
## 10 Rugo~ 206 NA none green orange NA male mascu~
## # ... with 27 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
dplyr also offers an alternative to saving your modified data if you are simply going to plug it into another function. The ‘%>%’ operator is called a pipe. It will transfer the output of one function to the input of another.
filter(starwars,hair_color=="none") %>% filter(mass >= 90)
Note that when piping data, we no longer have to enter the data as the first variable since this is automatically provided. In fact, entering data in the second filter function will cause an error.
Arrange sorts the rows of the data frame according to any specific data column. You can sort on multiple columns in the order they are called in the function.
arrange(starwars,mass)
arrange(starwars,desc(mass))
arrange(starwars,mass,height)
The select function can pull out specific columns of data for subsetting. You can manually type names of columns to select. It also has several self-explanatory options including: starts_with, ends_with, contains, matches, and num_range for selecting columns based on the column name. These options can be combined with logical expressions for interesting results.
select(starwars,name,ends_with("color"))
select(starwars,name,matches("s$"))
The matches option utilizes a syntax called regular expressions.
Mutate can create new columns in the dataset that are calculated from other existing columns.
starwars %>% select(name:mass) %>%
mutate(bmi = mass / ((height / 100) ^ 2),bigbmi = bmi*2)
We won’t be covering group_by() and summarize(), but these functions are covered in the previous links for dplyr tutorials.