Tidy R Intro

Intro

This is a quick introduction to dplyr and ggplot2, two highly used packages within the tidyverse group of tools. Esquisse is also introduced as a friendly user interface for ggplot2. Make sure you have the packages ‘tidyverse’ and ‘esquisse’ installed.

If you are viewing this tutorial online, it is recommended to copy each code chunk to an rscript file in RStudio and run it from there.

Let’s load the packages. Note that ‘library(tidyverse)’ will load all packages related to tidyverse. You can also just load dplyr and ggplot2 separately using the library command. The two datasets we will use in this study are loaded with the tidyverse package. To see them in our R global environment, we have to load them explicitly using the ‘data’ function.

library(tidyverse)

## -- Attaching packages --------------------

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(esquisse)

data(mpg)
data(starwars)

ggplot2 and esquisse

ggplot2 is a very popular plotting package that can create highly intricate plots. You can visit the R Graph Gallery for examples,tutorials, and inspiration. You can also refer to the data visualization chapter of R for Data Science for more detailed syntax and usage.

A basic ggplot includes the data frame to be plotted, aesthetic settings, and geometry. Aesthetics or ‘aes’ determines what is plotted on each axis. It can also assign data to the other characteristics of a plot such as color and size. the ‘+’ is used to tie together functions determine various plot characteristics together into one plot.

ggplot(mpg, aes(x=displ, y=hwy,color = class)) + geom_point()

ggplot has many layers of options to be explored. For example, we can add labels above the points while removing overlapping labels, and change to a more color-blindness friendly color palette.

ggplot(mpg, aes(x=displ, y=hwy,color = class)) + geom_point() + geom_text(label=mpg$manufacturer,check_overlap = TRUE,size=3,nudge_y = 1) + scale_color_viridis_d(end = .7)

You can use esquisse to load a more user friendly GUI that lets you manipulate and alter ggplots without code. You can then export that code into R and learn more about how these plots work.

esquisser(viewer = "browser")

Don’t be afraid to experiment! ggplot has a truly dizzying array of options and it is often best to look for a graph you’d like to emulate, then learn how that is constructed.

Just as one last example, you can even plot a second set of data over the first, and label a second axis in ggplot. The ‘~ .* 1’ statement specifies that the second axis is scaled the same as the first. Be careful when using a second axis of a different scale. You will need to transform the second set of plotted data with the inverse of what is used to transform the axis. They do not transform automatically.

ggplot(mpg) + geom_point(aes(x=displ, y=hwy,color = class)) + geom_point(shape = 4, aes(x=displ, y=cty,color = class)) + scale_color_viridis_d(end = .7) + scale_y_continuous(sec.axis = sec_axis( ~ .* 1, name = "cty"))

dplyr

The dplyr package includes a set of functions and syntax that makes it easier to manipulate data in R. This section is adapted from the tidyverse tutorial for dplyr and the data transformation chapter of R for Data Science.

We will use the starwars dataset, which is automatically loaded with dplyr. You can preview it by typing it into the console.

starwars

Filter

The filter() function provides a simple way to filter a data frame based on its columns.

filter(starwars,height >= 160)
filter(starwars,hair_color == "none")

Notice that each filter works on the starwars dataset without overwriting it. To save filtered data, you can use ‘<-’ with the same variable name to overwite it, or a different name to create a new variable.

alopecia<-filter(starwars,hair_color =="none")

alopecia

## # A tibble: 37 x 14
##    name  height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Dart~    202   136 none       white      yellow          41.9 male  mascu~
##  2 IG-88    200   140 none       metal      red             15   none  mascu~
##  3 Bossk    190   113 none       green      red             53   male  mascu~
##  4 Lobot    175    79 none       light      blue            37   male  mascu~
##  5 Ackb~    180    83 none       brown mot~ orange          41   male  mascu~
##  6 Nien~    160    68 none       grey       black           NA   male  mascu~
##  7 Nute~    191    90 none       mottled g~ red             NA   male  mascu~
##  8 Jar ~    196    66 none       orange     orange          52   male  mascu~
##  9 Roos~    224    82 none       grey       orange          NA   male  mascu~
## 10 Rugo~    206    NA none       green      orange          NA   male  mascu~
## # ... with 27 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Pipe

dplyr also offers an alternative to saving your modified data if you are simply going to plug it into another function. The ‘%>%’ operator is called a pipe. It will transfer the output of one function to the input of another.

filter(starwars,hair_color=="none") %>% filter(mass >= 90)

Note that when piping data, we no longer have to enter the data as the first variable since this is automatically provided. In fact, entering data in the second filter function will cause an error.

Arrange

Arrange sorts the rows of the data frame according to any specific data column. You can sort on multiple columns in the order they are called in the function.

arrange(starwars,mass)
arrange(starwars,desc(mass))
arrange(starwars,mass,height)

Select

The select function can pull out specific columns of data for subsetting. You can manually type names of columns to select. It also has several self-explanatory options including: starts_with, ends_with, contains, matches, and num_range for selecting columns based on the column name. These options can be combined with logical expressions for interesting results.

select(starwars,name,ends_with("color"))
select(starwars,name,matches("s$"))

The matches option utilizes a syntax called regular expressions.

Mutate

Mutate can create new columns in the dataset that are calculated from other existing columns.

starwars %>% select(name:mass) %>%
  mutate(bmi = mass / ((height / 100)  ^ 2),bigbmi = bmi*2)

Other functions

We won’t be covering group_by() and summarize(), but these functions are covered in the previous links for dplyr tutorials.