Setup

First, let’s load our packages. We wil mostly be using the dplyr and ggplot2 packages and those will load as part of the tidyverse. The ggthemes package contains themes that we can add to our ggplots. The ggridges package will be used to create a pretty cool graph towards the end.

library(pacman)
p_load(tidyverse, lubridate, ggridges, ggthemes, janitor)

For this lecture, we are using the mpg dataset that is built in the tidyverse package. It contains fuel economy data for 38 models of cars in 1999 and 2008. Let’s take a look at the data. Recall, we can preview some of our data by using the head() function and can look at variable names using names().

head(mpg, 10)
## # A tibble: 10 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
names(mpg)
##  [1] "manufacturer" "model"        "displ"        "year"        
##  [5] "cyl"          "trans"        "drv"          "cty"         
##  [9] "hwy"          "fl"           "class"

Making Plots

Before we start making ggplots, we should talk about plot(), Base R’s plotting function. The fastest way to make plots in Base R is with the qplot() function - it makes quick plots.

qplot(x = displ, y = hwy, data = mpg)

This is a nice and informative plot but ggplot is more user friendly and allows us to really customize our plots to make some really awesome data visualizations. The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.

In general, we can make a ggplot by typing the following: ggplot(data = <DATA>) + <geom_function>(mapping = aes(<MAPPING))

The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot() and tell R what data we are using.

ggplot(data = mpg)

Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy))

Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point().

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()

This looks like the plot we made earlier but with a lot of extra steps. So why did we do all this extra work to learn ggplot? Well, ggplot allows us to visualize data in ways that the base plot package does not. For example, we can color the points by a variable. We can also add themes by adding a layer to the graph. There are some themes built into the ggplot package and the ggthemes package has even more. You can also make your own custom theme!

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() + theme_minimal()

We can also change the size of the dots by a variable using size.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class)) + theme_fivethirtyeight()
## Warning: Using size for a discrete variable is not advised.

We can facet wrap, which will make a plot for each variable we wrap by and then arrange them in a grid.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) + 
  facet_wrap(~ class, nrow = 2)

Of course, we can make many different types of graphs besides scatterplots using ggplot. Here is how we would fit a line through data points:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can also combine multiple geoms by adding multiple layers.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Some geoms, like boxplots, use categorical x variables.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() 

Some Cool Extensions

Now that we have some basics, let’s make some cool graphs using some fun data. Tidy Tuesday is a weekly data science project that has a new, accessible dataset to try your tidyverse skills on. We are going to use two datasets from this github repo to make some interesting ggplots.

Bob Ross Data

The first dataset we are going to use is bob ross. We can download the data into R by simply copying and pasting the read_csv() code found on that page.

bob_ross <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   EPISODE = col_character(),
##   TITLE = col_character()
## )
## See spec(...) for full column specifications.
head(bob_ross, 10)
## # A tibble: 10 x 69
##    EPISODE TITLE APPLE_FRAME AURORA_BOREALIS  BARN BEACH  BOAT BRIDGE
##    <chr>   <chr>       <dbl>           <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 S01E01  "\"A…           0               0     0     0     0      0
##  2 S01E02  "\"M…           0               0     0     0     0      0
##  3 S01E03  "\"E…           0               0     0     0     0      0
##  4 S01E04  "\"W…           0               0     0     0     0      0
##  5 S01E05  "\"Q…           0               0     0     0     0      0
##  6 S01E06  "\"W…           0               0     0     0     0      0
##  7 S01E07  "\"A…           0               0     0     0     0      0
##  8 S01E08  "\"P…           0               0     0     0     0      0
##  9 S01E09  "\"S…           0               0     0     1     0      0
## 10 S01E10  "\"M…           0               0     0     0     0      0
## # … with 61 more variables: BUILDING <dbl>, BUSHES <dbl>, CABIN <dbl>,
## #   CACTUS <dbl>, CIRCLE_FRAME <dbl>, CIRRUS <dbl>, CLIFF <dbl>,
## #   CLOUDS <dbl>, CONIFER <dbl>, CUMULUS <dbl>, DECIDUOUS <dbl>,
## #   DIANE_ANDRE <dbl>, DOCK <dbl>, DOUBLE_OVAL_FRAME <dbl>, FARM <dbl>,
## #   FENCE <dbl>, FIRE <dbl>, FLORIDA_FRAME <dbl>, FLOWERS <dbl>,
## #   FOG <dbl>, FRAMED <dbl>, GRASS <dbl>, GUEST <dbl>,
## #   HALF_CIRCLE_FRAME <dbl>, HALF_OVAL_FRAME <dbl>, HILLS <dbl>,
## #   LAKE <dbl>, LAKES <dbl>, LIGHTHOUSE <dbl>, MILL <dbl>, MOON <dbl>,
## #   MOUNTAIN <dbl>, MOUNTAINS <dbl>, NIGHT <dbl>, OCEAN <dbl>,
## #   OVAL_FRAME <dbl>, PALM_TREES <dbl>, PATH <dbl>, PERSON <dbl>,
## #   PORTRAIT <dbl>, RECTANGLE_3D_FRAME <dbl>, RECTANGULAR_FRAME <dbl>,
## #   RIVER <dbl>, ROCKS <dbl>, SEASHELL_FRAME <dbl>, SNOW <dbl>,
## #   SNOWY_MOUNTAIN <dbl>, SPLIT_FRAME <dbl>, STEVE_ROSS <dbl>,
## #   STRUCTURE <dbl>, SUN <dbl>, TOMB_FRAME <dbl>, TREE <dbl>, TREES <dbl>,
## #   TRIPLE_FRAME <dbl>, WATERFALL <dbl>, WAVES <dbl>, WINDMILL <dbl>,
## #   WINDOW_FRAME <dbl>, WINTER <dbl>, WOOD_FRAMED <dbl>

The dataframe has Bob Ross episodes and a bunch of dummy variables equal to one if that item is included in the painting for that episode. This isn’t a super useful format for graphing so we need to clean it up a bit. The code below isn’t important for this class but if you’re curious about how I cleaned the data, it is below.

bob_ross_clean <- bob_ross %>% janitor::clean_names() %>%
  gather(element, present, -episode, -title) %>%
  filter(present == 1) %>%
  mutate(title = str_to_title(str_remove_all(title, '"')),
         element = str_to_title(str_replace(element, "_", " "))) %>%
  dplyr::select(-present) %>%
  separate(episode, into = c("season", "episode"), sep = "E") %>% 
  mutate(season = str_extract(season, "[:digit:]+")) %>% 
  mutate_at(vars(season, episode), as.integer) %>%
  arrange(season, episode)

head(bob_ross_clean, 10)
## # A tibble: 10 x 4
##    season episode title               element  
##     <int>   <int> <chr>               <chr>    
##  1      1       1 A Walk In The Woods Bushes   
##  2      1       1 A Walk In The Woods Deciduous
##  3      1       1 A Walk In The Woods Grass    
##  4      1       1 A Walk In The Woods River    
##  5      1       1 A Walk In The Woods Tree     
##  6      1       1 A Walk In The Woods Trees    
##  7      1       2 Mt. Mckinley        Cabin    
##  8      1       2 Mt. Mckinley        Clouds   
##  9      1       2 Mt. Mckinley        Conifer  
## 10      1       2 Mt. Mckinley        Mountain

Now our data has season, episode, title and element as variables. The element column is what we are going to be interested in - these are items that appear in the painting. Let’s make a graph of the most commonly occuring elements in all of the Bob Ross paintings. To do so, we need total counts of each variable. Let’s make a new dataframe.

counts <- bob_ross_clean %>% count(element, sort = TRUE) %>% 
  arrange(desc(n))

head(counts, 10)
## # A tibble: 10 x 2
##    element       n
##    <chr>     <int>
##  1 Tree        361
##  2 Trees       337
##  3 Deciduous   227
##  4 Conifer     212
##  5 Clouds      179
##  6 Mountain    160
##  7 Lake        143
##  8 Grass       142
##  9 River       126
## 10 Bushes      120

This counts dataframe has the elements and the total number of times that element appears, arranged greatest to least. Let’s plot the top 15 items in a bar chart.

plot1 <- counts %>% head(15) %>%
  ggplot(aes(element, n)) + geom_col() #+ coord_flip()

plot1

That is not a very pretty looking graph so let’s use our ggplot skills to make it look nicer. We can make a color palette and add that to the graph. Let’s also add a title and labels.

my_happy_little_palette = 'deepskyblue'

plot1 <- counts %>% head(15) %>%
  ggplot(aes(element, n)) + geom_col(fill = my_happy_little_palette) + coord_flip() +
  theme_minimal() + 
  labs(title = "Most Popular Items in Bob Ross Paintings",
       x = "Number",
       y = "Item")
plot1

Beautiful!

UFO Sightings Data

The next dataset from the Tidy Tuesday repo contains information on global UFO sightings. You can find the information on the dataset here.

ufo_sightings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")
## Parsed with column specification:
## cols(
##   date_time = col_character(),
##   city_area = col_character(),
##   state = col_character(),
##   country = col_character(),
##   ufo_shape = col_character(),
##   encounter_length = col_double(),
##   described_encounter_length = col_character(),
##   description = col_character(),
##   date_documented = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )
head(ufo_sightings, 10)
## # A tibble: 10 x 11
##    date_time city_area state country ufo_shape encounter_length
##    <chr>     <chr>     <chr> <chr>   <chr>                <dbl>
##  1 10/10/19… san marc… tx    us      cylinder              2700
##  2 10/10/19… lackland… tx    <NA>    light                 7200
##  3 10/10/19… chester … <NA>  gb      circle                  20
##  4 10/10/19… edna      tx    us      circle                  20
##  5 10/10/19… kaneohe   hi    us      light                  900
##  6 10/10/19… bristol   tn    us      sphere                 300
##  7 10/10/19… penarth … <NA>  gb      circle                 180
##  8 10/10/19… norwalk   ct    us      disk                  1200
##  9 10/10/19… pell city al    us      disk                   180
## 10 10/10/19… live oak  fl    us      disk                   120
## # … with 5 more variables: described_encounter_length <chr>,
## #   description <chr>, date_documented <chr>, latitude <dbl>,
## #   longitude <dbl>

Check out the variable names.

names(ufo_sightings)
##  [1] "date_time"                  "city_area"                 
##  [3] "state"                      "country"                   
##  [5] "ufo_shape"                  "encounter_length"          
##  [7] "described_encounter_length" "description"               
##  [9] "date_documented"            "latitude"                  
## [11] "longitude"

And let’s check out the class of this date_time variable.

class(ufo_sightings$date_time)
## [1] "character"

The date_time variable is a character. Wouldn’t it be cool if we could change this to a date format and use that in our plots? This would be helpful for time series data… Luckily, we can! The lubridate package allows us to format variables as dates and times. Let’s convert this variable to a date format and also filter out any NA values for the country.

ufo <- ufo_sightings %>% 
  mutate(date_time = parse_date_time(date_time, 'mdy_HM')) %>%
  filter(country != "NA")

head(ufo, 5)
## # A tibble: 5 x 11
##   date_time           city_area state country ufo_shape encounter_length
##   <dttm>              <chr>     <chr> <chr>   <chr>                <dbl>
## 1 1949-10-10 20:30:00 san marc… tx    us      cylinder              2700
## 2 1955-10-10 17:00:00 chester … <NA>  gb      circle                  20
## 3 1956-10-10 21:00:00 edna      tx    us      circle                  20
## 4 1960-10-10 20:00:00 kaneohe   hi    us      light                  900
## 5 1961-10-10 19:00:00 bristol   tn    us      sphere                 300
## # … with 5 more variables: described_encounter_length <chr>,
## #   description <chr>, date_documented <chr>, latitude <dbl>,
## #   longitude <dbl>

Now we are ready to make our plot. This plot is going to use the ggridges package and examines the distribution of monthly UFO sightings by country.

ufo %>% ggplot(aes(x = month(date_time), y = country, fill = country)) + geom_density_ridges() +
  theme_minimal()
## Picking joint bandwidth of 0.685

Because we made the date_time variable into a date format, we can also look at the hourly distribution of UFO sightings by country.

ufo %>% ggplot(aes(x = hour(date_time), y = country, fill = country)) + geom_density_ridges() +
  theme_minimal()
## Picking joint bandwidth of 1.78

Next, let’s make a time series line plot to look at the total UFO sightings over time.

ufo_total <- ufo %>% group_by(year(date_time)) %>% summarize(total = n())

names(ufo_total) <- c("year", "total") #this changes the variable names

ggplot(aes(x = year, y = total), data = ufo_total) + geom_line() + 
  labs(x = "Year",
       y = "UFO Sightings",
       title = "Total Recorded UFO Sightings") +
  theme_linedraw()

Looks like UFO sightings have drastically increased in recent years. Is it more UFOs or better data reporting? I’ll let you decide…