Lab 3: ggplot

ggplot: to plot is to be human

You may need to install the ggplot package, lubridate, and potentially more.

so run install.packages("ggplot2") and any other necessary package.

start off with usual workflow:

library(pacman)
p_load(tidyverse, ggplot2, ggthemes, lubridate, ggridges)

Let’s use the mpg dataset. the mpg dataset is built into the ggplot2 package. How do we go about getting information from a package?

?mpg

I am going to save mpg as mpg_data. This might be nice so we have the object in the global environment. Really don’t have to do this but it makes me feel warm and fuzzy

mpg_data<- mpg

we can also look at our data using the head command we heard about last time

head(mpg_data,10)

## # A tibble: 10 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…

To see the names of our dataset, let’s use ‘names()’

names(mpg)

##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

There are multiple ways to graph with ggplot, but a basic template is that we need * 1.the data * 2.the aesthetic mapping * 3.a geom In r scrip this will look like: ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

First off, what the heck is an ‘aes’? (try ?aes!) Basically it tells ggplot how a variable is going to be represented visually. aes can be include in the ‘ggplot()’ part or the ‘geom_BLANK()’ part. It is up to the R user to decide what goes into the aes–although there are certain requirements.

If we want a quick visual snapshot of our data we can use ‘qplot’ (makes quick plots). Lets look at engine size or displacement as the x variable and highway miles as the y variable:

qplot(x =displ, y = hwy, data = mpg)

We get a feel for the data and this is a great place to start, but we can do better than that with ggplot2!

First, try this:

ggplot(data = mpg)

Nothing happens! This is because we need to tell R what our mapping is, ie the aes. Now try:

ggplot(data = mpg, aes( x = displ, y = hwy))

Ok, now in the bottom right quadrant of r studio we have axes but nothing on the graph. we need to add geoms! Let’s try a scatter plot with geom_point. To add a geom you must use a ‘+’ between each added ‘layer’

ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy))

Cool, looks like the plot we made with qplot a second ago…

Note: we can also set the aes in ggplot so that every layer we add will by default have this aes. However we can manually change the aes for any specific geom. This is probably more info then you need right now but I wanted to put it on your radar.

Compare the following two:

ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy))

#and

ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point()

The reason we use ggplot is not to arbitrarily add steps, but because it has many more additional features than simple plot functions do. Lets try coloring by class (of car) and making the plot a bit prettier:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) + theme_minimal()

Notice the + theme_miinimal. Adding a theme will change a bunch of the underlying defaults in ggplot (e.g. grid color and axes settings). These are created by other r users but you can use them. For more themese check out the ggthemes package. Try ?ggthemes to get a list of the themes

We can also size by variables

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class)) + theme_fivethirtyeight()

## Warning: Using size for a discrete variable is not advised.

Facet wrapping gives you grids. You choose a variable to wrap with. In this example I am wrapping with class (of car). You can specify how many rows or columns you want

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = drv)) + 
  facet_wrap(~ class, nrow = 2)

We can also do line plots with mpg.

ggplot(data = mpg) + 
  geom_line(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))

This is pretty ugly and not very infomative imo. This is because there are many values of y for each x. Instead we might use geom_smooth. geom_smooth fits a smooth line and gives std error. Notice you can also specify linetype by a variable:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can hide the confidence interval if we don’t like how that looks

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv), se=FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We can, excitedly, combine different types of geoms by just layering them on:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

or alternatively

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Perhaps you are more of a boxplot person…

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

What if we want to switch the axes format. You might first think just to change the aes but this won’t work out. Do you know why? Instead lets use coord_flip()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()+
  coord_flip()

Now that we have some basics down, we can make some plots with cool datasets! We can get a free and cleaned data set from Tidy Tuesday every week on, you guessed it, Tuesday. This is a great place to get data if you want to practice ggplot or data cleaning skills. Ask Jenni Putz, she’s a HUGE advocate. Here is the website: https://github.com/rfordatascience/tidytuesday

I am getting a dataset about ufos from TidyTuesday using the readr package (it’s in tidyverse but can be loaded separately:

p_load(readr)

ufo_sightings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")

## Parsed with column specification:
## cols(
##   date_time = col_character(),
##   city_area = col_character(),
##   state = col_character(),
##   country = col_character(),
##   ufo_shape = col_character(),
##   encounter_length = col_double(),
##   described_encounter_length = col_character(),
##   description = col_character(),
##   date_documented = col_character(),
##   latitude = col_double(),
##   longitude = col_double()
## )

Moving on to some ufo sightings. First check out the dataset

head(ufo_sightings, 10)

## # A tibble: 10 x 11
##    date_time city_area state country ufo_shape encounter_length described_encou…
##    <chr>     <chr>     <chr> <chr>   <chr>                <dbl> <chr>           
##  1 10/10/19… san marc… tx    us      cylinder              2700 45 minutes      
##  2 10/10/19… lackland… tx    <NA>    light                 7200 1-2 hrs         
##  3 10/10/19… chester … <NA>  gb      circle                  20 20 seconds      
##  4 10/10/19… edna      tx    us      circle                  20 1/2 hour        
##  5 10/10/19… kaneohe   hi    us      light                  900 15 minutes      
##  6 10/10/19… bristol   tn    us      sphere                 300 5 minutes       
##  7 10/10/19… penarth … <NA>  gb      circle                 180 about 3 mins    
##  8 10/10/19… norwalk   ct    us      disk                  1200 20 minutes      
##  9 10/10/19… pell city al    us      disk                   180 3  minutes      
## 10 10/10/19… live oak  fl    us      disk                   120 several minutes 
## # … with 4 more variables: description <chr>, date_documented <chr>,
## #   latitude <dbl>, longitude <dbl>

names(ufo_sightings)

##  [1] "date_time"                  "city_area"                 
##  [3] "state"                      "country"                   
##  [5] "ufo_shape"                  "encounter_length"          
##  [7] "described_encounter_length" "description"               
##  [9] "date_documented"            "latitude"                  
## [11] "longitude"

Look at this date column..

class(ufo_sightings$date_time)

## [1] "character"

It’s a character, but it would be nice if we could use these dates/times in our plots. So let’s convert it using the lubridate package. ALSo we also want to get rid of the NA values for country.

ufo <- ufo_sightings %>% 
  mutate(date_time = parse_date_time(date_time, 'mdy_HM')) %>% #gives hour and minute too
  filter(country != "NA")

Now our date_time variable is in a useable form. Lubridate is WONDERFUL can turn a useless dataset into something super valuable. Note that the class of date_time has changed

class(ufo_sightings$date_time)

## [1] "character"

Let’s look at what time of day these ufo sightings are happening and let’s break it down by country.

ufo %>% ggplot(aes(x = hour(date_time), y = country, fill = country)) +
geom_density_ridges() +
  theme_minimal()

## Picking joint bandwidth of 1.78

What about the time of the year? look at it monthly

ufo %>% ggplot(aes(x = month(date_time), y = country, fill = country)) + geom_density_ridges() + theme_minimal()

## Picking joint bandwidth of 0.685

Now let’s look at total ufo sightings per year

ufo_total <- ufo %>% group_by(year(date_time)) %>% summarize(total = n())

names(ufo_total) <- c("year", "total")

ggplot(aes(x = year, y = total), data = ufo_total) + geom_line() + 
  labs(x = "Year",
       y = "UFO Sightings",
       title = "Total Recorded UFO Sightings") +
  theme_linedraw()

Cool beans. Now go make pictures and make the world a better place!