First, let’s load our packages. We wil mostly be using the dplyr and ggplot2 packages and those will load as part of the tidyverse. The ggthemes package contains themes that we can add to our ggplots. The ggridges package will be used to create a pretty cool graph towards the end.
library(pacman)
p_load(tidyverse, lubridate, ggridges, ggthemes, janitor)
For this lecture, we are using the mpg dataset that is built in the tidyverse package. It contains fuel economy data for 38 models of cars in 1999 and 2008. Let’s take a look at the data. Recall, we can preview some of our data by using the head()
function and can look at variable names using names()
.
head(mpg, 10)
## # A tibble: 10 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
names(mpg)
## [1] "manufacturer" "model" "displ" "year"
## [5] "cyl" "trans" "drv" "cty"
## [9] "hwy" "fl" "class"
Before we start making ggplots, we should talk about plot()
, Base R’s plotting function. The fastest way to make plots in Base R is with the qplot()
function - it makes quick plots.
qplot(x = displ, y = hwy, data = mpg)
This is a nice and informative plot but ggplot is more user friendly and allows us to really customize our plots to make some really awesome data visualizations. The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.
In general, we can make a ggplot by typing the following: ggplot(data = <DATA>) + <geom_function>(mapping = aes(<MAPPING))
The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot()
and tell R what data we are using.
ggplot(data = mpg)
Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point()
.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point()
This looks like the plot we made earlier but with a lot of extra steps. So why did we do all this extra work to learn ggplot? Well, ggplot allows us to visualize data in ways that the base plot package does not. For example, we can color the points by a variable. We can also add themes by adding a layer to the graph. There are some themes built into the ggplot package and the ggthemes package has even more. You can also make your own custom theme!
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() + theme_minimal()
We can also change the size of the dots by a variable using size
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class)) + theme_fivethirtyeight()
## Warning: Using size for a discrete variable is not advised.
We can facet wrap, which will make a plot for each variable we wrap by and then arrange them in a grid.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
facet_wrap(~ class, nrow = 2)
Of course, we can make many different types of graphs besides scatterplots using ggplot. Here is how we would fit a line through data points:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can also combine multiple geoms by adding multiple layers.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Some geoms, like boxplots, use categorical x variables.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
Now that we have some basics, let’s make some cool graphs using some fun data. Tidy Tuesday is a weekly data science project that has a new, accessible dataset to try your tidyverse skills on. We are going to use two datasets from this github repo to make some interesting ggplots.
The first dataset we are going to use is bob ross. We can download the data into R by simply copying and pasting the read_csv()
code found on that page.
bob_ross <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## EPISODE = col_character(),
## TITLE = col_character()
## )
## See spec(...) for full column specifications.
head(bob_ross, 10)
## # A tibble: 10 x 69
## EPISODE TITLE APPLE_FRAME AURORA_BOREALIS BARN BEACH BOAT BRIDGE
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 S01E01 "\"A… 0 0 0 0 0 0
## 2 S01E02 "\"M… 0 0 0 0 0 0
## 3 S01E03 "\"E… 0 0 0 0 0 0
## 4 S01E04 "\"W… 0 0 0 0 0 0
## 5 S01E05 "\"Q… 0 0 0 0 0 0
## 6 S01E06 "\"W… 0 0 0 0 0 0
## 7 S01E07 "\"A… 0 0 0 0 0 0
## 8 S01E08 "\"P… 0 0 0 0 0 0
## 9 S01E09 "\"S… 0 0 0 1 0 0
## 10 S01E10 "\"M… 0 0 0 0 0 0
## # … with 61 more variables: BUILDING <dbl>, BUSHES <dbl>, CABIN <dbl>,
## # CACTUS <dbl>, CIRCLE_FRAME <dbl>, CIRRUS <dbl>, CLIFF <dbl>,
## # CLOUDS <dbl>, CONIFER <dbl>, CUMULUS <dbl>, DECIDUOUS <dbl>,
## # DIANE_ANDRE <dbl>, DOCK <dbl>, DOUBLE_OVAL_FRAME <dbl>, FARM <dbl>,
## # FENCE <dbl>, FIRE <dbl>, FLORIDA_FRAME <dbl>, FLOWERS <dbl>,
## # FOG <dbl>, FRAMED <dbl>, GRASS <dbl>, GUEST <dbl>,
## # HALF_CIRCLE_FRAME <dbl>, HALF_OVAL_FRAME <dbl>, HILLS <dbl>,
## # LAKE <dbl>, LAKES <dbl>, LIGHTHOUSE <dbl>, MILL <dbl>, MOON <dbl>,
## # MOUNTAIN <dbl>, MOUNTAINS <dbl>, NIGHT <dbl>, OCEAN <dbl>,
## # OVAL_FRAME <dbl>, PALM_TREES <dbl>, PATH <dbl>, PERSON <dbl>,
## # PORTRAIT <dbl>, RECTANGLE_3D_FRAME <dbl>, RECTANGULAR_FRAME <dbl>,
## # RIVER <dbl>, ROCKS <dbl>, SEASHELL_FRAME <dbl>, SNOW <dbl>,
## # SNOWY_MOUNTAIN <dbl>, SPLIT_FRAME <dbl>, STEVE_ROSS <dbl>,
## # STRUCTURE <dbl>, SUN <dbl>, TOMB_FRAME <dbl>, TREE <dbl>, TREES <dbl>,
## # TRIPLE_FRAME <dbl>, WATERFALL <dbl>, WAVES <dbl>, WINDMILL <dbl>,
## # WINDOW_FRAME <dbl>, WINTER <dbl>, WOOD_FRAMED <dbl>
The dataframe has Bob Ross episodes and a bunch of dummy variables equal to one if that item is included in the painting for that episode. This isn’t a super useful format for graphing so we need to clean it up a bit. The code below isn’t important for this class but if you’re curious about how I cleaned the data, it is below.
bob_ross_clean <- bob_ross %>% janitor::clean_names() %>%
gather(element, present, -episode, -title) %>%
filter(present == 1) %>%
mutate(title = str_to_title(str_remove_all(title, '"')),
element = str_to_title(str_replace(element, "_", " "))) %>%
dplyr::select(-present) %>%
separate(episode, into = c("season", "episode"), sep = "E") %>%
mutate(season = str_extract(season, "[:digit:]+")) %>%
mutate_at(vars(season, episode), as.integer) %>%
arrange(season, episode)
head(bob_ross_clean, 10)
## # A tibble: 10 x 4
## season episode title element
## <int> <int> <chr> <chr>
## 1 1 1 A Walk In The Woods Bushes
## 2 1 1 A Walk In The Woods Deciduous
## 3 1 1 A Walk In The Woods Grass
## 4 1 1 A Walk In The Woods River
## 5 1 1 A Walk In The Woods Tree
## 6 1 1 A Walk In The Woods Trees
## 7 1 2 Mt. Mckinley Cabin
## 8 1 2 Mt. Mckinley Clouds
## 9 1 2 Mt. Mckinley Conifer
## 10 1 2 Mt. Mckinley Mountain
Now our data has season, episode, title and element as variables. The element column is what we are going to be interested in - these are items that appear in the painting. Let’s make a graph of the most commonly occuring elements in all of the Bob Ross paintings. To do so, we need total counts of each variable. Let’s make a new dataframe.
counts <- bob_ross_clean %>% count(element, sort = TRUE) %>%
arrange(desc(n))
head(counts, 10)
## # A tibble: 10 x 2
## element n
## <chr> <int>
## 1 Tree 361
## 2 Trees 337
## 3 Deciduous 227
## 4 Conifer 212
## 5 Clouds 179
## 6 Mountain 160
## 7 Lake 143
## 8 Grass 142
## 9 River 126
## 10 Bushes 120
This counts
dataframe has the elements and the total number of times that element appears, arranged greatest to least. Let’s plot the top 15 items in a bar chart.
plot1 <- counts %>% head(15) %>%
ggplot(aes(element, n)) + geom_col() #+ coord_flip()
plot1
That is not a very pretty looking graph so let’s use our ggplot skills to make it look nicer. We can make a color palette and add that to the graph. Let’s also add a title and labels.
my_happy_little_palette = 'deepskyblue'
plot1 <- counts %>% head(15) %>%
ggplot(aes(element, n)) + geom_col(fill = my_happy_little_palette) + coord_flip() +
theme_minimal() +
labs(title = "Most Popular Items in Bob Ross Paintings",
x = "Number",
y = "Item")
plot1
Beautiful!
The next dataset from the Tidy Tuesday repo contains information on global UFO sightings. You can find the information on the dataset here.
ufo_sightings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")
## Parsed with column specification:
## cols(
## date_time = col_character(),
## city_area = col_character(),
## state = col_character(),
## country = col_character(),
## ufo_shape = col_character(),
## encounter_length = col_double(),
## described_encounter_length = col_character(),
## description = col_character(),
## date_documented = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
head(ufo_sightings, 10)
## # A tibble: 10 x 11
## date_time city_area state country ufo_shape encounter_length
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 10/10/19… san marc… tx us cylinder 2700
## 2 10/10/19… lackland… tx <NA> light 7200
## 3 10/10/19… chester … <NA> gb circle 20
## 4 10/10/19… edna tx us circle 20
## 5 10/10/19… kaneohe hi us light 900
## 6 10/10/19… bristol tn us sphere 300
## 7 10/10/19… penarth … <NA> gb circle 180
## 8 10/10/19… norwalk ct us disk 1200
## 9 10/10/19… pell city al us disk 180
## 10 10/10/19… live oak fl us disk 120
## # … with 5 more variables: described_encounter_length <chr>,
## # description <chr>, date_documented <chr>, latitude <dbl>,
## # longitude <dbl>
Check out the variable names.
names(ufo_sightings)
## [1] "date_time" "city_area"
## [3] "state" "country"
## [5] "ufo_shape" "encounter_length"
## [7] "described_encounter_length" "description"
## [9] "date_documented" "latitude"
## [11] "longitude"
And let’s check out the class of this date_time variable.
class(ufo_sightings$date_time)
## [1] "character"
The date_time
variable is a character. Wouldn’t it be cool if we could change this to a date format and use that in our plots? This would be helpful for time series data… Luckily, we can! The lubridate
package allows us to format variables as dates and times. Let’s convert this variable to a date format and also filter out any NA values for the country.
ufo <- ufo_sightings %>%
mutate(date_time = parse_date_time(date_time, 'mdy_HM')) %>%
filter(country != "NA")
head(ufo, 5)
## # A tibble: 5 x 11
## date_time city_area state country ufo_shape encounter_length
## <dttm> <chr> <chr> <chr> <chr> <dbl>
## 1 1949-10-10 20:30:00 san marc… tx us cylinder 2700
## 2 1955-10-10 17:00:00 chester … <NA> gb circle 20
## 3 1956-10-10 21:00:00 edna tx us circle 20
## 4 1960-10-10 20:00:00 kaneohe hi us light 900
## 5 1961-10-10 19:00:00 bristol tn us sphere 300
## # … with 5 more variables: described_encounter_length <chr>,
## # description <chr>, date_documented <chr>, latitude <dbl>,
## # longitude <dbl>
Now we are ready to make our plot. This plot is going to use the ggridges
package and examines the distribution of monthly UFO sightings by country.
ufo %>% ggplot(aes(x = month(date_time), y = country, fill = country)) + geom_density_ridges() +
theme_minimal()
## Picking joint bandwidth of 0.685
Because we made the date_time
variable into a date format, we can also look at the hourly distribution of UFO sightings by country.
ufo %>% ggplot(aes(x = hour(date_time), y = country, fill = country)) + geom_density_ridges() +
theme_minimal()
## Picking joint bandwidth of 1.78
Next, let’s make a time series line plot to look at the total UFO sightings over time.
ufo_total <- ufo %>% group_by(year(date_time)) %>% summarize(total = n())
names(ufo_total) <- c("year", "total") #this changes the variable names
ggplot(aes(x = year, y = total), data = ufo_total) + geom_line() +
labs(x = "Year",
y = "UFO Sightings",
title = "Total Recorded UFO Sightings") +
theme_linedraw()
Looks like UFO sightings have drastically increased in recent years. Is it more UFOs or better data reporting? I’ll let you decide…