You may need to install the ggplot package, lubridate, and potentially more.
so run install.packages("ggplot2") and any other necessary package.
start off with usual workflow:
library(pacman)
p_load(tidyverse, ggplot2, ggthemes, lubridate, ggridges)
Let’s use the mpg dataset. the mpg dataset is built into the ggplot2 package. How do we go about getting information from a package?
?mpg
I am going to save mpg as mpg_data. This might be nice so we have the object in the global environment. Really don’t have to do this but it makes me feel warm and fuzzy
mpg_data<- mpg
we can also look at our data using the head command we heard about last time
head(mpg_data,10)
## # A tibble: 10 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
To see the names of our dataset, let’s use ‘names()’
names(mpg)
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
There are multiple ways to graph with ggplot, but a basic template is that we need * 1.the data * 2.the aesthetic mapping * 3.a geom In r scrip this will look like: ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
First off, what the heck is an ‘aes’? (try ?aes!) Basically it tells ggplot how a variable is going to be represented visually. aes can be include in the ‘ggplot()’ part or the ‘geom_BLANK()’ part. It is up to the R user to decide what goes into the aes–although there are certain requirements.
If we want a quick visual snapshot of our data we can use ‘qplot’ (makes quick plots). Lets look at engine size or displacement as the x variable and highway miles as the y variable:
qplot(x =displ, y = hwy, data = mpg)
We get a feel for the data and this is a great place to start, but we can do better than that with ggplot2!
First, try this:
ggplot(data = mpg)
Nothing happens! This is because we need to tell R what our mapping is, ie the aes. Now try:
ggplot(data = mpg, aes( x = displ, y = hwy))
Ok, now in the bottom right quadrant of r studio we have axes but nothing on the graph. we need to add geoms! Let’s try a scatter plot with geom_point. To add a geom you must use a ‘+’ between each added ‘layer’
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy))
Cool, looks like the plot we made with qplot a second ago…
Note: we can also set the aes in ggplot so that every layer we add will by default have this aes. However we can manually change the aes for any specific geom. This is probably more info then you need right now but I wanted to put it on your radar.
Compare the following two:
ggplot(data = mpg) + geom_point(aes(x = displ, y = hwy))
#and
ggplot(data = mpg, aes(x = displ, y = hwy)) + geom_point()
The reason we use ggplot is not to arbitrarily add steps, but because it has many more additional features than simple plot functions do. Lets try coloring by class (of car) and making the plot a bit prettier:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class)) + theme_minimal()
Notice the
+ theme_miinimal. Adding a theme will change a bunch of the underlying defaults in ggplot (e.g. grid color and axes settings). These are created by other r users but you can use them. For more themese check out the ggthemes package. Try ?ggthemes to get a list of the themes
We can also size by variables
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class)) + theme_fivethirtyeight()
## Warning: Using size for a discrete variable is not advised.
Facet wrapping gives you grids. You choose a variable to wrap with. In this example I am wrapping with class (of car). You can specify how many rows or columns you want
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = drv)) +
facet_wrap(~ class, nrow = 2)
We can also do line plots with mpg.
ggplot(data = mpg) +
geom_line(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))
This is pretty ugly and not very infomative imo. This is because there are many values of y for each x. Instead we might use
geom_smooth. geom_smooth fits a smooth line and gives std error. Notice you can also specify linetype by a variable:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can hide the confidence interval if we don’t like how that looks
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color=drv), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can, excitedly, combine different types of geoms by just layering them on:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
or alternatively
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Perhaps you are more of a boxplot person…
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
What if we want to switch the axes format. You might first think just to change the aes but this won’t work out. Do you know why? Instead lets use coord_flip()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()+
coord_flip()
Now that we have some basics down, we can make some plots with cool datasets! We can get a free and cleaned data set from Tidy Tuesday every week on, you guessed it, Tuesday. This is a great place to get data if you want to practice ggplot or data cleaning skills. Ask Jenni Putz, she’s a HUGE advocate. Here is the website: https://github.com/rfordatascience/tidytuesday
I am getting a dataset about ufos from TidyTuesday using the readr package (it’s in tidyverse but can be loaded separately:
p_load(readr)
ufo_sightings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-25/ufo_sightings.csv")
## Parsed with column specification:
## cols(
## date_time = col_character(),
## city_area = col_character(),
## state = col_character(),
## country = col_character(),
## ufo_shape = col_character(),
## encounter_length = col_double(),
## described_encounter_length = col_character(),
## description = col_character(),
## date_documented = col_character(),
## latitude = col_double(),
## longitude = col_double()
## )
Moving on to some ufo sightings. First check out the dataset
head(ufo_sightings, 10)
## # A tibble: 10 x 11
## date_time city_area state country ufo_shape encounter_length described_encou…
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 10/10/19… san marc… tx us cylinder 2700 45 minutes
## 2 10/10/19… lackland… tx <NA> light 7200 1-2 hrs
## 3 10/10/19… chester … <NA> gb circle 20 20 seconds
## 4 10/10/19… edna tx us circle 20 1/2 hour
## 5 10/10/19… kaneohe hi us light 900 15 minutes
## 6 10/10/19… bristol tn us sphere 300 5 minutes
## 7 10/10/19… penarth … <NA> gb circle 180 about 3 mins
## 8 10/10/19… norwalk ct us disk 1200 20 minutes
## 9 10/10/19… pell city al us disk 180 3 minutes
## 10 10/10/19… live oak fl us disk 120 several minutes
## # … with 4 more variables: description <chr>, date_documented <chr>,
## # latitude <dbl>, longitude <dbl>
names(ufo_sightings)
## [1] "date_time" "city_area"
## [3] "state" "country"
## [5] "ufo_shape" "encounter_length"
## [7] "described_encounter_length" "description"
## [9] "date_documented" "latitude"
## [11] "longitude"
Look at this date column..
class(ufo_sightings$date_time)
## [1] "character"
It’s a character, but it would be nice if we could use these dates/times in our plots. So let’s convert it using the lubridate package. ALSo we also want to get rid of the NA values for country.
ufo <- ufo_sightings %>%
mutate(date_time = parse_date_time(date_time, 'mdy_HM')) %>% #gives hour and minute too
filter(country != "NA")
Now our date_time variable is in a useable form. Lubridate is WONDERFUL can turn a useless dataset into something super valuable. Note that the class of date_time has changed
class(ufo_sightings$date_time)
## [1] "character"
Let’s look at what time of day these ufo sightings are happening and let’s break it down by country.
ufo %>% ggplot(aes(x = hour(date_time), y = country, fill = country)) +
geom_density_ridges() +
theme_minimal()
## Picking joint bandwidth of 1.78
What about the time of the year? look at it monthly
ufo %>% ggplot(aes(x = month(date_time), y = country, fill = country)) + geom_density_ridges() + theme_minimal()
## Picking joint bandwidth of 0.685
Now let’s look at total ufo sightings per year
ufo_total <- ufo %>% group_by(year(date_time)) %>% summarize(total = n())
names(ufo_total) <- c("year", "total")
ggplot(aes(x = year, y = total), data = ufo_total) + geom_line() +
labs(x = "Year",
y = "UFO Sightings",
title = "Total Recorded UFO Sightings") +
theme_linedraw()
Cool beans. Now go make pictures and make the world a better place!