Libraries

Your first line of code will usually be importing packages.

tidyverse

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • tidyverse is a package that includes many other packages
  • Above shows the many packages included, this lecture will use ggplot2

GGPlot

Car Mileage

We will look at the data set mpg, which gives many details about specific cars allowing us to find patterns within.

data(mpg) #data() lets us load the data
head(mpg)
## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
  • mpg is the data set that’s been loaded.

  • head allows you to view the first 6-10 rows.

Additional info

  • view(data set) lets you view the entire data set in a separate tab.

  • Adding ? in front of any variable will open the help tab which gives you a detailed explanation of the variable. example: ?mpg will show give you a rundown on the data.

Plotting the data

ggplot(data=mpg) + 
  geom_point(mapping = aes(x=displ, y = hwy))

  • ggplot(): creates an empty plot and loads the dataset mpg.
  • geom_point(): used to plot points to make a scatterplot
  • mapping = aes(x=z, y=j): tells ggplot how we want it to map the points on the graph. aes is short for aesthetic

Now make it colourful!

ggplot(data = mpg) +
  geom_point(mapping = aes(x=displ, y = hwy, color = class))

  • By adding color = class, we have now differentiated each point by their set class’ colour! This will make is easier for us to spot patterns in our scatter plot.

    Notice how color = class is inside `mapping``

What if we want to change the colour of the points without separating it by a categorical variable?

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy,), color= "blue")

We do it like this!

Notice how color = "blue" is outside mapping. This is because we making the points blue is not something that we are mapping but a design choice. Adding it inside mapping will cause weird outputs as plotting fixed variables does not make sense

How would you create a smooth line instead?

ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, color=class))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • To create a curve, we use a new function called geom_smooth. There are many other “geoms” that allow us to plot data differently.

If we use geom_point and geom_smooth together it will create something like this…

ggplot(data=mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Lets simplify this code a bit..

ggplot(data=mpg,mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • Adding the mapping chunk of code inside ggplot will automatically distribute the mapping format in the required parameters.

Facet

Facets allow us to create multiple scatterplots split by the variable of your choice. Sometimes, even with the different colours, it can be hard to make use of data when plotted in a singular graph

How do we do this?

ggplot(data=mpg,mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() + geom_smooth() +
  facet_wrap(~class, nrow = 2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • facet_wrap() lets you create multiple scatterplots split by a single variable, in this case, class.
  • ~ is always needed before the variable!!
  • nrow = 2 just lets the program know we want the plots to load in 2 rows.

What happens if we use facet for a continuous variable?

Lets use hwy as our variable.

ggplot(data=mpg,mapping = aes(x = displ, y = hwy, color = manufacturer)) + 
  geom_point() + geom_smooth() +
  facet_wrap(~hwy, nrow = 4)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

  • It split it into all the different hwy values! This makes no sense so don’t do it unless the continuous variable is the year, age.. anything that makes sense really.

We can also facet two variables!

ggplot(data=mpg,mapping = aes(x = displ, y = hwy, color = manufacturer)) + 
  geom_point() + 
  facet_grid(drv ~ cyl)

  • facet_grid() takes two variables split by the ~ sign and allows you to facet on two variables, in this example, drv and cyl.

    drv is the type of wheel (4 wheel, front wheel, rear wheel), cyl is the number of cylinders.

  • The graph might seem hard to read but imagine it as battleship:

    • first graph plots points that have 4 cylinders and is a 4 wheel drive
    • bottom left graoh is empty because no car has 4 cylinders and is a rear wheel drive.

We can also use facet_grid on one variable by using . at the beginning!

ggplot(data=mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(. ~cyl)

Layering Plots

ggplots are built in layers made up of data, mapping, geom and optionally stats. Data can be broken down into subsets using facets. When we + something, we are adding a new layer to the ggplot

Gapminder data

The gapminder dataset contains the life expectancy and many other data from various countries from 1952 to 2007.

Now lets add the library, etc!!

library(gapminder)
data(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Filtering data

Lets plot the life expectancy vs GDP per capita for 2007

gm07 <- filter(gapminder, year == 2007)
head(gm07)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       2007    43.8 31889923      975.
## 2 Albania     Europe     2007    76.4  3600523     5937.
## 3 Algeria     Africa     2007    72.3 33333216     6223.
## 4 Angola      Africa     2007    42.7 12420476     4797.
## 5 Argentina   Americas   2007    75.3 40301927    12779.
## 6 Australia   Oceania    2007    81.2 20434176    34435.
  • filter() lets us filter the data so that we only have relevant information
  • In this example, we’ve created a new variable called gm07 that only contains information from 2007.
ggplot(gm07, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point()

  • Using the dataset from gm07, we have made a scatterplot in which the points are separated by continent

The range of GDP is very large so the graph might not be very accurate to scale. To counter that, we can take the log of the GDP in hopes to get the numbers closer to scale.

gm07 <- mutate(gm07, log10GdpPercap = log10(gdpPercap))

ggplot(gm07, aes(x=log10GdpPercap, y = lifeExp, color = continent)) +
  geom_point()

  • We created a new variable log10GdpPercap to be part of the gm07 dataset using the function mutate that takes all the log of gdpPercap. This was used for our x instead of gdpPercap.
  • Now it is easier to see a pattern!

Now we’re ready to see layering in action!

Lets mutate gapminder first..

gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap))
p <- ggplot(gapminder, aes(x=log10GdpPercap, y=lifeExp, color = continent))
g1 <- geom_point(alpha=0.1)
p1 <- p + g1
  • the 2 layers are ggplot and geom point, represented as p and g1 respectively.
  • p1 is the combination of the two layers
  • alpha controls the transparency. In this example it is 10%.

We can look at each layer separately

p #base of ggplot 

g1 #the data of the points
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity
p1 #the combination of the two

How do we put two plots side by side?

We have to import a new library called gridExtra but lets make a new plot first

g2 = geom_point(alpha = 0.5)
p2 = p + g2
p2

Now that we’ve created a new plot, lets look at them side by side!

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, nrow = 1)

  • grid.arrange() takes in the plots and the number of rows we want as the parameters

  • There are many other ways to transform your data such as stat_smooth to make curves

p + stat_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'