Week 2 - Functions in R and Basic Grammars: part 2 - ggplot2 verbs

Who are you, ggplot?

In this part, I will be looking into the neccessary skill of data visualisation using ggplo2 package. As mentioned in the part 1, ggplot2 is one of the components in Tidyverse, having been frequently used by all R users from beginnners to data scientists. With ggplot2 and dplyr, beginners can even infer a good looking statistical inference and boost their quality of work. Let’s begin then.

First of all, we need to load library(ggplot2).

library(ggplot2)
library(dplyr)
library(gapminder)

David Robinson, Chief Data Scientist in DataCamp, says “Visualisation and data wrangling are often intertwined. Thus, ggplot2 and dplyr packages work closely together to create informative graphs.” This is so true like bread and butter. One can make the other taste better.

Before heading directly to the job, I will still use gapminder dataset as in part 1 and skills of dplyr. What I am going to start first is Variable Assignment. We, most of time, need to create variables when analysing data. In part 1, I used mutate() function from dplyr in order to create a new variable, while keeping the original dataset. This time, I will show how to assign new dataset also without harming the original dataset.

gapminder_2 <- gapminder %>%
  filter(year == 2007)

gapminder_2
## # A tibble: 142 x 6
##        country continent  year lifeExp       pop  gdpPercap
##         <fctr>    <fctr> <int>   <dbl>     <int>      <dbl>
##  1 Afghanistan      Asia  2007  43.828  31889923   974.5803
##  2     Albania    Europe  2007  76.423   3600523  5937.0295
##  3     Algeria    Africa  2007  72.301  33333216  6223.3675
##  4      Angola    Africa  2007  42.731  12420476  4797.2313
##  5   Argentina  Americas  2007  75.320  40301927 12779.3796
##  6   Australia   Oceania  2007  81.235  20434176 34435.3674
##  7     Austria    Europe  2007  79.829   8199783 36126.4927
##  8     Bahrain      Asia  2007  75.635    708573 29796.0483
##  9  Bangladesh      Asia  2007  64.062 150448339  1391.2538
## 10     Belgium    Europe  2007  79.441  10392226 33692.6051
## # ... with 132 more rows

When assigning a variable in R the sign of less and then minus, <-, is most frequently used by convention. In the example above, the gapminder dataset has been taken, filtered for the observations of the year 2007 to gapminder_2 dataset.

Now let’s see an example using ggplot2

ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point()

This is the code for the scatterplot above. To use ggplot, we need to know at least three components in it.

  • First, ggplot() for activating utilities in ggplot2 package.
  • Second, aes(x = , y = ) for labelling x and y axes. “Aes” stands for aesthetic by the way.
  • Last, plus sign, +, and geom_point() for drawing a scatterplot. If you want to make a histogram rather then a scatter plot, then using geom_histogram() in place of geom_point().

It works very well, but a crucial problem with the graph above is that most of cases (countries) are crammed into the leftmost part of the x-axis. It is very painful to have a look at thanks to its “scale”. What I will introduce therefore is logarithmic scale.

the Usefulness of Log Scale Graph

The log scale makes readers can more easily and quickly distinguish differences in variables. Let’s have a look then!

ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  scale_x_log10()

As can be seen, the graph resembles more linear and is made easy to figure out differences. What is difference between the log scale code and non-log scale code is scale_x_log10(). Just attatch it with the plus sign, +, to the right behind geom_point(). That’s it!

If you want to make a log-log scale graph, simply add scale_y_log10() to the end of the code above.

What about categorical varaibles in ggplot?

When handling data that contains categorical varaibles such as survey and census, the beginners of R will face the great wall that hinders progresses. Here, I will introduce an additional ‘aesthetic’, aes(), function for plotting categorical variables.

A great way to spot a categorical variable in scatterplots is the colour. See the example below

ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, colour = continent)) +
  geom_point() +
  scale_x_log10()

The only difference between the code right above and the original code up above is components in aes(). I added colour = to aes() function of the original code. By adding it, we can simply spot which scatter represents which continent.

To getting into ggplot deeper, let’s add another variable population, pop, to the scatterplot we have been using. Since pop is a numeric variable, you might be wondering how we could shows population without adding z-axis. But it is still possible to work with two axes as a two-way graph if you are using size = in aes().

ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, colour = continent, size = pop)) +
  geom_point() +
  scale_x_log10()

Again, the only difference in the code above is the components in aes(). size = has been added.

For the last part of today’s SLICC work, I will introduce another way of illustrating categorical variable in a fancier way, called faceting. Have a look at my example first.

ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, size = pop)) +
  geom_point() + 
  scale_x_log10() +
  facet_wrap(~ continent)

Again, by now you might notice which function is added into and subtracted from the code. Yes, those are facet_wrap(~ continent) and colour =. Within facet_wrap, you might wonder what tilde, ~, stands for. That means “by” in R by convention.

To sum up, I learnt and introduced five components in ggplot2 package, which of each is ggplot(), aes(), geom_point(), scale_x_log10() and facet_wrap(~ ). Without knowing it, R is less powerful. To be a professioner R programmer, work hard study hard!