ggplot2
verbsIn this part, I will be looking into the neccessary skill of data visualisation using ggplo2
package. As mentioned in the part 1, ggplot2
is one of the components in Tidyverse
, having been frequently used by all R users from beginnners to data scientists. With ggplot2
and dplyr
, beginners can even infer a good looking statistical inference and boost their quality of work. Let’s begin then.
First of all, we need to load library(ggplot2)
.
library(ggplot2)
library(dplyr)
library(gapminder)
David Robinson, Chief Data Scientist in DataCamp, says “Visualisation and data wrangling are often intertwined. Thus, ggplot2
and dplyr
packages work closely together to create informative graphs.” This is so true like bread and butter. One can make the other taste better.
Before heading directly to the job, I will still use gapminder
dataset as in part 1 and skills of dplyr
. What I am going to start first is Variable Assignment. We, most of time, need to create variables when analysing data. In part 1, I used mutate()
function from dplyr
in order to create a new variable, while keeping the original dataset. This time, I will show how to assign new dataset also without harming the original dataset.
gapminder_2 <- gapminder %>%
filter(year == 2007)
gapminder_2
## # A tibble: 142 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 2007 43.828 31889923 974.5803
## 2 Albania Europe 2007 76.423 3600523 5937.0295
## 3 Algeria Africa 2007 72.301 33333216 6223.3675
## 4 Angola Africa 2007 42.731 12420476 4797.2313
## 5 Argentina Americas 2007 75.320 40301927 12779.3796
## 6 Australia Oceania 2007 81.235 20434176 34435.3674
## 7 Austria Europe 2007 79.829 8199783 36126.4927
## 8 Bahrain Asia 2007 75.635 708573 29796.0483
## 9 Bangladesh Asia 2007 64.062 150448339 1391.2538
## 10 Belgium Europe 2007 79.441 10392226 33692.6051
## # ... with 132 more rows
When assigning a variable in R the sign of less and then minus, <-
, is most frequently used by convention. In the example above, the gapminder
dataset has been taken, filtered for the observations of the year 2007 to gapminder_2
dataset.
Now let’s see an example using ggplot2
ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
This is the code for the scatterplot above. To use ggplot, we need to know at least three components in it.
ggplot()
for activating utilities in ggplot2
package.aes(x = , y = )
for labelling x and y axes. “Aes” stands for aesthetic by the way.+
, and geom_point()
for drawing a scatterplot. If you want to make a histogram rather then a scatter plot, then using geom_histogram()
in place of geom_point()
.It works very well, but a crucial problem with the graph above is that most of cases (countries) are crammed into the leftmost part of the x-axis. It is very painful to have a look at thanks to its “scale”. What I will introduce therefore is logarithmic scale.
The log scale makes readers can more easily and quickly distinguish differences in variables. Let’s have a look then!
ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
scale_x_log10()
As can be seen, the graph resembles more linear and is made easy to figure out differences. What is difference between the log scale code and non-log scale code is scale_x_log10()
. Just attatch it with the plus sign, +
, to the right behind geom_point()
. That’s it!
If you want to make a log-log scale graph, simply add scale_y_log10()
to the end of the code above.
When handling data that contains categorical varaibles such as survey and census, the beginners of R will face the great wall that hinders progresses. Here, I will introduce an additional ‘aesthetic’, aes()
, function for plotting categorical variables.
A great way to spot a categorical variable in scatterplots is the colour. See the example below
ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point() +
scale_x_log10()
The only difference between the code right above and the original code up above is components in aes()
. I added colour =
to aes()
function of the original code. By adding it, we can simply spot which scatter represents which continent.
To getting into ggplot
deeper, let’s add another variable population, pop
, to the scatterplot we have been using. Since pop
is a numeric variable, you might be wondering how we could shows population without adding z-axis. But it is still possible to work with two axes as a two-way graph if you are using size =
in aes()
.
ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, colour = continent, size = pop)) +
geom_point() +
scale_x_log10()
Again, the only difference in the code above is the components in aes()
. size =
has been added.
For the last part of today’s SLICC work, I will introduce another way of illustrating categorical variable in a fancier way, called faceting
. Have a look at my example first.
ggplot(gapminder_2, aes(x = gdpPercap, y = lifeExp, size = pop)) +
geom_point() +
scale_x_log10() +
facet_wrap(~ continent)
Again, by now you might notice which function is added into and subtracted from the code. Yes, those are facet_wrap(~ continent)
and colour =
. Within facet_wrap, you might wonder what tilde, ~
, stands for. That means “by” in R by convention.
To sum up, I learnt and introduced five components in ggplot2
package, which of each is ggplot()
, aes()
, geom_point()
, scale_x_log10()
and facet_wrap(~ )
. Without knowing it, R is less powerful. To be a professioner R programmer, work hard study hard!