library("ggplot2")
Let’s import two datasets about the Game of Thrones:
da <- read.csv("character-deaths.csv")
bat <- read.csv("battles.csv")
da$Houses <- gsub("House ", "", da$Allegiances)
One of them is about characters, another one - about battles. Let’s explore them a bit:
str(da)
## 'data.frame': 917 obs. of 14 variables:
## $ Name : Factor w/ 916 levels "Addam Marbrand",..: 1 3 4 2 5 6 7 8 9 10 ...
## $ Allegiances : Factor w/ 21 levels "Arryn","Baratheon",..: 13 16 10 6 13 2 15 16 6 15 ...
## $ Death.Year : int NA 299 NA 300 NA NA 300 300 NA NA ...
## $ Book.of.Death : int NA 3 NA 5 NA NA 4 5 NA NA ...
## $ Death.Chapter : int NA 51 NA 20 NA NA 35 NA NA NA ...
## $ Book.Intro.Chapter: int 56 49 5 20 NA NA 21 59 11 0 ...
## $ Gender : int 1 1 1 1 1 1 1 0 1 1 ...
## $ Nobility : int 1 1 1 1 1 1 1 1 1 0 ...
## $ GoT : int 1 0 0 0 0 0 1 1 0 0 ...
## $ CoK : int 1 0 0 0 0 1 0 1 1 0 ...
## $ SoS : int 1 1 0 0 1 1 1 1 0 1 ...
## $ FfC : int 1 0 0 0 0 0 1 0 1 0 ...
## $ DwD : int 0 0 1 1 0 0 0 1 0 0 ...
## $ Houses : chr "Lannister" "None" "Targaryen" "Greyjoy" ...
Everything is rather simple: name, house, year, book, chapter of death, gender, nobility and appearance in every book.
Let’s assume that if Book.of.Death
is NA it means that character is not dead (at least not yet):
da$Alive <- is.na(da$Book.of.Death)
In addition, for most Houses we have both House X
and X
, for example, House Baratheon
and Baratheon
. You can try to find some logic here (e.g., loyalty vs. belonging to a family), but for simplicity we will create a column without House
:
da$Houses <- gsub("House ", "", da$Allegiances)
The last thing tha I want to do is converting columns Gender
and Nobility
from integer to factor:
da$Gender <- factor(da$Gender, labels = c("Female", "Male"))
da$Nobility <- factor(da$Nobility, labels = c("Not noble", "Noble"))
str(bat)
## 'data.frame': 38 obs. of 25 variables:
## $ name : Factor w/ 38 levels "Battle at the Mummer's Ford",..: 13 1 7 14 18 10 25 5 3 17 ...
## $ year : int 298 298 298 298 298 298 298 299 299 299 ...
## $ battle_number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ attacker_king : Factor w/ 5 levels "","Balon/Euron Greyjoy",..: 3 3 3 4 4 4 3 2 2 2 ...
## $ defender_king : Factor w/ 7 levels "","Balon/Euron Greyjoy",..: 6 6 6 3 3 3 6 6 6 6 ...
## $ attacker_1 : Factor w/ 11 levels "Baratheon","Bolton",..: 10 10 10 11 11 11 10 9 9 9 ...
## $ attacker_2 : Factor w/ 8 levels "","Bolton","Frey",..: 1 1 1 1 8 8 1 1 1 1 ...
## $ attacker_3 : Factor w/ 3 levels "","Giants","Mormont": 1 1 1 1 1 1 1 1 1 1 ...
## $ attacker_4 : Factor w/ 2 levels "","Glover": 1 1 1 1 1 1 1 1 1 1 ...
## $ defender_1 : Factor w/ 13 levels "","Baratheon",..: 12 2 12 8 8 8 6 11 11 11 ...
## $ defender_2 : Factor w/ 3 levels "","Baratheon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ defender_3 : logi NA NA NA NA NA NA ...
## $ defender_4 : logi NA NA NA NA NA NA ...
## $ attacker_outcome : Factor w/ 3 levels "","loss","win": 3 3 3 2 3 3 3 3 3 3 ...
## $ battle_type : Factor w/ 5 levels "","ambush","pitched battle",..: 3 2 3 3 2 2 3 3 5 2 ...
## $ major_death : int 1 1 0 1 1 0 0 0 0 0 ...
## $ major_capture : int 0 0 1 1 1 0 0 0 0 0 ...
## $ attacker_size : int 15000 NA 15000 18000 1875 6000 NA NA 1000 264 ...
## $ defender_size : int 4000 120 10000 20000 6000 12625 NA NA NA NA ...
## $ attacker_commander: Factor w/ 32 levels "","Asha Greyjoy",..: 8 6 9 22 16 18 6 30 2 28 ...
## $ defender_commander: Factor w/ 29 levels "","Amory Lorch",..: 7 4 10 28 12 14 15 1 1 1 ...
## $ summer : int 1 1 1 1 1 1 1 1 1 1 ...
## $ location : Factor w/ 28 levels "","Castle Black",..: 8 13 17 9 27 17 4 12 5 23 ...
## $ region : Factor w/ 7 levels "Beyond the Wall",..: 7 5 5 5 5 5 5 3 3 3 ...
## $ note : Factor w/ 6 levels "","Greyjoy's troop number based on the Battle of Deepwood Motte, in which Asha had 1000 soldier on 30 longships. T"| __truncated__,..: 1 1 1 1 1 1 1 1 1 2 ...
As for battles we have much more columns but only a few rows (i.e. battles). The most important columns for us are attacker_size
, defender_size
and attacker_outcome
.
It is important to know at least some basic plot functions. Its graphics is rather ugly and they are less flexible than ggplot2
but sometimes you just want to do something simple in one line.
Function plot()
is a generic function. It means that this function will behave differently depending of what input you give. For example, for two vectors it will draw a scatterplot.
plot(bat$defender_size, bat$attacker_size)
Hint: try function plot()
on different objects and see what it returns.
Another easy function for plotting is hist()
:
hist(bat$defender_size)
Syntax is rather simple, there are some number of chart types, some parameters etc. In general, it is not different to MATLAB graphics or matplotlib for Python. However, it is time to go to the next level.
It’s time to take the whole power of R graphics with external package ggplot2
!
First, we need to install this package:
install.packages("ggplot2")
We need to install it only once, but you need also to attach this package using function library()
:
library("ggplot2")
Remember to attach needed packages every time (i.e. every session) you need it!
gg
in ggplot2
stands for “Grammar of Graphics” plot: it is a famous book by L. Wilkinson where you can find an outstanding theory about translation your data in graphics. This book is not about creation fancy plots. It is more about “grammar” or rules of visualization based on object-oriented design and some math. I recommend you this book if you like such things or want to understand ggplot2
better. Hadley Wickham (the developer of ggplot2
, devtools
, dplyr
, stringr
, readr
, readxl
, tidyr
, lubridate
and maybe the whole Universe) updated an idea with his “Layered Grammar of Graphics” and successfully implemented it in ggplot2
.
We have already talked about different charts: scatterplots, boxplots etc. However, according to Wilkinson, thinking in terms of charts can limit our capabilities to create graphics. He gives an interesting example: the “Pie Chart” that you of course now is just a bar plot in polar coordinate system.
pie <- ggplot(data = da, aes(x = "", fill = Gender))+
geom_bar(width = 1, position = "fill", color = "black")
pie
pie + coord_polar(theta = "y")+theme_void()
Think about it!
There are several abstract concepts that you need to understand to master ggplot2
. First of all, it is layers. Each component of your graph (underlying data it’s plotting, coordinate system, legend, title, additional lines etc.) are separate layers. This concep of layer is rather similar to layers in Photoshop or animation production.
“Aesthetics, in the original Greek sense, offers principles for relating sensory attributes (color, shape, sound, etc.) to abstractions.” (from “Grammar of Graphics”). In ggplot2
aesthetics (function aes()
) are some elements of your plot that are somehow associated with data: - x position - y position - size of elements - shape of elements - color of elements
Geometries (geoms) are actual graphical elements utilized in a plot. For example: - points - lines - line segments - bars - text
Some of these geometries have their own aesthetics, for example, for points and lines you can set up different types.
Statistics is another important layer type. Sometimes you want to plot frequencies or some other things that you need to calculate before plotting. There are two major ways to do that:
Moreover, you can plot smoothing and regression lines, central tendencies and variability on your plot.
That was the most fundamental concepts of the grammar of graphics. In addition, there are scales - how you convert your data to specific geometric entities (range, is it linear etc.), coordinate systems (by default it is Cartesian but we can change it), facets - many subplots based on multiple dimensions.
Thus, the grammar of graphics allows you to explicitly describe components of any graphic. In addition, it simplifies complicated graphics to some very basic elements! We will show it with a pie-chart example.
First, we need to create data layer. You need to use function ggplot()
with two important parameters:
data
- dataset you want to plot (data.frame or data.table)aes()
- aes is a function and take as arguments columns of your data.frame/data.table. Indeed, you can leave this empty and specify aes()
for every geom individually. Otherwise, geoms will inherit aes()
from the main ggplot()
function.Let’s create this data layer for our bat
dataset and set attacker_size
and defender_size
as aesthetics.
ggplot(data = bat, aes(x = attacker_size, y = defender_size))
Our current plot is empty. We need to specify geometry. The best way to plot two continuous variables is a scatter plot geom_point()
. It is a layer. Each point of a scatter plot represents the case with two values: x and y. We need to use ‘+’ operator to add additional layer to a ggplot:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point()
Technically, geom_point()
is a function. It has many parameters but it will inherit essential parameters (data
and aes()
) from ggplot()
(unless they are specified inside geom_point()
).
Our scatterplot is very boring and not really informative. First, let’s change colour to…hm… purple. I like purple!
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(colour = "purple")
And make points bigger.
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(colour = "purple", size = 3)
I don’t like circles.
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(colour = "purple", size = 3, shape = 8)
For now our scatterplot is less boring but it is still not informative.
Let’s add some colours based on battle outcome. It means that colour changes depending on some variable. It means that we need to specify aesthetics:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)
We can also add labels. For example, attacker_king
column with additional layer:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3)
Hm… We need to move these lebels a little bit and delete overlapping labels:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)
Not ideal, but it is much better.
Let’s try to add some summary statistics on the plot:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
geom_smooth()
Rather ugly =( We have very few points and variability is very high.
We need to garnish our scatter plot.
First, I want to rescale this plot and change limits for x and y axes. There are many ways to do it, but I recommend you to use coord_cartesian()
function with xlim
and ylim
parameters. you need to specify vector with two numbers for each parameter (limits for each axis).
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
coord_cartesian(xlim = c(0, 100000), ylim = c(0, 100000) )
Ahhhggrrrhhh, the battle at Castle Black spoils everything! If we set other limits, we will skip this point:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
coord_cartesian(xlim = c(0, 25000), ylim = c(0, 25000) )
In general, skipping a point from the figure is a bad idea. Another option is using log-scale. You can do it in ggplot2
three ways:
scale_x_log10()
and scale_y_log10()
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
coord_trans(x = "log10", y = "log10", limx = c(1, 100000), limy = c(1,100000))
Also, we need to change title and axes names. As usual, you have different ways to do it. I recommend you function labs()
with parameters title
, x
, y
:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
coord_trans(x = "log10", y = "log10", limx = c(1, 100000), limy = c(1,100000))+
labs(title = "Example 1. GoT battles: army sizes and outcome",
x = "Defender's army",
y = "Attacker's army")
…and change a legend:
ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)+
geom_text(aes(label = location), size = 3, hjust = -0.2, check_overlap = T)+
coord_trans(x = "log10", y = "log10", limx = c(1, 100000), limy = c(1,100000))+
labs(title = "Example 1. GoT battles: army sizes and outcome",
x = "Defender's army",
y = "Attacker's army")+
scale_colour_discrete(name="Outcome",
labels=c("Unknown","Defender's win", "Attacker's win"))
Note: you need to use functions scale_fill_manual
, scale_colour_hue
, scale_colour_manual
, scale_shape_discrete
, scale_linetype_discrete
etc. depending on your variables. Therefore, changing legends can be very frustrating.
ggplot(da, aes(x = Gender))+
geom_bar()
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender))
Fill is not the same as colour!
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20", stat = "count")
For bars, densities, hidtograms etc. colour
changes a colour for border lines and fill
- colour of figure.
Let’s add a title:
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20")+
labs(title = "Example 2. Number of males and females")
… and rotate our graphic:
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20")+
labs(title = "Example 2. Number of males and females")+
coord_flip()
It is interesting to plot gender by different houses:
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20")+
labs(title = "Example 2. Number of males and females")+
coord_flip()+
facet_wrap(~Allegiances)
Default ggplot2 graphics is very nice (at least at first glance). Nevertheless, it has much more parameters that you can adjust. To change “deeper” parameters of ggplot2 graphics (like background, size and style of titles etc.) you need to modify theme
.
To understand variety of theme parameters you can try some built-in themes:
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20", position = position_dodge(0.9))+
labs(title = "Example 2. Number of males and females")+
coord_flip()+
theme_classic()
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20", position = position_dodge(0.9))+
labs(title = "Example 2. Number of males and females")+
coord_flip()+
theme_bw()
Moreover, we can modify theme with theme()
function. It is rather challanging due to large numbers of parameters that you can change. Moreover, some parameters are more general; more specific parameters will inherit general parameters. For more information check help for theme()
: ?theme
Let’s make our graphic crazy!
ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20", position = position_dodge(0.9))+
labs(title = "Example 2. Number of males and females")+
coord_flip()+
theme(line = element_line(colour = "red",linetype = 2, size = 3),
rect = element_rect(colour = "blue", fill = "purple", size = 4),
text = element_text(face = "bold.italic", colour = "green", size = 30, angle = 5),
panel.background = element_rect(colour = "red", fill = "brown"))
Ha-aha-ha-ahaha! HAHAHAHAHAH! Ha! Hm… Sorry.
Let’s plot a histogram for army sizes.
We need to transform data from wide to long. I usually use data.table
package for that, but you can use dplyr
if you prefer.
Installing data.table
:
install.packages("data.table")
If you need it only for one function you can use a function from package without attaching it. There is special operator ::
for that:
batlong <- data.table::melt(bat, measure.vars = c("attacker_size", "defender_size"),variable.name = "battle_role", value.name = "army_size")
Let’s have a look:
Ok, we can plot:
ggplot(data = batlong, aes(x = army_size))+
geom_histogram(bins = 50)
Let’s plot two histograms separately:
ggplot(data = batlong, aes(x = army_size, fill = battle_role))+
geom_histogram(bins = 50, position="identity")
They are overlapping! Let’s make them transparent.
ggplot(data = batlong, aes(x = army_size, fill = battle_role))+
geom_histogram(bins = 50, alpha = 0.5, position="identity")
…and add title with appropriate legend:
ggplot(data = batlong, aes(x = army_size, fill = battle_role))+
geom_histogram(bins = 50, alpha = 0.5, position="identity")+
labs(title = "Example 3. GoT battles: Distribution of army sizes",
x = "Army size",
y = "")+
scale_fill_discrete(name="Battle role",
labels=c("Attacker","Defender"))
Actually, if you have overlapping elements, you can choose what to do: - “stack” one on another(stack
- default for geom_histogram()
) - overlapping (identity
- that we use) - divide the full height of a plot proportionally between conditions (fill
) - add some “noise” (jitter
, recommended for geom_point sometimes) - divide by sides (dodge
, recommended for barplots with multiple conditions)
Instead of histogram(geom_histogram()
) we can use density (geom_density()
). It will create something like smoothed histograms:
ggplot(data = batlong, aes(x = army_size, fill = battle_role))+
geom_density(alpha = 0.5)+
labs(title = "Example 3. GoT battles: Distribution of army sizes",
x = "Army size",
y = "")+
scale_fill_discrete(name="Battle role",
labels=c("Attacker","Defender"))
Another popular geometry is violin. It is very similar to geom_density()
, but rotated, reflected and demonstrated as independent figures.
ggplot(data = batlong, aes(x = army_size, fill = battle_role))+
geom_violin(aes(x = battle_role, y = army_size), alpha = 0.5)+
labs(title = "Example 3. GoT battles: Distribution of army sizes",
x = "Army size",
y = "")+
scale_fill_discrete(name="Battle role",
labels=c("Attacker","Defender"))
Many people think that violin plots are much better than box plot - another popular graphic among psychologists. “Box and whiskers plot” is a rather convinient way to plot psychological data. The bottom and top of a box are the first and the trird quartile of plotted distribution (points that divide the first 25% and 75% of ranked data) and a line inside a box is the median (i.e. the second quartile). Whiskers represent… well, it varies. Wikipedia knows at least 5 possible variants for plotting whiskers. Default parameters for geom_boxplot()
whiskers you can find in help for this function:
The upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge, where IQR is the inter-quartile range, or distance between the first and third quartiles. The lower whisker extends from the hinge to the lowest value within 1.5 * IQR of the hinge. Data beyond the end of the whiskers are outliers and plotted as points (as specified by Tukey).
In addition, we can combine violins and box plots! You need to add additional geometry geom_boxplot()
for that:
ggplot(data = batlong, aes(x = battle_role, y = army_size))+
geom_violin(aes(fill = battle_role), alpha = 0.5)+
geom_boxplot(width = 0.25)+
labs(title = "Example 3. GoT battles: Distribution of army sizes",
x = "Army size",
y = "")+
scale_fill_discrete(name="Battle role",
labels=c("Attacker","Defender"))
In case you want more ggplot2 stuff:
I recommend you to try cowplot
package. It will set as a default minimalistic theme that is good fit for scientific stuff.
install.packages("cowplot")
library("cowplot")
ggplot(data = batlong, aes(x = battle_role, y = army_size))+
geom_violin(aes(fill = battle_role), alpha = 0.5)+
geom_boxplot(width = 0.25)+
labs(title = "Example 3. GoT battles: Distribution of army sizes",
x = "Army size",
y = "")+
scale_fill_discrete(name="Battle role",
labels=c("Attacker","Defender"))
If you want to return the default theme back:
theme_set(theme_grey())
Plotly is JavaScript-based package for dynamic plots in R. Yes, interactive plots!
install.packages("plotly")
library(plotly)
plot_ly(bat, x = ~defender_size,
y = ~attacker_size,
color = ~defender_size,
size = ~attacker_size,
hoverinfo = 'text',
text = ~paste("Атакующие:", attacker_size, "под предводительством", attacker_1, '<br>Обороняющиеся:', defender_size, "под предводительством", defender_1, "<br>Итог для атакующих:", attacker_outcome, '<br>Тип битвы:', battle_type, '<br>', location, " (",region,"), ", year, "год" ))%>%
layout(title = "Размеры армий в битвах Игры Престолов",
xaxis = list(title = "Размер обороняющейся армии"),
yaxis = list(title = "Размер атакующей армии")
)
In addition, you can just use ggplot
object and simple function ggplotly()
to make an interactive plot. Oh, did you know that you can save ggplot2
figure as an object? Given the fact that the grammar of graphics allows you to descrive every graphic, you can save it as object to plot or modify it later:
p1 <- ggplot(data = bat, aes(x = defender_size, y = attacker_size))+
geom_point(aes(colour = attacker_outcome), size = 3, shape = 8)
ggplotly(p1)
p2 <- ggplot(da, aes(x = Gender))+
geom_bar(aes(fill = Gender), colour = "grey20")+
labs(title = "Example 2. Number of males and females")+
coord_flip()+
facet_wrap(~Allegiances)
ggplotly(p2)
Plotly is not the only tool for dynamic visualization, you can see more at ggplot2 extensions