ggplot 2 is a widely used data visualisation package that is heavily used in the R coding language and is widely considered to be the most user friendly, and also versatile data visualisation tool in R. With ggplot2, you can easily map data variables to visual aesthetics such as colour, shape, size, patterns, and type to display the output of your data in different ways. ggplot2 also enables its users to create a wide range of plotting options for all data types, whether it be histograms, box-plots, line-graphs, density plots, violin plots, and many more.
ggplot2 follows a layered approach that sees the user create a base of, what will be, their final plot, and then add many dimensions on top of it to ensure the best looking visualisation can be made. For example, you might start with a basic scatter plot, then add a regression line, and finally, add labels. Each layer enhances your plot without making it overly complicated.
For this small introduction into ggplot2, we are going to be using the in-built iris data set that comes with the base R Studio.
Just as a small introduction to the iris data set, and what it is; the iris data set is a small set of data that is sometimes used in machine learning and early data visualisation which contains 4 variables (sepal length, sepal width, petal length, and petal width), as well as a species classification.
To have a look at the data set, you can use the base R function, ‘head()’ to view the first few rows of the data to understand what the data contains.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Just like all packages in R, ggplot2 is downloaded in the same way:
Also install and run ‘dplyr’, this package is very useful for sending data into your ggplot with the pipe function that comes with it: %>%
# install.packages("ggplot2")
library(ggplot2)
# install.packages('dplyr')
library(dplyr)
Note: remove the ‘#’ before running the ‘install.packages’ to ensure it runs on your system if you do not have them installed already.
Now that we have installed the packages and understand some of the data that we get for free with R, we can start to play around with some of the graphs and smaller details of ggplot2 to make some cool visualisations!
Here we can see that the iris dataset has been sent into the ggplot. The aesthetics are then set for what will be plotted (the ‘aes’ sets this, then the ‘x’ axis is determined to be the ‘sepal.length’). Then a geom_histogram (a normal histogram) has been added to the visualisation, this will then plot the data into a hisogram to see the distribution of the sepal length.
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram()
Great job making that first histogram! But how can we make it look a little bit more appealing to someone that is looking at it for the first time, maybe adding some colour and some more clear titles on the axis to further inform the viewer.
By adding colour = ‘white’, this adds some colour around each of the bars to help the reader see where each of them are seperated. But, you cna use whatever colour you would like!
You can add even more colour, this time with the ‘fill’ option.
You can also change your axis labels and titles using the ‘labs’ function within the package, this allows you to give more desciption to the viewer.
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(x = 'Sepal Length',
y = 'Total Count',
title = 'Count of Sepal Length from Iris data set')
Now that we have delved into histograms to see the distribution of data, can we look at a scatter-plot, to see the correlation between different variables within the dataset?
A scatter-plot is a chart that uses dots to show how two things are related to each other, like how height and weight might be related for a group of people.
Here we will explore what the correlation between petal length and petal width as an example.
When plotting a scatter-plot, the aesthetics takes in more than 1 specification this time, taking both an x and a y variable to plot against each other.
And to finish the base plot, a ‘geom_point’ is added to show the plots of the selected variables.
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point()
You might be used to seeing some correlation plots with a line of regression going through them, but don’t stress, they can easily be added into your plot with an extra small line of code!
A ‘geom_smooth’ can be added to introduce a line of best fit to your data, along with the method, in our case, we will use a linear method (‘lm’). The gray area around the linear line is the standard error within the data.
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width)) +
geom_point() +
geom_smooth(method = 'lm')
And just to showcase more plotting options, you can also add some colour to this plot too! In this example, colour can be used to show off differences in the species of flower, however, the function must be placed within the ‘aes’ section.
iris %>%
ggplot(aes(x = Petal.Length, y = Petal.Width, colour = Species)) +
geom_point()
Finally, the last section to be shown on the scatter-plot, is the introduction of ‘size’ as another visualisation tool. Size can be added to show the magnitude of data, such as a size being larger on a plot if a player in the NBA scores more points than someone else, that will draw the attention of the viewer to that point.
In this example, size is determined by the sepal width, meaning those with a larger sepal width, will have larger points on the plot, and smaller for those with smaller sepal widths.
iris %>%
ggplot(aes(x = Petal.Length,
y = Petal.Width,
colour = Species,
size = Sepal.Width)) +
geom_point()
Now that we have looked at how to plot distributions of data, and compare variables to see some rough correlations, how can we see more statistical approaches to data.
Well, to do this, we can look at a boxplot, which will give us a 5 number summary of variables along with the actual plot. A boxplot is useful for presenting data that then shows the minimum, first quartile, median, third quartile, maximum, and also outlays outliers within the data.
Just like the other plots, start off by piping your iris dataset into the ggplot. This time, we will look at sepal width across the whole dataset.
You can alter how the boxplot looks depending on whether you choose to place your data on the x or y axis:
x = Vertical boxplot
y = Horizontal boxplot
Here we can see the results give us the ends, which show the minimum and maximum values, the actual box made of the quartiles, along with the median which makes the middle line. This also displays some outliers within the data too.
iris %>%
ggplot(aes(y = Sepal.Width)) +
geom_boxplot()
After making the first boxplot, we can see that it is a little bit dull, and could definitely tell us more information. You are able to split the data into different boxplots by the species type in a few different ways, I will show you two different kinds below:
The first one will seem familiar from before, using colour will trigger ggplot to make different plots for each of the species if you set the colour to equal species again.
iris %>%
ggplot(aes(y = Sepal.Width, col = Species)) +
geom_boxplot()
The second option is to use a ‘facet_wrap’ which is a function that allows you to create multiple small subplots (facets) based on one or more categorical variables.
It is used by adding it to the plot, and using the ‘tilde’ ( ~ ), and then listing your variable after it. I have also included an extra section, the ‘scales’ being set to ‘free’ allow for each plot to have their own x and y scales for their faceted data.
iris %>%
ggplot(aes(y = Sepal.Width)) +
geom_boxplot() +
facet_wrap( ~ Species, scales = 'free')
Themes in ggplot are basically like Instagram filters for your data visualizations. You can choose from different themes to change the colors, fonts, and overall style of your plot. It’s a cool way to make your charts match your own aesthetic of your report or presentation.
For this example, we will go back to using the same histogram from the start to display some different themes ggplot can offer.
Here are some examples of themes:
BW theme
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(title = 'bw theme') +
theme_bw()
Classic theme
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(title = 'Classic theme') +
theme_classic()
Dark theme
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(title = 'Dark theme') +
theme_dark()
Minimal theme
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(title = 'Minimal theme') +
theme_minimal()
Line draw theme
iris %>%
ggplot(aes(x = Sepal.Length)) +
geom_histogram(colour = 'white', fill = 'lightpink') +
labs(title = 'Classic theme') +
theme_linedraw()
To conclude, ggplot2 is a massive package that has so many possibilities. We haven’t even scratched the surface of what the package is capable of in terms of visualisation. There is tons of help online to help with your own journey with plotting awesome graphs and charts! The possibilities are endless! Just remember to always choose an appropriate chart type for what your data is aiming to show. I hope this helped introduce you to the ggplot package and gave you some ideas on what you are able to make and display with your very own data.