Image source: thenewstack
Why visualize?
Data visualization is the graphical representation of information and
data. By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand
trends, outliers, and patterns in data. Hidden within your data lie
important insights. But the challenge is that you can’t always connect
the dots by looking at raw numbers alone. When you look at your data
presented in a visual format, patterns, connections, and other “a-ha”
insights emerge that would otherwise remain out of sight.
The earliest form of data visualization can be traced back to the
Egyptians in the pre-17th century, largely used to assist in navigation.
As time progressed, people leveraged data visualizations for broader
applications, such as in economic, social, health, and environmental
disciplines.
There are many different ways to visualize data. Some of the most
common techniques are:
Tables: This consists of rows and columns used
to compare variables. Tables can show a great deal of information in a
structured way, but they can also overwhelm users that are simply
looking for high-level trends.
Bar graphs and boxplots: These are used to
represent and compare groups or categorical variables. Boxplots in
addition also represent numerous descriptive statistics about each
category like its mean and range.
Scatter plots: These visuals are beneficial in
reveling the relationship between two variables, and they are commonly
used within regression data analysis. However, these can sometimes be
confused with bubble charts, which are used to visualize three variables
via the x-axis, the y-axis, and the size of the bubble.
Pie charts and stacked bar charts: These graphs
are divided into sections that represent parts (portions or percentages)
of a whole. They provide a simple way to organize data and compare the
size of each component to one other.
Line charts and area charts: These visuals show
change in one or more quantities by plotting a series of data points
over time and are frequently used within predictive analytics. Line
graphs utilize lines to demonstrate these changes while area charts
connect data points with line segments, stacking variables on top of one
another and using color to distinguish between variables.
Histograms: This graph plots a distribution of
numbers using a bar chart (with no spaces between the bars),
representing the quantity of data (or frequency) that falls within a
particular range. This visual makes it easy for an end user to identify
outliers within a given dataset.
Important: By convention, the x-axis is the
independent variable and the dependent
variable is plotted on the y-axis.
So far we have been using base R visualization tools (e.g. when you
used the command: plot). However, there are various other
packages that we can use to produce nicer and more sophisticated graphs.
One of the most popular of such packages is called
ggplot2.
But what are R packages?
One of the primary reasons for R’s popularity is its extensive
package ecosystem. On R’s main package repository: Comprehensive R Archive Network
(CRAN) alone you have over 10,000 packages available to choose from. You
can see the list of available packages on CRAN here.
Yet, when you first install R you only get a very limited set of core
packages “out of the box”. Any further packages that you’d like to use
you have to install yourself.
Installing packages
To be able to use a (non-base) package for the first time, we will
need to first install it on our system. Installing packages from CRAN
couldn’t be easier! Simply type install.packages() with the name of your
desired package in quotes as first argument.
install.packages("ggplot2")
Installing package into ‘/home/kazanjian/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/ggplot2_3.4.4.tar.gz'
Content type 'application/x-gzip' length 3159578 bytes (3.0 MB)
==================================================
downloaded 3.0 MB
* installing *source* package ‘ggplot2’ ...
** package ‘ggplot2’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (ggplot2)
The downloaded source packages are in
‘/tmp/Rtmpxwm0IQ/downloaded_packages’
Note: Although I have included and executed the code within my
notebook here (to show you how it is done), it is usually advisable to
not include this in R notebooks but directly in your console (the box at
the lower left corner usually in RStudio) as
- You will only need to install it once on each computer you
use.
- It will often produce very long codes that will make your html (or
pdf) outputs very messy.
Once a package is installed on your system, you can load it to R by
using the command: library(package name). You will have
to do this every time after you restart R. Thus, unlike the
install.packages command, it is recommended to include this at
the top of each of your notebooks, to make sure the packages you need
are loaded.
library(ggplot2)
ggplot2
ggplot2 is a powerful data visualization package in the R
programming language. It is based on the grammar of graphics, which is a
way of describing and building graphs using a structured approach.
ggplot2 allows you to create a wide variety of plots and
graphics, including scatter plots, bar plots, line plots, histograms,
and more, with a high degree of customization and flexibility. It is now
over 10 years old and is widely used in the data science and statistical
communities for creating visually appealing and informative plots from
data.
While it is slightly more complex to execute than the simple plots in
base R, once you understand its syntax (by practicing!) you will get the
hang of it quickly. It is best to imagine how ggplot2 works as
making graphs with different layers. In most cases you start with
ggplot(), supply a dataset and aesthetic mapping (with aes()) This
will be the background and axes of your graph. You then add on
layers (like geom_point() or geom_histogram()) These will be the
points, lines or bars on top of your background, scales (like
scale_colour_brewer()), faceting specifications (like facet_wrap()) and
coordinate systems (like coord_flip()). We will cover some of these in
this lecture and more throughout our course.
Let’s try it!
Now that we’ve loaded the ggplot library into R, let’s test it out.
But before we can put it to use, we’ll need to add one more vital
component we’re still missing. The actual data!
Let’s start with the dataset we used last week: fish_size.csv and
recreate the boxplot we made but with ggplot2 this time.
setwd("/home/kazanjian/Documents/R projects/ESS103/")
Warning: The working directory was changed to /home/kazanjian/Documents/R projects/ESS103 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
fish = read.csv ("Data/fish_size.csv", header=T, sep=",")
head(fish)
In this dataset, we have 2 variables:
- 1 continues variable: size (or length) of the fish
- 1 categorical variable: the type of the fish
As the size of the fish is often dependent on its type, we will
assume that former is the dependent variable (so
y-axis) and the latter is the independent variable (so
x-axis)
The simplest ggplot2 syntax follows the following logic: > ggplot
(dataset, aes (x-axis, y-axis)) + graph_type() #you will often get a
small pop-up in R studio while typing the command to help you choose the
syntax for the graph type you want.
So for our specific example, we need to rewrite the above as:
ggplot(fish, aes(fish_types, size)) + geom_boxplot()

Alternatively, if we do not want to view our data as a boxplot but
visualize all measurements as separate points, you can choose
geom_point() as a graph type instead of *geom_boxplot()
ggplot(fish, aes(fish_types, size)) + geom_point()

Now let us make some modifications to the above graph. Assume our
points are too small, so we want to make them bigger, and make them red
to pop out! We can add these details within the parantheses of
geom_point():
ggplot(fish, aes(fish_types, size)) + geom_point(size=3, color="red")

The flexibility of ggplot2 comes to the fore with the
ability to add multiple visualization types at once. For the fun of it,
let’s try to combine both previous graphs (boxplot and points) in one
graph. We can even add some transparency to the points (alpha = 0 for
full transparency to 1 for no transparency) to not fully shield the
boxplots below them.
ggplot(fish, aes(fish_types, size)) +
geom_boxplot() +
geom_point(size=3, color="red", alpha = 0.2)

Our figure though is still missing a title, however. This can be
added with the command labs. To also show our x and y-axes
titles more professionally, we can use the commands xlab and
ylab, respectively. Lastly, you can experiment with different
color themes like black and white (bw), a dark theme, a minimal theme,
etc. See below:
ggplot(fish, aes(fish_types, size)) +
geom_boxplot() +
geom_point(size=3, color="red", alpha = 0.2) +
labs(title= "Size of fishes by type",
x="Fish Types",
y="Length in cm") +
theme_bw()

Now we have a professional-looking graph that we can export and use
anywhere.
How to export a graph
To export any graph produced by ggplot2, we can simply use
the ggsave command, as follows: > ggsave(filename of image
to be saved,device = filetype #(e.g. png, jpeg, tiff, png, bmp, svg, or
pdf).
Unless otherwise specified, it defaults to last plot displayed.
ggsave("Fish sizes by type", device=jpeg)
Saving 7 x 7 in image
The image should now appear in your working directory.
Alternatively, you can also right-click on the generated image and
select ‘Save image as…’. Codes executed in the console also
appear in the ‘Plots’ tab in the bottom right window, where you can also
graphically export them.
Time series and line graphs
Recall that when we had a continuous or time series data, then the
most common type of graph to use was the line graph.
For this example, we will use the BOD dataset in the base R
package.
head(BOD)
As you can see, we have 2 columns: 1 indicating time, the other
oxygen demand. We want to plot the O2 demand vs time. As O2 is the
dependent variable and time is the independent variable, the former goes
onto the y-axis.
Replace geom_point() with geom_line() to draw a
line instead of scatter points.
ggplot(BOD, aes(Time, demand))+
geom_line()

You can still show the points if you’d like to clarify when the
measurements were made. To do so, you can use both geom_point()
with geom_line() commands in your code.
ggplot(BOD, aes(Time, demand))+
geom_line() +
geom_point(aes(color="red", alpha = 0.6, size = 2), show.legend = F)

Visualization of multiparameter data
So far, we’ve only looked into 2 parameters (fish type vs size or one
parameter vs time). What if we had several parameters that we wanted to
visualize simultaneously?
Consider the dataset called CO2, which shows carbon dioxide uptake in
grass plants from an experiment on the cold tolerance of the grass
species Echinochloa crus-galli. The data has 5 columns as can
be seen below:
print(CO2)
As you can see, this dataset has 1 independent numeric variable
(conc), 1 dependent numeric variable (uptake), and 3 categorical
variables (Plant, Type, and Treatment)
If we were to plot this like the first example, we’d plot
concentration vs uptake. As uptake is the dependent variable, it goes to
the y-axis. So we’ll have:
ggplot(CO2, aes(conc, uptake))+
geom_point()

But what if we want to see a 3rd parameter (such as Treatment) also
to investigate any relations with it too? We can add the treatment
parameter to be represented by different shapes to the above graph.
ggplot(CO2, aes(conc, uptake, shape=Treatment))+
geom_point()

As you see, we have the exact same graph produced but instead of the
dots we have 2 different shapes, circle and triangle, representing the
nonchilled and chilled treatments, respectively.
Finally, we want to add a 4th parameter as well: Plant. For
this variable we will use different colors to distinguish the different
plants in our dataset.
In the previous graph, things were already slightly difficult to
distinguish due to the small sizes of the points. Now with the extra
parameter, it might be even harder to analyze. So to improve the above
graph, let’s make the dots slightly bigger and increase their
transparency:
ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
geom_point(size=4, alpha=0.6)

Facets
Sometimes, when you have a complex dataset with too many variables,
you’ll reach the limit to how many of the variables you can add into a
single graph while still keeping it simple enough to read, understand,
or analyze.
In those cases, sometimes it is better to split the data into subsets
and display them as multi panel plots. This functionality in
ggplot2 is called facets.
Take the above example. Only now, we also want to add the
Type as well into the graph. We want to do that by splitting
the data into 2 panels, distinguished by the Type
parameter.
ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
geom_point(size=4, alpha=0.6)+
facet_wrap(~Type)

Now to analyze, we can deduce that there is a clear distinction
between the Quebec and Mississippi types, with uptake being relatively
higher in the former. There is also a clear Treatment effect in
Mississippi plants, with chilled plants having lower uptake than
nonchilled plants. However, this distinction is less clear in the Quebec
plants.
Finally, to complete our graph, we want to add a title and fix the
axis labels:
ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
geom_point(size=4, alpha=0.6)+
facet_wrap(~Type)+
labs (title="Uptake of CO2 per plant and treatment type",
x="concentration",
y= "uptake")

Try on your own
Select data(economics) This dataset was produced from US economic
time series data available from https://fred.stlouisfed.org/. It is a data frame with 6
population variables and 574 rows.
First of all, view the first 6 rows of the dataset so you have an
idea what it looks like.
Now plot the number of unemployed vs date. Make sure the dependent
variable is on the y-axis. Add onto your figure the median duration of
unemployment (uempmed) as size of the dots and the personal
savings rate (psavert) as the color of the dots.
Export the figure. Analyze!
Tip: The figure should look like this: 
Thirsty for more? Here are some additional resources
There are a plethora of resources that you can find freely online for
data visualization, in general and ggplot2, specifically. But here are a
few good suggestions if you want to delve deeper into the topic:
