Image source: thenewstack
Image source: thenewstack

Why visualize?

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Hidden within your data lie important insights. But the challenge is that you can’t always connect the dots by looking at raw numbers alone. When you look at your data presented in a visual format, patterns, connections, and other “a-ha” insights emerge that would otherwise remain out of sight.

The earliest form of data visualization can be traced back to the Egyptians in the pre-17th century, largely used to assist in navigation. As time progressed, people leveraged data visualizations for broader applications, such as in economic, social, health, and environmental disciplines.

There are many different ways to visualize data. Some of the most common techniques are:

Important: By convention, the x-axis is the independent variable and the dependent variable is plotted on the y-axis.

So far we have been using base R visualization tools (e.g. when you used the command: plot). However, there are various other packages that we can use to produce nicer and more sophisticated graphs. One of the most popular of such packages is called ggplot2.

But what are R packages?

One of the primary reasons for R’s popularity is its extensive package ecosystem. On R’s main package repository: Comprehensive R Archive Network (CRAN) alone you have over 10,000 packages available to choose from. You can see the list of available packages on CRAN here. Yet, when you first install R you only get a very limited set of core packages “out of the box”. Any further packages that you’d like to use you have to install yourself.

Installing packages

To be able to use a (non-base) package for the first time, we will need to first install it on our system. Installing packages from CRAN couldn’t be easier! Simply type install.packages() with the name of your desired package in quotes as first argument.

install.packages("ggplot2")
Installing package into ‘/home/kazanjian/R/x86_64-pc-linux-gnu-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cloud.r-project.org/src/contrib/ggplot2_3.4.4.tar.gz'
Content type 'application/x-gzip' length 3159578 bytes (3.0 MB)
==================================================
downloaded 3.0 MB

* installing *source* package ‘ggplot2’ ...
** package ‘ggplot2’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (ggplot2)

The downloaded source packages are in
    ‘/tmp/Rtmpxwm0IQ/downloaded_packages’

Note: Although I have included and executed the code within my notebook here (to show you how it is done), it is usually advisable to not include this in R notebooks but directly in your console (the box at the lower left corner usually in RStudio) as

  1. You will only need to install it once on each computer you use.
  2. It will often produce very long codes that will make your html (or pdf) outputs very messy.

Once a package is installed on your system, you can load it to R by using the command: library(package name). You will have to do this every time after you restart R. Thus, unlike the install.packages command, it is recommended to include this at the top of each of your notebooks, to make sure the packages you need are loaded.

library(ggplot2)

ggplot2

ggplot2 is a powerful data visualization package in the R programming language. It is based on the grammar of graphics, which is a way of describing and building graphs using a structured approach. ggplot2 allows you to create a wide variety of plots and graphics, including scatter plots, bar plots, line plots, histograms, and more, with a high degree of customization and flexibility. It is now over 10 years old and is widely used in the data science and statistical communities for creating visually appealing and informative plots from data.

While it is slightly more complex to execute than the simple plots in base R, once you understand its syntax (by practicing!) you will get the hang of it quickly. It is best to imagine how ggplot2 works as making graphs with different layers. In most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()) This will be the background and axes of your graph. You then add on layers (like geom_point() or geom_histogram()) These will be the points, lines or bars on top of your background, scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()). We will cover some of these in this lecture and more throughout our course.

Let’s try it!

Now that we’ve loaded the ggplot library into R, let’s test it out. But before we can put it to use, we’ll need to add one more vital component we’re still missing. The actual data!

Let’s start with the dataset we used last week: fish_size.csv and recreate the boxplot we made but with ggplot2 this time.

setwd("/home/kazanjian/Documents/R projects/ESS103/")
Warning: The working directory was changed to /home/kazanjian/Documents/R projects/ESS103 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
fish = read.csv ("Data/fish_size.csv", header=T, sep=",")
head(fish)

In this dataset, we have 2 variables:

As the size of the fish is often dependent on its type, we will assume that former is the dependent variable (so y-axis) and the latter is the independent variable (so x-axis)

The simplest ggplot2 syntax follows the following logic: > ggplot (dataset, aes (x-axis, y-axis)) + graph_type() #you will often get a small pop-up in R studio while typing the command to help you choose the syntax for the graph type you want.

So for our specific example, we need to rewrite the above as:

ggplot(fish, aes(fish_types, size)) + geom_boxplot()

Alternatively, if we do not want to view our data as a boxplot but visualize all measurements as separate points, you can choose geom_point() as a graph type instead of *geom_boxplot()

ggplot(fish, aes(fish_types, size)) + geom_point()

Now let us make some modifications to the above graph. Assume our points are too small, so we want to make them bigger, and make them red to pop out! We can add these details within the parantheses of geom_point():

ggplot(fish, aes(fish_types, size)) + geom_point(size=3, color="red")

The flexibility of ggplot2 comes to the fore with the ability to add multiple visualization types at once. For the fun of it, let’s try to combine both previous graphs (boxplot and points) in one graph. We can even add some transparency to the points (alpha = 0 for full transparency to 1 for no transparency) to not fully shield the boxplots below them.

ggplot(fish, aes(fish_types, size)) + 
  geom_boxplot() + 
  geom_point(size=3, color="red", alpha = 0.2)

Our figure though is still missing a title, however. This can be added with the command labs. To also show our x and y-axes titles more professionally, we can use the commands xlab and ylab, respectively. Lastly, you can experiment with different color themes like black and white (bw), a dark theme, a minimal theme, etc. See below:

ggplot(fish, aes(fish_types, size)) + 
  geom_boxplot() + 
  geom_point(size=3, color="red", alpha = 0.2) + 
  labs(title= "Size of fishes by type", 
       x="Fish Types",
       y="Length in cm") +
  theme_bw()

Now we have a professional-looking graph that we can export and use anywhere.

How to export a graph

To export any graph produced by ggplot2, we can simply use the ggsave command, as follows: > ggsave(filename of image to be saved,device = filetype #(e.g. png, jpeg, tiff, png, bmp, svg, or pdf).

Unless otherwise specified, it defaults to last plot displayed.

ggsave("Fish sizes by type", device=jpeg)
Saving 7 x 7 in image

The image should now appear in your working directory.

Alternatively, you can also right-click on the generated image and select ‘Save image as…’. Codes executed in the console also appear in the ‘Plots’ tab in the bottom right window, where you can also graphically export them.

Time series and line graphs

Recall that when we had a continuous or time series data, then the most common type of graph to use was the line graph.

For this example, we will use the BOD dataset in the base R package.

head(BOD)

As you can see, we have 2 columns: 1 indicating time, the other oxygen demand. We want to plot the O2 demand vs time. As O2 is the dependent variable and time is the independent variable, the former goes onto the y-axis.

Replace geom_point() with geom_line() to draw a line instead of scatter points.

ggplot(BOD, aes(Time, demand))+
  geom_line() 

You can still show the points if you’d like to clarify when the measurements were made. To do so, you can use both geom_point() with geom_line() commands in your code.

ggplot(BOD, aes(Time, demand))+
  geom_line() +
  geom_point(aes(color="red", alpha = 0.6, size = 2), show.legend = F)

Visualization of multiparameter data

So far, we’ve only looked into 2 parameters (fish type vs size or one parameter vs time). What if we had several parameters that we wanted to visualize simultaneously?

Consider the dataset called CO2, which shows carbon dioxide uptake in grass plants from an experiment on the cold tolerance of the grass species Echinochloa crus-galli. The data has 5 columns as can be seen below:

print(CO2)

As you can see, this dataset has 1 independent numeric variable (conc), 1 dependent numeric variable (uptake), and 3 categorical variables (Plant, Type, and Treatment)

If we were to plot this like the first example, we’d plot concentration vs uptake. As uptake is the dependent variable, it goes to the y-axis. So we’ll have:

ggplot(CO2, aes(conc, uptake))+
  geom_point()

But what if we want to see a 3rd parameter (such as Treatment) also to investigate any relations with it too? We can add the treatment parameter to be represented by different shapes to the above graph.

ggplot(CO2, aes(conc, uptake, shape=Treatment))+
  geom_point()

As you see, we have the exact same graph produced but instead of the dots we have 2 different shapes, circle and triangle, representing the nonchilled and chilled treatments, respectively.

Finally, we want to add a 4th parameter as well: Plant. For this variable we will use different colors to distinguish the different plants in our dataset.

In the previous graph, things were already slightly difficult to distinguish due to the small sizes of the points. Now with the extra parameter, it might be even harder to analyze. So to improve the above graph, let’s make the dots slightly bigger and increase their transparency:

ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
  geom_point(size=4, alpha=0.6)

Facets

Sometimes, when you have a complex dataset with too many variables, you’ll reach the limit to how many of the variables you can add into a single graph while still keeping it simple enough to read, understand, or analyze.

In those cases, sometimes it is better to split the data into subsets and display them as multi panel plots. This functionality in ggplot2 is called facets.

Take the above example. Only now, we also want to add the Type as well into the graph. We want to do that by splitting the data into 2 panels, distinguished by the Type parameter.

ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
  geom_point(size=4, alpha=0.6)+
  facet_wrap(~Type)

Now to analyze, we can deduce that there is a clear distinction between the Quebec and Mississippi types, with uptake being relatively higher in the former. There is also a clear Treatment effect in Mississippi plants, with chilled plants having lower uptake than nonchilled plants. However, this distinction is less clear in the Quebec plants.

Finally, to complete our graph, we want to add a title and fix the axis labels:

ggplot(CO2, aes(conc, uptake, color=Plant, shape=Treatment))+
  geom_point(size=4, alpha=0.6)+
  facet_wrap(~Type)+
  labs (title="Uptake of CO2 per plant and treatment type", 
        x="concentration",
        y= "uptake") 

Try on your own

Select data(economics) This dataset was produced from US economic time series data available from https://fred.stlouisfed.org/. It is a data frame with 6 population variables and 574 rows.

First of all, view the first 6 rows of the dataset so you have an idea what it looks like.

Now plot the number of unemployed vs date. Make sure the dependent variable is on the y-axis. Add onto your figure the median duration of unemployment (uempmed) as size of the dots and the personal savings rate (psavert) as the color of the dots.

Export the figure. Analyze!

Tip: The figure should look like this: this

Thirsty for more? Here are some additional resources

There are a plethora of resources that you can find freely online for data visualization, in general and ggplot2, specifically. But here are a few good suggestions if you want to delve deeper into the topic:

