In this lesson, we will take a look at how to visualize data using the powerful ggplot2 package.

Materials adapted from this resource: https://psych252.github.io/psych252book/visualization-1.html

1 Load packages

Let’s first load the packages that we need for this lesson. You can click on the green arrow to execute the code chunk below.

You may need to install the package first. If so, you can type install.packages(“<the package’s name>”) into your console.

Example: install.packages(“tidyverse”)

library("knitr")     # for rendering the RMarkdown file
library("tidyverse") # for plotting (and many more cool things we'll discover later).

2 Why visualize data?

Anscombe’s quartet (see it here: https://seaborn.pydata.org/examples/anscombes_quartet.html) illustrates the importance of visualizing data. Even though the datasets I-IV have the same summary statistics (mean, standard deviation, correlation), they are importantly different from each other.

Check out this resource to learn more: https://www.autodeskresearch.com/publications/samestats

Tip: Always plot the data first!

3 Setting up a plot

3.1 Import data

Let’s first get some data. Here we assign the dataframe diamonds to the variable “df.diamonds”. After you run this chunk, you will see df.diamonds populate in your Environment.

df.diamonds = diamonds

The diamonds dataset comes with the ggplot2 package. We can get a description of the dataset by running the following command:

?diamonds

Let’s take a look at the data by clicking on it in your Environment.

3.2 Visualizing the Diamonds Data

The df.diamonds data frame contains information about almost 60,000 diamonds, including their price, carat value, size, etc. Let’s use visualization to get a better sense for this dataset.

We start by setting up the plot. To do so, we pass a data frame to the function ggplot() in the following way.

ggplot(data = df.diamonds)

This, by itself, won’t do anything yet. We also need to specify what to plot.

Let’s take a look at how much diamonds of different color cost. The help file says that diamonds labeled D have the best color, and diamonds labeled J the worst color. Let’s make a bar plot that shows the average price of diamonds for different colors.

We do so by specifying a mapping from the data to the plot aesthetics with the function aes(). We need to tell aes() what we would like to display on the x-axis, and the y-axis of the plot.

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price))

Here, we specified that we want to plot color on the x-axis, and price on the y-axis. As you can see, ggplot2 has already figured out how to label the axes. However, we still need to specify how to plot it.

3.2.1 Bar plot

Let’s make a bar graph:

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price)) +
  stat_summary(fun = "mean",
               geom = "bar")

These three lines of code produce an almost-publication-ready plot! Note how we used a + at the end of the first line of code to specify that there will be more. This is a very powerful concept underlying ggplot2. We can start simple and keep adding things to the plot step by step using ‘+’.

We used the stat_summary() function to define what we want to plot (the “mean”), and how (as a “bar” chart). Let’s take a closer look at that function.

help(stat_summary)

Not the the easiest help file … We supplied two arguments to the function, fun = and geom =.

  1. The fun argument specifies what function we’d like to apply to the data for each value of x. Here, we said that we would like to take the mean and we specified that as a string.
  2. The geom (= geometric object) argument specifies how we would like to plot the result, namely as a “bar” plot.

Instead of showing the “mean”, we could also show the “median” instead.

ggplot(data = df.diamonds,
       mapping = aes(x = color,
                     y = price)) +
  stat_summary(fun = "median",
               geom = "bar")

Another way to make a bar chart is using geom_bar(). Here is an example of how we might use geom_bar().

ggplot(data = df.diamonds,
       mapping = aes(x = color)) +
  geom_bar()

See how this plot shows the frequencies of each color!

And instead of making a bar plot, we could plot some points.

ggplot(df.diamonds,
       aes(x = color,
           y = price)) +
  stat_summary(fun = "mean",
               geom = "point")

Tip: Take a look here to see what other geoms ggplot2 supports.

3.2.2 Scatter plot

Let’s say we want to see how the price of diamonds differs as a function of the carat value. Since we are interested in the relationship between two continuous variables, plotting a bar graph won’t work. Instead, let’s make a scatter plot. Let’s put the carat value on the x-axis, and the price on the y-axis.

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price)) +
  geom_point()
Scatterplot.

Figure 3.1: Scatterplot.

That looks sensible! Diamonds with a higher carat value tend to have a higher price. Our dataset has 53940 rows. So the plot actually shows 53940 circles even though we can’t see all of them since they overlap.

Let’s make some progress on trying to figure out why the diamonds with the better color weren’t the most expensive ones on average. We’ll add some color to the scatter plot in Figure 3.1. We color each of the points based on the diamond’s color. To do so, we pass another argument to the aesthetics of the plot via aes().

ggplot(data = df.diamonds,
       mapping = aes(x = carat,
                     y = price,
                     color = color)) +
  geom_point()
Scatterplot with color.

Figure 3.2: Scatterplot with color.

Now we’ve got some color corresponding to the discrete variable color. Notice how in Figure 3.2 ggplot2 added a legend for us. Form just eye-balling the plot, it looks like the diamonds with the best color (D) tended to have a lower carat value, and the ones with the worst color (J), tended to have the highest carat values.

So this is why diamonds with better colors are less expensive – these diamonds have a lower carat value overall.

There are many other things that we can define in aes(). Take a quick look at the vignette:

vignette("ggplot2-specs")

3.2.3 Line plot

What else do we know about the diamonds? The cut variable ranges from “Fair” to “Ideal”. Let’s take a look at the relationship between cut and price. This time, we’ll make a line plot instead of a bar plot (just because we can).

ggplot(data = df.diamonds,
       mapping = aes(x = cut,
                     y = price)) +
  stat_summary(fun = "mean",
               geom = "line")
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

All we did is that we replaced x = color with x = cut, and geom = "bar" with geom = "line". However, the plot doesn’t look like expected (i.e. there is no real plot). What happened here? The reason is that the line plot needs to know which points to connect. The error message tells us that each group consists of only one observation. Let’s adjust the group aesthetic to fix this.

ggplot(data = df.diamonds,
       mapping = aes(x = cut,
                     y = price,
                     group = 1)) +
  stat_summary(fun = "mean",
               geom = "line")

By adding the parameter group = 1 to mapping = aes(), we specify that we would like all the levels in x = cut to be treated as coming from the same group. The reason for this is that cut (our x-axis variable) is a factor (and not a numeric variable), so, by default, ggplot2 tries to draw a separate line for each factor level.

Interestingly, there is no simple relationship between the quality of the cut and the price of the diamond. In fact, “Ideal” diamonds tend to be cheapest.

3.2.4 Adding error bars

We often don’t just want to show the means but also give a sense for how much the data varies. ggplot2 has some convenient ways of specifying error bars. Let’s take a look at how much price varies as a function of clarity (another variable in our diamonds data frame).

ggplot(data = df.diamonds,
       mapping = aes(x = clarity,
                     y = price)) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "pointrange") 
Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals.

Figure 3.3: Relationship between diamond clarity and price. Error bars indicate 95% bootstrapped confidence intervals.

Here we have it. The average price of our diamonds for different levels of clarity together with bootstrapped 95% confidence intervals. How do we know that we have 95% confidence intervals? That’s what mean_cl_boot() computes as a default. Let’s take a look at that function:

help(mean_cl_boot)

Note that I had to use the fun.data = argument here instead of fun = because the mean_cl_boot() function produces three data points for each value of the x-axis (the mean, lower, and upper confidence interval).

3.2.5 Order matters

The order in which we add geoms to a ggplot matters! Generally, we want to plot error bars before the points that represent the means. To illustrate, let’s set the color in which we show the means to “red”.

ggplot(df.diamonds,
       aes(x = clarity,
           y = price)) +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange") +
  stat_summary(fun = "mean",
               geom = "point",
               color = "red")
This figure looks good. Error bars and means are drawn in the correct order.

Figure 3.4: This figure looks good. Error bars and means are drawn in the correct order.

Figure 3.4 looks good.

# I've changed the order in which the means and error bars are drawn.
ggplot(df.diamonds,
       aes(x = clarity,
           y = price)) +
  stat_summary(fun = "mean",
               geom = "point",
               color = "red") +
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange")
This figure looks bad. Error bars and means are drawn in the incorrect order.

Figure 3.5: This figure looks bad. Error bars and means are drawn in the incorrect order.

Figure 3.5 doesn’t look good. The error bars are on top of the points that represent the means.

One cool feature about using stat_summary() is that we did not have to change anything about the data frame that we used to make the plots. We directly used our raw data instead of having to make separate data frames that contain the relevant information (such as the means and the confidence intervals).

That’s all for now! Believe it or not, we have just scratched the surface of visualization with ggplot2. In the assignment, we will rehash the basic points from this lesson. Remember that Google is your friend! When you don’t know how to do something - for example giving an axis a label - search something like “How to label x axis in ggplot2”. Stack Overflow is a great resource.