Visualizing Data in the Tidyverse

PRIMERA SEMNA

General Features of Plots

Good plots have a number of features. While not exhaustive, good plots have: Clearly-labeled axes. Text that are large enough to see. Axes that are not misleading. Data that are displayed appropriately considering the type of data you have. More specifically, however, there are two general approaches to data visualization: exploratory plots and explanatory plots.

Exploratory Plots

These are data displays to help you better understand and discover hidden patterns in the data you’re working with. These won’t be the prettiest plots, but they will be incredibly helpful. Exploratory visualizations have a number of general characteristics:

They are made quickly. You’ll make a large number of them. The axes and legends are cleaned up.

Below we have a graph where the axes are labeled and general pattern can be determined. This is a great example of an exploratory plot. It lets you the analyst know what’s going on in your data, but it isn’t yet ready for a big presentation.

Explanatory Plots

These are data displays that aim to communicate insights to others. These are plots that you spend a lot of time making sure they’re easily interpretable by an audience. General characteristics of explanatory plots:

They take a while to make. There are only a few of these for each project. You’ve spent a lot of time making sure the colors, labels, and sizes are all perfect for your needs.

Here we see an improvement upon the exploratory plot we looked at previously. Here, the axis labels are more descriptive. All of the text is larger. The legend has been moved onto the plot. The points on the plot are larger. And, there is a title. All of these changes help to improve the plot, making it an explanatory plot that would be presentation-ready.

Plot Types

Histogram Densityplot Scatterplot Barplot Grouped Barplot Stacked Barplot Boxplot Line Plots

Choose the Right Type of Plot

If your goal is to allow the viewer to compare values across groups, pie charts should largely be avoided. This is because it’s easier for the human eye to differentiate between bar heights than it is between similarly-sized slices of a pie. Thinking about the best way to visualize your data before making the plot is an important step in the process of data visualization.

Be Mindful When Choosing Colors

Choosing colors that work for the story you’re trying to convey with your visualization is important. Avoiding colors that are hard to see on a screen or when projected, such as pastels, is a good idea. Additionally, red-green color blindness is common and leads to difficulty in distinguishing reds from greens. Simply avoiding making comparisons between these two colors is a good first step when visualizing data.

Label the Axes

Whether you’re making an exploratory or explanatory visualization, labeled axes are a must. They help tell the story of the figure. Making sure the axes are clearly labeled is also important. Rather than labeling the graph below with “h” and “g,” we chose the labels “height” and “gender,” making it clear to the viewer exactly what is being plotted.

Make Sure the Numbers Add Up

When you’re making a plot that should sum to 100, make sure that it in fact does. Taking a look at visualizations after you make them to ensure that they make sense is an important part of the data visualization process.

Make Sure the Numbers and Plots Make Sense Together

Another common error is having labels that don’t reflect the underlying graphic. For example, here, we can see on the left that the turquoise piece is more than half the graph, and thus the label 45% must be incorrect. At right, we see that the labels match what we see in the figure.

Make Comparisons Easy on Viewers

There are many ways in which you can make comparisons easier on the viewer. For example, avoiding unnecessary whitespace between the bars on your graph can help viewers make comparisons between the bars on the barplot.

Use y-axes That Start at Zero

Often, in an attempt to make differences between groups look larger than they are, y-axis will be started at a value other than zero. This is misleading. Y-axis for numerical information should start at zero.

Keep It Simple

The goal of data visualization is to improve understanding of data. Sometimes complicated visualizations cannot be avoided; however, when possible, keep it simple.

Three Questions You Should Ask

What’s your point?

Whenever you have data you’re trying to plot, think about what you’re actually trying to show. Once you’ve taken a look at your data, a good title for the plot can be helpful. Your title should tell viewers what they’ll see when they look at the plot.

How can you emphasize your point in your chart?

We talked about it in the last lesson, but an incredibly important decision is choosing an appropriate chart for the type of data you have. In the next section of this lesson, we’ll discuss what type of data are appropriate for each type of plot in R; however, for now, we’ll just focus on an iPhone data example. With this example, we’ll discuss that you can emphasize your point by:

Adding data Highlighting data with color Annotating your plot

Adding data

In any plot that makes a specific claim, it usually important to show additional data as a reference for comparison. For example, if you were making a plot of that suggests that the iPhone has been Apple’s most successful product, it would be helpful for the plot to compare iPhone sales with other Apple products, say, the iPad or the iPod. By adding data about other Apple products over time, we can visualize just how successful the iPhone has been compared to other products.

Highlighting data with color

Colors help direct viewers’ eyes to the most important parts of the figure. Colors tell your readers where to focus their attention. Grays help to tell viewers where to focus less of their attention, while other colors help to highlight the point your trying to make.

Annotate your plot

By highlighting parts of your plot with arrows or text on your plot, you can further draw viewers’ attention to certain part of the plot. These are often details that are unnecessary in exploratory plots, where the goal is just to better understand the data, but are very helpful in explanatory plots, when you’re trying to draw conclusions from the plot.

What Does Your Final Chart Show?

A plot title should first tell viewers what they would see in the plot. The second step is to show them with the plot. The third step is to make it extra clear to viewers what they should be seeing with descriptions, annotations, and legends. You explain to viewers what they should be seeing in the plot and the source of your data. Again, these are important pieces of creating a complete explanatory plot, but are not all necessary when making exploratory plots.

Write precise descriptions

Whether it’s a figure legend at the bottom of your plot, a subtitle explaining what data are plotted, or clear axes labels, text describing clearly what’s going on in your plot is important. Be sure that viewers are able to easily determine what each line or point on a plot represents.

Add a source

When finalizing an explanatory plot, be sure to source your data. It’s always best for readers to know where you obtained your data and what data are being used to create your plot. Transparency is important.

SEGUNDA SEMANA

ggplot2 Background

The grammar of graphics implemented in ggplot2 is based on the idea that you can build any plot as long as you have a few pieces of information. To start building plots in ggplot2, we’ll need some data and we’ll need to know the type of plot we want to make. The type of plot you want to make in ggplot2 is referred to as a geom.

This will get us started, but the idea behind ggplot2 is that every new concept we introduce will be layered on top of the information you’ve already learned. In this way, ggplot2 is layered - layers of information add on top of each other as you build your graph. In code written to generate a ggplot2 figure, you will see each line is separated by a plus sign (+).

Think of each line as a different layer of the graph. We’re simply adding one layer on top of the previous layers to generate the graph. You’ll see exactly what we mean by this throughout each section in this lesson.

To get started, we’ll start with the two basics (data and a geom) and build additional layers from there.

When using ggplot2 to generate figures, you will always begin by calling the ggplot() function. You’ll then specify your dataset within the ggplot() function. Then, before making your plot you will also have to specify what geom type you’re interested in plotting. We’ll focus on a few basic geoms in the next section and give examples of each plot type (geom), but for now we’ll just work with a single geom: geom_point.

geom_point is most helpful for creating scatterplots. As a reminder from an earlier lesson, scatterplots are useful when you’re looking at the relationship between two numeric variables. Within geom you will specify the arguments needed to tell ggplot2 how you want your plot to look.

You will map your variables using the aesthetic argument aes. We’ll walk through examples below to make all of this clear. However, get comfortable with the overall look of the code now.

Example Dataset: diamonds

To build your first plot in ggplot2 we’ll make use of the fact that there are some datasets already available in R. One frequently-used dataset is known as diamonds. This dataset contains prices and other attributes of 53,940 diamonds, with each row containing information about a different diamond. If you look at the first few rows of data, you can get an idea of what data are included in this dataset.

diamonds <- as_tibble(diamonds)
tail(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.72 Premium   D     SI1      62.7    59  2757  5.69  5.73  3.58
## 2  0.72 Ideal     D     SI1      60.8    57  2757  5.75  5.76  3.5 
## 3  0.72 Good      D     SI1      63.1    55  2757  5.69  5.75  3.61
## 4  0.7  Very Good D     SI1      62.8    60  2757  5.66  5.68  3.56
## 5  0.86 Premium   H     SI2      61      58  2757  6.15  6.12  3.74
## 6  0.75 Ideal     D     SI2      62.2    55  2757  5.83  5.87  3.64

Scatterplots: geom_point()

In ggplot2 we specify these by defining x and y within the aes() argument. The x argument defines which variable will be along the bottom of the plot. The y refers to which variable will be along the left side of the plot. If we wanted to understand the relationship between the number of carats in a diamond and that diamond’s price, we may do the following:

# generate scatterplot with geom_point()
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price))

In this plot, we see that, in general, the larger the diamond is (or the more carats it has), the more expensive the diamond is (price), which is probably what we would have expected. However, now, we have a plot that definitively supports this conclusion!

Aesthetics

What if we wanted to alter the size, color or shape of the points? Probably unsurprisingly, these can all be changed within the aesthetics argument. After all, something’s aesthetic refers to how something looks. Thus, if you want to change the look of your graph, you’ll want to play around with the plot’s aesthetics.

In fact, in the plots above you’ll notice that we specified what should be on the x and y axis within the aes() call. These are aesthetic mappings too! We were telling ggplot2 what to put on each axis, which will clearly affect how the plot looks, so it makes sense that these calls have to occur within aes(). Additionally now, we’ll focus on arguments within aes() that change how the points on the plot look.

Point color

In the scatterplot we just generated, we saw that there was a relationship between carat and price, such that the more carats a diamond has, generally, the higher the price. But, it’s not a perfectly linear trend. What we mean by that is that not all diamonds that were 2 carats were exactly the same price. And, not all 3 carat diamonds were exactly the same price. What if we were interested in finding out a little bit more about why this is the case?

Well, we could look at the clarity of the diamonds to see whether or not that affects the price of the diamonds? To add clarity to our plot, we could change the color of our points to differ based on clarity:

# adjusting color within aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, color = clarity))

Here, we see that not only are the points now colored by clarity, ggplot2 has also automatically added a legend for us with the various classes and their corresponding point color.

The Help pages of the diamonds dataset (accessed using ?diamonds) state that clarity is “a measurement of how clear the diamond is.” The documentation also tells us that I1 is the worst clarity and IF is the best (Full scale: I1, SI1, SI2, VS1, VS2, VVS1, VVS2, IF). This makes sense with what we see in the plot. Small (<1 carat) diamonds that have the best clarity level (IF) are some of the most expensive diamonds. While, relatively large diamonds (diamonds between 2 and 3 carats) of the lowest clarity (I1) tend to cost less.

By coloring our points by a different variable in the dataset, we now understand our dataset better. This is one of the goals of data visualization! And, specifically, what we’re doing here in ggplot2 is known as mapping a variable to an aesthetic. We took another variable in the dataset, mapped it to a color, and then put those colors on the points in the plot. Well, we only told ggplot2 what variable to map. It took care of the rest!

Of course, we can also manually specify the colors of the points on our graph; however, manually specifying the colors of points happens outside of the aes() call. This is because ggplot2 does not have to go through the mapping the variable to an aesthetic process. In the code here, ggplot2 doesn’t have to go through the trouble of figuring out which level of the variable is going to be which color on the plot (the mapping to the aesthetic part of the process). Instead, it just colors every point red. Thus, manually specifying the color of your points happens outside of aes():

# manually control color point outside aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), color = "red")

Point size

As above, we can change the point size by mapping another variable to the size argument within aes:

# adjust point size within aes
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, size = clarity))

As above, ggplot2 handles the mapping process. All you have to do is specify what variable you want mapped (clarity) and how you want ggplot2 to handle the mapping (change the point size). With this code, you do get a warning when you run it in R that using a “discrete variable is not advised.” This is because mapping to size is usually done for numeric variables, rather than categorical variables like clarity.

This makes sense here too. The relationship between clarity, carat, and price was easier to visualize when clarity was mapped to color than here where it is mapped to size.

Like the above example with color, the size of every point can be changed by calling size outside of aes:

# global control of point size
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), size = 4.5)

Point Shape

You can also change the shape of the points (shape). We’ve used solid, filled circles thus far (the default in geom_point), but we could specify a different shape for each clarity.

# map clarity to point shape within aes 
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, shape = clarity))

## Warning: Using shapes for an ordinal variable is not advised

## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 8. Consider
## specifying shapes manually if you must have them.

## Warning: Removed 5445 rows containing missing values (geom_point).

#### Facets

In addition to mapping variables to different aesthetics, you can also opt to use facets to help make sense of your data visually. Rather than plotting all the data on a single plot and visually altering the point size or color of a third variable in a scatterplot, you could break each level of that third variable out into a separate subplot. To do this, you would use faceting. Faceting is particularly helpful for looking at categorical variables.

To use faceting, you would add an additional layer (+) to your code and use the facet_wrap() function. Within facet wrap, you specify the variable by which you want your subplots to be made:

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price)) + 
  # facet by clarity
  facet_wrap(~clarity, nrow = 2)

#### Geoms

Thus far in this lesson we’ve only looked at scatterplots, which means we’ve only called geom_point. However, there are many additional geoms that we could call to generate different plots. Simply, a geom is just a shape we use to represent the data. In the case of scatterplots, they don’t really use a geom since each actual point is plotted individually.

Other plots, such as the boxplots, barplots, and histograms we described in previous lessons help to summarize or represent the data in a meaningful way, without plotting each individual point. The shapes used in these different types of plots to represent what’s going on in the data is that plot’s geom.

To see exactly what we mean by geoms being “shapes that represent the data”, let’s keep using the diamonds dataset, but instead of looking at the relationship between two numeric variables in a scatterplot, let’s take a step back and take a look at a single numeric variable using a histogram.

Histograms: geom_histogram

To review, histograms allow you to quickly visualize the range of values your variable takes and the shape of your data. (Are all the numbers clustered around center? Or, are they all at the extremes of the range? Somewhere in between? The answers to these questions describe the “shape” of the values of your variable.)

For example, if we wanted to see what the distribution of carats was for these data, we could to the following.

# change geom_ to generate histogram
ggplot(data = diamonds) + 
  geom_histogram(mapping =  aes(carat))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

he code follows what we’ve seen so far in this lesson; however, we’ve now called geom_histogram to specify that we want to plot a histogram rather than a scatterplot.

Here, the rectangular boxes on the plot are geoms (shapes) that represent the number of diamonds that fall into each bin on the plot. Rather than plotting each individual point, histograms use rectangular boxes to summarize the data. This summarization helps us quickly understand what’s going on in our dataset.

Specifically here, we can quickly see that most of the diamonds in the dataset are less than 1 carat. This is not necessarily something we could be sure of from the scatterplots generated previously in this lesson (since some points could have been plotted directly on top of one another). Thus, it’s often helpful to visualize your data in a number of ways when you first get a dataset to ensure that you understand the variables and relationships between variables in your dataset!

Barplots: geom_bar

Barplots show the relationship between a set of numbers and a categorical variable. In the diamonds dataset, we may be interested in knowing how many diamonds there are of each cut of diamonds. There are five categories for cut of diamond. If we make a barplot for this variable, we can see the number of diamonds in each category.

# geom_bar for bar plots
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(cut))

Boxplots: geom_boxplot

Boxplots provide a summary of a numerical variable across categories. For example, if you were interested to see how the price of a diamond (a numerical variable) changed across different diamond color categories (categorical variable), you may want to use a boxplot. To do so, you would specify that using geom_boxplot:

# geom_boxplot for boxplots
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = color, y = price))

In the figure itself we see that the median price (the black horizontal bar in the middle of each box represents the median for each category) increases as the diamond color increases from the worst category (J) to the best (D).

Now, if you wanted to change the color of this boxplot, it would just take a small addition to the code for the plot you just generated.

# fill globally changes bar color outside aes
ggplot(data = diamonds) + 
  geom_boxplot(mapping = aes(x = color, y = price), 
               fill = "green")

EDA Plots

As mentioned previously, an important step after you’ve read your data into R and wrangled it into a tidy format is to carry out Exploratory Data Analysis (EDA). EDA is the process of understanding the data in your dataset fully. To understand your dataset fully, you need a full understanding of the variables stored in your dataset, what information you have and what information you don’t have (missingness!). To gain this understanding, we’ve discussed using packages like skimr to get a quick idea of what information is stored in your dataset. However, generating plots is another critical step in this process. We encourage you to use ggplot2 to understand the distribution of each single variable as well as the relationship between each variable in your dataset.

In this process, using ggplot2 defaults is totally fine. These plots do not have to be the most effective visualizations for communication, so you don’t want to spend a ton of time making them visually perfect. Only spend as much time on these as you need to understand your data!

Colors

To get started, we’ll learn how to control color across plots in ggplot2. Previously, we discussed using color within aes() on a scatterplot to automatically color points by the clarity of the diamond when looking at the relationship between price and carat.

ggplot(diamonds) + 
  geom_point(mapping = aes(x = carat, y = price, color = cut))

However, what if we wanted to carry this concept over to a bar plot and look at how many diamonds we have of each clarity group?

# generate bar plot
ggplot(diamonds) + 
  geom_bar(aes(x = clarity))

As a general note, we’ve stopped including data = and mapping = here within our code. We included it so far to be explicit; however, in code you see in the world, the names of the arguments will typically be excluded and we want you to be familiar with code that appears as you see above.

OK, well that’s a start since we see the breakdown, but all the bars are the same color. What if we adjusted color within aes()?

# color changes outline of bar
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, color = clarity))

As expected, color added a legend for each level of clarity; however, it colored the lines around the bars on the plot, rather than the bars themselves. In order to color the bars themselves, you want to specify the more helpful argument fill:

# use fill to color bars
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = clarity))

Great! We now have a plot with bars of different colors, which was our first goal! However, adding colors here, while maybe making the plot prettier doesn’t actually give us any more information. We can see the same pattern of which clarity is most frequent among the diamonds in our dataset like we could see in the first plot we made.

Color is particularly helpful here, however, if we wanted to map a different variable onto each bar. For example, what if we wanted to see the breakdown of diamond “cut” within each “clarity” bar?

# fill by separate variable (cut) = stacked bar chart
ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut))

Now we’re getting some new information! We can see that each level in clarity appears to have diamonds of all levels of cut. Color here has really helped us understand more about the data.

But what if we were going to present these data? While there is no comparison between red and green (which is good!), there is a fair amount of yellow in this plot. Some projectors don’t handle projecting yellow well, and it will show up too light on the screen. To avoid this, let’s manually change the colors in this bar chart! To do so we’ll add an additional layer to the plot using scale_fill_manual.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut)) +
  # manually control colors used
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))

Here, we’ve specified five different colors within the values argument of scale_fill_manual(), one for each cut of diamond. The names of these colors can be specified using the names explained on the third page of the cheatsheet here. (Note: There are other ways to specify colors within R. Explore the details in that cheatsheet to better understand the various ways!)

Additionally, it’s important to note that here we’ve used scale_fill_manual() to adjust the color of what was mapped using fill = cut. If we had colored our chart using color within aes(), there is a different function called scale_color_manual. This makes good sense! You use scale_fill_manual() with fill and scale_color_manual() with color. Keep that in mind as you adjust colors in the future!

Now that we have some sense of which clarity is most common in our diamonds dataset and now that we are able to successfully specify the colors we want manually in order to make this plot useful for presentation, what if we wanted to compare the proportion of each cut across the different clarities? Currently, that’s difficult because there is a different number within each clarity. In order to compare the proportion of each cut we have to use position adjustment.

What we’ve just generated is a stacked bar chart. It’s a pretty good name for this type of chart as the bars for cut are all stacked on top of one another. If you don’t want a stacked bar chart you could use one of the other position options: identity, fill, or dodge.

Returning to our question about proportion of each cut within each clarity group, we’ll want to use position = “fill” within geom_bar(). Building off of what we’ve already done:

ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))

Here, we’ve specified how we want to adjust the position of the bars in the plot. Each bar is now of equal height and we can compare each colored bar across the different clarities. As expected, we see that among the best clarity group (IF), we see more diamonds of the best cut (“Ideal”)!

Briefly, we’ll take a quick detour to look at position = “dodge”. This position adjustment places each object next to one another. This will not allow for easy comparison across groups, as we just saw with the last group but will allow values within each clarity group to be visualized.

ggplot(diamonds) + 
  # dodge rather than stack produces grouped bar plot
  geom_bar(aes(x = clarity, fill = cut), position = "dodge") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4"))

Unlike in the first plot where we specified fill = cut, we can actually see the relationship between each cut within the lowest clarity group (I1). Before, when the values were stacked on top of one another, we were not able to visually see that there were more “Fair” and “Premium” cut diamonds in this group than the other cuts. Now, with position = “dodge”, this information is visually apparent.

Note: position = “identity” is not very useful for bars, as it places each object exactly where it falls within the graph. For bar charts, this will lead to overlapping bars, which is not visually helpful. However, for scatterplots (and other 2-Dimensional charts), this is the default and is exactly what you want.

Labels

Text on plots is incredibly helpful. A good title tells viewers what they should be getting out of the plot. Axis labels are incredibly important to inform viewers of what’s being plotted. Annotations on plots help guide viewers to important points in the plot. We’ll discuss how to control all of these now!

Titles

Now that we have an understanding of how to manually adjust color, let’s improve the clarity of our plots by including helpful labels by adding an additional labs() layer. We’ll return to the plot where we were comparing proportions of diamond cut across diamond clarity groups.

You can include a title, subtitle, and/or caption within the labs() function. Each argument, as per usual, will be specified by a comma.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  # add titles
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut")

Axis labels

You may have noticed that our y-axis label says “count”, but it’s not actually a count anymore. In reality, it’s a proportion. Having appropriately labeled axes is so important. Otherwise, viewers won’t know what’s being plotted. So, we should really fix that now using the ylab() function. Note: we won’t be changing the x-axis label, but if you were interested in doing so, you would use xlab(“label”).

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  # add y axis label explicitly
  ylab("proportion")

#### Themes

To change the overall aesthetic of your graph, there are 8 themes built into ggplot2 that can be added as an additional layer in your graph

For example, if we wanted remove the gridlines and grey background from the chart, we would use theme_classic(). Building on what we’ve already generated:

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  # change plot theme
  theme_classic()

We now have a pretty good looking plot! However, a few additional changes would make this plot even better for communication.

Note: Additional themes are available from the ggthemes package . Users can also generate their own themes.

Custom Theme

In addition to using available themes, we can also adjust parts of the theme of our graph using an additional theme() layer. There are a lot of options within theme. To see them all, look at the help documentation within RStudio Cloud using: ?theme. We’ll simply go over the syntax for using a few of them here to get you comfortable with adjusting your theme. Later on, you can play around with all the options on your own to become an expert!

Altering text size

For example, if we want to increase text size to make our plots more easily to view when presenting, we could do that within theme. Notice here that we’re increasing the text size of the title, axis.text, axis.title, and legend.text all within theme()! The syntax here is important. Within each of the elements of the theme you want to alter, you have to specify what it is you want to change. Here, for all three, we want to alter the text, so we specify element_text(). Within that, we specify that it’s size that we want to adjust.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  # control theme
  theme(title = element_text(size = 16), 
        axis.text = element_text(size =14),
        axis.title = element_text(size = 16),
        legend.text = element_text(size = 14))

Additional text alterations

Changing the size of text on your plot is not the only thing you can control within theme(). You can make text bold and change its color within theme(). Note here that multiple changes can be made to a single element. We can change size and make the text bold. All we do is separate each argument with a comma, per usual.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        # additional control
        plot.subtitle = element_text(color = "gray30"))

#### Legends

At this point, all the text on the plot is pretty visible! However, there’s one thing that’s still not quite clear to viewers. In daily life, people refer to the “cut” of a diamond by terms like “round cut” or “princess cut” to describe the shape of the diamond.

That’s not what we’re talking about here when we’re discussing “cut”. In these data, “cut” refers to the quality of the diamond, not the shape. Let’s be sure that’s clear as well! We can change the name of the legend by using an additional layer and the guides() and guide_legend() functions of the ggplot2 package!

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("cut quality"))

This guides() function, as well as the guides_* functions also allow us to modify legends even further.

This is especially useful if you have many colors in your legend and you want to control how the legend is displayed in terms of the number of columns and rows using ncol and nrow respectively.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("Cut Quality", 
                              ncol = 2))

Or, we can modify the font of the legend title using title.theme().

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # control legend
  guides(fill = guide_legend("Cut Quality", 
                             title.theme = element_text(face = "bold")))

Alternatively, we can do this modification, as well as other legend modifications, like adding a rectangle around the legend, using the theme() function.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
# changing the legend title:
guides(fill = guide_legend("Cut Quality")) +
  theme_classic() +
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30"),
# changing the legend style:
        legend.title = element_text(face = "bold"),
        legend.background = element_rect(color = "black"))

At this point, we have an informative title, clear colors, a well-labeled legend, and text that is large enough throughout the graph. This is certainly a graph that could be used in a presentation. We’ve taken it from a graph that is useful to just ourselves (exploratory) and made it into a plot that can communicate our findings well to others (explanatory)!

We have touched on a number of alterations you can make by adding additional layers to a ggplot. In the rest of this lesson we’ll touch on a few more changes you can make within ggplot2.

Scales

There may be times when you want a different number of values to be displayed on an axis. The scale of your plot for continuous variables (i.e. numeric variables) can be controlled using scale_x_continuous or scale_y_continuous. Here, we want to increase the number of labels displayed on the y-axis, so we’ll use scale_y_continuous:

ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for continuous variable
  scale_y_continuous(breaks = seq(0, 17000, by = 1000))

There is very handy argument called trans for the scale_y_continuous or the scale_x_continuous functions to change the scale of the axes. For example it can be very useful to show the logarithmic version of the scale if you have very high values with large differences.

According to the documentation for the trans argument: > Built-in transformations include “asn”, “atanh”, “boxcox”, “date”, “exp”, “hms”, “identity”, “log”, “log10”, “log1p”, “log2”, “logit”, “modulus”, “probability”, “probit”, “pseudo_log”, “reciprocal”, “reverse”, “sqrt” and “time”.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for continuous variable
  scale_y_continuous(trans = "log10") +
    labs(y = "Count (log10 scale)",
         x = "Clarity")

Notice that the values are not changed, just the way they are plotted. Now the y-axis increases by a factor of 10 for each break.

We will create a plot of the price of the diamonds to demonstrate the utility of creating a plot with a log10 scaled y-axis.

ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity))

ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity)) + 
  scale_y_continuous(trans = "log10") + 
  labs(y = "Price (log10 scale)",
       x = "Diamond Clarity")

In the first plot, it is difficult to tell what values the boxplots correspond to and it is difficult to compare the boxplots (particularly for the last three clarity categories), however this is greatly improved in the second plot.

We can also use another argument of the scale_y_continuous() function to add specific labels to our plot. For example, it would be nice to add dollar signs to the y-axis. We can do so using the labels argument. A variety of label_* functions within the scales package can be used to modify axis labels. See here to take a look at the many options.

ggplot(diamonds) + 
  geom_boxplot(aes(y = price, x = clarity)) + 
  scale_y_continuous(trans = "log10", 
                    labels = scales::label_dollar()) +
  labs(y = "Price (log10 scale)",
       x = "Diamond Clarity")

Now we can more easily determine that the SI2 diamonds are the most expensive.

Another way to modify discrete variables (aka factors or categorical variables where there is a limited number of levels), is to use scale_x_discrete or scale_y_discrete. In this case we will just pick a few of the clarity categories to plot and we will specify the order.

ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  # control scale for discrete variable
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000))

## Warning: Removed 30940 rows containing non-finite values (stat_count).

Coordinate Adjustment

There are times when you’ll want to flip your axis. This can be accomplished using coord_flip(). Adding an additional layer to the plot we just generated switches our x- and y-axes, allowing for horizontal bar charts, rather than the default vertical bar charts:

ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  # flip coordinates
  coord_flip() +
  labs(title = "Clearer diamonds tend to be of higher quality cut",
       subtitle = "The majority of IF diamonds are an \"Ideal\" cut") +
  ylab("proportion") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  guides(fill = guide_legend("cut quality"))

## Warning: Removed 30940 rows containing non-finite values (stat_count).

## Warning: Removed 30940 rows containing non-finite values (stat_count).

It’s important to remember that all the additional alterations we already discussed can still be applied to this graph, due to the fact that ggplot2 uses layering!

p <- ggplot(diamonds) + 
  geom_bar(mapping = aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  coord_flip() +
  labs(title = "Number of diamonds by diamond clarity",
       subtitle = "Subset of all diamonds, looking three levels of clarity") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30") )

Note: we could have accomplished this by adding an additional geom: geom_text. However, this requires creating a new dataframe, as explained here. This can also be used to label the points on your plot. Keep this reference in mind in case you have to do that in the future.

Annotation

Finally, there will be times when you’ll want to add text to a plot or to annotate points on your plot. We’ll discuss briefly how to accomplish that here!

To add text to your plot, we can use the function annotate. This requires us to specify that we want to annotate here with “text” (rather than say a shape, like a rectangle - “rect” - which you can also do!). Additionally, we have to specify what we’d like that text to say (using the label argument), where on the plot we’d like that text to show up (using x and y for coordinates), how we’d like the text aligned (using hjust for horizontal alignment where the options are “left”, “center”, or “right” and vjust for vertical alignment where the arguments are “top”, “center”, or “bottom”), and how big we’d like that text to be (using size):

ggplot(diamonds) + 
  geom_bar(aes(x = clarity)) +
  scale_y_continuous(breaks = seq(0, 17000, by = 1000)) +
  scale_x_discrete(limit = c("SI2", "SI1", "I1")) +
  coord_flip() +
  labs(title = "Number of diamonds by diamond clarity",
       subtitle = "Subset of all diamonds, looking three levels of clarity") +
  theme_classic() +
  theme(title = element_text(size = 18), 
        axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        legend.text = element_text(size = 14),
        plot.subtitle = element_text(color = "gray30")) +
  # add annotation
  annotate("text", label = "SI1 diamonds are among \n the most frequent clarity diamond", 
           y = 12800, x = 2.9, 
           vjust = "top", hjust = "right", 
           size = 6)

## Warning: Removed 30940 rows containing non-finite values (stat_count).

## Warning: Removed 30940 rows containing non-finite values (stat_count).

Vertical and Horizontal Lines

Sometimes it is very useful to add a line to our plot to indicate an important threshold. We can do so by using the geom_hline() function for a horizontal line and geom_vline() for a vertical line.

In each case, the functions require that a y or x intercept be specified respectively.

For example, it might be useful to add a horizontal line to indicate 50% of the total counts for each of the clarity categories. We will also use the scale_y_continuous() function to change the y-axis to show percentages.

ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Percent of diamonds") +
  geom_hline(yintercept = 0.5, color = "red", size = 1)

Now it’s easier to tell that slightly over half of the VVS2 diamonds have an Ideal cut. This would be much more difficult to see without the horizontal line.

To add a vertical line we would instead use the geom_vline() function and we would specify an x-axis intercept. Since this plot has a discrete x-axis, numeric values specify a categorical value based on the order, thus a value of 4 would create a line down the center of the VS2 bar. However, if we used 5.5 we could add a line offset from the center of a bar, as you can see in the following example:

ggplot(diamonds) + 
  # fill scales to 100%
  geom_bar(aes(x = clarity, fill = cut), position = "fill") +
  scale_fill_manual(values = c("red", "orange", "darkgreen", "dodgerblue", "purple4")) +
  scale_y_continuous(labels = scales::percent) +
  labs(y = "Percent of diamonds") +
  geom_hline(yintercept = 0.5, color = "red", size = 1 ) +
  geom_vline(xintercept = 5.5, color = "black", size = .5)

This would be helpful if we wanted to especially point out differences between the last three clarity categories of diamonds compared to the other categories.

SEMANA 3

While we have focused on figures here so far, tables can be incredibly informative at a glance too. If you are looking to display summary numbers, a table can also visually display information.

Using this same dataset, we can use a table to get a quick breakdown of how many males and females there are in the dataset and what percentage of each gender there is.

A few things to keep in mind when making tables is that it’s best to:

Limit the number of digits in the table Include a caption When possible, keep it simple.

Tables in R

Now that we have a good understanding of what to consider when making tables, we can to practice making good tables in R. To do this, we’ll continue to use the diamonds dataset (which is part of the ggplot2 package). As a reminder, this dataset contains prices and other information about ~54,000 different diamonds. If we want to provide viewers with a summary of these data, we may want to provide information about diamonds broken down by the quality of the diamond’s cut. To get our data in the form we want we will use the dplyr package, which we discussed in a lesson earlier.

Getting the Data in Order

To start figuring out how the quality of the cut of the diamond affects the price of that diamond, we first have to get the data in order. To do that we’ll use the dplyr package. This allows us to group the data by the quality of the cut (cut) before summarizing the data to determine the number of diamonds in each category (N), the minimum price of the diamonds in this category (min), the average price (avg), and the highest price in the category (max).

To get these data in order, you could use the following code. This code groups the data by cut (quality of the diamond) and then calculates the number of diamonds in each group (N), the minimum price across each group (min), the average price of diamonds across each group (avg), and the maximum price within each group (max):

# get summary data for table in order
df <- diamonds %>% 
  group_by(cut) %>%
  dplyr::summarize(
    N = n(), 
    min = min(price), 
    avg = mean(price), 
    max = max(price)
  )
## `summarise()` ungrouping output (override with `.groups` argument)

An Exploratory Table

df

## # A tibble: 5 × 5
##   cut           N   min   avg   max
##   <ord>     <int> <int> <dbl> <int>
## 1 Fair       1610   337 4359. 18574
## 2 Good       4906   327 3929. 18788
## 3 Very Good 12082   336 3982. 18818
## 4 Premium   13791   326 4584. 18823
## 5 Ideal     21551   326 3458. 18806

By getting the data summarized into a single object in R (df), we’re on our way to making an informative table. However, this is clearly just an exploratory table. The output in R from this code follows some of the good table rules above, but not all of them. At a glance, it will help you to understand the data, but it’s not the finished table you would want to send to your boss.

An Exploratory Table

# look at data
df

## # A tibble: 5 × 5
##   cut           N   min   avg   max
##   <ord>     <int> <int> <dbl> <int>
## 1 Fair       1610   337 4359. 18574
## 2 Good       4906   327 3929. 18788
## 3 Very Good 12082   336 3982. 18818
## 4 Premium   13791   326 4584. 18823
## 5 Ideal     21551   326 3458. 18806

From this output, you, the creator of the table, would be able to see that there are a number of good qualities:

there is a reasonable number of rows and columns - There are 5 rows and 5 columns. A viewer can quickly look at this table and determine what’s going on. the first column cut is organized logically - The lowest quality diamond category is first and then they are ordered vertically until the highest quality cut (ideal)) comparisons are made top to bottom - To compare between the groups, your eye only has to travel up and down, rather than from left to right. There are also things that need to be improved on this table:

column headers could be even more clear there’s no caption/title it could be more aesthetically pleasing

Improving the Table Output

By-default, R outputs tables in the Console using a monospaced font. However, this limits our ability to format the appearance of the table. To fix the remaining few problems with the table’s format, we’ll use the kable() function from the R package knitr and the additional formatting capabilities of the R packages kableExtra.

The first step to a prettier table just involves using the kable() function from the knitr package, which improves the readability of this table. The knitr package is not a core tidyverse package, so you’ll have to be sure it’s installed and loaded.

# install.packages("knitr")
library(knitr)

kable(df)

cut	N	min	avg	max
Fair	1610	337	4358.758	18574
Good	4906	327	3928.864	18788
Very Good	12082	336	3981.760	18818
Premium	13791	326	4584.258	18823
Ideal	21551	326	3457.542	18806

However, there are still a few issues we want to improve upon:

column names could be more informative too many digits in the avg column caption/title is missing source of data not included To begin addressing these issues, we can use the add_header_above function from kableExtra() to specify that the min, avg, and max columns refer to price in US dollars (USD). Additionally, kable() takes a digits argument to specify how many significant digits to display. This takes care of the display of too many digits in the avg column. Finally, we can also style the table so that every other row is shaded, helping our eye to keep each row’s information separate from the other rows using kable_styling() from kableExtra. These few changes really improve the readability of the table.

If you copy this code into your R console, the formatted table will show up in the Viewer tab at the bottom right-hand side of your RStudio console and the HTML code used to generate that table will appear in your console.

# install.packages("kableExtra")
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

# use kable_styling to control table appearance
kable(df, digits=0, "html") %>%
  kable_styling("striped", "bordered") %>% 
  add_header_above(c(" " = 2,  "price (USD)" = 3))

		price (USD)
cut	N	min	avg	max
Fair	1610	337	4359	18574
Good	4906	327	3929	18788
Very Good	12082	336	3982	18818
Premium	13791	326	4584	18823
Ideal	21551	326	3458	18806

Annotating Your Table

We mentioned earlier that captions and sourcing your data are incredibly important. The kable package allows for a caption argument. Below, an informative caption has been included. Additionally, kableExtra has a footnote() function. This allows you to include the source of your data at the bottom of the table. With these final additions, you have a table that clearly displays the data and could be confidently shared with your boss.

kable(df, digits=0, "html", caption="Table 1: Diamonds Price by Quality of Cut. Most Diamonds are of the highest quality cut and the most expensive diamonds are of the highest quality") %>%
  kable_styling("striped", "bordered") %>% 
  # add headers and footnote
  add_header_above(c(" " = 2,  "price (USD)" = 3)) %>% 
  footnote(general = "Diamonds dataset from ggplot2", general_title = "Source:", footnote_as_chunk = T)

Table 1: Diamonds Price by Quality of Cut. Most Diamonds are of the highest quality cut and the most expensive diamonds are of the highest quality
		price (USD)
cut	N	min	avg	max
Fair	1610	337	4359	18574
Good	4906	327	3929	18788
Very Good	12082	336	3982	18818
Premium	13791	326	4584	18823
Ideal	21551	326	3458	18806
Source: Diamonds dataset from ggplot2

Visualizing Data in the Tidyverse

RRR

8/10/2021

PRIMERA SEMNA

General Features of Plots

Exploratory Plots

Explanatory Plots

Plot Types

Choose the Right Type of Plot

Be Mindful When Choosing Colors

Label the Axes

Make Sure the Numbers Add Up

Make Sure the Numbers and Plots Make Sense Together

Make Comparisons Easy on Viewers

Use y-axes That Start at Zero

Keep It Simple

Three Questions You Should Ask

What’s your point?

How can you emphasize your point in your chart?

Adding data

Highlighting data with color

Annotate your plot

What Does Your Final Chart Show?

Write precise descriptions

Add a source

SEGUNDA SEMANA

ggplot2 Background

Example Dataset: diamonds

Scatterplots: geom_point()

Aesthetics

Barplots: geom_bar

Boxplots: geom_boxplot

EDA Plots

Colors

Labels

Custom Theme

Scales

Coordinate Adjustment

Annotation

Vertical and Horizontal Lines

SEMANA 3

Tables in R

Getting the Data in Order

An Exploratory Table

An Exploratory Table

Improving the Table Output

Annotating Your Table