1.Introduction

Hello there! Today, we will be learning how to make visualizations with R’s popular ggplot2 package. You can create simple visualizations with base R, but we are choosing to teach you ggplot2 because it is relatively easy to implement, well-documented, and supported by a community of data enthusiasts.

The ggplot2 package is part of the tidyverse so it comes equipped with other data cleaning and processing tools. The package is designed with the grammar of graphics in mind. This grammar of graphics describes a general pattern to follow when you’re creating visualization. In fact, that’s where the “gg” in ggplot comes from: grammar of graphics. ggplot2 has fostered a large programming community so you’ll find that as you forage into making your own plot outside of the Codecademy platform, there’ll be lots of resources and examples to welcome you.

This lesson will teach you the basic grammar required to create a plot. After you learn the underlying structure or “philosophy” of ggplot2, you can extend the logic to create many types of plots. Once you get the basic structure down, our upcoming lessons will explore how to customize your plot and calculate statistics in your visualization. Let’s get started.

Instructions

Take a look at the image. Observe how ggplot layers elements of the plot to create a final visualization. All plots start as a blank canvas with data associated to them. Layers of geometries, labels, and scales are then added to display the information. Click next when you’re ready to proceed!

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Intro.png.png")

2.Layers and Geoms

When you learn grammar in school you learn about the basic units to construct a sentence. The basic units in the “grammar of graphics” consist of:

1.The data or the actual information you wish to visualize.

2.The geometries, shortened to “geoms”, describe the shapes that represent our data. Whether it be dots on a scatter plot, bar charts on the graph, or a line to plot the data! The list goes on. Geoms are the shapes that “map” our data.

3.The aesthetics, or the visual attributes of the plot, including the scales on the axes, the color, the fill, and other attributes concerning appearance.

Another key component to understand is that in ggplot2, geoms are “added” as layers to the original canvas which is just an empty plot with data associated to it.

Once you learn these three basic grammatical units, you can create the equivalent of a basic sentence, or a basic plot. There are more units in the “grammar of graphics,” but in this lesson we’ll mostly be learning about these three.

Instructions

Take a look at the code that generates the plot on the graph, you will understand every single line of what it’s doing by the end of this lesson! For now, focus on the plus signs! Each plus sign is adding a layer to the plot!

# load libraries and data
library(readr)
library(dplyr)
library(ggplot2)
movies <- read_csv("imdb.csv")
movies
# Observe layers being added with the + sign
viz <- ggplot(data=movies, aes(x=imdbRating, y=nrOfWins)) +
       geom_point(aes(color=nrOfGenre), alpha=0.5) + 
       labs(title="Movie Ratings Vs Award Wins", subtitle="From IMDB dataset", y="Number of Award Wins", x="Movie Rating", color = "Number of Genre")


# Prints the plot
viz

3.The ggplot() function

The first thing you’ll need to do to create a ggplot object is invoke the ggplot() function. Conceptualize this step as initializing the “canvas” of the visualization. In this step, it’s also standard to associate the data frame the rest of the visualization will use with the canvas. What do we mean by “the rest” of the visualization? We mean all the layers you’ll add as you build out your plot.

As we mentioned, at its heart, a ggplot visualization is a combination of layers that each display information or add style to the final graph. You “add” these layers to a starting canvas, or ggplot object, with a + sign. We’ll add geometries and aesthetics in the next exercises. For now, let’s stop to understand that any arguments inside the ggplot() function call are inherited by the rest of the layers on the plot.

Here we invoke ggplot() to create a ggplot object and assign the dataframe df, saving it inside a variable named viz:

df <- read_csv("imdb.csv")
## New names:
## Rows: 290 Columns: 45
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): fn, tid, title, wordsInTitle, url, type dbl (39): ...1, imdbRating,
## ratingCount, duration, year, nrOfWins, nrOfNomin...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
viz <- ggplot(data=df)

viz

Note: The code above assigns the value of the canvas to viz and then states the variable name viz after so that the visualization is rendered in the notebook.

Any layers we add to viz would have access to the dataframe. We mentioned the idea of aesthetics before. It’s important to understand that any aesthetics that you assign as the ggplot() arguments will also be inherited by other layers. We’ll explore what this means in depth too, but for now, it’s sufficient to conceptualize that arguments defined inside ggplot() are inherited by other layers.

Instructions

Take a look at the animated diagram.

1.Notice how initially, a ggplot object is created as a blank canvas. Notice that the aesthetics are then set inside the ggplot() function as an argument.

2.You could individually set the scales for each layer, but if all layers will use the same scales, it makes sense to set those arguments at the ggplot() “canvas” level.

3.When two subsequent layers are added, a scatter plot and a line of best fit, both of those layers are mapped onto the canvas using the same data and scales.

4.Associating the Data

Before we go any further, let’s stop to understand when the data gets bound to the visualization:

1.Data is bound to a ggplot2 visualization by passing a data frame as the first argument in the ggplot() function call. You can include the named argument like ggplot(data=df_variable) or simply pass in the data frame like ggplot(data frame).

2.Because the data is bound at this step, this means that the rest of our layers, which are function calls we add with a + plus sign, all have access to the data frame and can use the column names as variables.

For example, assume we have a data frame sales with the columns cost and profit. In this example, we assign the data frame sales to the ggplot() object that is initailized: ” viz <- ggplot(data=sales) + geom_point(aes(x=cost, y=profit)) viz # renders plot ”

In the example above:

1.The ggplot object or canvas was initialized with the data frame sales assigned to it.

2.The subsequent geom_point layer used the cost and profit columns to define the scales of the axes for that particular geom. Notice that it simply referred to those columns with their column names.

3.We state the variable name of the visualization ggplot object so we can see the plot.

Note: There are other ways to bind data to layers if you want each layer to have a different dataset, but the most readable and popular way to bind the dataframe happens at the ggplot() step and your layers use data from that dataframe.

Instructions

1.Create a new variable named viz and assign it the value of a new ggplot object that you create by invoking the ggplot() call and assigning it the dataframe movies as the data argument. After you’ve defined viz you need to state the variable name on a new line in order to see it.

Click run and watch your code render an empty canvas. Even though no data is displayed, the data is bound to the viz ggplot object!

#Define variable and print it
viz <- ggplot(data = movies)
viz

5.What are aesthetics?

In the context of ggplot, aesthetics are the instructions that determine the visual properties of the plot and its geometries.

Aesthetics can include things like the scales for the x and y axes, the color of the data on the plot based on a property or simply on a color preference, or the size or shape of different geometries.

There are two ways to set aesthetics, by manually specifying individual attributes or by providing aesthetic mappings. We’ll explore aesthetic mappings first and come back to manual aesthetics later in the lesson. Aesthetic mappings “map” Preview: Docs Loading link description variables from the data frame to visual properties in the plot. You can provide aesthetic mappings in two ways using the aes() mapping function:

1.At the canvas level: All subsequent layers on the canvas will inherit the aesthetic mappings you define when you create a ggplot object with ggplot().

2.At the geom level: Only that layer will use the aesthetic mappings you provide.

Let’s discuss inherited aesthetics first, or aesthetics you define at the canvas level. Here’s an example of code that assigns aes() mappings for the x and y scales at the canvas level: ” viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) + geom_point() + geom_smooth() ”

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/ozone.png")

In the example above:

1.The aesthetic mapping is wrapped in the aes() aesthetic mapping function as an additional argument to ggplot().

2.Both of the subsequent geom layers, geom_point() and geom_smooth() use the scales defined inside the aesthetic mapping assigned at the canvas level.

You should set aesthetics for subsequent layers at the canvas level if all layers will share those aesthetics.

Instructions

1.In the visualization we will be creating, we want to plot the Movie Ratings (imdbRating) on the x axis and the number of awards (nrOfWins) on the y axis to see if there is a correlation between a movie rating and the number of awards it wins. We will use this scale on the subsequent layers, so create the aesthetic mappings at the canvas level.

#Create aesthetic mappings at the canvas level
viz <- ggplot(data=movies, aes(x = imdbRating, y = nrOfWins))
viz

6.Adding Geoms

Before we teach you how to add aesthetics specific to a geom layer, let’s create our first geom! As mentioned before, geometries or geoms are the shapes that represent our data.

In ggplot, there are many types of geoms for representing different relationships in data. You can read all about each one in the layers section of the ggplot2 documentation. Once you learn the basic grammar of graphics, all you’ll have to do is read the documentation of a particular geom and you’ll be prepared to make a plot with it following the general pattern. For simplicity’s sake, let’s start with the scatterplot geom, or geom_point(), which simply represents each datum as a point on the grid. Scatterplots are great for graphing paired numerical data or to detect a correlation between two variables.

The following code adds a scatterplot layer to the visualization:

” viz <- ggplot(data=df, aes(x=col1,y=col2)) + geom_point() ” In the code above:

1.Notice the layer is being added by using a + sign which comes after the ggplot object is created, and it comes on the same line.

2.The geom_point() function call is what adds the points layer to the plot. This call can take arguments but we are keeping it simple for now.

The code above would render the following plot:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Adding Geoms1.png")

Another popular layer that allows you to eye patterns in the data by completing a line of best fit is the geom_smooth() layer. This layer, by nature, comes with a gray error band. You could add a smooth layer to the plot by typing the following:

” viz <- ggplot(data=df, aes(x=col1,y=col2)) + geom_point() + geom_smooth() ”

1.Notice that you can add layers one on top of the other. We added the smooth line after adding the geom_point() layer. We could have just included the point layer, or just the line-of-best-fit layer. But the combination of the two enhances our visual understanding of the data, so they make a great pairing.

2.It is nice to put each layer on its own line although it is not necessary, since it improves readability in the long run if you’re collaborating with other people.

The code above would render the following plot:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Adding Geoms2.png")

Instructions

1.Add a scatter plot of the data to the viz ggplot object by using the geom_point() layer.

# Add a geom point layer
viz <- ggplot(data=movies, aes(x=imdbRating, y=nrOfWins)) + geom_point()


# Prints the plot
viz

7.Geom Aesthetics

In the previous exercises, we added geoms to the plot and explored the idea of layers inheriting the original aesthetic mappings of the canvas. Sometimes, you’ll want individual layers to have their own mappings. For example, what if we wanted the scatterplot layer to classify the points based on a data-driven property? We achieve this by providing an aesthetic mapping for that layer only.

Let’s explore the aesthetic mappings for the geom_point() layer. What if we wanted to color-code the points on the scatterplot based on a property? It’s possible to customize the color by passing in an aes() aesthetic mapping with the color based on a data-driven property. Observe this example: ” viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) + geom_point(aes(color=Month)) + geom_smooth() ”

The code above would only change the color of the point layer, it would not affect the color of the smooth layer since the aes() aesthetic mapping is passed at the point layer.

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Geom Aesthetics1.png")

Note: You can read about the individual aesthetics available for each geom when you read its documentation. There are some aesthetics shared across geoms and others that are specific to a particular ones.

Instructions

1.Inside our movies dataset, we have a column named nrOfGenre that describes the number of genres a movie is assigned. For example, the movie “Terminator” is classified as both “Action” and “Sci-Fi”, so its number of genres is equal to 2. What if we are wondering if the number of genres a movie is assigned, in other words its versatility, is correlated to its movie rating or its number of wins? Are movies better off when they stick to one simple genre or when the explore multiple ones? We want to display this information on our plot.

Add an aesthetic mapping to the geom_point() layer that color coordinates the data based on nrOfGenre.

# Add manual alpha aesthetic mapping
viz <- ggplot(data=movies, aes(x=imdbRating, y=nrOfWins)) + geom_point(aes(color = nrOfGenre)) 

# Prints the plot
viz

8.Manual Aesthetics

We’ve reviewed how to assign data-driven aesthetic mappings at the canvas level and at the geom level. However, sometimes you’ll want to change an aesthetic based on visual preference and not data. You might think of this as “manually” changing an aesthetic.

If you have a pre-determined value in mind, you provide a named aesthetic parameter and the value for that property without wrapping it in an aes(). For example, if you wanted to make all the points on the scatter plot layer dark red because that’s in line with the branding of the visualization you are preparing, you could simply pass in a color parameter with a manual value darkred or any color value like so: ” viz <- ggplot(data=airquality, aes(x=Ozone, y=Temp)) + geom_point(color=“darkred”)

*Note that we did not wrap the color argument inside aes() because we are manually setting that aesthetic. Here are more aesthetics for the geom_point() layer: x, y, alpha, color, fill, group, shape, size, stroke. The alpha aesthetic describes opacity of the points, and the shape of the dots could be different than a dot. Read more about the values each of these aesthetics take in the geom_point() layer documentation.

The code above would render the following plot:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Manual Aesthetics1.png")

We advise that your aesthetic choices have intention behind them. Too much styling can overcomplicate the appearance of a plot, making it difficult to read.

Instructions

1.There seems to be some crowding in our movie scatterplot. Let’s change the opacity of our points by making them .5 translucent. We can accomplish by manually assigning the alpha value of the geom_point() layer.

# Add manual alpha aesthetic mapping
viz <- ggplot(data = movies, aes(x = imdbRating, y = nrOfWins)) +
  geom_point(aes(color = nrOfGenre), alpha = 0.5)

# Prints the plot
viz

9.Labels

So far, we’ve reviewed how to add geometries to represent our data. We’ve also learned how to modify aesthetic values in our plot- whether those aesthetics are data-driven or assigned manually. Another big part of creating a plot is in making sure it has reader-friendly labels. The ggplot2 package automatically assigns the name of the variable corresponding to the different components on the plot as the initial label. Code variable names are unfortunately not always legible to outside readers with no context.

If you wish to customize your labels, you can add a labs() function call to your ggplot object. Inside the function call to labs() you can provide new labels for the x and y axes as well as a title, subtitle, or caption. You can check out the list of available label arguments in the labs() documentation here.

The following labs() function call and these specified arguments would render the following plot:

” viz <- ggplot(df, aes(x=rent, y=size_sqft)) + geom_point() + labs(title=“Monthly Rent vs Apartment Size in Brooklyn, NY”, subtitle=“Data by StreetEasy (2017)”, x=“Monthly Rent ($)”, y=“Apartment Size (sq ft.)”) viz ”

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Labels.png")

Instructions

1.The labels on the plot we’ve been building could definitely use an update!

Add a labs() function call and change the following arguments:

1.Change the title to “Movie Ratings Vs Award Wins” to contextualize the goal of the plot

2.Contextualize details about where the data comes from inside the subtitle by adding “From IMDB dataset”

3.Change the x label to “Movie Rating” and the y label to “Number of Award Wins”

4.Change the legend label by providing a color argument with the string value of “Number of Genre”

# Add labels as specified
viz <- ggplot(data=movies, aes(x=imdbRating, y=nrOfWins)) +
       geom_point(aes(color=nrOfGenre), alpha=0.5) +
       labs(title = "Movie Ratings Vs Award Wins",subtitle = "From IMDB dataset", x = "Movie Rating", y = "Number of Award Wins", color = "Number of Genre")


# Prints the plot
viz

10.Extending The Grammar

We’ve gone over each of the basic units in the grammar of graphics: data, geometries, and aesthetics. Let’s extend this new knowledge to create a new type of plot: the bar chart. Bar charts are great for showing the distribution of categorical data. Typically, one of the axes on a bar chart will have numerical values and the other will have the names of the different categories you wish to understand.

Let’s build a bar chart by using some of the R built-in datasets. These are Preview: Docs Data frames are objects that store data into two dimensions of columns and rows. data frames that you can readily access in your code to explore and create visualizations. They are handy because these built-in datasets usually include nicely distributed categorical data.

The geom_bar() layer adds a bar chart to the canvas. Typically when creating a bar chart, you assign an aes() aesthetic mapping with a single categorical value on the x axes and the aes() function will compute the count for each category and display the count values on the y axis.

Since we’re extending the grammar of graphics, let’s also learn about how to save our visuals as local image files.

The following code maps the count of each category in the Language column in a dataset of 100 popular books to a bar length and then saves the visualization as a .png file named “bar-example.png”:

” bar <- ggplot(books, aes(x=Language)) + geom_bar() bar ggsave(“bar-example.png”) ”

Note: The ggsave() function allows you to save visualizations as a local file with the name of your choice. It’s a useful function when developing visualizations locally.

The code above outputs the following plot:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Extending The Grammar.png")

Instructions

1.The mpg dataset in R is a built-in dataset describing fuel economy data from 1999 and 2008 for 38 popular models of cars and is included with ggplot.

Inspect the built-in dataset mpg by printing its head(). Take special note of the class column which describes vehicle class for the cars with a total of 7 types (compact, SUV, minivan etc.)

# Inspect the mpg builtin dataset
library(ggplot2)
head(mpg)

2.Create a variable bar that is equal to a ggplot() object with the mpg built-in dataset associated as its data argument.

#Create a bar chart
bar <- ggplot(mpg)

3.We want to understand the breakdown of the types of vehicles in the dataset, so provide the canvas, or the ggplot() object with an aesthetic mapping aes() that makes the x axis represent the categorical values of the class column in the dataframe. ggplot2 will count each unique value in the class column and automagically designate that value to the y axis.

#Create a bar chart
bar <- ggplot(mpg, aes(x = class))
bar

4.Add a geom_bar() layer to bar. Be sure to type bar after you’ve declared the variable and added the layer so that the plot can render in your R notebook output.

#Create a bar chart
bar <- ggplot(mpg, aes(x = class)) + geom_bar()
bar

5.Let’s add some color to the bar chart, by adding an aes() aesthetic mapping to the geom_bar() layer that fills the color of each bar based on the class value.

#Create a bar chart
bar <- ggplot(mpg, aes(x = class)) + geom_bar(aes(fill = class))
bar

6.Our plot could use some context, let’s add a title and a sub-title so that users can understand more about what we are displaying with this bar chart and the mpg dataset.

Use the labs() function to assign a new title that describes this plot is illustrating the Types of Vehicles and a subtitle describing the data as From fuel economy data for popular car models (1999-2008)

#Create a bar chart
bar <- ggplot(mpg, aes(x = class)) + geom_bar(aes(fill = class)) + labs(title = "Types of Vehicles", subtitle = "From fuel economy data for popular car models (1999-2008)")
bar

11.Review

You’ve completed the introduction to ggplot lesson! You’re ready to follow the general pattern for creating a visualization:

1.Determine what relationship you wish to explore in your data

2.Find the right geom(s) in the ggplot2 documentation to display that relationship and read about the arguments and aesthetics specific to that geom

3.Extend the grammar of graphics to follow the pattern learned in this lesson to add layers and create a visualization. Improve graph legibility by polishing labels and styles.

Some of the key concepts you learned in this lesson include:

1.The basic units of grammar include data, geoms, and aesthetics.

2.The dataframe associated to the plot by using the ggplot() function creates a ggplot object that is known as the canvas.

3.The geometries or geoms are the shapes that display the data. Geometries become layers as you add them to your ggplot object.

4.The aesthetics are visual instructions you provide the plot. Aesthetics can be inherited or specified at the geom level.

5.Aesthetic mappings are data-driven visual instructions for the plot.

6.You can add context to your plot by customizing its labels with the labs() function

Instructions

Feel free to customize the plot you just created by modifying labels or assigning new manual or mapped aesthetics. What other relationships in the mpg data could you display and what geoms could you use to show them? Continue when you’re ready!

#Create a bar chart
bar <- ggplot(mpg, aes(x = manufacturer)) + geom_bar(aes(fill = class)) + labs(title = "Types & Manufacturer of Vehicles ", subtitle = "From fuel economy data for popular car models (1999-2008)")
bar