Hello there! Today, we will be learning how to make visualizations with R’s popular ggplot2 package. You can create simple visualizations with base R, but we are choosing to teach you ggplot2 because it is relatively easy to implement, well-documented, and supported by a community of data enthusiasts.
The ggplot2 package is part of the tidyverse so it comes equipped with other data cleaning and processing tools. The package is designed with the grammar of graphics in mind. This grammar of graphics describes a general pattern to follow when you’re creating visualization. In fact, that’s where the “gg” in ggplot comes from: grammar of graphics. ggplot2 has fostered a large programming community so you’ll find that as you forage into making your own plot outside of the Codecademy platform, there’ll be lots of resources and examples to welcome you.
This lesson will teach you the basic grammar required to create a plot. After you learn the underlying structure or “philosophy” of ggplot2, you can extend the logic to create many types of plots. Once you get the basic structure down, our upcoming lessons will explore how to customize your plot and calculate statistics in your visualization. Let’s get started.
Take a look at the image. Observe how ggplot layers elements of the plot to create a final visualization. All plots start as a blank canvas with data associated to them. Layers of geometries, labels, and scales are then added to display the information. Click next when you’re ready to proceed!
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Intro.png.png")
When you learn grammar in school you learn about the basic units to construct a sentence. The basic units in the “grammar of graphics” consist of:
1.The data or the actual information you wish to visualize.
2.The geometries, shortened to “geoms”, describe the shapes that represent our data. Whether it be dots on a scatter plot, bar charts on the graph, or a line to plot the data! The list goes on. Geoms are the shapes that “map” our data.
3.The aesthetics, or the visual attributes of the plot, including the scales on the axes, the color, the fill, and other attributes concerning appearance.
Another key component to understand is that in ggplot2, geoms are “added” as layers to the original canvas which is just an empty plot with data associated to it.
Once you learn these three basic grammatical units, you can create the equivalent of a basic sentence, or a basic plot. There are more units in the “grammar of graphics,” but in this lesson we’ll mostly be learning about these three.
Take a look at the code that generates the plot on the graph, you will understand every single line of what it’s doing by the end of this lesson! For now, focus on the plus signs! Each plus sign is adding a layer to the plot!
# load libraries and data
library(readr)
library(dplyr)
library(ggplot2)
movies <- read_csv("imdb.csv")
movies
# Observe layers being added with the + sign
viz <- ggplot(data=movies, aes(x=imdbRating, y=nrOfWins)) +
geom_point(aes(color=nrOfGenre), alpha=0.5) +
labs(title="Movie Ratings Vs Award Wins", subtitle="From IMDB dataset", y="Number of Award Wins", x="Movie Rating", color = "Number of Genre")
# Prints the plot
viz
The first thing you’ll need to do to create a ggplot object is invoke the ggplot() function. Conceptualize this step as initializing the “canvas” of the visualization. In this step, it’s also standard to associate the data frame the rest of the visualization will use with the canvas. What do we mean by “the rest” of the visualization? We mean all the layers you’ll add as you build out your plot.
As we mentioned, at its heart, a ggplot visualization is a combination of layers that each display information or add style to the final graph. You “add” these layers to a starting canvas, or ggplot object, with a + sign. We’ll add geometries and aesthetics in the next exercises. For now, let’s stop to understand that any arguments inside the ggplot() function call are inherited by the rest of the layers on the plot.
Here we invoke ggplot() to create a ggplot object and assign the dataframe df, saving it inside a variable named viz:
df <- read_csv("imdb.csv")
## New names:
## Rows: 290 Columns: 45
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (6): fn, tid, title, wordsInTitle, url, type dbl (39): ...1, imdbRating,
## ratingCount, duration, year, nrOfWins, nrOfNomin...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
viz <- ggplot(data=df)
viz
Note: The code above assigns the value of the canvas to viz and then states the variable name viz after so that the visualization is rendered in the notebook.
Any layers we add to viz would have access to the dataframe. We mentioned the idea of aesthetics before. It’s important to understand that any aesthetics that you assign as the ggplot() arguments will also be inherited by other layers. We’ll explore what this means in depth too, but for now, it’s sufficient to conceptualize that arguments defined inside ggplot() are inherited by other layers.
Take a look at the animated diagram.
1.Notice how initially, a ggplot object is created as a blank canvas. Notice that the aesthetics are then set inside the ggplot() function as an argument.
2.You could individually set the scales for each layer, but if all layers will use the same scales, it makes sense to set those arguments at the ggplot() “canvas” level.
3.When two subsequent layers are added, a scatter plot and a line of best fit, both of those layers are mapped onto the canvas using the same data and scales.