The structure of your data will dictate how you construct plots in ggplot2. In this chapter, you’ll explore the iris dataset from several different perspectives to showcase this concept. You’ll see that making your data conform to a structure that matches the plot in mind will make the task of visualization much easier through several R data visualization examples.
These courses are about understanding data visualization in the context of the grammar of graphics. To gain a better appreciation of ggplot2 and to understand how it operates differently from base package, it’s useful to make some comparisons.
In the video, you already saw one example of how to make a (poor) multivariate plot in base package. In this series of exercises you’ll take a look at a better way using the equivalent version in ggplot2.
First, let’s focus on base package. You want to make a plot of mpg (miles per gallon) against wt (weight in thousands of pounds) in the mtcars data frame, but this time you want the dots colored according to the number of cylinders, cyl. How would you do that in base package? You can use a little trick to color the dots by specifying a factor variable as a color. This works because factors are just a special class of the integer type.
INSTRUCTIONS 70 XP Using the base package plot(), make a scatter plot with mtcars$wt
on the x-axis and mtcars$mpg
on the y-axis, colored according to mtcars$cyl
(use the col argument). You can specify data =
but you’ll just do it the long way here. Add a new column, fcyl, to the mtcars data frame. This should be cyl converted to a factor. Create a similar plot to instruction 1, but this time, use fcyl (which is cyl as a factor) to set the col. Show Answer (-70 XP) HINT plot() takes three arguments: mtcars$wt
, mtcars$mpg
and col is set to mtcars\(cyl. You can use as.factor() to define a factor variable. To solve the third instruction, use col = mtcars\)fcyl.
It’s all about that base! Recall that under-the-hood, factors are simply integer type vectors, so the colors in the second plot are 1, 2, and 3. In the first plot the colors were 4, 6, and 8.
If you want to add a linear model to your plot, shown right, you can define it with lm() and then plot the resulting linear model with abline(). However, if you want a model for each subgroup, according to cylinders, then you have a couple of options.
You can subset your data, and then calculate the lm() and plot each subset separately. Alternatively, you can vectorize over the cyl variable using lapply() and combine this all in one step. This option is already prepared for you.
The code to the right contains a call to the function lapply(), which you might not have seen before. This function takes as input a vector and a function. Then lapply() applies the function it was given to each element of the vector and returns the results in a list. In this case, lapply() takes each element of mtcars$cyl
and calls the function defined in the second argument. This function takes a value of mtcars$cyl
and then subsets the data so that only rows with cyl == x are used. Then it fits a linear model to the filtered dataset and uses that model to add a line to the plot with the abline() function.
Now that you have an interesting plot, there is a very important aspect missing - the legend!
In base package you have to take care of this using the legend() function. This has been done for you in the predefined code.
INSTRUCTIONS 70 XP INSTRUCTIONS 70 XP Fill in the lm() function to calculate a linear model of mpg described by wt and save it as an object called carModel. Draw the linear model on the scatterplot. Write code that calls abline() with carModel as the first argument. Set the line type by passing the argument lty = 2. Run the code that generates the basic plot and the call to abline() all at once by highlighting both parts of the script and hitting control/command + enter on your keyboard. These lines must all be run together in the DataCamp R console so that R will be able to find the plot you want to add a line to. Run the code already given to generate the plot with a different model for each group. You don’t need to modify any of this.
Show Answer (-70 XP) HINT
You can use lm() as follows: lm(var1 ~ var2, data = dataset). Of course fill in the correct variables (mpg and wt) and dataset (mtcars). abline() takes a linear model object (i.e. the output of lm()) as the first argument. Other graphical parameters can also be set - e.g. lty = 2 for a dashed line.
Phew! Notice how the legend had to be set manually. In general, ggplot2 makes it easier to polish plots compared to base graphics.
In this exercise you’ll recreate the base package plot in ggplot2.
The code for base R plotting is given at the top. The first line of code already converts the cyl variable of mtcars to a factor.
INSTRUCTIONS 70 XP Plot 1: add geom_point() in order to make a scatter plot.
Plot 2: copy and paste Plot 1.
Add a linear model for each subset according to cyl by adding a geom_smooth() layer.
Inside this geom_smooth(), set method to “lm” and se to FALSE.
Note: geom_smooth() will automatically draw a line per cyl subset. It recognizes the groups you want to identify by color in the aes() call within the ggplot() command.
Plot 3: copy and paste Plot 2.
Plot a linear model for the entire dataset, do this by adding another geom_smooth() layer.
Set the group aesthetic inside this geom_smooth() layer to 1. This has to be set within the aes() function.
Set method to “lm”, se to FALSE and linetype to 2. These have to be set outside aes() of the geom_smooth().
Note: the group aesthetic will tell ggplot() to draw a single linear model through all the points.
Show Answer (-70 XP) HINT For Plot 1, you have to add geom_point() to the ggplot() command without any arguments. Use +. For Plot 2, you should expand the previous ggplot() command with a geom_smooth() layer. The arguments you have to set are described in the instructions. You don’t have to add anything to draw a line per cyl group, ggplot() will do this automatically, as described in the instructions. For Plot 3, expand the previous ggplot() command with another geom_smooth() layer. This time the first argument should be an aes() mapping with group = 1. Don’t forget to set the linetype as defined in the instructions. This is an argument of geom_smooth(), not of aes(). method and se should be set as in the previous command.
Peachy plotting! Plots like this are actually much easier to implement in ggplot2 than in base package!
ggplot2 has become very popular and for many people it’s the go-to plotting package in R. Which of these statements about ggplot2 is most accurate?
ANSWER THE QUESTION 35 XP Possible Answers ggplot2 creates plotting objects, which can be manipulated. press 1 ggplot2 takes care of a lot of the leg work for you, such as choosing nice color palettes and making legends. press 2 ggplot2 is built upon the grammar of graphics plotting philosophy, making it more flexible and intuitive for understanding the relationship between your visuals and your data. press 3 Options 1, 2, and 3. press 4 ggplot2 is effectively a replacement for all base-package plotting functions. press 5 HINT ggplot2 is an expansion of base R in many ways, but that doesn’t mean it’s a replacement!
Yes! ggplot2 has all of these advantages and you’ll get familiar with them throughout the courses.
In the video, Rick showed you different ggplot2 calls to plot two groups of data onto the same plot:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + geom_point(aes(x = Petal.Length, y = Petal.Width), col = “red”)
ggplot(iris.wide, aes(x = Length, y = Width, col = Part)) + geom_point() Which one is preferable? Both iris and iris.wide are available in the workspace, so you can experiment in the R Console straight away!
INSTRUCTIONS 35 XP INSTRUCTIONS 35 XP Possible Answers Option 1. press 1 Option 2. press 2 Both are equally preferable. press 3
Correct! You’re starting to grasp the ggplot2 philosophy; that’s great!
So far you’ve seen four different forms of the iris dataset: iris, iris.wide, iris.wide2 and iris.tidy. Don’t let all these different forms confuse you! It’s exactly the same data, just rearranged so that your plotting functions become easier.
To see this in action, consider the plot in the graphics device at right. Which form of the dataset would be the most appropriate to use here?
INSTRUCTIONS 70 XP Look at the structures of iris, iris.wide and iris.tidy using str(). Fill in the ggplot function with the appropriate data frame and variable names. The variable names of the aesthetics of the plot will match the ones you found using the str() command in the previous step. Show Answer (-70 XP) HINT To check the structure of the datasets, use str(). Use the iris.tidy dataset. For the x, y and color aesthetics, take a look at the labels and legend of the plot and use these variables. Measure is used for the facet layer.Tidy work! Ggplots always want one measurement per row of the data frame.
In the last exercise you saw how iris.tidy was used to make a specific plot. It’s important to know how to rearrange your data in this way so that your plotting functions become easier. In this exercise you’ll use functions from the tidyr package to convert iris to iris.tidy.
The resulting iris.tidy data should look as follows:
Species Part Measure Value
1 setosa Sepal Length 5.1
2 setosa Sepal Length 4.9
3 setosa Sepal Length 4.7
4 setosa Sepal Length 4.6
5 setosa Sepal Length 5.0
6 setosa Sepal Length 5.4
...
You can have a look at the iris dataset by typing head(iris) in the console.
Note: If you’re not familiar with %>%, gather() and separate(), you may want to take the Cleaning Data in R course. In a nutshell, a dataset is called tidy when every row is an observation and every column is a variable. The gather() function moves information from the columns to the rows. It takes multiple columns and gathers them into a single column by adding rows. The separate() function splits one column into two or more columns according to a pattern you define. Lastly, the %>% (or “pipe”) operator passes the result of the left-hand side as the first argument of the function on the right-hand side.
INSTRUCTIONS 70 XP INSTRUCTIONS 70 XP You’ll use two functions from the tidyr package:
gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. Complete the command. Notice that only one variable is categorical in iris. separate() splits up the new key column, which contains the former headers, according to .. The new column names “Part” and “Measure” are given in a character vector. Don’t forget the quotes. Show Answer (-70 XP) HINT For the first instruction, you have one categorical variable: Species. You want to place this after the -. You created a key column with the gather() command. This column contains values like Sepal.Width. You want to separate this into the “Part”, Sepal in this case, and the “Measure”, Width in this case. Fill in “Part” and “Measure” in the correct place in order to achieve this.
Here you’ll take a look at another plot variant, shown on the right. Which of your data frames would be used to produce this plot?
INSTRUCTIONS 70 XP Look at the heads of iris, iris.wide and iris.tidy using head(). Fill in the ggplot function with the appropriate data frame and variable names. The names of the aesthetics of the plot will match with variable names in your dataset. The previous instruction will help you match variable names in datasets with the ones in the plot. Show Answer (-70 XP) HINT To check the head of the datasets, use head(). Use the iris.wide dataset. For the x, y and color aesthetics, take a look at the labels and legend of the plot and use these variables. Species is used for the facet layer.
Marvelous! In this case “one measurement per row” means something different to the previous exercises.
In the last exercise you saw how iris.wide was used to make a specific plot. You also saw previously how you can derive iris.tidy from iris. Now you’ll move on to produce iris.wide.
The head of the iris.wide should look like this in the end:
Species Part Length Width 1 setosa Petal 1.4 0.2 2 setosa Petal 1.4 0.2 3 setosa Petal 1.3 0.2 4 setosa Petal 1.5 0.2 5 setosa Petal 1.4 0.2 6 setosa Petal 1.7 0.4 … You can have a look at the iris dataset by typing head(iris) in the console.
INSTRUCTIONS 70 XP Before you begin, you need to add a new column called Flower that contains a unique identifier for each row in the data frame. This is because you’ll rearrange the data frame afterwards and you need to keep track of which row, or which specific flower, each value came from. It’s done for you, no need to add anything yourself. gather() rearranges the data frame by specifying the columns that are categorical variables with a - notation. In this case, Species and Flower are categorical. Complete the command. separate() splits up the new key column, which contains the former headers, according to .. The new column names “Part” and “Measure” are given in a character vector. The last step is to use spread() to distribute the new Measure column and associated value column into two columns. Show Answer (-70 XP) HINT The first instruction is programmed for you! For the second instruction, you have two categorical variables: Species and Flower. You want to place these after the - signs. You created a key column with the gather() command. This column contains values like Sepal.Width. You want to separate this into the “Part” (Sepal in this case) and the “Measure” (Width in this case). Fill in “Part” and “Measure” in the correct place in order to achieve this. You created a value column with gather(), for the last instruction you want to distribute this value column over the different possibilities of Measure.
Top tidying! The tidyr package is excellent for preparing your datasets to use with ggplot2.