Today, we are going to refresh some concepts in R. We will also start to explore ggplot2
, the library of functions we will be working with all semester to help us visualize data.
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
Once we have opened up an RMarkdown file, the first thing we need to do is load the data. We will access data from a variety of sources in this course, but today we will be loading our data from the internet. R has some commands that allow us to do that, and we will explore those first.
To start off with, we need to install a package, since the data we will be using today is one that is contained inside an R package. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this course.
To install a package, look at the top of your RStudio screen and find "Tools". From the drop down menu, choose "Install Packages". The next step is to tell R exactly what package you want to use. In the white box, type palmerpenguins, and then hit install. Palmerpenguins is the name of the package where the data for our lab today is stored.
Now that we have the package installed, it is time to actually get the data from that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).
To tell RMarkdown we are about to give it code, we need to create a Code chunk, or chunk for short. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a chunk.
Anything we put inside a chunk will be treated as computer code. Let's put in a code to load the data now.
library(palmerpenguins)
data("penguins")
Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play).
Now, look on the upper right hand panel of your RStudio screen. See how you now have a data set called penguins? Great! We are ready to go!
Now that you have loaded the data, you are ready to actually start the lab. Hit enter (return on a Mac) to leave one line of blank space between your code chunk where we loaded the penguins data and what you are about to type. This space between text and code chunks is necessary in order for your document to format properly.
You are going to create a section header for the first lab question. Create two ## signs on line 17, put a space, and then type Question 1. In other words, you should have ## Question 1. This creates a new section in your Markdown called Question 1. Go ahead and knit; see that you have created a section.
You can create as many sections as you like. Under a section header, you have the ability to type, just like you would in a word document, so you can type the responses to questions. You can also insert a code chunk and run code. Remember, a new chunk can be inserted by clicking on the Insert button and choosing R.
Knit your document now, and make sure that so far, your document contains one section, Question 1. We are now ready to begin the lab.
As a reminder, to load the data you should have the following two lines of code in a chunk at the top of your RMarkdown file:
library(palmerpenguins)
data("penguins")
This data set contains information on n = 344 penguins from three different species. We have a client who is interested in building a model for Y = the body mass of a penguin in grams.
In addition to this response variable, we have information on 7 features:
species
- the type penguin.island
- the island where the penguin lives.bill_length_mm
- the length of the penguin bill in millimeters.bill_depth_mm
- the depth of the penguin bill in millimeters.flipper_length_mm
- the flipper length of the penguin in millimeters.sex
- the biological sex of the penguin.year
- the year the penguin was measured.Client 1 is interested in determining the relationship between the flipper length and the body mass of a penguin.
Client 2 would like us to build a model that can be used to estimate the body mass of penguins once other certain characteristics, such as flipper length, are measured.
Now, we could use one model to try and suit the needs of both clients, or we could build two separate models, one designed for each task. As statisticians, it is our job to determine what is appropriate, and what is possible with our resources. More on that later.
Once we have the data, and have information on the goal of the analysis, the next step is to start to look at the data itself. Along with the client goals, the characteristics of the data will help inform our choice of model. We refer to the process of exploring the data as exploratory data analysis, or EDA. This means that we use graphs, charts, tables, and other such approaches to explore the data. We will start with a numerical summary of our response variable.
In R, to tell the computer we wish to work with a single column in a data set (like body_mass_g
), we use $
. So, to tell the computer "Please go into the penguins data set and look at only the body mass column", we use penguins$body_mass_g
. This means that to get a summary of that single column, we use:
summary(penguins$body_mass_g)
If you look at your summary, you will notice one small problem with our data set that needs to be addressed before we begin modeling. We have some penguins that are missing information on the response variable, body mass. When we fit most models, we need each row in the data to be complete, meaning we need to know X and Y for every row.
There are many ways of handling missing data, and in this course we are going to explore several. However, we are not quite there yet, so we are stuck with only one option - removing the rows that contain missing data. To do this, run the following code:
penguinsClean <- na.omit(penguins)
The command na.omit
removes all rows from the penguins
data set that contain missing data. We then store the new data set (<-
) under the name penguinsClean
. This new data set does not contain any missing data.
penguins
data set? How many penguins are we left with in the penguinsClean
data set?Note: We will use the penguinsClean
data set for the rest of this lab.
While removing the rows is an easy way to handle missing data, we will learn that is has disadvantages. More on that later!
The process of handling missing data, creating features, subsetting the data, etc., is called data cleaning. We will see this process in action as we work with data in this course.
Now that we have handled our missing data, let's start adding some graphs to our EDA. If you have already installed ggplot2, you can skip the next section.
Once we have the data, and have information on the goal of the analysis, the next step is to start to look at the data itself. Along with the client goals, the characteristics of the data will help inform our choice of model. We refer to the process of exploring the data as exploratory data analysis, or EDA. This means that we use graphs, charts, tables, and other such approaches to explore the data. To make visualizations, we will be working with a group of functions in the ggplot2 package in R.
The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics. The first step to using ggplot2 is to install the ggplot2
package. Go to the top of your RStudio window and find "Tools". From there, click on "Install Packages." In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.
Now, some of you may see an error about language parsing, or an error involving rlang
. If you do, go ahead and install the rlang
package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that's okay.
library(rlang)
Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don't have to teach it again.
Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.
suppressMessages(library(ggplot2)
)
Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R "Hey, remember those skills we taught you? Use them."
You are now ready to begin EDA.
To start off with, let's create a visual to see the distribution the response variable. To do this, we could use a histogram, a box plot, or another plot that works for one numeric variable. We're going to start with a histogram.
To create the histogram of penguin body mass, paste the following code in a chunk and hit play.
ggplot(penguinsClean, aes(x=body_mass_g)) + geom_histogram(bins = 20, fill='blue', col = 'black')
The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot()
part of the code. Once we have built the background, we are ready to plot our data. The command we will use for this depends upon the data type we are working with. In this case, we want a histogram. The command we use to build a histogram is geom_histogram()
. In this same layer, we specify that we want the bars to be filled ('fill'
) in blue and we want them outlined ('col'
) in black. The only other part of the command is bins=20
. We need this because for histograms, we have to specify how many bins we want.
Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph. Let's break that down in more detail:
ggplot(penguinsClean, aes(x=body_mass_g))
: This part of the code creates the background of the plot. The two arguments are the data set we are using (penguinsClean
) and the variable(s) that will be used to define the axis/axes(aes
). In this case, we defined only that the x-axis would contain information on body mass (aes(x=body_mass_g)
).geom_histogram(fill='blue', col = 'black')
: Once the axes are set, we are adding on (+
) the actual data. In this case, we want a bar plot, so we add bars (geom_histogram
). We also specify that we want 20 bars, and those bars should colored blue and outlined in black.Let's try it.
We can add on another layer to our plot by adding x and y axis labels, as well as a title for the plot. The command labs
, which stands for labels, is used for this.
ggplot(penguinsClean, aes(x=body_mass_g)) + geom_histogram(fill='blue', col = 'black', bins = 20) + labs(title="Figure 1:", x = "Body Mass (in grams)", y = "Frequency")
This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.
One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like "body_mass_g". Instead, we want clear labels like "Body Mass (in grams)". This is going to be a requirement for all graphs you make in this course - label your axes appropriately and title your graphs.
Now let's try a new plot: a box plot.
geom_bar
, we want geom_boxplot
, and box plots do not have bins.Now that we have our graphs, let's think about what information they yield.
All of these questions can help us determine the type of model we might want to consider.
Recall that Client 1 is particularly interested in exploring the relationship between flipper length in penguins and the body mass in grams (Y). This means that we want to create a graph to examine the relationship between these two variables, and as both are numeric, we need a scatter plot.
In ggplot2, we could create the necessary plot using the following code:
ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g)) + geom_point()
Just as before, we have two layers. The first draws the background and the second adds on the graph. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.
fill
, but a color
). Title your plot Figure 4, and label the x axis "Flipper Length (in mm)" and the y axis "Body Mass (in grams)".We can add also add a line or curve on a graph. To add a line or curve onto the graph, we add on another layer. The layer is added with the code + stat_smooth(method = "lm", formula = y ~ x, size = 1, se = FALSE)
. This code will (1) fit an LSLR line and (2) plot it on top of your scatter plot.
Client 1 now also informs us that when building the model to explore the relationship between flipper length and body mass, it is important to control for bill length, species, and sex. This means that, ideally, we'd like to make a plot to explore the relationship between each of these features and the response. That starts to take up a lot of space in a report. One nice way to present multiple graphs is to stack them.
Suppose I want to make a histogram of body mass. The code I need for that is:
ggplot(penguinsClean, aes(body_mass_g)) + geom_histogram(bins = 20, fill = "blue", col = "white")
If I put that in a chunk and press play, one histogram will appear on my screen. Let's suppose I want to show a box plot along with this histogram. I can create both graphs and have them both print out separately in my knit document. However, I can also tell the computer to print more than one graph at once to save space. Try the following:
g1 <- ggplot(penguinsClean, aes(body_mass_g)) + geom_histogram(bins = 20, fill = "blue", col = "white")+ labs(title ="Figure 1")
g2 <- ggplot(penguinsClean, aes(body_mass_g)) + geom_boxplot(fill = "white", col = "blue") + labs(title ="Figure 2")
gridExtra::grid.arrange(g1,g2, ncol = 2)
If you get a warning message about not having gridExtra
, this means you will need to install the package using the same steps we did above to install ggplot2
. We have to install packages all the time with R, depending on the version of R you have and your computer system.
What you will notice is that we have stored each of the two graphs under a name. Our histogram is stored under g1
and our box plot is stored under g2
. Then, we use a special code called gridExtra::grid.arrange
to help us arrange the graphs in a grid. In our case, we have to graphs, and we want them side by side. This means we want the graphs in a 1 (row) by 2 (column) grid.
To create the 1 by 2 grid, we feed the computer our two graphs, and then tell it we want the figures in 2 columns (next to each other) by specifying ncol = 2
. In other words, the number of columns we want is 2!
Stack the 4 graphs you would use to explore the relationship between each of the 4 features (flipper length, bill length, species, and sex) versus the response (so flipper length vs. body mass, and then bill length vs. body mass,etc.). You need to stack the graphs in a 2 x 2 grid.
While we are thinking about professional formatting, let's talk about tables. If I want to know about the species or islands in this data set, do we really want to see hundreds of rows of information on species or islands? Probably not. What we actually want to do is look at some sort of summary of the data. For categorical data like this, a table is most useful for this.
There are several ways to make tables in R, but we will discuss two. The first is very direct. We tell R we want to use the penguins
data set and the variable species
, by using the code penguins$species
(dataset$variable). Then, we use the table(whatWeWantToMakeATableWith)
command to actually make the table.
table(penguins$species)
However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:
knitr::kable(table(penguins$species), col.names=c("Species", "Count") )
The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?
Okay, so now we can summarize the data by looking at the islands and the species. What if we want to look at them together? In other words, what if I want to know which species of penguins are on which islands? Can I do that?
knitr::kable(table(penguins$species, penguins$island))
Why would this be important? Well, when we are building models, we might need to know whether or not our groups were balanced before choosing a model. What if we had only male penguins from Dream Island, and then tried to fit a model looking at the relationship between penguin sex and beak length on this island. We couldn't do it, because we only have information on male penguins. This is why taking the time to perform exploratory data analysis, and really dig into the data, is so important.
At this point in the lab, we have explored numeric and categorical variables separately. What if we want to make a plot that shows the relationship among flipper length, species, and body mass, all once? We could, perhaps, make a scatter plot, and then just use colors to separate out the different types of penguins. To do this, we include a color command in the specification of the axes.
ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g, color = species)) + geom_point()
Here, the command color=species
tells R that we want each dot to be colored in a way that corresponds to the species of the penguin. If we want to change the shape of the dots instead of the color, or addition to changing the color, we can use pch=species
.
There is another option that allows us to compare the three variables. Instead of trying to create one plot, we can create a faceted plot. Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below.
ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g, color = species)) + geom_point() + facet_wrap( ~ species, ncol=3)
This is the same code we used above, with the addition of the line facet_wrap( ~ species, ncol=3)
. Let's break this addition down. The command facet_wrap
tells R that we are going to separate our graphs based on some categorical variable. The specific variable is then chosen with the code ~species
. We are then able to specify how we want the graphs to be stacked. We want to allow 3 columns (ncol=3
).
One last step before we knit. Look at the very first chunk in your Markdown file. You should see something like knitr::opts_chunk$set(echo = TRUE)
. Change this to knitr::opts_chunk$set(echo = FALSE)
. What this will do is hide all the code you have created from your final document. All your plots will still show up, but your code will not.
For today, we started to explore this penguin data set, and some of what we can do in R. As we move through our semester, we will use the types of commands we learned today over and over. We will next start adding models after EDA!