STA 363/663 Lab 1
Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
The Goal
We are going to be using R extensively in this course. Because of this, our lab today is designed to remind us of some of the codes that we will be using frequently, including coding for data visualization. Most of the coding should feel familiar.
We will also practice some of the vocabulary terms we covered in our first class. These terms will be used throughout the semester, so let’s take some time to practice them now!
Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.
Loading the Data
Once we have opened up an RMarkdown file, the first thing we need to do is load the data. We will access data from a variety of sources in this course, but today we will be loading data from inside an R package. A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package.This means that our first step is to remind ourselves how to install an R package.
To install a package:
- Look at the top of your RStudio screen and find “Tools”.
- From the drop down menu, choose “Install Packages”.
- In the white box, type palmerpenguins, and then hit install.
Now that we have the package installed, it is time to actually get the data from that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).
To tell RMarkdown we are about to give it code, we need to create a code chunk, or chunk for short. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a chunk.
Anything we put inside a chunk will be treated as computer code. Let’s put in a code to load the data now.
Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play).
And that’s it! The data is now loaded and ready to use.
The Data
Our data set for today contains information on n = 344 penguins from three different species. We have a client who is interested in building a model for Y = the body mass of a penguin in grams.
Question 1
Based on the information we have so far, how do we know that we are dealing with a supervised learning task?
Question 2
Based on the information we have so far, are we dealing with a regression or classification modeling task?
So, there’s 3 vocab terms already:
- Supervised Learning
- Classification Task
- Regression Task
Starting next week, there will be a list of vocabulary terms on Canvas to show which terms we have covered so far. Hopefully, this will be helpful for studying!
In addition to the response variable, the data set contains information on 7 features:
-
species
- the type penguin. -
island
- the island where the penguin lives. -
bill_length_mm
- the length of the penguin bill in millimeters. -
bill_depth_mm
- the depth of the penguin bill in millimeters. -
flipper_length_mm
- the flipper length of the penguin in millimeters. -
sex
- the biological sex of the penguin. -
year
- the year the penguin was measured.
Question 3
Client 1 is interested in determining the relationship between the flipper length and the body mass of a penguin. For Client 1, are we dealing with a prediction or association task?
Question 4
Client 2 would like us to build a model that can be used to estimate the body mass of penguins once other certain characteristics, such as flipper length, are measured. For Client 2, are we dealing with a prediction or association task?
Now, we could use one model to try and suit the needs of both clients, or we could build two separate models, one designed for each task. As statisticians, it is our job to determine what is appropriate, and what is possible with our resources. More on that later.
We have just gotten started with the analysis, and we have already run through our vocab list:
- Supervised Learning
- Classification Task
- Regression Task
- Association Task
- Prediction Task
- Feature
These are key terms you will be hearing constantly as we move through the course. If you have any questions about them, please let me know!
Exploratory Data Analysis (EDA)
Once we have the data and know the goal of the analysis, the next step is to start to look at the data. We refer to the process of exploring the data as exploratory data analysis, or EDA. This means using graphs, charts, tables, and other such approaches to explore the data.
Let’s start by creating a summary of our response variable. In R, to
tell the computer we wish to work with a single column in a data set
(like body_mass_g
), we use $
. So, to tell the
computer “Please go into the penguins data set and look at only the body
mass column”, we use penguins$body_mass_g
. This means that
to get a summary of that single column, we use:
We can also use the code below to obtain the same result.
This means that to select one column in R you have two choices:
data$nameOfColumn
data[ , "nameOfColumn"]
Question 5
Based on the summary, what is the smallest body mass in the data set?
Missing Data
In our case, the summary highlights one small problem with our data set that needs to be addressed before we begin modeling. We have some penguins that are missing information on the response variable, body mass. When we fit most models, we need each row in the data to be complete, meaning we need to know X and Y for every row.
Question 6
How many penguins in the data set do not have a body mass recorded?
There are many ways of handling missing data, and in this course we are going to explore several, starting next class!!!
For today, we are going to choose a method called complete case analysis, which means removing all the rows that contain missing data. We’ll talk about pros and cons of this method next time, but for today, let’s use it.
We are going to create a new data set called
penguinsClean
which contains only the rows from the
original penguins
data set, but without the rows that
contain the missing data. To do this, run the following code:
The command na.omit
removes all rows from the
penguins
data set that contain missing data. We then store
the new data set ( <-
) under the name
penguinsClean
. This new data set does not contain any
missing data.
Question 7
In total, how many rows contained missing data in the
penguins
data set?How many penguins are we left with in the
penguinsClean
data set?
Note: We will use the penguinsClean
data set for
the rest of this lab.
While removing the rows is an easy way to handle missing data, we will learn that it has disadvantages. More on that next class.
The process of handling missing data, creating features, subsetting the data, etc., is called data cleaning. We will see this process in action as we work with data in this course.
Creating graphs
Now that we have handled our missing data, let’s start adding some graphs to our EDA. If you have already installed ggplot2, you can skip the next section.
Installing ggplot2
We will use the package ggplot2 to create graphs in our course. The
first step to using ggplot2 is to install the ggplot2
package using the same steps we used to install the
palmerpenguins
data set earlier.
Now, some of you may see an error about language parsing, or an error
involving rlang
. If you do, go ahead and install the
rlang
package. Then, copy and paste the following into a
chunk and hit play. Nothing will seem to happen, and that’s okay.
Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don’t have to teach it again.
Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.
Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R “Hey, remember those skills we taught you? Use them.”
Plotting Two Numeric Variables
We are going to begin with creating graphs to answer Client 1’s question. Recall that Client 1 is interested in determining the relationship between the flipper length and the body mass of a penguin.
Question 8
We have a numeric feature (flipper length) and a numeric response variable (body mass). What type of plot could we use to examine the relationship between these two variables?
Why do we want to start off with a graph? Well, many models, including linear regression, involve assumptions about the shape of the relationship between \(X\) and \(Y\). If that shape is not something we can easily model, we may want to use one of the non-parametric techniques we will learn in this course. These techniques will not have a shape assumption.
To create the plot we need, paste the following code in a chunk and hit play.
The creation of plots in ggplot2 requires building the plot in
layers. First, we build the background, the grid on which we will be
building our graph. This is the job of the ggplot()
part of
the code. Once we have built the background, we are ready to plot our
data.
The command we will use for this depends upon the data type we are
working with. In this case, the command we use is
geom_point()
. We specify that we want the dots to be
colored ( 'col'
) blue.
Notice that to add each layer to the graph, in the code we use a plus sign. We add the background AND THEN the bars to make the final graph.
Question 9
Create a plot to explore the relationship between X = bill depth and Y = body mass. Make the color of the points on the graph purple and show the resulting plot.
We can add on another layer to our plot by adding x and y axis
labels, as well as a title for the plot. The command labs
,
which stands for labels, is used for this.
ggplot(penguinsClean, aes(x =flipper_length_mm, y=body_mass_g)) +
geom_point(col = 'blue') +
labs(title="Figure 1:", x = "Flipper Length (in mm)",
y = "Body Mass (in grams)", caption = "A scatter plot of
fliiper length versus body mass")
This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.
Question 10
Copy the code you used to make the graph from Question 9. Now, add the title “Figure 2:” and add appropriate labels to the x and y axis.
One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “body_mass_g”. Instead, we want clear labels like “Body Mass (in grams)”. This is going to be a requirement for all graphs you make in this course - label your axes appropriately and title your graphs.
Now that we have our graph, let’s think about what information it yields.
Question 11
Based on the graph of X = bill depth versus Y = body mass, does it look like using a parametric model is appropriate? In other words, is the relationship a clear shape like a line or a curve?
These questions can help us determine the type of model we might want to consider, and we will be using them as we move through the semester to build models.
Quick Side Note: Themes
For those of you who like to consider different colors in R, there are really cool packages called themes or palettes that allow you to personalize your graphs. For instance, let’s say I want my graph to look like Barbie drew it (yep, it’s exactly what you think):
If you are prompted during the install process to type something,
type 3. Once you have finished installing, you need to put a # in front
of the install command:
#remotes::install_github("MatthewBJane/theme_park")
.
If you do not do this, your document will not knit.
# Make the scatter plot
ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g)) +
geom_point(color = barbie_theme_colors["medium"]) +
labs(title = 'Barbie Scatter Plot', x = "Flipper Length (mm)", y = "Body Mass (grams)") +
theme_barbie()
If you run this code, you will see other themes. Try a few out!! If you find a few you like, play with using them for the rest of the lab!
Adding Lines or Curves
One way to assess whether a parametric might model be a good fit to
the data is to actually add a line or curve on a graph. To add a line or
curve onto the graph, we add on another layer. The layer is added with
the code
+ stat_smooth(method = "lm", formula = y ~ x, se = FALSE)
.
This code will (1) fit an LSLR line and (2) plot it on top of your
scatter plot.
Question 12
Add an LSLR line to the plot for X = bill length vs Y = body mass. Title the resulting graph Figure 3.
Question 13
Suppose a new client wants you to build a model for estimating Y = bill depth using X = bill length. Using EDA, explain to your client whether or not a linear model might be a reasonable choice.
Note: You do NOT need to check residual or qqplots here; just focus on whether the shape is reasonable for LSLR.
Stacking Graphs
Client 1 now also informs us that when building the model to explore the relationship between flipper length and body mass, it is important to control for bill length, species, and sex. This means that, ideally, we’d like to make a plot to explore the relationship between each of these features and the response. That starts to take up a lot of space in a report. One nice way to present multiple graphs is to stack them.
Suppose I want to make a histogram of body mass. The code I need for that is:
If I put that in a chunk and press play, one histogram will appear on my screen. Let’s suppose I want to show a box plot along with this histogram. I can create both graphs and have them both print out separately in my knit document. However, I can also tell the computer to print more than one graph at once to save space. Try the following:
# First graph
g1<- ggplot(penguinsClean, aes(body_mass_g)) +
geom_histogram(bins = 20, fill = "blue", col = "white")+
labs(title ="Figure 1")
# Second graph
g2 <- ggplot(penguinsClean, aes(body_mass_g)) +
geom_boxplot(fill = "white", col = "blue") +
labs(title ="Figure 2")
# Stack the two graphs
gridExtra::grid.arrange(g1,g2, ncol = 2)
If you get a warning message about not having gridExtra
,
this means you will need to install the package using the same steps we
did above to install ggplot2
. We have to install packages
all the time with R, depending on the version of R you have and your
computer system.
What you will notice is that we have stored each of the two graphs
under a name. Our histogram is stored under g1
and our box
plot is stored under g2
. Then, we use a special code called
gridExtra::grid.arrange
to help us arrange the graphs in a
grid. In our case, we have to graphs, and we want them side by side.
This means we want the graphs in a 1 (row) by 2 (column) grid.
To create the 1 by 2 grid, we feed the computer our two graphs, and
then tell it we want the figures in 2 columns (next to each other) by
specifying ncol = 2
. In other words, the number of columns
we want is 2!
Question 14
Stack the 4 graphs you would use to explore the relationship between each of the 4 features (flipper length, bill length, species, and sex) versus the response (so flipper length vs. body mass, and then bill length vs. body mass,etc.). You need to stack the graphs in a 2 x 2 grid.
Making Tables
While we are thinking about professional formatting, let’s talk about tables. If I want to know about the species or islands in this data set, do we really want to see hundreds of rows of information on species or islands? Probably not. What we actually want to do is look at some sort of summary of the data. For categorical data like this, a table is most useful for this.
There are several ways to make tables in R, but we will discuss two.
The first is very direct. We tell R we want to use the
penguins
data set and the variable species
, by
using the code penguinsClean$species
(dataset$variable).
Then, we use the table(whatWeWantToMakeATableWith)
command
to actually make the table.
However, this makes a table that is not particularly pretty or professional when you knit. A second option that does create professional tables is:
The code is more complex, but the heart of it is the same table. This table will not look very pretty when you press play, but go ahead and knit. See how nicely the table gets formatted?
Question 15
Create a table, using the second way to make a table, for the island where the penguins live. Label the columns appropriately.
Okay, so now we can summarize the data by looking at the islands and the species. What if we want to look at them together? In other words, what if I want to know which species of penguins are on which islands? Can I do that?
Question 16
Do we have the same number of male and female penguins on each island? Create a table to find out. Show the table, and answer the question in a complete sentence.
Question 17
Our client wants to know if we should include the feature island in the model, or if including species is enough. Create a table to explore the relationship between species and island. Show the table, and respond to your client in a complete sentence.
Why would this be important? Well, when we are building models, we might need to know whether or not our groups were balanced before choosing a model. What if we had only male penguins from Dream Island, and then tried to fit a model looking at the relationship between penguin sex and beak length on this island. We couldn’t do it, because we only have information on male penguins. This is why taking the time to perform exploratory data analysis, and really dig into the data, is so important.
Combining Categorical and Numeric Variables
At this point in the lab, we have explored numeric and categorical variables separately. What if we want to make a plot that shows the relationship among flipper length, species, and body mass, all once? We could, perhaps, make a scatter plot, and then just use colors to separate out the different types of penguins. To do this, we include a color command in the specification of the axes.
Here, the command color=species
tells R that we want
each dot to be colored in a way that corresponds to the species of the
penguin. If we want to change the shape of the dots instead of the
color, or addition to changing the color, we can use
pch=species
.
Question 18
Create a scatter plot with bill depth on the x axis, body mass on the y axis and color the dots by island. Make sure to use appropriate titles/labels.
There is another option that allows us to compare the three variables. Instead of trying to create one plot, we can facet our plot. Faceted plots take a particular variable, such as the type of track condition, and create plots that are divided by that variable. To see an example, run the code below.
ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
facet_wrap( ~ species, ncol=3)
This is the same code we used above, with the addition of the line
facet_wrap( ~ species, ncol=3)
. Let’s break this addition
down. The command facet_wrap
tells R that we are going to
separate our graphs based on some categorical variable. The specific
variable is then chosen with the code ~species
. We are then
able to specify how we want the graphs to be stacked. We want to allow 3
columns ( ncol=3
).
Question 19
What command would you use if you wanted only two columns? Show the resultant plot, and add appropriate labels.
Question 20
What command would you use if you wanted to add trained LSLR lines to the facet plot? Show the resultant plot, and make sure the axes are appropriately labeled.
Before you submit
One last step before we knit. Look at the very first chunk in your
Markdown file. You should see something like
knitr::opts_chunk$set(echo = TRUE)
. Change this to
knitr::opts_chunk$set(echo = FALSE)
. What this will do is
hide all the code you have created from your final document. All your
plots will still show up, but your code will not.
Next Steps
For today, we started to explore this penguin data set, and some of what we can do in R. As we move through our semester, we will use the types of commands we learned today over and over. Let me know if you have any questions!
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 January 7.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .