STA 214 Lab 1

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

Goal

We have two goals for our lab today. We are going to refresh the process of using R to create data visualizations, and we are also going to refresh how to use R to build linear regression models. Both of these will prove helpful as we move into our first new model next class.

Getting Started:

Note: This lab assumes that you have already installed the software and gone through the RMarkdown tutorial. If you have not done so, make sure to do these before starting the lab.

Clearing the Work space

For most of us, we used R in STA 112. This means that when you open RStudio, the “data environment” (the upper right hand panel of RStudio) where all our data sets are stored may be very full! If your Environment is blank, don’t worry about it! If it’s not blank, there is a way to clear your Environment, i.e., remove data sets that you no longer need.

To clear all of the data in the upper right hand panel, look at the top of the panel. Beside “Import Dataset” there is an image that looks like a small broom. Pushing that will “clean” your space, meaning that it will remove all the contents within the panel. It is good practice to do this before you start each lab. This keeps us from having a lot of unnecessary stuff cluttering up this window, making the data sets you need easier to find.

Loading the Data

Once we have opened up an RMarkdown file, the first thing we need to do is load the data. We will access data from a variety of sources in this course, but today we will be loading our data from an R package.

A package is a collection of codes that relate to one another. When you load a package into R, you give R access to all of the functions within the package. We will be using several packages as we move through this course.

To install a package, look at the top of your RStudio screen and find “Tools”. From the drop down menu, choose “Install Packages”. The next step is to tell R exactly what package you want to use. In the white box, type palmerpenguins, and then hit install. Palmerpenguins is the name of the package where the data for our lab today is stored.

Now that we have the package installed, it is time to actually get the data from that package. To do that, we need to create a space in our RMarkdown file to create code. Remember, Markdown can do three things: (1) create regular text, (2) run code, and (3) create math equations. Right now, we want (2).

To tell RMarkdown we are about to give it code, we need to create a Code chunk, or chunk for short. Look at the top of your Markdown file, and find Insert. Click it. From the drop-down menu, choose R. Click it! A gray box should appear in your Markdown file. This is a chunk.

Anything we put inside a chunk will be treated as computer code. Let’s put in a code to load the data now.

library(palmerpenguins) 
data("penguins")

Now, look at the right hand side of the chunk and find the little green triangle symbol. We will call this the play button in this course. Go ahead and press the play button (press play).

Now, look on the upper right hand panel of your RStudio screen. See how you now have a data set called penguins? Great! We are ready to go!

Exploratory Data Analysis

We are going to be working with the same data set from class, the penguins data set, which we have just loaded. This data set contains information on n = 344 penguins from three different species. We have a client who is interested in building a model for Y = the body mass of a penguin in grams.

In addition to this response variable, we have information on 7 explanatory variables:

species - the type penguin: Chinstrap, Gentoo, or Adelie
island - the island where the penguin lives: Biscoe, Torgeson, or Dream
bill_length_mm - the length of the penguin bill in millimeters.
bill_depth_mm - the depth of the penguin bill in millimeters.
flipper_length_mm - the flipper length of the penguin in millimeters.
sex - the biological sex of the penguin: male or female
year - the year the penguin was measured.

EDA: Numerical Summary

Now that we have the data, the next step is to start to look at the data. In our course, the characteristics of the response variable especially will help inform our choice of model.

We refer to the process of exploring the data as exploratory data analysis, or EDA. This means that we use graphs, charts, tables, and other such approaches to explore the data.

We will start our EDA process by creating a numerical summary of our response variable.

Remember that R works using the structure command(object). The command is what you want R to actually do. In this case, we will use summary as our command because we want to create a 6 number summary.

The object is what you want the command to be executed on. In other words, we want a summary of what? In our case, we want a summary of the column body mass in the penguins data set.

In R, to tell the computer we wish to work with a single column in a data set (like body_mass_g), we use $. So, to tell the computer “Please go into the penguins data set and look at only the body mass column”, we use penguins$body_mass_g.

Putting this call together, to get a summary of body mass, we use:

summary(penguins$body_mass_g)

Question 1

What is the smallest body mass in the data set?

Question 2

50% of penguins have a body mass that is less than or equal to how many grams?

If you look at your summary, you will notice one small problem with our data set that needs to be addressed before we begin modeling. We have some penguins that are missing information on the response variable, body mass. When we fit most models, we need each row in the data to be complete, meaning we need to know X and Y for every row.

Question 3

How many penguins in the data set do not have a body mass recorded? Is this the only variable with missing data, or are there others?

Hint: Open the data set to figure out the answer to the second part of the question.

There are many ways of handling missing data, but in this course, we are going to choose a method called complete cases analysis, which means removing all the rows that contain missing data. This means we are going to create a new data set called penguinsClean which contains only the rows from the original penguins data set, but without the rows that contain the missing data.

To do this, run the following code:

penguinsClean <- na.omit(penguins)

The command na.omit removes all rows from the penguins data set that contain missing data. We then store the new data set ( <-) under the name penguinsClean. This new data set does not contain any missing data.

Question 4

In total, how many rows contained missing data in the penguins data set? How many penguins are we left with in the penguinsClean data set?

Note: We will use the penguinsClean data set for the rest of this lab.

The process of handling missing data, creating features, subsetting the data, etc., is called data cleaning. We will see this process in action as we work with data in this course.

Now that we have handled our missing data, let’s start adding some graphs to our EDA. If you have already installed ggplot2, you can skip the next section.

Installing ggplot2

To make visualizations in this course, we will be working with a group of functions in the ggplot2 package in R. The package ggplot2 is a powerful collection of R functions for creating flexible, professional graphics.

The first step to using ggplot2 is to install the ggplot2 package. Go to the top of your RStudio window and find “Tools”. From there, click on “Install Packages.” In the blank box, type in ggplot2, and hit install. The computer should automatically begin to load in the packages that you need, but this may take a minute.

Note that this process of installing a package is one you need to do only once. Think of this as teaching R a new set of skills. Once it knows the skills, you don’t have to teach it again.

Now, some of you may see an error about language parsing, or an error involving rlang. If you do, go ahead and install the rlang package. Then, copy and paste the following into a chunk and hit play. Nothing will seem to happen, and that’s okay.

library(rlang)

EDA: Histograms

Once you have installed the ggplot2 package, you need to tell R that you would like to begin using the function by loading the library. Remember that we said installing a package is like teaching R a skill? Loading a library is how we tell we R we want it to use those skills. To do this, create a chunk in your RMarkdown and copy and paste the following, and hit play.

suppressMessages(library(ggplot2))

Note that this process of loading a library is one you have to do ONCE each time we start a lab or project.This tells R “Hey, remember those skills we taught you? Use them.”

Let’s create a plot to see the distribution the response variable. To do this, we can use a histogram.

To create a histogram of penguin body mass, paste the following code in a chunk and hit play.

ggplot(penguinsClean, aes(x=body_mass_g)) +  
  geom_histogram(bins = 20, fill='blue', col = 'black')

Hmm, that seems like a lot. Let’s break this down.

The creation of plots in ggplot2 requires building the plot in layers. First, we build the background, the grid on which we will be building our graph. This is the job of the ggplot() part of the code.

ggplot()

Once we have built the background, we need to tell R what data we want to use to create the graph. We need to tell it the name of the data set we want to use (penguinsClean) and then the variable we want to work with (body_mass_g). The aes part of the code stands for “axis”, we want body mass to be on the x-axis of our graph.

ggplot( penguinsClean, aes(x=body_mass_g) )

The next step is to tell R what type of graph we want to make using this variable. In this case, we want a histogram. The command we use to build a histogram is + geom_histogram().

ggplot( penguinsClean, aes(x=body_mass_g) )
  + geom_histogram()

Now, we want to specify some things about our histogram. We want the bars to be filled ( 'fill') in blue and we want them outlined ( 'col') in black. We also want 20 bins (bins=20) in our histogram.

ggplot(penguinsClean, aes(x=body_mass_g)) +  
  geom_histogram(bins = 20, fill='blue', col = 'black')

Let’s try it.

Question 5

Create a histogram of flipper length, using 15 bins. Make the bars of the histogram cyan and outline them in white. Show your result.

Labels and Titles

We can add on another layer to our plot by adding x and y axis labels, as well as a title for the plot. The command labs, which stands for labels, is used for this.

ggplot(penguinsClean, aes(x=body_mass_g)) + 
  geom_histogram(fill='blue', col = 'black', bins = 20) +
  labs(title="Figure 1:", x = "Body Mass (in grams)", y = "Frequency")

This ggplot syntax actually mimics the ways humans would draw a graph by hand. First, you draw the axes. Then, you add on your data. Finally, you add a label. Thinking through the steps in this manner will help you understand the syntax of this package.

Question 6

Copy the code you used to make the graph from the previous question. Now, add the title “Figure 2:” and add appropriate labels to the x and y axis.

One VERY important thing to remember when we make plots is to make sure the axes are easily interpretable by your reader. You do not want to use default variable names, like “body_mass_g”. Instead, we want clear labels like “Body Mass (in grams)”.

This is a requirement for all graphs you make in this course - your must label your axes appropriately and title your graphs.

EDA: Plotting Two Numeric Variables

Now, we stated that a goal of today was to practice regression models. This means that in addition to our Y variable, we need an X.

We have a client who is interested in exploring the relationship between X = flipper length in penguins and Y = the body mass in grams. This means that we want to create a graph to examine the relationship between these two variables. Because both X and Y are numeric, we need a scatter plot.

We can create the necessary plot using the following code:

ggplot(penguinsClean, aes(x=flipper_length_mm, y = body_mass_g))
  + geom_point()

Just as before, we have two layers. The first draws the background and the second adds on the graph. Here, in the first layer, we specify both the x and y axis of the graph, as we have two variables that we are working with.

Question 7

Start with the code above to create a scatter plot for flipper length (X) vs body mass (Y). Color the dots purple.(Hint: This time we are not specifying a fill, but a color). Title your plot Figure 3, and label the x axis “Flipper Length (in mm)” and the y axis “Body Mass (in grams)”.

Modeling

Based on what we see in the scatter plot, it looks like the relationship between flipper length and body mass is a linear relationship. With this information, and the fact that body mass is a numeric variable, we should reach for least-squares linear regression as a tool to model the relationship between flipper length (X) and body mass (Y).

Choosing linear regression means that we assume the relationship between X and Y can be well described by a line. To check this, it is helpful to actually add the least-squares linear regression line to the scatter plot we created in the previous section.

To add the LSLR line onto a scatter plot, we add on another layer to the ggplot code. The layer is added with the code + stat_smooth(method = "lm", formula = y ~ x, se = FALSE). This code will (1) fit an LSLR line and (2) plot it on top of your scatter plot.

Question 8

Add an LSLR line to the scatter plot of flipper length versus body mass, and add an appropriate title.

A Second Model

Since we already worked with the model for flipper length and body mass in class, let’s try a new relationship. Our client now wants to examine the relationship between X = bill depth and Y = body mass.

Question 9

Create a scatter plot to look at the relationship between bill depth and body mass. Describe the relationship you see.

From the plot, you will notice something different from the previous scatter plot. Instead of one clump of points with a nice linear shape, we have two different groups of points!

Typically, this means that we have a grouping variable involved, and this is going to be an important thing to consider as we move into linear mixed effects models next class. For today, let’s think about what grouping variables might be involved for these data.

Grouping variables are typically categorical variables in the data set.

Question 10

What categorical variables do we have in our data set?

One way to determine which variable may be related to the two distinct groups that we see is by changing the color of the dots on the scatter plot graph based on the value of the categorical variable for that point. To do this, we can adapt our scatter plot code:

ggplot(penguinsClean, aes( x =, y = , col = )) + 
  geom_point() + labs( x = , y = , title =)

The categorical variable you want to use to define the color of the points goes in the col = part of the code. For example, we could have col = sex.

Question 11

Create a scatter plot of bill depth vs body mass with the points colored by sex. Make sure to add appropriate labels. Does it look like this variable is the grouping variable which defines the two clusters on the graph?

Ideally, we’d like to make a plot where we explore each of the categorical variables as the possible grouping variable before we make our decision. That starts to take up a lot of space in a report. One nice way to present multiple graphs is to stack them.

Stacking Graphs

Suppose I want to make a histogram of body mass. The code I need for that is:

ggplot(penguinsClean, aes(body_mass_g)) +
  geom_histogram(bins = 20, fill = "blue", col = "white")

If I put that in a chunk and press play, one histogram will appear on my screen. Let’s suppose I want to show a box plot along with this histogram. I can create both graphs and have them both print out separately in my knit document. However, I can also tell the computer to print more than one graph at once to save space. Try the following:

# First graph 
g1<- ggplot(penguinsClean, aes(body_mass_g)) +
  geom_histogram(bins = 20, fill = "blue", col = "white") +
  labs(title ="Figure 1") 

# Second graph 
g2 <- ggplot(penguinsClean, aes(body_mass_g)) +
  geom_boxplot(fill = "white", col = "blue") +
  labs(title ="Figure 2") 

# Stack the two graphs 
gridExtra::grid.arrange(g1,g2, ncol = 2)

If you get a warning message about not having gridExtra, this means you will need to install the package using the same steps we did above to install ggplot2. We have to install packages all the time with R, depending on the version of R you have and your computer system.

What you will notice is that we have stored each of the two graphs under a name. Our histogram is stored (<-) under the name g1 and our box plot is stored (<-) under the name g2. Then, we use a special code called gridExtra::grid.arrange to help us arrange the graphs in a grid. In our case, we have to graphs, and we want them side by side. This means we want the graphs in a 1 (row) by 2 (column) grid.

So, to create the 1 by 2 grid, we feed the computer our two graphs, and then tell it we want the figures in 2 columns (next to each other) by specifying ncol = 2. In other words, the number of columns we want is 2!

Question 12

Create and stack together 3 scatter plots:

Bill depth (X) vs body mass (Y) by sex (group)
Bill depth (X) vs body mass (Y) by island (group)
Bill depth (X) vs body mass (Y) by species (group)

Make sure you have appropriate labels!

Question 13

Based on what you see, there are two possible variables that could be defining the two clusters in the scatter plot. Which variables are they?

Question 14

Why is it that two variables seem to defining the same groups?? And does it matter which of the two variables we use to define the groups?

Back to the model

Now, from our results so far it looks like a specific level of one variable is defining the group. This means it would be helpful to put into our model a specific indicator variable that suggests that penguins are in that group.

For instance, suppose that the grouping variable is sex (note: it isn’t!) and I want to indicate whether the penguins is male. To do this, I add into my model the following:

Model1 <- lm( body_mass_g ~ bill_depth_mm +I(sex == "male"), data = penguinsClean)

This code pairs up with the following model:

\[BodyMass_i = \beta_0 + \beta_1 BillDepth_i+ \beta_2 SexMale_i + \epsilon_i, \epsilon_i \stackrel{iid}{\sim} N(0,\sigma)\]

Here, $SexMale_i$ is an indicator variable, meaning it is a 1 if the penguin $i$ is a male and 0 otherwise. The code I(sex == "male") creates this indicator variable for male penguins in the fitted model.

To create all this lovely math notation, put the following into the white space in your Markdown file (where you type, not in a chunk):

$$BodyMass_i = \beta_0 + \beta_1 BillDepth_i+ \beta_2 SexMale_i + \epsilon_i,$$ $$\epsilon_i \stackrel{iid}{\sim} N(0,\sigma)$$

Question 15

The grouping variable is not sex! Using the template above, write down the equation for the correct model (meaning change the grouping variable).

Question 16

What is the systematic component of this linear regression model?

Question 17

What is the random component of this linear regression model?

Question 18

Using the code template provided for Model 1, fit the linear regression model you wrote down in Question 15. Call the model Model2. Run the code below to print out the information from the fitted model:

knitr::kable(summary(Model2)$coefficients)

Question 19

Write down the fitted model (using the estimates above) using proper notation. Round to two decimal places.

Hint: To write math notation in Markdown, go to the white space in your Markdown and type: $$\widehat{BodyMass}_i = Intercept + Slope(BodyDepth_i) + ...$$. Just fill in the missing pieces!

Great! So we now have a model. Let’s use it.

Using the Model: Interpretations

Question 20

Interpret the coefficient for bill depth (which should be $\hat{\beta}_1$ in your fitted model.)

Using the Model: Predictions

Question 21

Using your fitted model, what is the predicted average body mass for a penguin with the characteristic of the 2nd penguin in the data set?

Question 22

What is the residual for the 2nd penguin in the data set? Interpret what this value tells us.

Using the Model: Assessing Model Fit

The last thing we want to do today is assess how well our model is doing at fitting the data. We will do this a little differently for all of the models in our course, but the goal is the same - how much of the variability in $Y$ are we capturing with the systematic component of our model?

We will start with a breakdown of the sum of squares.

anova(Model2)

Question 23

What is the RSS, MSS, and TSS for our model?

Question 24

Compute the $R^2$ and adjusted $R^2$ for our model.

Hint: $R^2_{adj} = 1 - \left( \frac{RSS}{TSS} \times \frac{n-p}{n-1}\right)$, where $p$ is the number of $\beta$ terms in the model including the intercept.

Next Time

We have refreshed a bit on how to use R to create graphs as well as how to build LSLR models in R. We are going to build on these skills next time to discuss our first new class of models: linear mixed effects models.

This work was created by Nicole Dalzell and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2023 17 August.