STA 111 Lab 2
Complete all Questions and submit final PDF or html under Assignments in Canvas.
Goal
In the last lab, we started to explore how statistical computing can help us answer questions using data. In this lab, we are going to see how measures of center and measures of spread can be used to describe data and we will also use data visualizations to explore a data set.
If you want a reminder on how to use R or R markdown, look back at Lab 1!
The Data
As we learned in the last lab, the first thing we need to start an analysis is data. Our data for today comes from the movie series Star Wars. We are going to look at information on different characters from the movies. In order to do that, we need a data set.
The Star Wars data set is stored in a library called
dplyr
. Your first question is probably “what is a library?”
A library in R is a collection of codes that perform a certain task. For
example, the library ggplot2
contains multiple codes to
create professional data visualizations. As we work in R, sometimes we
need to load libraries in order to access the codes that we want. So,
how do we do that?
Look at the very top of your RStudio screen. You should see an option
called Tools. Click on tools and from the drop-down menu select Install
Packages. In the prompt box that appears, type dplyr
, and
then hit Install. Now, a whole bunch of output is going to start to
appear on your screen. We can ignore all of it. This is just how R tells
us it is loading a library.
Once the package installs, there is one more step we need to do to access the Star Wars data. Create a chunk in R, copy and paste the following, and hit play.
If you look in your Environment Tab on the upper right-hand corner of
your RStudio screen, you should see a data set called
starwars
. Let’s start to explore this data set.
Question 1
How many cases (rows) are in this data set? How many variables?
Question 2
Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary? Note: You only need to do this for columns 2 - 6.
You will notice something interesting in the 4th column of the data
set. Some of the entries are recorded as NA
.
NA
stands for not available and is one way that
people indicate in a data set that information is missing. In other
words when we see NA
in a data set this tells us that a
certain piece of information is not known.
Considering Height
Let’s start off by exploring the height (in centimeters) of different characters in Star Wars. This is an interesting question to explore because there are multiple different species in the Star Wars universe. Whenever you start to explore a variable, remember that the tools you use will depend on the type of variable you are exploring. The tools we use to describe a categorical variable are different from the tools we use to describe a numeric variable.
In R, one powerful command for helping us explore a numeric variable
is the summary
command. Copy and paste the code from the
chunk below and press play.
Question 3
For how many characters in this data set is their height unknown?
Question 4
For how many characters in this data set is their mass unknown? Hint: This requires changing the code above slightly.
Measures of Center
The summary
command provides a lot of information about
the variable height, including measures of center.
There are two measures of center provided in the summary
output.
Question 5
What are the names and values of the two different measures of center provided in the summary output for height?
To determine which measure of center is a more appropriate tool for these data, we need to know something about the distribution of heights in this data set. Remember that a distribution just means what values are possible for a particular variable and how often those different values occur. Because height is a numeric variable, one plot that we can use to visualize the distribution of height is a histogram.
To create a histogram in R, we use the command hist
.
Recall that
hist
is the command; we want to make a histogram.object
is the data we want to make a histogram of. You need to replace this with the name of the data you want to plot.col
controls the color of the bars.xlab
controls what text appears on the x-axis.
Question 6
Adapt the code above to create a histogram of the height of the Star
Wars characters in this data set (Hint 1: Height is the
object
in the code above). Color the bars of your graph
gold and label the x axis “Height of characters (in centimeters)”. Hint
2: Remember that to grab only one column in a data set, we use
dataset$column
.
When we look at a histogram, there are two things we are looking for. The first is modality, which just means how many peaks are in the distribution. One main peak means the distribution is unimodal. More than one main peak means the distribution is multimodal. The second thing we look for is skew. We tend to describe unimodal distributions as right skewed (long right tail), left skewed (long left tail), or symmetric (balanced around the peak).
Question 7
Is the distribution of heights unimodal or multimodal?
Question 8
Is the distribution of heights skewed right, skewed left, or symmetric?
Sometimes it can be tricky to determine skew when looking at a histogram. One clue is that right skewed distributions tend to have means that are higher than the median. Left skewed distributions tend to have medians that are higher than the mean. Symmetric distributions tend to have means and medians that are very similar to one another.
For skewed distributions, the appropriate measure of center is the median. For symmetric distributions, the mean and the median are about the same but generally we use the mean. Why don’t we just use the median all the time?? Well, this is because the mean has some very nice mathematical properties that make it easier to work with than the median in a lot of situations. We will see this fairly soon as we continue to move through our course.
Question 9
What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice.
Boxplots
Histograms are very useful, but they are not the only tool that we use to visualize the distribution of a numeric variable. Another tool we use is a boxplot, which visualizes the center and spread of a distribution quite differently from a histogram. Specifically boxplots show the first quartile, median, and third quartile of a variable, and also make it easier to see outliers, i.e., unusually large or small values of the variable.
To make a boxplot in R, the command we need is
boxplot
.
This is the same set up we used for the histogram. The only new part
of the code is horizontal = TRUE
. The
horizontal = TRUE
part of the code just tells R that we
want a horizontal box plot. If you want a vertical boxplot just remove
this piece of the code.
We are starting to see that when we have a primary command, like
boxplot
or hist
, we can then add
arguments, or extra pieces of the command that help us to
personalize our plot. Adding color, adding labels to the axes, and
choosing whether the plot is vertical or horizontal are all examples of
things we can specify using arguments.
Question 10
Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white! For suggestions, look at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
Question 11
Which measure of center is depicted in a boxplot: the mean or the median?
Question 12
Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present.
Question 13
Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!
Question 14
Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set!
Measures of Spread
In addition to measures of center, boxplots can be used to visualize measures of spread, specifically the Interquartile Range (IQR).
Question 15
What is the IQR of height in this data set? Hint: remember that we
have already created a summary
of height, and that might
prove useful for answering this question.
Question 16
We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 July 13.
The starwars
data set used in this lab is from the dplyr
library in R: Wickham H, François R, Henry L, Müller K, Vaughan D
(2025). dplyr: A Grammar of Data Manipulation. R package version
1.1.4.9000, https://github.com/tidyverse/dplyr. .