STA 111 Lab 2

Complete all Questions and submit final PDF or html under Assignments in Canvas.

Goal

In the last lab, we started to explore how statistical computing can help us answer questions using data. In this lab, we are going to see how measures of center and measures of spread can be used to describe data and we will also use data visualizations to explore a data set.

If you want a reminder on how to use R or R markdown, look back at Lab 1!

The Data

As we learned in the last lab, the first thing we need to start an analysis is data. Our data for today comes from the movie series Star Wars. We are going to look at information on different characters from the movies. In order to do that, we need a data set.

The Star Wars data set is stored in a library called dplyr. Your first question is probably “what is a library?” A library in R is a collection of codes that perform a certain task. For example, the library ggplot2 contains multiple codes to create professional data visualizations. As we work in R, sometimes we need to load libraries in order to access the codes that we want. So, how do we do that?

Look at the very top of your RStudio screen. You should see an option called Tools. Click on tools and from the drop-down menu select Install Packages. In the prompt box that appears, type dplyr, and then hit Install. Now, a whole bunch of output is going to start to appear on your screen. We can ignore all of it. This is just how R tells us it is loading a library.

Once the package installs, there is one more step we need to do to access the Star Wars data. Create a chunk in R, copy and paste the following, and hit play.

library(dplyr)
data("starwars")

If you look in your Environment Tab on the upper right-hand corner of your RStudio screen, you should see a data set called starwars. Let’s start to explore this data set.

Question 1

How many cases (rows) are in this data set? How many variables?

Question 2

Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary? Note: You only need to do this for columns 2 - 6.

You will notice something interesting in the 4th column of the data set. Some of the entries are recorded as NA. NA stands for not available and is one way that people indicate in a data set that information is missing. In other words when we see NA in a data set this tells us that a certain piece of information is not known.

Considering Height

Let’s start off by exploring the height (in centimeters) of different characters in Star Wars. This is an interesting question to explore because there are multiple different species in the Star Wars universe. Whenever you start to explore a variable, remember that the tools you use will depend on the type of variable you are exploring. The tools we use to describe a categorical variable are different from the tools we use to describe a numeric variable.

In R, one powerful command for helping us explore a numeric variable is the summary command. Copy and paste the code from the chunk below and press play.

summary(starwars$height)

Question 3

For how many characters in this data set is their height unknown?

Question 4

For how many characters in this data set is their mass unknown? Hint: This requires changing the code above slightly.

Measures of Center

The summary command provides a lot of information about the variable height, including measures of center. There are two measures of center provided in the summary output.

Question 5

What are the names and values of the two different measures of center provided in the summary output for height?

To determine which measure of center is a more appropriate tool for these data, we need to know something about the distribution of heights in this data set. Remember that a distribution just means what values are possible for a particular variable and how often those different values occur. Because height is a numeric variable, one plot that we can use to visualize the distribution of height is a histogram.

To create a histogram in R, we use the command hist.

hist(  object,  col = "some color", xlab = "the X axis label" )

Recall that

  • hist is the command; we want to make a histogram.
  • object is the data we want to make a histogram of. You need to replace this with the name of the data you want to plot.
  • col controls the color of the bars.
  • xlab controls what text appears on the x-axis.

Question 6

Adapt the code above to create a histogram of the height of the Star Wars characters in this data set (Hint 1: Height is the object in the code above). Color the bars of your graph gold and label the x axis “Height of characters (in centimeters)”. Hint 2: Remember that to grab only one column in a data set, we use dataset$column.

When we look at a histogram, there are two things we are looking for. The first is modality, which just means how many peaks are in the distribution. One main peak means the distribution is unimodal. More than one main peak means the distribution is multimodal. The second thing we look for is skew. We tend to describe unimodal distributions as right skewed (long right tail), left skewed (long left tail), or symmetric (balanced around the peak).

Question 7

Is the distribution of heights unimodal or multimodal?

Question 8

Is the distribution of heights skewed right, skewed left, or symmetric?

Sometimes it can be tricky to determine skew when looking at a histogram. One clue is that right skewed distributions tend to have means that are higher than the median. Left skewed distributions tend to have medians that are higher than the mean. Symmetric distributions tend to have means and medians that are very similar to one another.

For skewed distributions, the appropriate measure of center is the median. For symmetric distributions, the mean and the median are about the same but generally we use the mean. Why don’t we just use the median all the time?? Well, this is because the mean has some very nice mathematical properties that make it easier to work with than the median in a lot of situations. We will see this fairly soon as we continue to move through our course.

Question 9

What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice.

Boxplots

Histograms are very useful, but they are not the only tool that we use to visualize the distribution of a numeric variable. Another tool we use is a boxplot, which visualizes the center and spread of a distribution quite differently from a histogram. Specifically boxplots show the first quartile, median, and third quartile of a variable, and also make it easier to see outliers, i.e., unusually large or small values of the variable.

To make a boxplot in R, the command we need is boxplot.

boxplot(object, col = "some color", xlab = "X axis label", horizontal = TRUE)

This is the same set up we used for the histogram. The only new part of the code is horizontal = TRUE. The horizontal = TRUE part of the code just tells R that we want a horizontal box plot. If you want a vertical boxplot just remove this piece of the code.

We are starting to see that when we have a primary command, like boxplot or hist, we can then add arguments, or extra pieces of the command that help us to personalize our plot. Adding color, adding labels to the axes, and choosing whether the plot is vertical or horizontal are all examples of things we can specify using arguments.

Question 10

Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white! For suggestions, look at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

Question 11

Which measure of center is depicted in a boxplot: the mean or the median?

Question 12

Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present.

Question 13

Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!

Question 14

Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set!

Measures of Spread

In addition to measures of center, boxplots can be used to visualize measures of spread, specifically the Interquartile Range (IQR).

Question 15

What is the IQR of height in this data set? Hint: remember that we have already created a summary of height, and that might prove useful for answering this question.

Question 16

We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?

References

Creative Commons License
This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2025 July 13.

The starwars data set used in this lab is from the dplyr library in R: Wickham H, François R, Henry L, Müller K, Vaughan D (2025). dplyr: A Grammar of Data Manipulation. R package version 1.1.4.9000, https://github.com/tidyverse/dplyr. .