Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.
In the last lab, we started to explore how statistical computing can help us answer questions using data. In this lab, we are going to see how measures of center and measures of spread can be used to describe data and we will also use data visualizations to explore a data set.
As we learned in the last lab, the first thing we need to start an analysis is data. Our data for today comes from the movie series Star Wars. We are going to look at information on different characters from the movies. In order to do that, we need a data set.
The Star Wars data set is stored in a library called “dplyr”. Your first question is probably “what is a library?” A library in R is a collection of codes that perform a certain task. For example, the library “ggplot2” contains multiple codes to create professional data visualizations. As we work in R, sometimes we need to load libraries in order to access the codes that we want. So, how do we do that?
Look at the very top of your RStudio screen. You should see an option called Tools. Click on tools and from the drop-down menu select Install Packages. In the prompt box that appears, type dplyr, and then hit Install. Now, a whole bunch of output is going to start to appear on your screen. We can ignore all of it. This is just how R tells us it is loading a library.
Once the package installed, there is one more step we need to do to access the Star Wars data. Create a chunk in R, copy and paste the following, and hit play.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("starwars")
If you look in your Environment Tab on the upper right-hand corner of your RStudio screen, you should see a data set called “starwars”. Let’s start to explore this data set.
How many cases (rows) are in this data set? How many variables?
Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordinal or nominal? Are any binary? (Note: You only need to do this for columns 2 - 6).
Note - You will notice something interesting in the 4th column of the data set. Some of the entries are recorded as “NA”. “NA” stands for “not available” and is one way that people indicate in a data set that information is missing. In other words when we see “NA” in a data set this tells us that a certain piece of information is not known.
Let’s start off by exploring the height (in centimeters) of different characters in Star Wars. This is an interesting question to explore because there are multiple different species in the Star Wars universe. Whenever you start to explore a variable, remember that the tools you use will depend on the type of variable you are exploring. The tools we use to describe a categorical variable are different from the tools we use to describe a numeric variable.
In R, one powerful command for helping us explore a numeric variable is the “summary” command. Copy and paste the code from the chunk below and press play.
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
For how many characters in this data set is their “height” unknown (How many missing (“NA”) values in the height variable)?
starwars$height
## [1] 172 167 96 202 150 178 165 97 183 182 188 180 228 180 173 175 170 180 66
## [20] 170 183 200 190 177 175 180 150 NA 88 160 193 191 170 185 196 224 206 183
## [39] 137 112 183 163 175 180 178 79 94 122 163 188 198 196 171 184 188 264 188
## [58] 196 185 157 183 183 170 166 165 193 191 183 168 198 229 213 167 96 193 191
## [77] 178 216 234 188 178 206 NA NA NA NA NA
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
For how many characters in this data set is their “mass” unknown? (Hint: This requires changing the code above slightly.)
starwars$mass
## [1] 77.0 75.0 32.0 136.0 49.0 120.0 75.0 32.0 84.0 77.0
## [11] 84.0 NA 112.0 80.0 74.0 1358.0 77.0 110.0 17.0 75.0
## [21] 78.2 140.0 113.0 79.0 79.0 83.0 NA NA 20.0 68.0
## [31] 89.0 90.0 NA 45.0 66.0 82.0 NA NA NA 40.0
## [41] NA NA 80.0 NA 55.0 15.0 45.0 NA 65.0 84.0
## [51] 82.0 87.0 NA 50.0 NA NA 80.0 NA 85.0 NA
## [61] NA 80.0 56.2 50.0 NA 80.0 NA 79.0 55.0 102.0
## [71] 88.0 NA NA NA 48.0 NA 57.0 159.0 136.0 79.0
## [81] 48.0 80.0 NA NA NA NA NA
summary(starwars$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 15.00 55.60 79.00 97.31 84.50 1358.00 28
The “summary” command provides a lot of information about the variable height, including “measures of center”. There are two measures of center provided in the summary output.
What are the names and values of the two different measures of center provided in the summary output for height?
Note - To determine which measure of center is a more appropriate tool for these data, we need to know something about the “distribution” of heights in this data set. Remember that a distribution means what values are possible for a particular variable and how often those different values occur. Because height is a numeric variable, one plot that we can use to visualize the distribution of height is a “histogram”.
To create a histogram in R, we use the command “hist”.
#hist(object, col = “some color”, xlab = “the X axis label”)
Recall that:
“hist” is the command; we want to make a histogram.
“object” is the data we want to make a histogram of. You need to replace this with the name of the data you want to plot.
“col” controls the color of the bars.
“xlab” controls what text appears on the x-axis.
Adapt the code above to create a histogram of the height of the Star Wars characters in this data set (Hint 1: “height” is the object in the code above). Color the bars of your graph “gold” and label the x axis “Height of characters (in centimeters)”. (Hint 2: Remember that to grab only one column in a data set, we use dataset$column.)
hist(starwars$height, col = "gold", xlab = "Height of Characters (in centimeters)",main = "Histogram of Height")
Note - When we look at a histogram, there are two things we are looking for. The first is “modality”, which just means how many peaks are in the distribution. One main peak means the distribution is “unimodal”. More than one main peak means the distribution is “multimodal”. The second thing we look for is skew. We tend to describe “unimodal” distributions as right skewed (long right tail), left skewed (long left tail), or symmetric (balanced around the peak).
Is the distribution of heights “unimodal” or “multimodal”?
Is the distribution of heights “skewed right”, “skewed left”, or “symmetric”?
Note - Sometimes it can be tricky to determine skew when looking at a histogram. One clue is that right skewed distributions tend to have means that are higher than the median. Left skewed distributions tend to have medians that are higher than the mean. Symmetric distributions tend to have means and medians that are very similar to one another.
For skewed distributions, the appropriate measure of center is the median. For symmetric distributions, the mean and the median are about the same but generally we use the mean. Why don’t we just use the median all the time?? Well, this is because the mean has some very nice mathematical properties that make it easier to work with than the median in a lot of situations. We will see this fairly soon as we continue to move through our course.
What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice.
Histograms are very useful, but they are not the only tool that we use to visualize the distribution of a numeric variable. Another tool we use is a “boxplot”, which visualizes the center and spread of a distribution quite differently from a histogram. Specifically boxplots show the “first quartile”, “median”, and “third quartile” of a variable, and also make it easier to see “outliers”, i.e., unusually large or small values of the variable.
To make a “boxplot” in R, the command we need is “boxplot”.
#boxplot(object, col = “some color”, xlab = “X axis label”, horizontal = TRUE)
Note - This is the same set up we used for the histogram. The only new part of the code is “horizontal = TRUE”. The “horizontal = TRUE” part of the code just tells R that we want a horizontal box plot. If you want a vertical boxplot just remove this piece of the code or change that to “horizontal = FALSE” .
We are starting to see that when we have a primary command, like “boxplot” or “hist”, we can then add “arguments”, or extra pieces of the command that help us to personalize our plot. Adding color, adding labels to the axes, and choosing whether the plot is vertical or horizontal are all examples of things we can specify using arguments.
Adapt the code above to create a “boxplot” of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use “gold, black, or white!” (Refer this link for more colors - https://leesj.sites.oasis.unc.edu/FETCH/GRAB/Vignettes/colors_in_R.pdf )
boxplot(starwars$height, col = "lightgreen", xlab = "Height of Characters (in centimeters)", main= "Boxplot of Height", horizontal = TRUE)
Which measure of center is depicted in a boxplot: the mean or the median?
Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present.
Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!
boxplot(starwars$mass, col = "red", xlab = "Mass of Characters (in kgs)", main= "Boxplot of Mass", horizontal = TRUE)
Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? (Hint: You will need to open your data set!)
In addition to measures of center, boxplots can be used to visualize measures of spread, specifically the Interquartile Range (IQR).
What is the IQR of height in this data set? (Hint: remember that we have already created a “summary” of height, and that might prove useful for answering this question.)
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
\[IQR=Q3-Q1\] \[IQR=191-167=24\] —————————————————————
We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?