STA 111 Lab 2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data("starwars")
starwars

## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Question 1

Here is the text of the first question.

How many cases (rows) are in this data set? How many variables?

There are 87 cases (rows) and 14 variables (height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species, films, vehicles, and starships).

Question 2

Here is the text of the second question.

Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary? Note: You only need to do this for columns 2 - 6.

From columns 2-6 (height - eye_color), the numeric variables are height and mass while the categorical variables are hair_color, skin_color, and eye_color. All three of the categorical values are nominal given that they have no meaningful order. In addition, there are no binary variables.

summary(starwars$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    66.0   167.0   180.0   174.6   191.0   264.0       6

Question 3

Here is the text of the third question.

For how many characters in this data set is their height unknown?

6 characters

summary(starwars$mass)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   15.00   55.60   79.00   97.31   84.50 1358.00      28

Question 4

Here is the text of the fourth question.

For how many characters in this data set is their mass unknown? Hint: This requires changing the code above slightly.

28 characters

Question 5

Here is the text of the fifth question.

What are the names and values of the two different measures of center provided in the summary output for height?

Median = 180.0 centimeters

Mean = 174.6 centimeters

Question 6

Here is the text of the sixth question.

Adapt the code above to create a histogram of the height of the Star Wars characters in this data set (Hint 1: Height is the object in the code above). Color the bars of your graph gold and label the x axis “Height of characters (in centimeters)”. Hint 2: Remember that to grab only one column in a data set, we use dataset$column.

hist( starwars$height, col = "gold", xlab = "Height of Characters (in centimeters)")

Question 7

Here is the text of the seventh question.

Is the distribution of heights unimodal or multimodal?

Unimodal (one peak)

Question 8

Here is the text of the eighth question.

Is the distribution of heights skewed right, skewed left, or symmetric?

Appears skewed left

Question 9

Here is the text of the ninth question.

What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice.

I would use the median here as the histogram shows a skew to the left. Looking at the summary of height, the median is greater than the mean, suggesting it’s skewed left, and therefore, the median (180 cm) is the most appropriate for describing the center of the height distribution.

Question 11

Here is the text of the eleventh question.

Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white!

boxplot(starwars$height, col = "blue", xlab = "Height of Characters (in centimeters)", horizontal = TRUE)

Question 12

Here is the text of the twelfth question.

Which measure of center is depicted in a boxplot: the mean or the median?

The median

Question 13

Here is the text of the thirteenth question.

Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present.

There are many outliers. There appears to be many that are abnormally small and lie below two standard deviations from the mean and there are a handful that are abnormally large and lie more than two standard deviations from the mean.

Question 14

Here is the text of the fourteenth question.

Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!

boxplot(starwars$mass, col = "purple", xlab = "Mass of Characters (in kg)", horizontal = TRUE)

Question 15

Here is the text of the fifteenth question.

Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set!

Jabba Desilijic Tiure aka Jabba the Hutt

Question 16

Here is the text of the sixteenth question.

What is the IQR of height in this data set? Hint: remember that we have already created a summary of height, and that might prove useful for answering this question.

summary(starwars$mass)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   15.00   55.60   79.00   97.31   84.50 1358.00      28

IQR = 3rd quartile - 1st quartile => 84.50 - 55.60 === 28.9

IQR (mass) = 28.9 kg

Question 17

Here is the text for the seventeenth question.

We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?

From utilizing both the histogram and the boxplot throughout this lab, both of the plot styles have their own unique uses. I found that the histogram tends to be particularly useful for visualizing the spread of the distribution and it is far and away the ideal plot for identifying skewness. The Histogram is also a good visual representation of the frequency of different values for a given variable. However, there are some things that the box plot does better. While the box plot isn’t as visually convenient for determining the skewness and frequency within a distribution, it does visually portray the median along with the IQR, 1st quartile, 3rd quartile, and two standard deviations in both directions. Because of this, the boxplot is convenient for identifying the median, IQR, and any outliers within a distribution whereas the Histogram is not.