library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data("starwars")
starwars
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Here is the text of the first question.
How many cases (rows) are in this data set? How many variables?
There are 87 cases (rows) and 14 variables (height, mass, hair_color, skin_color, eye_color, birth_year, sex, gender, homeworld, species, films, vehicles, and starships).
Here is the text of the second question.
Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary? Note: You only need to do this for columns 2 - 6.
From columns 2-6 (height - eye_color), the numeric variables are height and mass while the categorical variables are hair_color, skin_color, and eye_color. All three of the categorical values are nominal given that they have no meaningful order. In addition, there are no binary variables.
summary(starwars$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 66.0 167.0 180.0 174.6 191.0 264.0 6
Here is the text of the third question.
For how many characters in this data set is their height unknown?
6 characters
summary(starwars$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 15.00 55.60 79.00 97.31 84.50 1358.00 28
Here is the text of the fourth question.
For how many characters in this data set is their mass unknown? Hint: This requires changing the code above slightly.
28 characters
Here is the text of the fifth question.
What are the names and values of the two different measures of center provided in the summary output for height?
Median = 180.0 centimeters
Mean = 174.6 centimeters
Here is the text of the sixth question.
Adapt the code above to create a histogram of the height of the Star Wars characters in this data set (Hint 1: Height is the object in the code above). Color the bars of your graph gold and label the x axis “Height of characters (in centimeters)”. Hint 2: Remember that to grab only one column in a data set, we use dataset$column.
hist( starwars$height, col = "gold", xlab = "Height of Characters (in centimeters)")
Here is the text of the seventh question.
Is the distribution of heights unimodal or multimodal?
Unimodal (one peak)
Here is the text of the eighth question.
Is the distribution of heights skewed right, skewed left, or symmetric?
Appears skewed left
Here is the text of the ninth question.
What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice.
I would use the median here as the histogram shows a skew to the left. Looking at the summary of height, the median is greater than the mean, suggesting it’s skewed left, and therefore, the median (180 cm) is the most appropriate for describing the center of the height distribution.
Here is the text of the eleventh question.
Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white!
boxplot(starwars$height, col = "blue", xlab = "Height of Characters (in centimeters)", horizontal = TRUE)
Here is the text of the twelfth question.
Which measure of center is depicted in a boxplot: the mean or the median?
The median
Here is the text of the thirteenth question.
Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present.
There are many outliers. There appears to be many that are abnormally small and lie below two standard deviations from the mean and there are a handful that are abnormally large and lie more than two standard deviations from the mean.
Here is the text of the fourteenth question.
Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!
boxplot(starwars$mass, col = "purple", xlab = "Mass of Characters (in kg)", horizontal = TRUE)
Here is the text of the fifteenth question.
Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set!
Jabba Desilijic Tiure aka Jabba the Hutt
Here is the text of the sixteenth question.
What is the IQR of height in this data set? Hint: remember that we have already created a summary of height, and that might prove useful for answering this question.
summary(starwars$mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 15.00 55.60 79.00 97.31 84.50 1358.00 28
IQR = 3rd quartile - 1st quartile => 84.50 - 55.60 === 28.9
IQR (mass) = 28.9 kg
Here is the text for the seventeenth question.
We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?
From utilizing both the histogram and the boxplot throughout this lab, both of the plot styles have their own unique uses. I found that the histogram tends to be particularly useful for visualizing the spread of the distribution and it is far and away the ideal plot for identifying skewness. The Histogram is also a good visual representation of the frequency of different values for a given variable. However, there are some things that the box plot does better. While the box plot isn’t as visually convenient for determining the skewness and frequency within a distribution, it does visually portray the median along with the IQR, 1st quartile, 3rd quartile, and two standard deviations in both directions. Because of this, the boxplot is convenient for identifying the median, IQR, and any outliers within a distribution whereas the Histogram is not.