Marshall Meisel Stats Lab # 2

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data(starwars)

Question 1

How many cases (rows) are in this data set? How many variables? 84 rows and 14 variables

Question 2

Classify the variables in the data set as either numeric or categorical. For the categorical variables, are they ordered or un-ordered? Are any binary? Note: You only need to do this for columns 2 - 6.

1,4-14 are categorical and 2,3 are numeric none of 2-6 is binary but column 7 is with masculine and feminine

summary(starwars$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    66.0   167.0   180.0   174.4   191.0   264.0       6

Question 3

For how many characters in this data set is their height unknown? 6

summary(starwars$mass)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   15.00   55.60   79.00   97.31   84.50 1358.00      28

Question 4

For how many characters in this data set is their mass unknown? Hint: This requires changing the code above slightly. 28

Question 5

What are the names and values of the two different measures of center provided in the summary output for height? The mean height is 174.4 and the median height is 180.0

Question 6

Adapt the code above to create a histogram of the height of the Star Wars characters in this data set (Hint 1: Height is the object in the code above). Color the bars of your graph gold and label the x axis “Height of characters (in centimeters)”. Hint 2: Remember that to grab only one column in a data set, we use dataset$column.

hist(   starwars$height, col = "gold", xlab = "Height of characters (in centimeters")

Question 7

Is the distribution of heights unimodal or multimodal? Unimodel

Question 8

Is the distribution of heights skewed right, skewed left, or symmetric? It is skewed very slightly to the left

Question 9

What measure of center would you use to describe the height of the Star Wars characters in this data set? State the value of that measure of center and briefly explain your choice. I would use median which is 180.0 because it is the most accurate when determining the middle when there are outliers like there are in this data set.

boxplot( starwars$height, col = "red", xlab = "Height of characters (in centimeters", horizontal = TRUE)

Question 11

Adapt the code above to create a boxplot of the height of the Star Wars characters in this data set. Label the x axis “Height of characters (in centimeters)”. Make the plot any color you like, but do not use gold, black, or white! For suggestions, look at

Question 12

Which measure of center is depicted in a boxplot: the mean or the median? Median

Question 13

Based on the boxplot, are there any outliers in terms of the characters’ heights? If so, is there one outlier, just a handful of outliers, or many outliers? State whether these outliers are abnormally large, abnormally small, or if both types of outliers are present. There are many different outliers with some being abnormally large and some being abnormally small

Question 14

Adapt the code above to create a boxplot of the mass of the Star Wars characters in this data set. Label the x axis “Mass of characters (in kilograms)”. Make the plot any color you like, but do not use gold, black, or white!

boxplot( starwars$height, col = "orange", xlab = "Mass of characters (in kilograms)", horizontal = TRUE)

Question 15

Based on the boxplot, there is one very large outlier in terms of mass. Which Star Wars character has this very large body mass? Hint: You will need to open your data set! Jabba Desilijic Tiure

Question 16

What is the IQR of height in this data set? Hint: remember that we have already created a summary of height, and that might prove useful for answering this question. 24

Question 17

We have now seen two different visualizations of the distribution of height. What pieces of information about the distribution of height is provided in the histogram but not the boxplot, and vice versa?

The histogram does a better job explaining the distribution of the data and shows the swedness of the data along with the modality of the data. The Box plot shows us what the outliers are and the different quartiles. It also tells us that we should be using median when it comes to finding the middle value because of the presence of outliers.