title: “Project 2” author: “Courtney Erickson and Rachel K. Erickson” output: html_document —

For this project you will use the starwars dataset that comes with the dplyr package. According to swapi.com: The Star Wars API is the world’s first quantified and programmatically-accessible data source for all the data from the Star Wars canon universe!

Load the dplyr package and any other packages you may need in the code chunk below. You only need to load the package. You do not need to install anything. (Note: You also might want the ggplot2 and mosaic packages.)

There are entries in the starwars data set that have missing values. We need to remove these cases to make our analysis easier. Delete the # in the code chunk below to remove cases with missing values.

In the questions below, you will be asked to write hypotheses in terms of appropriate symbols. Here are some symbols you may need: \(\mu\), \(\rho\), \(\sigma\), \(\leq\), \(\geq\), \(\neq\), <, >. You can also write Greek letters with subscripts like this: \(\mu_{1}\).

Part 1: Height by Gender

  1. (4 pts.) Construct a plot that shows the effect of gender on height. Be sure your graph has appropriate labels!
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

## $y
## [1] "Height"
## 
## attr(,"class")
## [1] "labels"
  1. (4 pts.) Does there appear to be a difference in the distribution of height for males and females? Explain your answer.

ANSWER:Yes, the plot shows that it is more likely to be tall as a male than it is to be tall and female

  1. (2 pts.) Suppose we are interested in determining if there is a difference in the mean height for male and female characters in Star Wars. Write the null and alternative hypotheses using appropriate symbols.

\(H_0\): μ(m)-μ(f) = 0

\(H_a\): μ(m)-μ(f) ≠ 0

  1. (2 pts.) Define any parameters you used above.

ANSWER:The parameters would just be the difference of heights between the male and female star wars characters. μ(m) is mean of males and μ(f) is mean of females.

  1. (3 pts.) Copy the data into Statkey.

Find the p-value and give an interpretation in context.

(Hint: Copy and paste the last line of code into the console). The second line makes a subset for which only males or females are included. The na.omit function removes any observations with missing (“NA”) values from the dataset.

Be sure to uncheck the “Data has header row” box in StatKey.

##              
##  male   , 172
##  male   , 202
##  female , 150
##  male   , 178
##  female , 165
##  male   , 183
##  male   , 182
##  male   , 188
##  male   , 228
##  male   , 180
##  male   , 173
##  male   , 170
##  male   , 180
##  male   ,  66
##  male   , 170
##  male   , 183
##  male   , 190
##  male   , 177
##  male   , 175
##  male   , 180
##  male   ,  88
##  male   , 160
##  male   , 193
##  male   , 191
##  male   , 196
##  male   , 224
##  male   , 112
##  male   , 175
##  female , 178
##  male   ,  94
##  male   , 163
##  male   , 188
##  male   , 198
##  male   , 196
##  female , 184
##  male   , 188
##  male   , 185
##  male   , 183
##  female , 170
##  female , 166
##  male   , 193
##  male   , 183
##  female , 168
##  male   , 198
##  male   , 229
##  male   ,  79
##  male   , 193
##  female , 178
##  male   , 216
##  male   , 234
##  male   , 188
##  female , 178
##  male   , 206
##  female , 165

ANSWER:P-value = 0.474; There is weak evidence that the difference in means of heights between males and females is not due to random chance

Part 2: BMI

  1. (2 pts.) The data set contains measurements of height in centimeters and mass in kilograms. We will use these variables to calculate Body Mass Index, BMI for all of The Star Wars characters. \(BMI = \frac{mass}{Height^2 }\) where mass is in kilograms and height is in meters. I have created a variable called heightm that converts centimeters to meters in the code chunk below. Delete the # to use the heightm variable. Notice that we ensure heightm is added to the starwars dataset by calling it starwars$heightm
heightm <- starwars$height / 100

Create a variable called BMI. Do this by dividing the mass variable from the starwars data set by the heightm variable you created above. Notice that we ensure BMI is added to the starwars dataset by calling it starwars$BMI

BMI <- starwars$mass / (heightm^2)
starwars$BMI = BMI

Once you add the variable to the dataset you may want to use the View() function to see some of the results.

  1. (4 pts.) Create a graph to display the distribution of BMI. Be sure to include proper axis labels and titles!

  1. (4 pts.) What measure of center and spread would you use to summarize the distribution of BMI? Explain your answer.

ANSWER. You would use the median and the 5 number summary to summarize the distribution because of how far it is skewed to the right. The one outlier interferes with the mean and standard deviation

  1. (3 pts.) BMI contains one extreme outlier. What case number does this correspond to? What is the character’s name?
which(starwars$BMI == max(starwars$BMI))
## [1] 12
starwars$name[which(starwars$BMI == max(starwars$BMI))]
## [1] "Jabba Desilijic Tiure"

ANSWER: Case number 12, Jabba Desilijic Tiure

  1. (4 pts.) The starwars data set clearly contains one outlier. First, we will create a new dataset that does not contain the extreme outlier. The code in the chunk below copies the original starwars data set, without case 0, and saves it into a new data set called starwars2. Remove the # from the beginning of the code chunk below. Change the 0 to the case number that you looked up in question 9. Keep the negative sign. The negative sign is used to exclude this case.
starwars2 <- starwars[-(which(starwars$BMI == max(starwars$BMI))),]

Recreate the plot showing the distribution of BMI with the outlier removed. What measure of center and spread would you use to summarize the distributio of BMI?

ggplot(data = starwars2, aes(x = BMI)) +
  geom_histogram(binwidth = 10, fill = "Orange") +
  ggtitle("Distribution of BMI in Star Wars without the Outlier") +
  xlab("BMI")

ANSWER:You would use the median to identify the center and the IQR for the spread

  1. (4 pts.) Calculate the mean and standard deviation of BMI with and without the outlier. Comment on how removing the outlier changed these values.
mean(starwars$BMI)
## [1] 32.05795
mean(starwars2$BMI)
## [1] 24.57848
sd(starwars$BMI)
## [1] 56.36346
sd(starwars2$BMI)
## [1] 6.699268

ANSWER:With the outlier, the mean is 32.05795 and the standard deviation is 56.36346. Without the outlier, the mean is 24.57848 and the standard deviation is 6.699268. Removing the outlier changed the values because it was at such a significiantly higher number and it maded the interval closer together.

  1. (2 pts.) Explain why a confidence interval for the mean BMI is a more appropriate statistical inference than a hypothesis test.

ANSWER:Because a hypothesis test is looking for a yes or no response where a confidence interval is looking for something that is answering the question of how much or a range of values.

  1. (4 pts.) Copy the BMI data (without the outlier)into Statkey. Be sure to uncheck the “Data has header row” box in StatKey. Construct a 95% confidence interval for BMI and interpret it here.
##          
##  26.02758
##  33.33007
##  21.77778
##  37.87401
##  27.54821
##  25.08286
##  23.24598
##  23.76641
##  21.54509
##  24.69136
##  24.72518
##  26.64360
##  33.95062
##  39.02663
##  25.95156
##  23.35095
##  35.00000
##  31.30194
##  25.21625
##  25.79592
##  25.61728
##  25.82645
##  26.56250
##  23.89326
##  24.67038
##  17.18034
##  16.34247
##  31.88776
##  26.12245
##  17.35892
##  50.92802
##  24.46460
##  23.76641
##  20.91623
##  22.64681
##  14.76843
##  22.63468
##  24.83565
##  23.88844
##  19.44637
##  18.14487
##  21.47709
##  23.58984
##  19.48696
##  26.01775
##  16.78076
##  24.03461
##  12.88625
##  17.99015
##  34.07922
##  24.83746
##  22.35174
##  15.14960
##  18.85192
##  16.52893

ANSWER:-23.871 to 23.481. I am 95% confident that the BMI of the Star Wars characters is between -23.871 to 23.481

BONUS: (up to 4 bonus points on this assignment) What happens to the bootstrap distribution if you include the outlier? Why does it have this particular shape? Would it be appropriate to construct a 95% confidence interval in this case?

ANSWER:If you include the outlier, the bootstrap distribution could have a larger difference between the various samples. It has this shape because it is less likely to get the outlier as often as it is to get the smaller “normal” values. It would appropriate to construct a 95% confidence interval with the outlier because then the distribution is not normal.

Part 3: Height and Mass

  1. (4 pts.) Construct a plot that displays the relationship between height and mass, for the entire starwars dataset (include the outlier). Use height as the explanatory variable. Be sure to label your plot.

  1. (3 pts.) Summarize the relationship between height and mass in terms of strength, direction, and trend.

ANSWER:The graph of the relationship between height and mass shows that it is moderately strong and positive

  1. (1 pt.) Calculate the correlation for height and mass (including the outlier).
cor(starwars$height, starwars$mass)
## [1] 0.1136525
  1. (2 pts.) Calculate the correlation for height and mass using the starwars2 dataset. How does the correlation change once the outlier is removed?
cor(starwars2$height, starwars2$mass)
## [1] 0.7376548

ANSWER: The correlation increases dramatically once the outlier is removed because it no longer accounts for such a high variable, making the data closer together and more linear.

  1. (2 pts.) Fit a regression model with height as the explanatory variable and mass as the response variable including the outlier. Display the summary statistics for the regression coefficients.
reg.model <- lm(height ~ mass, data=starwars2)
summary.lm(reg.model)$coefficients 
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 110.7263422   8.877073 12.473294 2.143763e-17
## mass          0.8580839   0.107884  7.953765 1.326426e-10
xmy.plot <- ggplot(data = starwars2, aes(x = height, y = mass)) + geom_point(color = "blue") + 
  ggtitle("Distribution of Mass over Height without the Outlier") +
  xlab("Height") + 
  ylab("Mass") 
my.plot + geom_smooth(method = "lm")

  1. (2 pts.) Add the regression line to the plot from question 14. You may do this in the code chunk for question 14.

  2. (4 pts.) Construct the regression model for height and mass with the outlier removed. Display the summary statistics for the regression coefficients. Also create a plot displaying the relationship and the regression line. (Hint: You have already done these steps using the starwars dataset. You should be able to copy your previous code and modify it to perform these tasks.)

reg.model <- lm(height ~ mass, data=starwars2)
summary.lm(reg.model)$coefficients
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 110.7263422   8.877073 12.473294 2.143763e-17
## mass          0.8580839   0.107884  7.953765 1.326426e-10
my.plot <- ggplot(data = starwars2, aes(x = height, y = mass)) + geom_point(color = "blue") + 
  ggtitle("Distribution of Mass over Height") +
  xlab("Height") + 
  ylab("Mass") 
my.plot + geom_smooth(method = "lm")

  1. (2 pts.) Comment on how removing the outlier changed the regression line.

ANSWER: When we removed the outlier, the regression line became stronger and more positive.

  1. (2 pts.) For humans, there is a positive correlation between height and mass. We would like to test if there is a positive correlation between height and mass for characters in the Star Wars universe. Using appropriate symbols, what null and alternative hypotheses would we use?

\(H_0\): r(h) = r(m)

\(H_a\): r(h) ≠ r(m)

15). (1 pt.) Define any symbols you used above.

Answer:The parameters would be the correlation between height and mass of the star wars characters. r(h) is correlation of height and r(m) is correlation of mass.

  1. (4 pts.) Copy the data into Statkey with the outlier removed, and conduct the test. Be sure to uncheck the “Data has header row” box in Statkey.

Report the p-value, and and give an interpretation in context.

ANSWER: The p value = .164. Therefore we would fail to reject the null hypothesis