title: “Project 2” author: “Courtney Erickson and Rachel K. Erickson” output: html_document —
For this project you will use the starwars dataset that comes with the dplyr package. According to swapi.com: The Star Wars API is the world’s first quantified and programmatically-accessible data source for all the data from the Star Wars canon universe!
Load the dplyr package and any other packages you may need in the code chunk below. You only need to load the package. You do not need to install anything. (Note: You also might want the ggplot2 and mosaic packages.)
There are entries in the starwars data set that have missing values. We need to remove these cases to make our analysis easier. Delete the # in the code chunk below to remove cases with missing values.
In the questions below, you will be asked to write hypotheses in terms of appropriate symbols. Here are some symbols you may need: \(\mu\), \(\rho\), \(\sigma\), \(\leq\), \(\geq\), \(\neq\), <, >. You can also write Greek letters with subscripts like this: \(\mu_{1}\).
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
## $y
## [1] "Height"
##
## attr(,"class")
## [1] "labels"
ANSWER:Yes, the plot shows that it is more likely to be tall as a male than it is to be tall and female
\(H_0\): μ(m)-μ(f) = 0
\(H_a\): μ(m)-μ(f) ≠ 0
ANSWER:The parameters would just be the difference of heights between the male and female star wars characters. μ(m) is mean of males and μ(f) is mean of females.
Find the p-value and give an interpretation in context.
(Hint: Copy and paste the last line of code into the console). The second line makes a subset for which only males or females are included. The na.omit function removes any observations with missing (“NA”) values from the dataset.
Be sure to uncheck the “Data has header row” box in StatKey.
##
## male , 172
## male , 202
## female , 150
## male , 178
## female , 165
## male , 183
## male , 182
## male , 188
## male , 228
## male , 180
## male , 173
## male , 170
## male , 180
## male , 66
## male , 170
## male , 183
## male , 190
## male , 177
## male , 175
## male , 180
## male , 88
## male , 160
## male , 193
## male , 191
## male , 196
## male , 224
## male , 112
## male , 175
## female , 178
## male , 94
## male , 163
## male , 188
## male , 198
## male , 196
## female , 184
## male , 188
## male , 185
## male , 183
## female , 170
## female , 166
## male , 193
## male , 183
## female , 168
## male , 198
## male , 229
## male , 79
## male , 193
## female , 178
## male , 216
## male , 234
## male , 188
## female , 178
## male , 206
## female , 165
ANSWER:P-value = 0.474; There is weak evidence that the difference in means of heights between males and females is not due to random chance
# to use the heightm variable. Notice that we ensure heightm is added to the starwars dataset by calling it starwars$heightmheightm <- starwars$height / 100
Create a variable called BMI. Do this by dividing the mass variable from the starwars data set by the heightm variable you created above. Notice that we ensure BMI is added to the starwars dataset by calling it starwars$BMI
BMI <- starwars$mass / (heightm^2)
starwars$BMI = BMI
Once you add the variable to the dataset you may want to use the View() function to see some of the results.
ANSWER. You would use the median and the 5 number summary to summarize the distribution because of how far it is skewed to the right. The one outlier interferes with the mean and standard deviation
which(starwars$BMI == max(starwars$BMI))
## [1] 12
starwars$name[which(starwars$BMI == max(starwars$BMI))]
## [1] "Jabba Desilijic Tiure"
ANSWER: Case number 12, Jabba Desilijic Tiure
0, and saves it into a new data set called starwars2. Remove the # from the beginning of the code chunk below. Change the 0 to the case number that you looked up in question 9. Keep the negative sign. The negative sign is used to exclude this case.starwars2 <- starwars[-(which(starwars$BMI == max(starwars$BMI))),]
Recreate the plot showing the distribution of BMI with the outlier removed. What measure of center and spread would you use to summarize the distributio of BMI?
ggplot(data = starwars2, aes(x = BMI)) +
geom_histogram(binwidth = 10, fill = "Orange") +
ggtitle("Distribution of BMI in Star Wars without the Outlier") +
xlab("BMI")
ANSWER:You would use the median to identify the center and the IQR for the spread
mean(starwars$BMI)
## [1] 32.05795
mean(starwars2$BMI)
## [1] 24.57848
sd(starwars$BMI)
## [1] 56.36346
sd(starwars2$BMI)
## [1] 6.699268
ANSWER:With the outlier, the mean is 32.05795 and the standard deviation is 56.36346. Without the outlier, the mean is 24.57848 and the standard deviation is 6.699268. Removing the outlier changed the values because it was at such a significiantly higher number and it maded the interval closer together.
ANSWER:Because a hypothesis test is looking for a yes or no response where a confidence interval is looking for something that is answering the question of how much or a range of values.
##
## 26.02758
## 33.33007
## 21.77778
## 37.87401
## 27.54821
## 25.08286
## 23.24598
## 23.76641
## 21.54509
## 24.69136
## 24.72518
## 26.64360
## 33.95062
## 39.02663
## 25.95156
## 23.35095
## 35.00000
## 31.30194
## 25.21625
## 25.79592
## 25.61728
## 25.82645
## 26.56250
## 23.89326
## 24.67038
## 17.18034
## 16.34247
## 31.88776
## 26.12245
## 17.35892
## 50.92802
## 24.46460
## 23.76641
## 20.91623
## 22.64681
## 14.76843
## 22.63468
## 24.83565
## 23.88844
## 19.44637
## 18.14487
## 21.47709
## 23.58984
## 19.48696
## 26.01775
## 16.78076
## 24.03461
## 12.88625
## 17.99015
## 34.07922
## 24.83746
## 22.35174
## 15.14960
## 18.85192
## 16.52893
ANSWER:-23.871 to 23.481. I am 95% confident that the BMI of the Star Wars characters is between -23.871 to 23.481
BONUS: (up to 4 bonus points on this assignment) What happens to the bootstrap distribution if you include the outlier? Why does it have this particular shape? Would it be appropriate to construct a 95% confidence interval in this case?
ANSWER:If you include the outlier, the bootstrap distribution could have a larger difference between the various samples. It has this shape because it is less likely to get the outlier as often as it is to get the smaller “normal” values. It would appropriate to construct a 95% confidence interval with the outlier because then the distribution is not normal.
starwars dataset (include the outlier). Use height as the explanatory variable. Be sure to label your plot.ANSWER:The graph of the relationship between height and mass shows that it is moderately strong and positive
cor(starwars$height, starwars$mass)
## [1] 0.1136525
starwars2 dataset. How does the correlation change once the outlier is removed?cor(starwars2$height, starwars2$mass)
## [1] 0.7376548
ANSWER: The correlation increases dramatically once the outlier is removed because it no longer accounts for such a high variable, making the data closer together and more linear.
reg.model <- lm(height ~ mass, data=starwars2)
summary.lm(reg.model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.7263422 8.877073 12.473294 2.143763e-17
## mass 0.8580839 0.107884 7.953765 1.326426e-10
xmy.plot <- ggplot(data = starwars2, aes(x = height, y = mass)) + geom_point(color = "blue") +
ggtitle("Distribution of Mass over Height without the Outlier") +
xlab("Height") +
ylab("Mass")
my.plot + geom_smooth(method = "lm")
(2 pts.) Add the regression line to the plot from question 14. You may do this in the code chunk for question 14.
(4 pts.) Construct the regression model for height and mass with the outlier removed. Display the summary statistics for the regression coefficients. Also create a plot displaying the relationship and the regression line. (Hint: You have already done these steps using the starwars dataset. You should be able to copy your previous code and modify it to perform these tasks.)
reg.model <- lm(height ~ mass, data=starwars2)
summary.lm(reg.model)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 110.7263422 8.877073 12.473294 2.143763e-17
## mass 0.8580839 0.107884 7.953765 1.326426e-10
my.plot <- ggplot(data = starwars2, aes(x = height, y = mass)) + geom_point(color = "blue") +
ggtitle("Distribution of Mass over Height") +
xlab("Height") +
ylab("Mass")
my.plot + geom_smooth(method = "lm")
ANSWER: When we removed the outlier, the regression line became stronger and more positive.
\(H_0\): r(h) = r(m)
\(H_a\): r(h) ≠ r(m)
15). (1 pt.) Define any symbols you used above.
Answer:The parameters would be the correlation between height and mass of the star wars characters. r(h) is correlation of height and r(m) is correlation of mass.
Report the p-value, and and give an interpretation in context.
ANSWER: The p value = .164. Therefore we would fail to reject the null hypothesis