5.6 Working backwards, Part II. A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

# Given that the question stem had given us a confidence interval, we can add the lower CI and upper CI and divide by two to get the mean
mean <- (65 + 77)/2

# The formula for CI is: CI = mean +/- (Margin of Error) --> (Margin of Error) = CI - mean
margin.error <- 77 - mean

# We will now look up in a T table for (25 observations - 1) or 24 degrees of freedom for a two tailed test that coresponds with 90% CI.
# T = 2.06
# Now let's calculate the sample standard deviation.
# Remember, (Margin of Error) = T * (standard deviation)/sqrt(n observations)
st.dev <- (margin.error/2.06) * sqrt(25)

paste("Sample Mean: ", mean)
## [1] "Sample Mean:  71"
paste("Margin of Error: ", margin.error)
## [1] "Margin of Error:  6"
paste("Sample Standard Deviation: ", round(st.dev,2))
## [1] "Sample Standard Deviation:  14.56"

5.14 SAT scores. SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

  1. Raina wants to use a 90% confidence interval. How large a sample should she collect?

It is important to remember to round up if there is a fraction (which was done in the code) for this excercise.

# Margin of error = T * (standard deviation) / sqrt(n observations)
# In this case, we will assume a Z value for T, as there will likely be more than 30 students, making the T values approach the Z values
# Look for the Z value for 90% confidence interval. Given that this is two tailed, will need to use z value for .95
z <- qnorm(.95, mean = 0, sd = 1)
sat.sd <- 250
margin.error2 <- 25
sat.size <- (sat.sd/(margin.error2/z))^2
paste("Raina should collect", round(sat.size,0), "students.")
## [1] "Raina should collect 271 students."
  1. Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.
# Luke's sample will need to be bigger. ????????
  1. Calculate the minimum required sample size for Luke.
# For 99% confidence interval, will need to reach out to find the z value for .995
z1 <- qnorm(.995, mean = 0, sd = 1)
sat.sd <- 250
margin.error3 <- 25
sat.size1 <- (sat.sd/(margin.error3/z1))^2
paste("Luke should collect", round(sat.size1,0)+1, "students.")
## [1] "Luke should collect 664 students."

5.20 High School and Beyond, Part I. The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

  1. Is there a clear difference in the average reading and writing scores?
# There is no clear difference in the average reading and writing scores on visual inspection.
  1. Are the reading and writing scores of each student independent of each other?
# Given that this is a 200 student sampling form a large database from the National Center of Education Statistics, it is presumed that they are independent as well as the sampling is likely less than 10% of the total sample in the survey.
  1. Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?
# Null hypothesis: The difference in the average scores of students in the reading and writing exam == 0
# Alternative hypothesis: The difference in the average scores of students in the reading and writing exam != 0
  1. Check the conditions required to complete this test.
# Because these are samples picked from a survey and represent less than 10% of of the entire survey, it is likely independent. Given that this is a large database, it is presumed that it likely has a normal (or near normal) distribution. And even if there is a skew, this can be ignored given the fact that there is 200 in the sample size. The box plots demonstrate little skew as well.
  1. The average observed difference in scores is x ̄read write = 0.545, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?
# Null hypothesis: Average score (read - write) == 0
# Calculate the standard of error
st.er <- 8.887/sqrt(200)
# Calculate the Tscore. Mean will be zero, given assumption of null hypothesis
Tscore <- (0.545 - 0)/st.er
paste("T score:", round(Tscore,3))
## [1] "T score: 0.867"
# Look up the T score on a table for degrees of freedom 199 (two-tailed)
# On a table, this falls between the .50 and .20 range for p values.
# Therefore, we cannot reject the null hypothesis
  1. What type of error might we have made? Explain what the error means in the context of the application.
# This may potentially be a type II error. This means that we may have not enough samples to power our study to detect a difference.
  1. Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.
# Yes, since we failed to reject H0, which had a null value of 0.

5.32 Fuel efficiency of manual and automatic cars, Part I. Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year. Below are summary statistics on fuel e ciency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage? Assume that conditions for inference are satisfied.

# To detect a difference, we will use an alpha = 0.05. 
# Null hypothesis: Automatic MPG == Manual MPG, or other words, difference == 0
# Alternative hypothesis: Automatic MPG != Manual MPG
# Search for two-tailed p value
# Given sample size < 30, will use T test instead of Z test
manual.mean <- 19.85
manual.sd <- 4.51
manual.n <- 26
automatic.mean <- 16.12
automatic.sd <- 3.58
automatic.n <- 26
mean.diff.car <- manual.mean - automatic.mean

# Find standard of error
car.se <- sqrt((3.58^2/26) + (4.51^2/26))

# Search for T value
car.T <- (mean.diff.car - 0)/car.se
paste("T score:", round(car.T,2))
## [1] "T score: 3.3"
# We have (26 - 1) = 25 degrees of freedom
# Look up on chart for the corresponding T value for 25 degrees of freedom
# The two-tailed p value is between .005 and .002, thus we can reject the null hypothesis.
# This does provide strong evidence that manual cars are great for gas mileage.

5.48 Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

  1. Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.
# Null hypothesis: The average hours worked among all 5 groups is equal. In other words, the differences in mean hours worked == 0
# Alterative hypothesis: Differents in mean hours worked among all 5 groups != 0
  1. Check conditions and describe any assumptions you must make to proceed with the test.
#It is reasonable to assume that all 1172 respondents probably did not know each other, making them independent. Also, given that these are 1172 respondents for the entire country, it is also safe to assume that the sample taken is < 10% of the population. The standard deviations look fairly similar to each other, thus each group has similar variance. And we'll need to make the assumption that the distribution is normal. However, even if the distribution was skewed, given that we have 1172 respondents, the ANOVA can still be performed given the high N number.
  1. Below is part of the output associated with this test. Fill in the empty cells.

Credit to www.khanacademy.com. They had great explanations on how to calculate the MSG, MSE, and F values. Another website with great information on how to calculate the MSG, MSE, and F values is: http://oak.ucc.nau.edu/rh232/courses/EPS625/Handouts/One-Way%20ANOVA/Hand%20Calculation%20of%20ANOVA.pdf

#             | Df    | Sum Sq  | Mean Sq | F value | Pr(>F) 
#_____________________________________________________________________________________
# degree      | 4     | 2004.1  | 501.03  | 2.19    | 0.0682
# Residuals   | 1167  | 267382  | 229.12  |
# Total       | 1171  | 269386.1
  1. What is the conclusion of the test?
# If we are using a p value < 0.05, then we fail to reject the null hypothesis. Our P value here is 0.0682.