5.6

A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations.

Calculate the sample mean, the margin of error, and the sample standard deviation.

#First off, mean will be the center of the confidence interval. We can calculate that easily by taking the average.

sample_mean <- mean(c(77,65))

#Next lets look at spread. In particular the upper half of the spread, which will tell us how far away from the mean we get, i.e the margin of error. 

spread <- 77 - 65
margin <- spread/2



#In a 90% confidence interval, each tail will have 5%.  To get the appropriate z value for that we'll find the z-score for 95%.

z <- qnorm(.95)

#We know margin of error, we can divide by our z-value to find out the standard error.

SE <- margin / z 

# Finally, now that we know the sample standard error, we can plug that in the SE formula (SE = sd/sqrt(n)) to find sample SD.

sd <- SE * sqrt(25)

df <- data.frame(c(sample_mean,margin,sd))
row.names(df) <- c("Sample Mean", "Margin of Error","Sample SD")

df
##                 c.sample_mean..margin..sd.
## Sample Mean                        71.0000
## Margin of Error                     6.0000
## Sample SD                          18.2387

5.14

SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project.

They want their margin of error to be no more than 25 points.

(a)Raina wants to use a 90% confidence interval. How large a sample should she collect?

sample_calc <- function(ci, sd, me) {
  z <- qnorm(1-((1 - ci)/2))
  
  n <- ((z * sd)^2)/me^2
  
  return(ceiling(n))
}

sample_calc(ci = .90, sd = 250, me = 25)
## [1] 271

(b)Luke wants to use a 99% confidence interval. Without calculating the actual sample size,determine whether his sample should be larger or smaller than Rainab

Lukes sample size would need to be larger. In order for margin of error to remain capped at 25, and confidence interval (z-score) to increase, ‘n’ must increase as well, as its inversely proportional to ME

(c)Calculate the minimum required sample size for Luke.

sample_calc(ci = .99, sd = 250, me = 25)
## [1] 664

5.20

The National Center of Education Statistics conducteda survey of high school seniors, collecting test data on reading, writing, and several other subjects.

Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

  1. Is there a clear difference in the average reading and writing scores?

Not quite, there is certainly a difference in the median, but the IQR is very similar so its unclear how different they actually are.

  1. Are the reading and writing scores of each student independent of each other.

The students are surveyed and this sample was a simple random sample of that survey. We can assume each student is independent of the others

The reading and writing score of each student are not independent, they are paired data

  1. Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam.

H0: Mu_read - Mu_Write = 0

Ha: MU_read - Mu_write != 0

  1. Check the conditions required to complete this test.

The sample is a simple random sample Each observation/case (student) is independent of the next The distribution is nearly normal

  1. The average observed difference in scores is X read-write = -0.545, and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams.
n <- 200
diff <- -0.545
sd <- 8.887

SE <- sd/sqrt(n) 

t_stat <- diff/SE

pnorm(t_stat)
## [1] 0.192896

The difference is less than one standard error away from the Null hypothesis, with a p-value of .19 (19% chance we’d see that mean or worse if the null hypothesis were true). Hence, we fail to reject the null hypothesis.

  1. What type of error might we have made? Explain what the error means in the context of the application.

A type 2 error. A type 2 error is a failure to reject the null hypothesis when there was in fact evidence to do so. In this application it would be a failure to identify a signifcant difference in reading and writing scores. We should note that the high p-value makes this unlikely.

  1. Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

Yes, definitely. We just calculated that there is a large chance of seeing 0 within a couple of standard errors of our observed sample mean.

5.32

Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year.Below are summary statistics on fuel economy data on cars manufactured in that year.

Below are summary statistics on fuel effiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions n terms of their average city mileage? Assume that conditions for inference are satisfied.

mean_auto <- 16.12
mean_man <- 19.85
sd_auto <- 3.58
sd_man <- 4.15
n <- 26
df <- n-1

mean_diff <- mean_auto - mean_man

SE_diff <- sqrt((sd_auto^2/n) + (sd_man^2/n))

t_stat <- mean_diff/SE_diff

p <- pt(t_stat, df)

p
## [1] 0.0009511977

We have a miniscule p-value, showing that the difference between manual and automatic City MPG is not due to chance, but due to a significant difference in automatic vs. manual transmissions. We can reject the null hypothesis that the two MPGs are equal

5.48

The General Social Survey collects data on demographics,education, and work, among many other characteristics of US residents.

Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributionsof hours worked by educational attainment and relevant summary statistics that will be helpful incarrying out this analysis.

  1. Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

H0: There is no difference in the means of the groups

Ha: There is a least one group mean that differs from the others

  1. Check conditions and describe any assumptions you must make to proceed with the test.

We are assuming the respondents are independent and randomly sampled from the population.

We are assuming a normal distribution of responses within the groups

Finally we see that variance (SD) among the groups is similar, though not identical.

  1. Below is part of the output associated with this test. Fill in the empty cells
k <- 5
n <- 1172

dfg <- k - 1
dfe <- n - k

df1 <- dfg
df2 <- dfe

MSG <- 501.54


f <- qf(0.0682, df1, df2, lower.tail = FALSE)


MSE <- MSG/f 

# MSG = 1/df1 * SSG, therefore MSG * df1 = SSG

SSG <- MSG * df1

df <- data.frame(c(dfg,dfe, dfg+dfe),c(SSG,267382,SSG+267382),c(501.54,round(MSE,2),""),c(round(f,2),"",""),c(0.0682,"",""))
colnames(df) <-c("Df", "SUm Sq", "Mean Sq", "F Value", "PR(>5)")
row.names(df) <- c("degree", "Residuals","Total")
df
##             Df    SUm Sq Mean Sq F Value PR(>5)
## degree       4   2006.16  501.54    2.19 0.0682
## Residuals 1167 267382.00  229.13               
## Total     1171 269388.16
  1. What is the conclusion of the test?

The p-value is greater than .05, so we fail to reject the null hypothesis, i.e. there is not a significant difference among the means.