A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations.
Calculate the sample mean, the margin of error, and the sample standard deviation.
#First off, mean will be the center of the confidence interval. We can calculate that easily by taking the average.
sample_mean <- mean(c(77,65))
#Next lets look at spread. In particular the upper half of the spread, which will tell us how far away from the mean we get, i.e the margin of error.
spread <- 77 - 65
margin <- spread/2
#In a 90% confidence interval, each tail will have 5%. To get the appropriate z value for that we'll find the z-score for 95%.
z <- qnorm(.95)
#We know margin of error, we can divide by our z-value to find out the standard error.
SE <- margin / z
# Finally, now that we know the sample standard error, we can plug that in the SE formula (SE = sd/sqrt(n)) to find sample SD.
sd <- SE * sqrt(25)
df <- data.frame(c(sample_mean,margin,sd))
row.names(df) <- c("Sample Mean", "Margin of Error","Sample SD")
df
## c.sample_mean..margin..sd.
## Sample Mean 71.0000
## Margin of Error 6.0000
## Sample SD 18.2387
SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project.
They want their margin of error to be no more than 25 points.
(a)Raina wants to use a 90% confidence interval. How large a sample should she collect?
sample_calc <- function(ci, sd, me) {
z <- qnorm(1-((1 - ci)/2))
n <- ((z * sd)^2)/me^2
return(ceiling(n))
}
sample_calc(ci = .90, sd = 250, me = 25)
## [1] 271
(b)Luke wants to use a 99% confidence interval. Without calculating the actual sample size,determine whether his sample should be larger or smaller than Rainab
Lukes sample size would need to be larger. In order for margin of error to remain capped at 25, and confidence interval (z-score) to increase, ‘n’ must increase as well, as its inversely proportional to ME
(c)Calculate the minimum required sample size for Luke.
sample_calc(ci = .99, sd = 250, me = 25)
## [1] 664
The National Center of Education Statistics conducteda survey of high school seniors, collecting test data on reading, writing, and several other subjects.
Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.
Not quite, there is certainly a difference in the median, but the IQR is very similar so its unclear how different they actually are.
The students are surveyed and this sample was a simple random sample of that survey. We can assume each student is independent of the others
The reading and writing score of each student are not independent, they are paired data
H0: Mu_read - Mu_Write = 0
Ha: MU_read - Mu_write != 0
The sample is a simple random sample Each observation/case (student) is independent of the next The distribution is nearly normal
n <- 200
diff <- -0.545
sd <- 8.887
SE <- sd/sqrt(n)
t_stat <- diff/SE
pnorm(t_stat)
## [1] 0.192896
The difference is less than one standard error away from the Null hypothesis, with a p-value of .19 (19% chance we’d see that mean or worse if the null hypothesis were true). Hence, we fail to reject the null hypothesis.
A type 2 error. A type 2 error is a failure to reject the null hypothesis when there was in fact evidence to do so. In this application it would be a failure to identify a signifcant difference in reading and writing scores. We should note that the high p-value makes this unlikely.
Yes, definitely. We just calculated that there is a large chance of seeing 0 within a couple of standard errors of our observed sample mean.
Each year the US Environmental Protection Agency (EPA) releases fuel economy data on cars manufactured in that year.Below are summary statistics on fuel economy data on cars manufactured in that year.
Below are summary statistics on fuel effiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions n terms of their average city mileage? Assume that conditions for inference are satisfied.
mean_auto <- 16.12
mean_man <- 19.85
sd_auto <- 3.58
sd_man <- 4.15
n <- 26
df <- n-1
mean_diff <- mean_auto - mean_man
SE_diff <- sqrt((sd_auto^2/n) + (sd_man^2/n))
t_stat <- mean_diff/SE_diff
p <- pt(t_stat, df)
p
## [1] 0.0009511977
We have a miniscule p-value, showing that the difference between manual and automatic City MPG is not due to chance, but due to a significant difference in automatic vs. manual transmissions. We can reject the null hypothesis that the two MPGs are equal
The General Social Survey collects data on demographics,education, and work, among many other characteristics of US residents.
Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributionsof hours worked by educational attainment and relevant summary statistics that will be helpful incarrying out this analysis.
H0: There is no difference in the means of the groups
Ha: There is a least one group mean that differs from the others
We are assuming the respondents are independent and randomly sampled from the population.
We are assuming a normal distribution of responses within the groups
Finally we see that variance (SD) among the groups is similar, though not identical.
k <- 5
n <- 1172
dfg <- k - 1
dfe <- n - k
df1 <- dfg
df2 <- dfe
MSG <- 501.54
f <- qf(0.0682, df1, df2, lower.tail = FALSE)
MSE <- MSG/f
# MSG = 1/df1 * SSG, therefore MSG * df1 = SSG
SSG <- MSG * df1
df <- data.frame(c(dfg,dfe, dfg+dfe),c(SSG,267382,SSG+267382),c(501.54,round(MSE,2),""),c(round(f,2),"",""),c(0.0682,"",""))
colnames(df) <-c("Df", "SUm Sq", "Mean Sq", "F Value", "PR(>5)")
row.names(df) <- c("degree", "Residuals","Total")
df
## Df SUm Sq Mean Sq F Value PR(>5)
## degree 4 2006.16 501.54 2.19 0.0682
## Residuals 1167 267382.00 229.13
## Total 1171 269388.16
The p-value is greater than .05, so we fail to reject the null hypothesis, i.e. there is not a significant difference among the means.