Working backwards, Part II. (5.24, p. 203) A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.
p <- 0.9
sample_mean <- (65+77)/2
MOE <- (77-65)/2
t <- qt((p + (1-p)/2), 25 - 1)
StdErr <- MOE / t
stdev <- StdErr*sqrt(25)
paste("Sample Mean : ", sample_mean)
## [1] "Sample Mean : 71"
paste("Margin of Error: ", MOE)
## [1] "Margin of Error: 6"
paste("Standard Deviation: ", round(stdev, 5))
## [1] "Standard Deviation: 17.53481"
SAT scores. (7.14, p. 261) SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.
std <- 250
MOE <- 25
z <- 1.645
sample <- ((std/MOE)*z)^2
ceiling(sample)
## [1] 271
His sample should be larger so he can have a better measure that is more representative of the population. He should have a narrower confidence interval with the larger sample size.
std <- 250
MOE <- 25
z <- 2.576
sample <- ((std/MOE)*z)^2
ceiling(sample)
## [1] 664
High School and Beyond, Part I. (7.20, p. 266) The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.
There is not a clear difference in the average reading and writing scores. The distribution of reading - writing scores follows a near normal distribution, which appears to show that the scores are actually quite similar. The box plot shows that the scores differ by only a small margin and that the distributions of each are relatively similar, with reading’s spread being greater.
Reading and writing scores are not necessarily independent of one another. For a given student, the two scores actually seem to be correlated. The two may be dependent on one another for each student.
Null: There is not evidence that the scores are different. Alternative: There is an evident difference in the average scores of students in the reading and writing exam.
The two variables appear to be independent of one another, and the histogram of reading - writing follows a near normal distribution. Normality and independence may be satisfied.
p <- .99
null <- 0
stdev <- 8.887
sample_mean <- -.545
sample <- 200
StdErr <- stdev / sqrt(sample)
t <- (sample_mean - null)/StdErr
fin <- pt(t, df = sample - 1, lower.tail = TRUE)
fin
## [1] 0.1934182
The calculated p value 0.19342 > 0.05, so the null hypothesis is not rejected.
The possible error here would be a Type II error, or a false negative. In context, a Type II error would mean that we have incorrectly determined that there is no evidence that the scores are different, when in fact there is evidence that they are different.
For any hypothesis test in which the null hypothesis is not rejected, I would expect the confidence interval to include 0, which is where the null hypothesis is found. Because no conclusion has been drawn that there is evidence of a difference in the scores, 0 must be near the center of confidence interval.
Fuel efficiency of manual and automatic cars, Part II. (7.28, p. 276) The table provides summary statistics on highway fuel economy of cars manufactured in 2012. Use these statistics to calculate a 98% confidence interval for the difference between average highway mileage of manual and automatic cars, and interpret this interval in the context of the data.
sample <- 26
auto_sd <- 5.29
auto_mean <- 22.92
man_sd <- 5.01
man_mean <- 27.88
StdErr <- sqrt(((auto_sd^2)/sample) + ((man_sd^2)/sample))
mean <- man_mean - auto_mean
t <- abs(qt((1 - 0.98)/2, df = (sample - 1)))
low <- mean - t*StdErr
high <- mean + t*StdErr
paste("Interval: ", low, high)
## [1] "Interval: 1.40907836730634 8.51092163269366"
The 98% confidence interval for the population of manual and automatic vehicles’ highway mileage indicates that manual vehicles tend to have improved performance over their automatic counterparts.
Email outreach efforts. (7.34, p. 284) A medical research group is recruiting people to complete short surveys about their medical history. For example, one survey asks for information on a person’s family history in regards to cancer. Another survey asks about what topics were discussed during the person’s last visit to a hospital. So far, as people sign up, they complete an average of just 4 surveys, and the standard deviation of the number of surveys is about 2.2. The research group wants to try a new interface that they think will encourage new enrollees to complete more surveys, where they will randomize each enrollee to either get the new interface or the current interface. How many new enrollees do they need for each interface to detect an effect size of 0.5 surveys per enrollee, if the desired power level is 80%?
alpha <- 0.05
siglvlz <- 1.96
z <- 0.842
stdev <- 2.2
eff_size <- 0.5
StdErr <- eff_size/(z + siglvlz)
sample <- ((2*stdev^2)/StdErr^2)
ceiling(sample)
## [1] 304
Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.
Null: The number of hours worked is consistent across all five employee types. Alternative: The number of hours work varies by education level.
The sample size is large, the data should be independent across each education level, and the distributions in each group should be near normal with help from the large sample size. The spreads are similar across each group as well.
p <- 0.0682
meansq_deg <- 501.54
sumsq_residual <- 267382
df_deg <- 5 - 1
df_residual <- 1172 - 1*5
df_tot <- 1172 - 1
sumsq_deg <- meansq_deg*df_deg
sumsq_tot <- sumsq_residual + sumsq_deg
meansq_residual <- sumsq_residual/df_residual
f_value <- meansq_deg/meansq_residual
col1 <- c("Degree", "Residuals", "Total")
col2 <- c(df_deg, df_residual, df_tot)
col3 <- c(sumsq_deg, sumsq_residual, sumsq_tot)
col4 <- c(meansq_deg, meansq_residual, "")
col5 <- c(f_value, "", "")
col6 <- c(p, "", "")
final_df <- data.frame(col1,col2,col3,col4,col5,col6)
colnames(final_df) <- c("", "Df", "Sum Sq", "Mean Sq", "F-value", "Pr(>F)")
final_df
The conclusion of this test is given by the p-value, Pr(>F) = 0.0682, which is greater than 0.05 (assuming significance level of 0.05). For that reason, the null hypothesis can not be rejected and it can not be concluded that hours worked varies by education level.