Chapter 7 - Inference for Numerical Data

Working backwards, Part II. (5.24, p. 203) A 90% confidence interval for a population mean is (65, 77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations. Calculate the sample mean, the margin of error, and the sample standard deviation.

p <- 0.9

sample_mean <- (65+77)/2
MOE <- (77-65)/2
t <- qt((p + (1-p)/2), 25 - 1)
StdErr <- MOE / t
stdev <- StdErr*sqrt(25)

paste("Sample Mean : ", sample_mean)

## [1] "Sample Mean :  71"

paste("Margin of Error: ", MOE)

## [1] "Margin of Error:  6"

paste("Standard Deviation: ", round(stdev, 5))

## [1] "Standard Deviation:  17.53481"

SAT scores. (7.14, p. 261) SAT scores of students at an Ivy League college are distributed with a standard deviation of 250 points. Two statistics students, Raina and Luke, want to estimate the average SAT score of students at this college as part of a class project. They want their margin of error to be no more than 25 points.

Raina wants to use a 90% confidence interval. How large a sample should she collect?

std <- 250
MOE <- 25
z <- 1.645

sample <- ((std/MOE)*z)^2

ceiling(sample)

## [1] 271

Luke wants to use a 99% confidence interval. Without calculating the actual sample size, determine whether his sample should be larger or smaller than Raina’s, and explain your reasoning.

His sample should be larger so he can have a better measure that is more representative of the population. He should have a narrower confidence interval with the larger sample size.

Calculate the minimum required sample size for Luke.

std <- 250
MOE <- 25
z <- 2.576

sample <- ((std/MOE)*z)^2

ceiling(sample)

## [1] 664

High School and Beyond, Part I. (7.20, p. 266) The National Center of Education Statistics conducted a survey of high school seniors, collecting test data on reading, writing, and several other subjects. Here we examine a simple random sample of 200 students from this survey. Side-by-side box plots of reading and writing scores as well as a histogram of the differences in scores are shown below.

Is there a clear difference in the average reading and writing scores?

There is not a clear difference in the average reading and writing scores. The distribution of reading - writing scores follows a near normal distribution, which appears to show that the scores are actually quite similar. The box plot shows that the scores differ by only a small margin and that the distributions of each are relatively similar, with reading’s spread being greater.

Are the reading and writing scores of each student independent of each other?

Reading and writing scores are not necessarily independent of one another. For a given student, the two scores actually seem to be correlated. The two may be dependent on one another for each student.

Create hypotheses appropriate for the following research question: is there an evident difference in the average scores of students in the reading and writing exam?

Null: There is not evidence that the scores are different. Alternative: There is an evident difference in the average scores of students in the reading and writing exam.

Check the conditions required to complete this test.

The two variables appear to be independent of one another, and the histogram of reading - writing follows a near normal distribution. Normality and independence may be satisfied.

The average observed difference in scores is \({ \widehat { x } }_{ read-write }=-0.545\), and the standard deviation of the differences is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams?

p <- .99
null <- 0
stdev <- 8.887
sample_mean <- -.545
sample <- 200
StdErr <- stdev / sqrt(sample)                                                                                                                                                                     
t <- (sample_mean - null)/StdErr

fin <- pt(t, df = sample - 1, lower.tail = TRUE)

fin

## [1] 0.1934182

The calculated p value 0.19342 > 0.05, so the null hypothesis is not rejected.

What type of error might we have made? Explain what the error means in the context of the application.

The possible error here would be a Type II error, or a false negative. In context, a Type II error would mean that we have incorrectly determined that there is no evidence that the scores are different, when in fact there is evidence that they are different.

Based on the results of this hypothesis test, would you expect a confidence interval for the average difference between the reading and writing scores to include 0? Explain your reasoning.

For any hypothesis test in which the null hypothesis is not rejected, I would expect the confidence interval to include 0, which is where the null hypothesis is found. Because no conclusion has been drawn that there is evidence of a difference in the scores, 0 must be near the center of confidence interval.

Fuel efficiency of manual and automatic cars, Part II. (7.28, p. 276) The table provides summary statistics on highway fuel economy of cars manufactured in 2012. Use these statistics to calculate a 98% confidence interval for the difference between average highway mileage of manual and automatic cars, and interpret this interval in the context of the data.

sample <- 26
auto_sd <- 5.29
auto_mean <- 22.92

man_sd <- 5.01
man_mean <- 27.88

StdErr <- sqrt(((auto_sd^2)/sample) + ((man_sd^2)/sample))
mean <- man_mean - auto_mean

t <- abs(qt((1 - 0.98)/2, df = (sample - 1)))

low <- mean - t*StdErr
high <- mean + t*StdErr

paste("Interval: ", low, high)

## [1] "Interval:  1.40907836730634 8.51092163269366"

The 98% confidence interval for the population of manual and automatic vehicles’ highway mileage indicates that manual vehicles tend to have improved performance over their automatic counterparts.

Email outreach efforts. (7.34, p. 284) A medical research group is recruiting people to complete short surveys about their medical history. For example, one survey asks for information on a person’s family history in regards to cancer. Another survey asks about what topics were discussed during the person’s last visit to a hospital. So far, as people sign up, they complete an average of just 4 surveys, and the standard deviation of the number of surveys is about 2.2. The research group wants to try a new interface that they think will encourage new enrollees to complete more surveys, where they will randomize each enrollee to either get the new interface or the current interface. How many new enrollees do they need for each interface to detect an effect size of 0.5 surveys per enrollee, if the desired power level is 80%?

alpha <- 0.05
siglvlz <- 1.96
z <- 0.842
stdev <- 2.2
eff_size <- 0.5
StdErr <- eff_size/(z + siglvlz)
sample <- ((2*stdev^2)/StdErr^2)
ceiling(sample)

## [1] 304

Work hours and education. The General Social Survey collects data on demographics, education, and work, among many other characteristics of US residents.47 Using ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below are the distributions of hours worked by educational attainment and relevant summary statistics that will be helpful in carrying out this analysis.

Write hypotheses for evaluating whether the average number of hours worked varies across the five groups.

Null: The number of hours worked is consistent across all five employee types. Alternative: The number of hours work varies by education level.

Check conditions and describe any assumptions you must make to proceed with the test.

The sample size is large, the data should be independent across each education level, and the distributions in each group should be near normal with help from the large sample size. The spreads are similar across each group as well.

Below is part of the output associated with this test. Fill in the empty cells.

p <- 0.0682
meansq_deg <- 501.54
sumsq_residual <- 267382

df_deg <- 5 - 1
df_residual <- 1172 - 1*5
df_tot <- 1172 - 1


sumsq_deg <- meansq_deg*df_deg
sumsq_tot <- sumsq_residual + sumsq_deg
meansq_residual <- sumsq_residual/df_residual
f_value <- meansq_deg/meansq_residual


col1 <- c("Degree", "Residuals", "Total")
col2 <- c(df_deg, df_residual, df_tot)
col3 <- c(sumsq_deg, sumsq_residual, sumsq_tot)
col4 <- c(meansq_deg, meansq_residual, "")
col5 <- c(f_value, "", "")
col6 <- c(p, "", "")

final_df <- data.frame(col1,col2,col3,col4,col5,col6)
colnames(final_df) <- c("", "Df", "Sum Sq", "Mean Sq", "F-value", "Pr(>F)")

final_df

What is the conclusion of the test?

The conclusion of this test is given by the p-value, Pr(>F) = 0.0682, which is greater than 0.05 (assuming significance level of 0.05). For that reason, the null hypothesis can not be rejected and it can not be concluded that hours worked varies by education level.

Chapter 7 - Inference for Numerical Data

Shane Hylton