DACSS-603
The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population.Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?
To calculate the confidence intervals of bypass and angiography procedures we must use the following formula: x +/- z(s/sqrt(n)). X represents the sample mean, z represents the z-value, s represents the sample standard deviation, and n represents sample size.
In this part of the problem, variables x, z, s, and n hold the following values:
x = 19,
z = 1.645 (for 90% confidence interval),
s = 10,
n = 539.
19+1.645*(10/(sqrt(539)))
[1] 19.70855
19-1.645*(10/(sqrt(539)))
[1] 18.29145
Confidence Interval = 18.29145, 19,70855
In this part of the problem, variables x, z, s, and n hold the following values:
x = 18,
z = 1.645,
s = 9,
n = 847
18+1.645*(9/(sqrt(847)))
[1] 18.50871
18-1.645*(9/(sqrt(847)))
[1] 17.49129
Confidence Interval = 17.49129, 18.50871
A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.
To find the point estimate p, with a 95% confidence interval we must use the following formula:
p +/- z*(sqrt(p(1 p)/n))
in which p represents the sample proportion, z represents the chosen z-value, and n represents sample size. The values these variables take are listed in the code chunk below.
p <-0.54995
z <-1.96
n <-1031
p+1.96*(sqrt(p*(1-p)/n))
[1] 0.5803182
p-1.96*(sqrt(p*(1-p)/n))
[1] 0.5195818
Confidence Interval = 0.5195818, 0.5803182
Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?
We must first start with the confidence interval formula:
x +/- z*(s/sqrt(n)) = 5 in which
x = sample mean, z = chosen z-value, s = sample standard deviation, and n = sample size. 5 accounts for the $5 range the cost estimate must be within.
But because we are looking for variable n (sample size), we have to reorganize the equation;
z*(s/5)^2 = n
f <-function(n, z = 1.96, s = 42.5) {
res <- z*s/sqrt(n)
return(res)
}
vec <- vapply(1:300, FUN = f, FUN.VALUE = 5.0)
which(vec < 5) [1]
[1] 278
Once we transition the formula into code, we must then create a vector in order to see the lowest value the sample size can be to achieve a mean textbook cost within $5 of the true population mean.
The result indicates that the sample must contain at least 278 people to achieve an estimate within $5 of the true population mean.
(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.
a. Test whether the mean income of female employees differs from $500/week. Include assumptions, hypotheses, test statistic, and p-value. Interpret the result.
b. Report the p-value for \(H_{a}\) : μ < 500. Interpret.
c. Report and interpret the P-value for \(H_{a}\) : μ > 500.
(Hint: The P-values for the two possible one-sided tests must sum to 1.)
In order to test whether or not the mean income for female employees differs from $500/week, we must first condect a one-sample, two-sided significance test.
We can also assume the following:
1. The sample is random and the population has a normal distribution
2. The mean income for all senior-level workers = $500/week
3. From the random sample of 9 female employees, the mean income = $410/week
4. Standard deviation = 90
5. Null Hypothesis: \(H_{0}\): μ = 500
6. Alternative Hypothesis: \(H_{a}\): μ ≠ 500
(410 - 500)/(90/sqrt(9))
[1] -3
The test statistic/t-test value is -3.
Now onto the P-value
random_sample_n <- 9
df_n <- (random_sample_n - 1)
t_test <- (410 - 500)/(90/sqrt(9))
p_value <- pt(t_test, df_n)*2
print(p_value)
[1] 0.01707168
Interpretation: We know that the p-value is 0.01707. Assuming α = 0.05, we can see that 0.01707 < 0.05, meaning we can reject the null hypothesis. As a result, we have sufficient evidence to assert that the mean income for female employees differs from the general mean of $500/week.
The next part of the question asks us to report the p-value for Ha: μ < 500, and then interpret.
Hypotheses
\(H_{0}\) : μ = $500/week
\(H_{a}\) : μ = <$500/week
P-Value = p(t < t_test)p(t<-3)
P-value for \(H_{a}\): μ > 500 (left tail test) using the formula pt(q,df,lower.tail=TRUE,log.p=FALSE)
q <- -3
random_sample_n <- 9
df_n <- (random_sample_n-1)
left_p_value <- pt(q,df_n,lower.tail=TRUE,log.p=FALSE)
print(left_p_value)
[1] 0.008535841
P= 0.0085, which can be rounded to 0.01. This indicates that there is strong evidence against the mean weekly income being $500 or more.
Next we must calculate the P-value for \(H_{0}\) : μ < 500 (right tail test)
q <- -3
random_sample_n <- 9
df_n <- (random_sample_n-1)
right_p_value <- pt(q,df_n,lower.tail = FALSE,log.p=FALSE)
print(right_p_value)
[1] 0.9914642
The P-value for \(H_{0}\) : μ < 500 is 0.99. This indicates strong evidence in favor of the null hypothesis, going against the claim that mean μ > 500. To ensure these findings are in fact correct, we have to confirm that the sum of left_p_value and right_p_value = 1.
left_p_value <- 0.01
right_p_value <- 0.99
total_sum_lr <- left_p_value + right_p_value
print(total_sum_lr)
[1] 1
As is shown above, the sum of the left and right tails is 1.
(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018) Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7,with se = 10.0.
a. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.
Jones
We will start with a t-test, using the formula: t = (\(\overline{y}\) - μ)/10.0
\(\overline{y}\) = 519.5 μ = 500 se = 10.0
t_test <- (519.5-500)/10.0
print((519.5-500)/10.0)
[1] 1.95
We have thus shown that for Jones, t = 1.95. Now, onto the p-value.
n <- 1000
df_j <- (n - 1)
t_test <- (519.5-500)/10.0
p_value <- pt(t_test, df_j,lower.tail = FALSE,log.p = FALSE)*2
print(p_value)
[1] 0.05145555
The presented p-value of 0.051 is in fact correct, as the math shows. The next step is doing the same calculations for Smith, in which t = 1.97 and P-value = 0.049
Smith
T-test:
\(\overline{y}\) = 519.7 μ = 500 se = 10.0
t_test <- (519.7 - 500)/10.0
print(t_test)
[1] 1.97
Once again the presented value for t has been confirmed as correct (t=1.97). Now that we have this information, we can calculate and hopefully confirm the p-value as well.
P-value:
n <- 1000
df_s <- (n - 1)
t_test <- (519.7-500)/10.0
p_value <- pt(t_test, df_s,lower.tail = FALSE,log.p = FALSE)*2
print(p_value)
[1] 0.04911426
Smith’s p-value is correct at 0.049.
b. Using α = 0.05, for each study indicate whether the result is “statistically significant.”
jones_p_value = 0.051
smith_p_value = 0.049
\(\alpha\) = 0.05
In order for a p-value to be statistically significant, it must be greater than 0.05. Smith’s p-value is 0.049 which, while close, is still less than 0.05. Jones’s p-value, however, is statistically significant at 0.051.
c. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.
Both studies yielded extremely similar results, but the difference is great enough that only Jones’s work was statistically significant.However, given the closeness in result values between the two, we can see that both have moderate evidence against \(H_{0}\).
Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.
Assumptions:
gas_taxes_sample <- 18 df_gt <- gas_taxes_sample-1 95% confidence interval Significance level: \(\alpha\) = 0.05 (based on confidence interval)
To start, we must calculate the t-score to find the upper and lower intervals of gas_taxes_sample
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
gas_taxes_sample <- 18
df_gt <- gas_taxes_sample - 1
mean_gt <- mean(gas_taxes)
tscore_gt <- qt(p=0.05,df=df_gt,lower.tail=FALSE)
gas_sd <- sd(gas_taxes)
me_gas_taxes <- qt(0.05,df = df_gt)*gas_sd/sqrt(18)
lower_int_gt<-(mean_gt-me_gas_taxes)
print(lower_int_gt)
[1] 44.67946
The lower bound of gas_taxes = 44.67946
Now to find the upper bound:
upper_int_gt <- (mean_gt + me_gas_taxes)
print(upper_int_gt)
[1] 37.0461
The upper bound is 37.0461.Therefore, the confidence interval (at 95%) is [37.0461, 44.6794].
The average tax/gallon of gas is less than $0.45, so it is within the upper and lower bounds of the confidence interval. However, we will test an alternate outcome via a t-test
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
mean(gas_taxes)
[1] 40.86278
t.test(gas_taxes,conf.level=0.95)
One Sample t-test
data: gas_taxes
t = 18.625, df = 17, p-value = 9.555e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
36.23386 45.49169
sample estimates:
mean of x
40.86278
In this scenario, the confidence interval is [36.2338, 45.4916]. Therefore there isn’t enough evidence to conclude that at a 95% confidence level the average tax per gallon of gas was less than $0.45 in the US in 2005 since $0.45 is within the confidence interval (which contains tax rates greater than $0.45).