Homework # 1 questions and answers for DACSS 603: Introduction to Quantitative Analysis
The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population. Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?
procedure <- c('Bypass', 'Angiography')
samplesize <- c(539, 847)
meanwait <- c(19, 18)
standev <- c(10, 9)
surgdata <- data.frame(procedure, samplesize, meanwait, standev)
kable(surgdata, col.names = c("Surgical Procedure", "Sample Size", "Mean Wait Time", "Standard Deviation"),
align = c('c', 'c', 'c', 'c')) %>%
kable_styling(fixed_thead = TRUE)%>%
scroll_box(width = "100%", height = "100%")
| Surgical Procedure | Sample Size | Mean Wait Time | Standard Deviation |
|---|---|---|---|
| Bypass | 539 | 19 | 10 |
| Angiography | 847 | 18 | 9 |
We have the sample size, mean, and standard deviation and can assume the sample is representative of the Ontario population. We do not know the population mean or standard deviation. We will use the t-distribution to produce an interval estimate for the true mean wait times of the two procedures. According to the text (SMSS, section 5.6), “confidence intervals using the t-distribution apply with any n but assume a normal population distribution.”.
Using formula to calculate confidence intervals:
I will use qt() to calculate the t-score since we know the sample distribution
I will set p to .05 which accounts for .05 right tail and .05 left tail calculations to achieve the 90% confidence interval
df=n-1 for t-score
\(\bar{y}\) ± t(se)
[1] 18.29029
ybarB - tB*seB
[1] 19.70971
[1] 17.49078
ybarA - tA*seA
[1] 18.50922
Reporting the confidence intervals:
We can be 90% confident that the population mean wait time for the bypass procedure is between 18.29029 and 19.70971 minutes.
We can be 90% confident that the population mean wait time for the angiography procedure is between 17.49078 and 18.50922 minutes.
Which confidence interval is narrower?
# Bypass confidence interval difference
(ybarB - tB*seB)-(ybarB + tB*seB)
[1] 1.419421
# Angiography confidence interval difference
(ybarA - tA*seA)-(ybarA + tA*seA)
[1] 1.018436
The confidence interval for the angiography procedure is narrower than the confidence interval for the bypass procedure.
A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.
Since the data is that of proportions, we will use prop.test() to calculate p and the 95% confidence interval.
prop.test(567, 1031, conf.level = .95)
1-sample proportions test with continuity correction
data: 567 out of 1031, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5189682 0.5805580
sample estimates:
p
0.5499515
The point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success is 0.5499515.
We can be 95% confident that the population proportion who believe that a college education is essential for success is between 0.5189682 and 0.5805580.
Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?
The formula used to determine the sample size for estimating mean is n=\(σ^{2}\) (\(\frac{z}{M})^{2}\).
The financial aid office estimates the population standard deviation to be about a quarter or the range, which is \(\frac{(200-30)}{4}\).
The office wants the confidence interval to have a length of 10 dollars or less. Since confidence interval = point estimate ± margin of error, the margin of error in this case will be \(\frac{10}{2}\).
With the significance level set at 5%, z=1.96.
# Computing sample size
stdevBooks <- (200-30)/4
margerrorBooks <- (10/2)
zBooks <- 1.96
stdevBooks^2 * (zBooks/margerrorBooks)^2
[1] 277.5556
To achieve an estimate of the mean cost of books with the range of a 95% confidence interval equal to or less than $10, the sample size should be at least 278.
(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.
(Hint: The P-values for the two possible one-sided tests must sum to 1.)
a.
To test whether the mean income of female employees differs from $500, we will perform a one-sample two-sided significance test for a mean (which uses t-statistic).
We assume that the sample is random and that the population has a normal distribution.
Null hypothesis: \(H_{0}\): μ = 500
Alternative hypothesis: \(H_{a}\): μ ≠ 500
The test statistic is t=\(\frac{ȳ-μ_{0}}{se}\), where se=\(\frac{s}{\sqrt{n}}\)
We will reject the null hypothesis at a p-value less than or equal to α=.05
# Calculate t-statistic:
(410-500)/(90/sqrt(9))
[1] -3
# Calculate p-value
pt(-3, 8)*2
[1] 0.01707168
# Multiply by 2 to account for two-tails
The test statistic is t=-3 and the p-value is P=0.01707168. With an α-level of .05, the p-value is substantially less than .05, thus we will reject the null hypothesis. There is strong evidence that the mean income of female employees is not equal to $500.
b.
# Calculate p-value for Ha: μ < 500
pt(-3, 8, lower.tail = TRUE)
[1] 0.008535841
The p-value for \(H_{a}\): μ < 500 is P=0.008535841. With an α-level of .05, the p-value is substantially less than .05, thus we will reject the null hypothesis. There is evidence that the mean income of female employees is less than $500.
c.
# Calculate p-value for Ha: μ > 500
pt(-3, 8, lower.tail = FALSE)
[1] 0.9914642
The p-value for \(H_{a}\): μ > 500 is P=0.9914642. With an α-level of .05, we fail to reject the null hypothesis. It is highly unlikely that the mean income of female employees is greater than $500.
(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018) Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.
a.
# Jones t=1.95, P=.051
JonesT <- (519.5-500)/10
JonesT
[1] 1.95
JonesP <- pt(1.95, 999, lower.tail = FALSE)*2
JonesP
[1] 0.05145555
# Smith t=1.97, P=.049
SmithT <- (519.7-500)/10
SmithT
[1] 1.97
SmithP <- pt(1.97, 999, lower.tail = FALSE)*2
SmithP
[1] 0.04911426
b.
With an α-level of .05, the p-values that both Jones (P=.051) and Smith (P=.049) found are very close to equivalent. Although Jones’ P-value is slightly greater than α=.05 and Smith’s P-value is slightly less than α=.05, the proximity of the results should yield the same conclusion. Both P-values provide moderate evidence to reject the null hypothesis and indicate that the mean is not equal to 500. If we were to technically interpret the P-values, then Jones’ test would fail to reject the null hypothesis, and Smith’s test would reject the null hypothesis.
c.
If we fail to report the P-value and simply state whether the P-value is less than/equal to or greater than the defined significance level of the test, one cannot determine the strength of the conclusion. For example, a P-value of .009 for a significance level of .05 provides much stronger evidence to reject the null than a P-value of .045, however both values allow for rejection of the null at the significance level .05. In the Jones/Smith example, reporting the results only as “P ≤ 0.05” versus “P > 0.05” will lead to different conclusions about very similar results (rejecting versus failing to reject the null).
Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.
t.test(gas_taxes, mu = 18.4, conf.level = .95)
One Sample t-test
data: gas_taxes
t = 10.238, df = 17, p-value = 1.095e-08
alternative hypothesis: true mean is not equal to 18.4
95 percent confidence interval:
36.23386 45.49169
sample estimates:
mean of x
40.86278
The 95% confidence interval for the mean tax per gallon is 36.23386 through 45.49169. We cannot conclude with 95% confidence that the mean tax is less than 45 cents, since the 95% confidence interval contains values above 45 cents (up to 45.49169).