Homework 1

Statistical Inference

Claire Battaglia https://rpubs.com/clairebattaglia (DACSS 603 Introduction to Quantitative Analysis)
Feb. 23, 2022

Question 1

Question

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population.

Surgical Procedure Sample Size (\(n\)) Mean Wait Time (\(\overline{x}\)) Standard Deviation (\(s\))
bypass 539 19 10
angiography 847 18 9

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

Answer

For those undergoing bypass surgery the estimated mean wait time is between 18.29 and 19.71 days (19 \(\pm\) .71 days).

For those undergoing angiography surgery the estimated mean wait time is between 17.49 and 18.51 days (18 \(\pm\) .51 days).

The confidence interval is narrower for the population of patients undergoing angiography surgery.

Solution

There are two populations: patients who have undergone bypass surgery and those who have undergone angiography surgery. I have a sample from each population and the mean and standard deviation of each sample.

While I don’t know the sampling method(s), I am told each sample is representative of its corresponding population and I can see that each is large (\(n>30\)). Because the variable wait time is measured in days, I am working with discrete interval data, which I need to know in order to construct the confidence interval correctly.

A confidence interval is essentially a “best guess” of a population parameter (called the point estimate), plus or minus a margin of error. The confidence level, which is frequently expressed as a percentage, is the proportion of trials in which the confidence interval would contain the true population mean. As a statement of proportion in the long run, it is only meaningful within the frequentist approach to statistics.

The confidence interval (\(CI\)) is a \[{point\;estimate}\pm{margin\;of\;error}\] where the \[{margin\;of\;error} = t*{standard\;error}\] and the \[{standard\;error} = \frac{\sigma}{\sqrt{n}}\] Because I don’t know the true population mean (\(\mu\)), I’ll use the sample mean (\(\overline{x}\)) as my estimator (\(\hat{\mu}\)) of the population mean. Similarly, because I don’t know the true population standard deviation (\(\sigma\)), I’ll use the sample standard deviation (\(s\)) as my estimator (\(\hat{\sigma}\)) of the population standard deviation. Because I don’t know the population standard deviation and am instead using the sample standard deviation as an estimate, I’m going to use the \(t\) distribution, which means I’ll be using a \(t\)-score instead of a \(z\)-score. It’s worth noting, however, that both samples are large enough that the \(t\) distribution will be essentially identical to the standard normal distribution.

This finally gives me the formula \[CI=\overline{x}\;\pm\;\left(t*\frac{s}{\sqrt{n}}\right)\] To calculate the \(t\)-score for a 90% confidence interval, I’m going to use R. Because the confidence interval is 90% (.9), the error probability (\(\alpha\)) is .1. \(\alpha/2=5\)%, or .05 on the right tail and .05 on the left tail.

Show code
# calculate quantile for t distribution
tScoreBypass <- qt(p = .95, df = 538)

# view
tScoreBypass
[1] 1.647691

So the final equation is \[CI_{90}={19}\;\pm\;\left(1.647691*\frac{10}{\sqrt{539}}\right)\]

Show code
# calculate lower bound
lowerBypass <- 19 - (1.647691*(10/sqrt(539)))

# calculate upper bound
upperBypass <- 19 + (1.647691*(10/sqrt(539)))

# calculate range
rangeBypass <- upperBypass - lowerBypass

The lower bound is 18.29029 and upper bound is 19.70971, making the range 1.41942.

I’ll do the same calculations for my second population (patients who have undergone angiography surgery).

Show code
# calculate quantile for t distribution
tScoreAngio <- qt(p = .95, df = 846)

# view
tScoreAngio
[1] 1.646657

So the final equation is \[CI_{90}={18}\;\pm\;\left(1.646657*\frac{9}{\sqrt{847}}\right)\]

Show code
# calculate lower bound
lowerAngio <- 18 - (1.646657*(9/sqrt(847)))

# calculate upper bound
upperAngio <- 18 + (1.646657*(9/sqrt(847)))

# calculate range
rangeAngio <- upperAngio - lowerAngio

The lower bound is 17.49078 and the upper bound is 18.50922, making the range 1.01844.

This means that if I were to sample repeatedly from these two populations, 90% of the intervals created with these models would contain the true populations means.

The confidence interval is narrower for the angiography population (1.01844 \(<\) 1.41942). This is to be expected because the angiography sample size is larger and the standard deviation is smaller.

Question 2

Question

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

Answer

The point estimate is 0.55 and the 95% confidence interval is 0.52 to 0.58. That is, the estimated proportion of adult Americans who believe that college education is essential for success is 55% \(\pm\) 3%.

Solution

The population is all adult Americans and I have a representative sample (\(n\)) of 1031. Of those sampled, 567 believe that a college education is essential for success. This is enough information to calculate the sample proportion (\(\hat{\pi}\)), which I’ll then use to construct a 95% confidence interval for the true population proportion.

To construct the confidence interval, I’ll be using the formula \[CI=\hat{\pi}\;\pm\;z\sqrt{\left(\frac{\hat{\pi}\;(1-\hat{\pi})}{n}\right)}\] First, I’ll calculate the sample proportion, which will serve as my estimator for the true population proportion.

Show code
# calculate sample proportion
sampProp <- 567/1031

Thus the sample proportion is 0.5499515

In a standard normal distribution 95% of all values fall within 1.96 standard deviations of the mean. Because I’m constructing a 95% confidence interval, then, the \(z\)-score I’ll be using is 1.96.

Thus \[CI_{95}=0.5499515\;\pm\;1.96\sqrt{\left(\frac{0.5499515\;(1-0.5499515)}{1031}\right)}\]

Show code
# calculate lower bound
lowerBound <- sampProp - (1.96 * sqrt(((sampProp * (1-sampProp)/1031))))

# calculate upper bound
upperBound <- sampProp + (1.96 * sqrt(((sampProp * (1-sampProp)/1031))))

Thus the lower bound of the confidence interval is 0.52 and the upper bound is 0.58.

This means that using the sample proportion of 0.5499515 as the estimator (\(\hat{\pi}\)) for the true population proportion and constructing a 95% confidence interval, I can expect that 95% of trials would contain the true population mean within the interval of 0.52 and 0.58 (\(\pi\pm.0304\)).

Alternatively, I can use the R function prop.test to calculate everything needed to answer this question. Since I’ve already calculated everything, I’ll simply use it to validate my calculations.

Show code
# calculate prop test
prop.test(567, 1031, p = 0.5499515)

    1-sample proportions test with continuity correction

data:  567 out of 1031, null probability 0.5499515
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5499515
95 percent confidence interval:
 0.5194543 0.5800778
sample estimates:
        p 
0.5499515 

Question 3

Question

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?

Answer

The sample size should be 278.

Solution

The formula for determining the sample size for estimating a population mean (\(\mu\)) is \[n=\sigma^2\left(\frac{z}{M}\right)^2\] where \(n\) is the sample size (what I’m trying to find), \(\sigma\) is the population standard deviation, \(z\) is the \(z\)-score for the chosen confidence level, and \(M\) is the margin of error.

While I don’t know the true population standard deviation (\(\sigma\)), I am given an estimate, which will suffice. The suspected range is 170 and I am told the population standard deviation is estimated to be about a quarter of that, which is 42.50.

If the significance level (\(\alpha\)) is 5%, then the confidence level is 95% (\(1-{confidence\;level}=\alpha\)). Assuming a normal distribution, the \(z\)-score for a 95% confidence level is 1.96.

The margin of error is 5.

Thus \[n=42.5^2\left(\frac{1.96}{5}\right)^2\]

Show code
# calculate sample size
n <- (42.5^2)*((1.96/5)^2)

which indicates that a sample size of 277.5556 is required to estimate the population mean with a 95% confidence interval.

Because the units in the sample are people (i.e. a discrete unit), I’ll round up to the nearest whole number, giving me a total of 278.

This means that I would need to sample a minimum of 278 people to estimate the mean cost of textbooks per student per quarter within plus or minus $5. If I repeatedly sample at least this number of people, 95% of the intervals constructed would contain the true population mean.

Question 4

Question

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

  1. Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.
  2. Report the P-value for Ha : μ < 500. Interpret.
  3. Report and interpret the P-value for Ha: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.)

Answer

  1. There is sufficient evidence to reject the null hypothesis and thereby accept the alternative hypothesis that the mean income for female employees does not equal $500 per week. The test statistic is -3. The \(P\)-value is .017.

  2. For \(H_a:\mu<500\), the \(P\)-value is .009. Because the \(P\)-value is less than the most stringent significance level of .01, we can reject the null hypothesis (\(H_0:\mu\ge500\)) and thereby accept the alternative hypothesis.

  3. For \(H_a:\mu>500\), the \(P\)-value is .991. Because the \(P\)-value is greater than the significance level of .01 (and even the more lenient .05), we cannot reject the null hypothesis (\(H_0:\mu\le500\)) at this time.

Solution

To solve this problem, I’m going to assume:

  1. A normal population distribution.
  2. The sample of female employees is sufficiently large.

To begin, I’ll identify the null hypothesis (\(H_0\)) and alternative hypothesis (\(H_a\)).

The null hypothesis is that there is no difference between the mean for female employees and the mean for all employees. That is, the mean for female employees is also 500.

\[{H_0:}\;\mu = 500\] The alternative hypothesis is that there is a difference. That is, the mean for female employees is not 500. Because I’m interested in any kind of difference (\(\mu < 500\) or \(\mu > 500\)), I’ll be using a two-sided test.

\[{H_a:}\;\mu\neq 500\] Conducting a hypothesis test asks whether what we’ve observed (\(\overline{y}=410\)) would be so unlikely if the null hypothesis (\(H_0=500\)) were true that we are obligated to reject it. If we find that what we observed is not that unlikely and could reasonably be explained by sample variability, we will not be able to reject the null hypothesis at this time.

The formula for calculating the test statistic is \[t=\frac{\overline{y}-\mu_0}{se}\] where \[se=\frac{s}{\sqrt{n}}\] Thus the standard error is \[se=\frac{90}{\sqrt{9}}\] and \[t=\frac{410-500}{30}\]

Show code
# calculate estimated standard error
se <- 90/sqrt(9)

# calculate test statistic
testStat <- (410-500)/se

\(t=\) -3 and \(|t|=\) 3

Next I’ll calculate the \(P\)-value.

Show code
# calculate 2-sided P value
pValue <- 2 * (1-pt(q = 3, df = 8))

Thus the \(P\)-value \(=\) 0.0170717

This indicates a 0.0170717 probability that we would observe the sample mean (\(\overline{y}\)) of 410 if the null hypothesis were true. That is, if the mean for female employees (\(\mu\)) were really 500. While this finding isn’t significant at the .01 level, it is significant at the .05 level and I can conclude that if the mean for female employees were 500, I’d be rather unlikely to end up with a sample mean of 410. I feel comfortable, then, in rejecting the null hypothesis and accepting the alternative hypothesis.

Now I’ll look at the probabilities that the sample mean would be above or below the population mean separately.

Show code
# calculate P value for Ha > 500, right-tail
pValueG <- 1-pt(q = 3, df = 8, lower.tail = FALSE)

# calculate P value for Ha < 500, left-tail
pValueL <- 1-pValueG

For \(H_a:\mu<500\), \(P=\) 0.009. This indicates a 0.009 probability that we would have observed a \(t\)-score equal to or lesser than what we what we did in fact observe if the null hypothesis (\(H_0:\mu\ge500\)) were true. Put more simply, if the null hypothesis were true, it is highly unlikely we would have observed what we observed. We can reject the null hypothesis.

For \(H_a:\mu>500\), \(P=\) 0.991. This indicates a 0.991 probability that we would have observed a \(t\)-score equal to or greater than what we did in fact observe if the null hypothesis (\(H_0:\mu\le500\)) were true. That is, if the null hypothesis were true, it is highly likely we would have observed what we observed. We can accept the null hypothesis.

Taken together these lend strong support to the claim that the mean income for female employees is less than the mean for all employees.

Question 5

Question

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

  1. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.
  2. Using α = 0.05, for each study indicate whether the result is “statistically significant.”
  3. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

Solution

The formula for calculating the test statistic is \[t=\frac{\overline{y}-\mu_0}{se}\] Since they both assume the population mean (\(\mu\)) to be 500 (the null hypothesis) and they both got a standard error of 10, the only difference between their tests is the mean of each of their samples (\(\overline{y}\)).

Thus Jones’s test statistic is \[t=\frac{{519.5}-{500}}{10}\] and Smith’s is \[t=\frac{{519.7}-{500}}{10}\]

Show code
# calculate test stat Jones
tStatJ <- (519.5-500)/10

# calculate test stat Smith
tStatS <- (519.7-500)/10

For Jones, \(t=\) 1.95

For Smith, \(t=\) 1.97

Now I’ll calculate the \(P\)-value for each of them. This will tell me the probability of observing the data they actually observed if the null hypothesis is true. Since the alternative hypothesis is non-directional, I’ll calculate the two-sided probability.

Show code
# calculate P value Jones
pJones <- 2*pt(q = tStatJ, df = 999, lower.tail=FALSE)

# calculate P value Smith
pSmith <- 2*pt(q = tStatS, df = 999, lower.tail=FALSE)

For Jones, \(P=\) 0.051

For Smith, \(P=\) 0.049

Since the significance level (\(\alpha\)) is .05, Jones is unable to reject the null hypothesis at this time (\(.051>.05\)) and his/her results are not statistically significant. Smith, however, is able to reject the null hypothesis (\(.049<.05\)) and can claim that his/her results are statistically significant (at the level of .05).

This example illustrates the danger of living and dying by whether or not the \(P\)-value is statistically significant. For example, if a statistically significant finding is requisite for publishing, then only Smith’s finding would make its way to a larger audience. If the \(P\)-value were not included, the reader might wrongly assume that the evidence for rejecting the null hypothesis is strong, when in reality it only nudges us towards that conclusion.

Alternatively, if both findings were published and the \(P\)-values were not included, the reader would see that Jones does not reject the null hypothesis but that Smith does and might wrongly believe that their findings were contradictory. Including the \(P\)-values, however, would allow the reader to see that the findings are actually not contradictory.

It’s important to remember that the significance level is an arbitrary demarcation. This is ultimately a problem of using a binary framework (reject/fail to reject) in a world in which very few things (if any) are actually binary. Maintaining a larger perspective is paramount—statistics can and should inform our understanding of the world but significance testing is not a substitute for critical thinking.

Question 6

Question

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

Answer

Yes, there is sufficient evidence to reject the null hypothesis and thereby accept the alternative hypothesis that in 2005 the average gas tax in the United States was less than 45.00 cents.

Solution

To answer this question, I’m going to conduct a one-sided significance test. To do so, I’m going to assume a random sample and normal distribution.

The null hypothesis is that the average tax per gallon is 45.00 cents. \[H_0:\mu=45.00\] The alternative hypothesis is that the average tax per gallon is less than 45.00 cents. \[H_a:\mu<45.00\] Conducting a hypothesis test asks whether the observed data (the gas taxes for the 18 cities in our sample) would be so unlikely if the null hypothesis were true that we are forced to reject it.

First I’ll load the given sample data and calculate some summary statistics.

Show code
# create vector of sample data
gasTaxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

# calculate mean
sampMean <- mean(gasTaxes)

# calculate sd
sampSD <- sd(gasTaxes)

The sample mean (\(\overline{y}\)) is 40.8627778 and the sample standard deviation (\(s\)) is 9.3083168.

The formula for calculating the test statistic is \[t=\frac{\overline{y}-\mu_0}{se}\] where \[se=\frac{s}{\sqrt{n}}\] Thus \[se=\frac{9.3083168}{\sqrt{18}}\] and \[t=\frac{40.8627778-45.00}{2.193991}\]

Show code
# calculate standard error
se <- 9.3083168/sqrt(18)

# calculate t statistic
tStatGas <- (40.8627778-45.00)/se

Thus the test statistic is -1.8857058.

Finally, I’ll calculate the \(P\)-value.

Show code
# calculate P value
pGas <- pt(q = tStatGas, df = 17, lower.tail=TRUE)

Thus the \(P\)-value is 0.038

This means that if the null hypothesis were true there is a 0.038 probability of observing the data we observed or data more extreme.

A 95% confidence interval means a significance level of .05 (\(1-{confidence\;level}=\alpha\)). Given that the \(P\)-value is less than the significance level of .05, I can say that, yes, given a 95% confidence interval there is sufficient evidence to reject the null hypothesis and accept the alternative hypothesis that in 2005 the average gas tax in the United States was less than 45.00 cents.

Alternatively, I can use the R function t.test to calculate everything needed to answer this question. Since I’ve already calculated everything, I’ll simply use it to validate my calculations.

Show code
t.test(gasTaxes, mu = 45.00, alternative = "less")

    One Sample t-test

data:  gasTaxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278 

Another approach to hypothesis testing is constructing the appropriate confidence interval and determining whether the mean of the null hypothesis is contained within that interval or not.

The t.test function has returned the confidence interval and since I’m looking at it only to confirm my decision to reject the null hypothesis, I won’t calculate it manually. Because the alternative hypothesis is directional, the confidence interval is likewise one-sided. In this case it is [-\(\infty\), 44.67946]. This interval does not contain the mean of the null hypothesis, which confirms my decision to reject the null hypothesis.