HW1_DACSS603

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population.

  surgical_procedure sample_size mean_wait_time standard_deviation
1            Bypasss         539             19                 10
2        Angiography         847             18                  9

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

I will do the following to calculate the confidence intervals:

Calculate the mean
Calculate the standard error of the mean
Calculate n
Determine a confidence level
Find the t-score
Calculate interval

I will start with the values we know.

# we already know the mean, sample size, and standard deviation
# creating variables for all the values we do know
bypass_n <- 539
bypass_mean_wait_time <- 19
bypass_sd <- 10
angio_n <- 847
angio_mean_wait_time <- 18
angio_sd <- 9

Next, I will specify the confidence level and use that the calculate the tail area.

# specify confidence level
# calculate tail area - we can use this for both the angio and bypass confidence intervals

confidence_level <- 0.90
tail_area <- (1 - confidence_level)/2 # divide by two because we care about both sides.

tail_area

[1] 0.05

Then I will use the tail areas to calculate the t-scores and confidence intervals. I will start with the bypass surgery.

# bypass
# calculate t-score
bypass_t_score <- qt(p = 1 - tail_area, df = bypass_n - 1)

bypass_t_score

[1] 1.647691

# bypass 
# calculate confidence internal

bypass_lower <- bypass_mean_wait_time - bypass_t_score * bypass_sd / sqrt(bypass_n)
bypass_upper <- bypass_mean_wait_time + bypass_t_score * bypass_sd / sqrt(bypass_n)

print(c(bypass_lower, bypass_upper))

[1] 18.29029 19.70971

The confidence interval for the bypass surgery is between 18.29029 and 19.70971 days.

Next, I will do the same for the angio surgery.

# angio
# calculate t-score

angio_t_score <- qt(p = 1 - tail_area, df = angio_n - 1)

angio_t_score

[1] 1.646657

# margin of error and confidence interval - angio

angio_lower <- angio_mean_wait_time - angio_t_score * angio_sd / sqrt(angio_n)
angio_upper <- angio_mean_wait_time + angio_t_score * angio_sd / sqrt(angio_n)

print(c(angio_lower, angio_upper))

[1] 17.49078 18.50922

The confidence interval for the angiography surgery is between 17.49078 and 18.50922 days.

We can calculate that the confidence interval for mean days waiting is narrower for the angiography surgery than for the bypass surgery, but it also may be easier to see in graph form:

# add confidence intervals to df
surgical_procedure = c('Bypasss', 'Angiography')
sample_size = c(bypass_n, angio_n)
mean_wait_time = c(bypass_mean_wait_time, angio_mean_wait_time)
standard_deviation = c(bypass_sd, angio_sd)
lower = c(bypass_lower, angio_lower)
upper = c(bypass_upper, angio_upper)

df <- data.frame(surgical_procedure, sample_size, mean_wait_time, standard_deviation, lower, upper)

# compare confidence intervals - plot

ggplot(df) +   
  geom_point(aes(x = surgical_procedure, y = mean_wait_time), color = "#9784c2", size = 3) +
  geom_errorbar(aes(x = surgical_procedure, ymin = lower, ymax = upper), color = "#9784c2", width = 0.5) +
  labs(x = "Surgical Procedure", y = "Mean Wait Time (Days)") +
  geom_text(aes(x = surgical_procedure, y = upper, label = round(upper, digits = 2)), 
            family = "Avenir", size=3, color = "#33475b", hjust = -3) +
  geom_text(aes(x = surgical_procedure, y = lower, label = round(lower, digits = 2)), 
            family = "Avenir", size=3, color = "#33475b", hjust = -3) +
  geom_text(aes(x = surgical_procedure, y = mean_wait_time, label = mean_wait_time), 
            family = "Avenir", size=3, color = "#33475b", hjust = -1) +
  theme(axis.text.x = element_text(family = "Avenir", color = "#33475b", size=10),
        axis.text.y = element_text(family = "Avenir", color = "#33475b", size=8),
        axis.title.y = element_text(family = "Avenir", color = "#33475b", size=13),
        axis.title.x = element_text(family = "Avenir", color = "#33475b", size=13))

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

First, we want to find the point estimate (p) and then construct the confidence interval because that will be much more accurate than a single point.

#find the point estimate

college_education_essential <- 567
survey_n <- 1031

point_estimate <- college_education_essential/survey_n

point_estimate

[1] 0.5499515

The next thing I am going to do is calculate the margin of error on either side of the point estimate. For a 95% confidence interval, the alpha is 0.05, which means that the z-score is 1-(0.05/2) = 0.975). We can use the z-score because we are assuming a normal distribution and the sample size is greater than 30.

# calculate the error

error <- qnorm(0.975)*sqrt(point_estimate*(1-point_estimate)/survey_n)

error

[1] 0.03036761

# calculate the confidence interval

upper2 <- point_estimate + error
lower2 <- point_estimate - error

print(c(lower2, upper2))

[1] 0.5195839 0.5803191

print(c(round(lower2, digits = 3), round(upper2, digits = 3))) # round

[1] 0.52 0.58

Here we can see for that our sample proportion our point of estimate is 0.5499515. The 95% confidence interval indicates that the population mean is between .520 and .580. In other words, the percentage of Americans who believe college is important is between 52% and 58%. This means that when a a series of representative samples are created, 95% of the time the true mean should be between .520 and .580 (the result of % of Americans who believe college is important should be between 52% and 58%).

Alternative Approach

We could also use prop.test() using the same numbers that would tell us automatically what the point of estimate and confidence are.

conf.level = 0.95 (this is also the default for prop.test() but we will still specify)

x = the number of “successes” (in this case it is the number of survey respondents who say that college education is needed)

n = number of survey respondents.

# calculate the confidence interval

prop.test(x = 567, n = 1031, conf.level = 0.95)


    1-sample proportions test with continuity correction

data:  567 out of 1031, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515

NOTE: There is a slight difference in the margin of error (less than .001). I suspect this has to do with how standard deviation is calculated (rounded) by the prop.test() in r. If we assume that we are rounding to the nearest hundredth this might not even be noticed.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?

Right away we know a few key things:

range = the difference between $30 and $200 (170)

z-value = significance level is 5%, the alpha is 0.05, which means that the z-score is 1-(0.05/2) = 0.975)

margin = the estimate is useful within $5 on either side, so the margin is 5.

The first thing we will do is calculate the standard deviation, which we know is a quarter of the range.

#range - difference between $30 and $200

range <- 170

#significance level is 5%, alpha is 0.5
z <- qnorm(1-(0.05/2))

#margin within $5 of the true population mean - margin = 5

margin <- 5

#calculate sd ("standard deviation is a quarter of the range")

sd <- range*0.25

sd

[1] 42.5

Now that I have the standard deviation, I can calculate the sample size.

#Now I can calculate the sample size with the formula n = (z-value/margin)^2.

sample_size <- ((1.96*sd)/margin)^2

sample_size

[1] 277.5556

round(sample_size, digits = 0) #round to the nearest whole person

[1] 278

Rounding to the nearest whole person, we get 278. Interpreting that, in order for the financial aid office to estimate the mean cost of textbooks (+ or - $5) with a significance level of 5%, they should sample 278 students.

Question 4

(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410’and s = 90.

Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.
Report the P-value for Ha : μ < 500. Interpret.
Report and interpret the P-value for H a: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.)

Assumptions:

null hypothesis (H0) is that the mean weekly earnings for the population of women at the company is $500 per week: μ = 500

alternative hypothesis (Ha) is that the mean weekly earnings for the population of women at the company is not $500 per week: μ =/= 500

sample size = 9

sample mean = 410

sample standard deviation = 90

First, I will find the t-score and then I will calculate the p-value. I will assume a 95% confidence level. t-statistic and use it to find the p-value. We can use t = (sample mean - hypothesized mean)/ (sample standard deviation / sqrt(n))

# start with what we know

salary_mean <- 410
salary_sd <- 90
salary_n <- 9

# find the t-score t = (sample mean - hypothesized mean) / (sample standard deviation / sqrt(n))
# hypothesized mean = null hypothesis = 500

salary_t_score <- (salary_mean - 500)/(salary_sd/sqrt(salary_n))

salary_t_score

[1] -3

# now I will find the p-value using pt()

pt(q = salary_t_score, df = salary_n-1)*2 # multiplied by two because this is a two-tailed test

[1] 0.01707168

At the 95% confidence interval we can reject the null hypothesis that mean income is $500 based on the alternate hypothesis that mean income is not $500 because the p value is less than .05. Additionally, we see that, at the 95% confidence interval, the null hypothesis ($500) falls outside of the interval.

Now we’ll look at the p-value if the alternative hypothesis is μ < 500.

#p-value of the right side only (less than 500)

pt(q = salary_t_score, df = salary_n-1)

[1] 0.008535841

At the 95% confidence interval we can reject the null hypothesis that mean income is $500 based on the alternate hypothesis that mean income is less than $500 because the p value 0.001, which is less than 0.05.

Now we’ll look at if the alternative hypothesis is μ < 500.

#p-value of the left side only

pt(q = salary_t_score, df = salary_n-1, lower.tail = FALSE)

[1] 0.9914642

At the 95% confidence interval we cannot reject the null hypothesis that mean income is $500 based on the alternate hypothesis that mean income is greater than $500 because the p value 0.99, which is greater than 0.05 and is not significant at that level.

Question 5

(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018) Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519. 7,with se = 10.0.

Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.
Using α = 0.05, for each study indicate whether the result is “statistically significant.”
Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

Starting with Jones, we will find the t and p values:

t = (sample mean - null hypothesis)/(sample standard error).

se = 10

sample mean = 519.5

null hypothesis = 500

We will find the p-value using pt(). Since this is a two-tailed test, we will multiply the result by two.

# t = (y-hat - H0)/se
#Jones got population mean of 519.5 with standard error of 10.0

jones_t <- (519.5-500)/10

# now we are conducting a 2-sided test. Find the area to the right of 1.95 and the area to the left of -1.95 to get p-value
# use pt(); degrees of freedom are n-1 = 999
# pt() finds the area to the left of a value
df <- 999

jones_p <- pt(q = -1.95, df = 999) + pt(q = 1.95, df = 999, lower.tail = FALSE) #lower tail + upper tail

jones_t

[1] 1.95

jones_p

[1] 0.05145555

We get the same results as the ones we are given; t = 1.95, p = 0.51.

Now we will verify the t and p values for Smith:

t = (sample mean - null hypothesis)/(sample standard error).

se = 10

sample mean = 519.7

null hypothesis = 500

We will find the p-value using pt(). Since this is a two-tailed test, we will multiply the result by two.

# Smith got population mean of 519.7 with standard error of 10.0

smith_t <- (519.7-500)/10

# now we are conducting a 2-sided test. Find the area to the right of 1.97 and the area to the left of -1.97 to get p-value

smith_p <- pt(q = -1.97, df = 999) + pt(q = 1.97, df = 999, lower.tail = FALSE) #lower tail + upper tail

smith_t

[1] 1.97

smith_p

[1] 0.04911426

We get the same results as the ones we are given; t = 1.97, p = 0.49.

Using a significance level of 0.05, Jones will not be able to reject the null hypothesis because his p-value is greater than 0.05 (0.051), and his results are deemed “not statistically significant. Smith, with a p-value of 0.049 is able to reject the null hypothesis and say his results are”statistically significant."

The issue is that both of these studies are have such similar results, but because one p-value is less than 0.05, that study is deemed significant and would get published, while the other one may not. Additionally, Smith reporting a result of “p < 0.05” rather than the actual p-value can be misleading because it is so close to 0.05, and the reader of the study may assume that the result is more significant than it actual is if they don’t know the actual p-value.

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

First we calculate what we can and use those values to calculate the t-score. We will assume that p = 0.05 because we are looking at a 95% confidence interval.

# calculate values that we can

mean_gt <- mean(gas_taxes) # mean
sd_gt <- sd(gas_taxes) # standard deviation
n_gt <- 18 # sample size
se_gt <-(sd_gt/sqrt(18)) # standard error

# calculate t-score using qt()
# we know that the p = 0.05 because we are looking for the 95% confidence interval.
# df = 18 - 1
# we are looking for the lower.tail, which is the default ("less than 45%")

t_gt <- qt(p = 0.05, df = 17)

t_gt

[1] -1.739607

From here we can calculate the error margin on each side of the mean: error = t * sd/sqrt(n)

# calculate margin of error t * sd/sqrt(n)

me_gt <- t_gt * sd_gt/sqrt(n_gt)

# then we get the upper and lower bounds of the confidence interval

upper3 <- mean_gt + me_gt
lower3 <- mean_gt - me_gt

print(c(upper3, lower3))

[1] 37.04610 44.67946

Based on the sample from the 18 cities, the 95% confidence interval for the average tax per gallon of gas in the US is between 37.05 cents and 44.68 cents.

If the null hypothesis is the average tax per gallon of gas is the US in 2005 45 cents (μ = 45), we reject the null hypothesis because the upper bound of the 95% confidence interval is 44.68 cents. If the alternate hypothesis is that the average tax per gallon of gas in the US in 2005 was less than 45 cents (μ < 45) than we accept the alternate hypothesis.