Statistical Inference II & Comparing two means

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population

library(distill)
library(dplyr)
library(tidyverse)
Problem1<- read.csv('homework1_prob1.csv',TRUE,',',na.strings = "N/A")
Problem1

  ï..Surgical.Procedure Sample.Size Mean.Wait.Time Standard.Deviation
1                Bypass         539             19                 10
2           Angiography         847             18                  9

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?

For Bypass group:

#Our values are: 
Bypass.mean<- 19
Bypass.sd<-10
Bypass.n<-539
Bypass.se <- Bypass.sd/sqrt(Bypass.n) #This is standard error of the mean

#Find the t.score for CI 90%
alpha = 0.1
degrees.freedom = Bypass.n - 1
Bypass.t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(Bypass.t.score)

[1] 1.647691

#Calculate margin of error
Bypass.margin.error <- Bypass.t.score * Bypass.se
print(Bypass.margin.error)

[1] 0.7097107

#Calculate the 90% confidence interval for Bypass
  lower.bound <- Bypass.mean - Bypass.margin.error
  upper.bound <- Bypass.mean + Bypass.margin.error
  print(c(lower.bound,upper.bound))

[1] 18.29029 19.70971

For Angiography group:

#Our values are: 
Angio.mean<- 18
Angio.sd<-9
Angio.n<-847
Angio.se <- Angio.sd/sqrt(Angio.n) #This is standard error of the mean

#Find the t.score for CI 90%
alpha = 0.1
degrees.freedomA = Angio.n - 1
Angio.t.score = qt(p=alpha/2, df=degrees.freedomA,lower.tail=F)
print(Angio.t.score)

[1] 1.646657

#Calculate margin of error
Angio.margin.error <- Angio.t.score * Angio.se
print(Angio.margin.error)

[1] 0.5092182

#Calculate the 90% confidence interval for Angiography
  lower.boundA <- Angio.mean - Angio.margin.error
  upper.boundA <- Angio.mean + Angio.margin.error
  print(c(lower.boundA,upper.boundA))

[1] 17.49078 18.50922

Answer for Question #1-Is the confidence interval narrower for angiography or bypass surgery?:

Angiography patients, at 90% confidence interval, had between (17.49078 and 18.50922) wait time in days which is NARROWER compared to the bypass patients wait time which is between (18.29029 and 19.70971)

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.

#Point Estimate, p
n<-1031
k<-567
p<-k/n
p

[1] 0.5499515

Interpretation of point estimate p:

The sample proportion of adult Americans who believed that college education is essential for success is 0.5499515 or 55%. This represents our point estimate for the population (adult Americans) proportion.

#Construct 95% confidence interval for p

S.margin <- qnorm(0.975)*sqrt(p*(1-p)/n)  #calculate margin of error
  S.lower.bound <- p-S.margin
  S.upper.bound <- p+S.margin
  print(c(S.lower.bound,S.upper.bound))

[1] 0.5195839 0.5803191

Interpretation of 95% confidence interval for p:

The 95% confidence interval for the population (adult Americans) proportion is [0.5195839 0.5803191]. This means between 51.9% to 58% of adult Americans believed that college education is essential for success.

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?

n= square of ((Z 0.05/2 * sd of pop)/within $5 of true pop mean

sd_of_pop =(200-30)/4 #This is standard deviation of population
sd_of_pop

[1] 42.5

sample_size=((1.96*sd_of_pop)/5)**2  #Using Z score 1.96 for significance level 5%
sample_size

[1] 277.5556

ANSWER: Sample size needed to achieve significance level of 95% is 278.

Question 4

(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

assumption: seed set at 123, using rnorm
Ho mu=500
Ha mu≠500
Mean y_hat =410
sd = 90
n =9

set.seed(123)
Mean_income <- c(rnorm(9, mean = 410, sd = 90)) 
Mean_income

[1] 359.5572 389.2840 550.2837 416.3458 421.6359 564.3558 451.4825
[8] 296.1445 348.1832

t.test(Mean_income, mu = 500) # Ho: mu=500


    One Sample t-test

data:  Mean_income
t = -2.6213, df = 8, p-value = 0.03059
alternative hypothesis: true mean is not equal to 500
95 percent confidence interval:
 353.2313 490.6071
sample estimates:
mean of x 
 421.9192

Interpretation of one sample t-test result:

At p-value 0.03, considered statistically significant, we reject the null hypothesis. We can conclude that mean income for female employees is not 500.

Report the P-value for Ha : μ < 500. Interpret.

t.test(Mean_income, mu=500, alternative = 'less') # Ha: mu<500


    One Sample t-test

data:  Mean_income
t = -2.6213, df = 8, p-value = 0.01529
alternative hypothesis: true mean is less than 500
95 percent confidence interval:
     -Inf 477.3087
sample estimates:
mean of x 
 421.9192

Interpretation of Ha : μ < 500:

At p-value 0.01529, considered statistically significant, we reject the null hypothesis and accept the alternative hypothesis. We can conclude that mean income for female employees is less than 500.

Report and interpret the P-value for Ha: μ > 500.

t.test(Mean_income, mu=500, alternative = "greater") # Ha: mu>500


    One Sample t-test

data:  Mean_income
t = -2.6213, df = 8, p-value = 0.9847
alternative hypothesis: true mean is greater than 500
95 percent confidence interval:
 366.5297      Inf
sample estimates:
mean of x 
 421.9192

Interpretation of Ha: μ > 500:

At p-value 0.9847, considered statistically NOT significant, we fail to reject the null hypothesis. We can reject the alternative hypothesis that mean income for female employees is greater than 500.

(Hint: The P-values for the two possible one-sided tests must sum to 1.)

total_p_value=0.9847+0.01529
print(c("Total p-values for the two possible one-sided tests is",total_p_value))

[1] "Total p-values for the two possible one-sided tests is"
[2] "0.99999"

Question 5

(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018) Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7, with se = 10.0.

tab <- matrix(c(519.5, 519.7, 10, 10), ncol=2, byrow=TRUE)
colnames(tab) <- c("Jones","Smith")
rownames(tab) <- c("y_hat","se")
tab <- as.table(tab)
print(tab)

      Jones Smith
y_hat 519.5 519.7
se     10.0  10.0

Calculate t and p-value

t.stat <- (y_hat - mu)/sample.se

p.value = pt(q=abs(t.stat), df=degrees.freedom,lower.tail=F) 2 *

Show that t = 1.95 and P-value = 0.051 for Jones.

t.stat <- (519.5 - 500)/10
print(c("t.stat",t.stat))

[1] "t.stat" "1.95"

degrees.freedom = 1000 - 1
p.value = pt(q=abs(t.stat), df=degrees.freedom,lower.tail=F) * 2
print(c("Two-sided p-value",p.value))

[1] "Two-sided p-value"  "0.0514555476459477"

Show that t = 1.97 and P-value = 0.049 for Smith.

Smith.t.stat <- (519.7 - 500)/10
print(c("t.stat",Smith.t.stat))

[1] "t.stat" "1.97"

degrees.freedom = 1000 - 1
Smith.p.value = pt(q=abs(Smith.t.stat), df=degrees.freedom,lower.tail=F) * 2
print(c("Two-sided p-value",Smith.p.value))

[1] "Two-sided p-value"  "0.0491142565416521"

Using α = 0.05, for each study indicate whether the result is “statistically significant.”

  α = 0.05
  H0: μ = 500
  Ha : μ ≠ 500

For Jones Data: Since p=0.051, we fail to reject the null hypothesis (H0: μ = 500) and reject the alternative hypothesis (Ha : μ ≠ 500). The p-value is NOT statistically significant.

For Smith Data: Since p=0.049, we reject the null hypothesis (H0: μ = 500) and accept the alternative hypothesis. We conclude that μ ≠ 500. P-value is statistically significant.

Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

print(tab)

      Jones Smith
y_hat 519.5 519.7
se     10.0  10.0

#α = 0.05
#H0: μ = 500
#Ha : μ ≠ 500
#Zc= 1.96

Calculate z-score= (y_hat -μ )/ se

Jones.z <- (519.5-500)/10
Jones.z

[1] 1.95

Smith.z <- (519.7-500)/10
Smith.z

[1] 1.97

Interpretation without reporting the actual P-value:

Reporting the result of a test using p-values could be misleading. We can avoid this by using z-score to report the results, using z-score =1.96 as the same 95% confidence level. Jones z-score of 1.95 < 1.96 means we fail to reject the null hypothesis (H0: μ = 500) and reject the alternative hypothesis (Ha : μ ≠ 500). Smith z-score of 1.97 > 1.96 we reject the null hypothesis (H0: μ = 500) and accept the alternative hypothesis(Ha : μ ≠ 500).

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

#H0: μ = 45
#Ha : μ ≠ 45, specifically μ < 45 assuming one-sided using argument alternative= "lesser"

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
t.test(gas_taxes,mu=45, alternative = 'less')


    One Sample t-test

data:  gas_taxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278

Interpretation of one sample t-test:

Yes, there is enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents. With the p-value of 0.03827 < α = 0.05, we reject the null hypothesis that μ = 45 and accept the alternative hypothesis that μ < 45 cents.