Homework One

DACSS 603

Cynthia Hester
February 13,2022

Question 1

The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population.

Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?


Solution

First we assign variable names to the sample summary statistics
bypass_sample<-539 #bypass_sample
angio_sample<-847  #angiography_sample
bypass_mean<-19    #bypass_sample_wait_time_mean
angio_mean<-18     #angiography_sample_wait_time_mean
bypass_sd<-10      #bypass_standard_deviation
angio_sd<-9        #angiography_standard_deviation
We then calculate the t-confidence interval or t-score of each sample. To do this we start by calculating the degrees of freedom of each sample (bypass and angiography)
Calculating degrees of freedom of the bypass and angiography samples
bypass_df<-bypass_sample - 1  #bypass degrees of freedom
angio_df<-angio_sample - 1    #angiography degrees of freedom
Now that we have degrees of freedom we can calculate the t-critical values or t-score for the respective 90% intervals of each sample.
#Calculating the t-critical value for the **angiography** sample

angio_sample<-847
angio_df<-angio_sample - 1
t_score_angio<-qt(p=0.05, df=angio_df,lower.tail=F)
print(t_score_angio)
[1] 1.646657
# Calculating the t-critical value for the **bypass** sample

bypass_sample<-539
bypass_df<-bypass_sample - 1
t_score_bypass<-qt(p=0.05, df=bypass_df,lower.tail=F)
print(t_score_bypass)
[1] 1.647691
# We now find the margin of error for both samples


margin_angio<- qt(0.05,df=angio_df)*9/sqrt(847)    #margin of error angiography
print(margin_angio)
[1] -0.5092182
margin_bypass<-qt(0.05,df=bypass_df)*10/sqrt(539)   #margin of error bypass
print(margin_bypass)
[1] -0.7097107
# To calculate the lower bound and upper bound of the angiography sample

lower_bound_angio<-angio_mean-margin_angio
print(lower_bound_angio)
[1] 18.50922
upper_bound_angio<-angio_mean+margin_angio
print(upper_bound_angio)
[1] 17.49078
# To calculate the lower bound and upper bound of the bypass sample

lower_bound_bypass<-bypass_mean-margin_bypass
print(lower_bound_bypass)
[1] 19.70971
upper_bound_bypass<-bypass_mean+margin_bypass
print(upper_bound_bypass)
[1] 18.29029

To determine which confidence interval is narrower I subtract the upper bound from the lower bound of each respective procedure.

The angiography 90% confidence interval is narrower at 1.02 compared to the bypass of 1.42

Analysis

We see the mean wait time of the 90% confidence interval is from 17.49 to 18.51 days for the angiography procedure. Whereas the mean wait time of the 90% confidence interval is from 18.29 to 19.71 for the bypass procedure. This results in a narrower wait time of 1.02 days for the angiography compared to the bypass of 1.42 days. This could be attributed to the larger sample size for the angiography of 847 compared to a sample of 539 for the bypass. As well as a smaller standard deviation of 9 for the angiography compared to a standard deviation of 10 for the bypass.


Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success.

Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success.

Construct and interpret a 95% confidence interval for p.

Solution

First, we find the point estimate,p, of the proportion.

# Specify sample occurrences (x), sample size(n), and confidence_level


x<- 567                     # survey respondents  (successes)
n<- 1031                    # total surveyed
confidence_level<-0.95      # confidence level
point_estimate<-x/n         # the point estimate is the sample proportion

Now to determine the 90% confidence interval I must find the alpha,the critical z-value, standard error and the margin of error.

alpha<-(1-confidence_level)
critical_z<-qnorm(1-alpha/2)
standard_error<-sqrt(point_estimate*(1-point_estimate)/n)
margin_of_error<-critical_z*standard_error 

The lower bound and upper bound of the confidence interval are calculated.

lower_bound<-point_estimate-margin_of_error 
upper_bound<-point_estimate+margin_of_error

Results

sprintf("Point Estimate: %0.3f", point_estimate)
[1] "Point Estimate: 0.550"
sprintf("Critical Z-value: %0.3f", critical_z)
[1] "Critical Z-value: 1.960"
sprintf("Margin of Error: %0.3f", margin_of_error)
[1] "Margin of Error: 0.030"
sprintf("Confidence Interval: [%0.3f,%0.3f]", lower_bound,upper_bound)
[1] "Confidence Interval: [0.520,0.580]"
sprintf("The %0.1f%% confidence interval for the population proportion is:", confidence_level*100)
[1] "The 95.0% confidence interval for the population proportion is:"
sprintf("between %0.4f and %0.4f",lower_bound,upper_bound)
[1] "between 0.5196 and 0.5803"

Confidence interval interpretation

We have 95% confidence that the interval from the lower bound to the upper bound,[0.5196,0.5803] actually contains the true value of the population proportion of those in the sample believing a college education is valuable.


Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?

Solution

Here’s what we know:

mean population error = +/-5

range of the data - upper range - lower range = 200-30 = 170

population standard deviation = Range/a quarter 170/4=42.5 which is sigma

significance level or alpha = 0.05

from this we can calculate the z-score or critical value

critical_z<-qnorm(1-0.05/2) #using the significance level or alpha we calculate z-score                 
print(critical_z)
[1] 1.959964
n_sample_size<-((1.96*42.5)/5)**2      #n=(z-score*standard deviation)/margin of error)**2
print(n_sample_size)                   #sample size
[1] 277.5556

Thus, we see that we would need a minimum sample of 277.5556 or 278


Question 4

(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

a)Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses,test statistic, and P-value. Interpret the result.

b)Report the P-value for Ha : μ < 500. Interpret.

c)Report and interpret the P-value for H a: μ > 500.

(Hint: The P-values for the two possible one-sided tests must sum to 1.)

Solution

a)Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.

Here’s what we know:

\(\mu\) mean income for all senior level workers = $500/ week

\(\bar{y}\) random sample of 9 female employees income = $410/week

s standard deviation = 90

n random sample female employees = 9


Hypotheses

The null and alternative hypotheses are:

\(H_0\): \(\mu\) = $500 per week

The null hypothesis weekly income for all senior-level workers is $500 per week

\(H_a\): \(\mu\) ≠ $500 per week

The alternative hypothesis suggests the mean weekly income is ≠ $500 (two-sided)

Test Statistic

t_test_income<-(410-500)/(90/sqrt(9))          #Test statistic using t-test
print(t_test_income)
[1] -3

P-Value

n_random_sample<-9                      #random sample female employees
df_sample<-(n_random_sample-1)          #degrees of freedom
t_test_income<-(410-500)/(90/sqrt(9))   #test statistic
p_val<-pt(t_test_income,df_sample)*2    #p-value
print(p_val)
[1] 0.01707168

Interpretation:

part a

Assuming alpha α = 0.05 and we know the p-value is 0.0171 The p-value 0.0171 < 0.05 , we reject the null hypothesis There is therefore sufficient evidence to claim the mean differs from the weekly income of $500.


part b:

Report the P-value for Ha : μ < 500. Interpret.

Hypotheses

\(H_0\): \(\mu\) = $500 per week

\(H_a\): \(\mu\) < $500 per week (left-tail test)

p-value = p(t < t_test_income) p(t < -3)

P-value for \(H_a\) < $500 per week (left-tail test)

#using the formula: pt(q,df,lower.tail=TRUE,log.p=FALSE) to find the p-value


q<-(-3)
n_random_sample<-9
df_sample<-(n_random_sample-1)                            #degrees of freedom 
left_p_value<-pt(q,df_sample,lower.tail = T,log.p = F)    #p value for alternative hypothesis
print(left_p_value)         
[1] 0.008535841

Interpretation:

part b

Since the P-value 0.0085 is less than the presumed significance level,alpha α = 0.05 I reject the null hypothesis,\(H_0\). What this suggests is that there is sufficient evidence to conclude that the mean is < less than 500.


part c:

Report and interpret the P-value for Ha: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.)

Hypotheses

\(H_0\): \(\mu\) = $500 per week

\(H_a\): \(\mu\) > $500 per week (right-tail test)

p-value = p(t > t_test_income) p(t > -3)

P-value for \(H_a\) >$500 per week (right-tail test)

#using the formula: pt(q,df,lower.tail=TRUE,log.p=FALSE) to find the p-value


q<-(-3)
n_random_sample<-9
df_sample<-(n_random_sample-1)                                  #degrees of freedom 
right_p_value<-pt(q,df_sample,lower.tail = F,log.p = F)         #p-value for alternative hypothesis
print(right_p_value)   
[1] 0.9914642
#verification sum of left and right tail p-values equal 1

left_p_value<-pt(q,df_sample,lower.tail = T,log.p = F) 
right_p_value<-pt(q,df_sample,lower.tail = F,log.p = F)
left_right_sum<-(left_p_value+right_p_value)
print(left_right_sum)
[1] 1

Interpretation:

part c

Since the p-value 0.9915 is greater than the presumed significance level alpha α = 0.05 \(H_a\) > 500,therefore we do not reject the null hypothesis \(H_0\). There is insufficient evidence to support the claim \(\mu\) mean is > 500.


Question 5

(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018) Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7,with se = 10.0.

  1. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

  2. Using α = 0.05, for each study indicate whether the result is “statistically significant.”

  3. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.


Solution

  1. Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.

Here’s what we know :

sample_n = 1000 each for Jones and Smith

df_sample_n<-(1000-1) degrees of freedom

Jones

\(\bar{y}\) = 519.5

t = 1.95

p-value = 0.051

se = 10.0

Smith

\(\bar{y}\) = 519.7

t = 1.97

p-value = 0.049

se = 10.0

Hypotheses

\(H_0\): \(\mu\) = 500

\(H_a\): \(\mu\) ≠ 500

Jones We were already given both the test statistic 1.95 and p-value 0.051 for Jones so we are just verifying both.

Test_statistic = (\(\bar{y}\) - \(\mu\))/10

df_sample_n<-(1000-1)  #degrees of freedom

t_test_jones<-(519.5-500)/10    #Test statistic using t-test
print(t_test_jones)
[1] 1.95
p_value_jones<-pt(t_test_jones,df_sample_n,lower.tail = F,log.p = F)*2
print(p_value_jones)
[1] 0.05145555

Smith

We were already given both the test statistic 1.97 and p-value 0.049 for Smith so we are verifying both.

Test_statistic = (\(\bar{y}\) - \(\mu\))/10

df_sample_n<-(1000-1)            #degrees of freedom

t_test_smith<-(519.7-500)/10    #Test statistic using t-test
print(t_test_smith)
[1] 1.97
p_value_smith<-pt(t_test_smith,df_sample_n,lower.tail = F,log.p = F)*2       #p-value for smith
print(p_value_smith)                                                         #output smith
[1] 0.04911426
  1. Using α = 0.05, for each study indicate whether the result is “statistically significant.”

For α = 0.05 we reject the null hypothesis \(H_0\) if the p-value is greater than 0.05 and do not reject if the p-value is equal or greater to => \(H_0\). So in the case of Jones since the p-value 0.051 is negligibly larger than alpha, it is not statistically significant and we fail to reject the null hypothesis \(H_0\). In the case of Smith, the p-value of 0.049 is less than alpha, α = 0.05 which is less than the level of significance, we reject the null hypothesis since a smaller p-value suggests statistical significance.

  1. Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value.

In this study the p-values are negligibly the same between Jones and Smith. However, in spite of this because our predetermined significance p-value is ≤ 0.05 in the case of Smith, the null hypothesis is rejected and we conclude there is statistical evidence for the alternative hypothesis \(H_a\). Conversely, if our predetermined significance p-value is greater than 0.05 as in the case of Jones then we fail to reject the null hypothesis. There is insufficient evidence to draw any conclusions.


Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.

Solution

Here’s what we know:

sample_gas_taxes<-18

df_gas_taxes<-sample_gas_taxes-1

95% confidence level (presumptive)

significance level alpha = 0.05 based on a 95% confidence level

z-score = 1.96 based on 95% confidence level

From this information we can determine the following:

#manual calculation of t-score used for finding the upper and lower interval of gas_tax_sample

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

gas_taxes_sample<-18                                           #gas taxes sample size

df_gas_taxes<-gas_taxes_sample-1                               #degrees of freedom 
mean_gas_taxes<-mean(gas_taxes)                                #mean gas taxes
t_score_gas_taxes<-qt(p = 0.05,df=df_gas_taxes,lower.tail = F) #t_score
sd_gas<-sd(gas_taxes)                                          #standard deviation
m_error_gas_taxes<-qt(0.05,df=df_gas_taxes)*sd_gas/sqrt(18)    #margin of error gas taxes




 
#Now that all of the needed parameters for lower and upper bounds have been calculated, I can find the confidence interval for the gas taxes sample.



mean_gas_taxes<-mean(gas_taxes)  
m_error_gas_taxes<-qt(0.05,df=df_gas_taxes)*sd_gas/sqrt(18)

lower_gas_tax<-(mean_gas_taxes-m_error_gas_taxes) #lower bound using mean and margin of error
print(lower_gas_tax)
[1] 44.67946
upper_gas_tax<-(mean_gas_taxes+m_error_gas_taxes) #upper bound using mean and margin of error
print(upper_gas_tax)
[1] 37.0461

The 95% confidence interval is [37.0461 ,44.6795] using manual calculations.

Because the average tax per gallon is less than 45 cents, it is within the lower and upper bounds of the 95% confidence interval.We can therefore reasonably conclude that there is sufficient evidence that the confidence interval contains taxes less than 45 cents.

Alternative outcome using t.test:

gas_taxes<- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

mean(gas_taxes)
[1] 40.86278
t.test(gas_taxes,conf.level = 0.95)                            #one sample t-test

    One Sample t-test

data:  gas_taxes
t = 18.625, df = 17, p-value = 9.555e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 36.23386 45.49169
sample estimates:
mean of x 
 40.86278 

The 95% confidence interval is [36.23386 ,45.49169]

There is not enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents because 45 cents is inside the confidence interval. The confidence interval contains taxes greater than 45 cents.


Please note I am not sure why there is a difference between 95% confidence intervals when calculated manually versus the t.test function. I therefore include both outcomes.