Homework 1 for DACSS 603
The time between the date a patient was recommended for heart surgery and the surgery date for cardiac patients in Ontario was collected by the Cardiac Care Network (“Wait Times Data Guide,” Ministry of Health and Long-Term Care, Ontario, Canada, 2006). The sample mean and sample standard deviation for wait times (in days) of patients for two cardiac procedures are given in the accompanying table. Assume that the sample is representative of the Ontario population
#create matrix
tab <- matrix(c(539,19,10,847,18,9), ncol=3, byrow=TRUE)
#define column names and row names of matrix
colnames(tab) <- c('Sample Size', 'Mean Wait Time', 'Standard Deviation')
rownames(tab) <- c('bypass', 'angiography')
#convert matrix to table
tab <- as.table(tab)
#view table
tab
Sample Size Mean Wait Time Standard Deviation
bypass 539 19 10
angiography 847 18 9
Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?
The test to use is the student T-test because we do not know the population standard deviation. The student t test uses the sample standard deviation. In order to do this we calculate the critical t-values using the formula qt(p/2, df, lower.tail=FALSE). Where p is the significance level, df is the degrees of freedom and lower.tail=FALSE creates a right tail test. Dividing the significance level by 2 will give us the critical t-value for a two tailed test.
# set up the sample
# angiography
a <- 847
mean1 <- 18
s1 <- 9
# bypass
b <- 539
mean2 <- 19
s2 <- 10
# critical t values
ta <- qt(p=.1/2, df=846, lower.tail=FALSE)
ta
[1] 1.646657
tb <- qt(p=.1/2, df=538, lower.tail=FALSE)
tb
[1] 1.647691
# margin of error is t * standard deviation/sqrt(sample size)
# calculate the lower and upper bounds of the confidence interval
CIa <- c(mean1 - ta* s1 / sqrt(a),
mean1 + ta * s1 / sqrt(a))
CIa
[1] 17.49078 18.50922
[1] 18.29029 19.70971
Here we can see that that the critical t valeu for angiography is 1.647 and for bypass is 1.648. We can then use the formula CI <- c(s_mean - t_score * s_sd / sqrt(s_size), s_mean + t_score * s_sd / sqrt(s_size)). To find the upper and lower bounds of the confidence interval.
Here we can see for that for a 90% confidence interval for the population mean:
angiography is between 17.5 and 18.5 days
bypass is between 18.3 and 19.7 days
Therefore the confidence interval is narrower for the angiography.
In order to find the confidence interval we first must create variables for the mean, standard deviation, and sample size. Since the sample size is considerably more than large, I’m going to use the normal distribution as an approximation for the student t-distribution for calculating the confidence interval.
To find the confidence interval you would use alpha (.1)/2, which equals .05. This allows us to use .95 in qnorm() to find the z score for the confidence interval for both surgeries.
Then we calculate the standard error and multiple this by z-score for .95 to find the margin of error. To find the margin of error we multiply the z score by the population standard deviation divided by the square root of the sample size (margin <- (s/sqrt(n)) * qnorm(.95)). I am going to approximate the population standard deviation (sigma) with the sample standard deviation (s). After that we would add or subtract the margin of error from the mean value to get the upper and lower bounds of the confidence interval.
[1] 0.5086606
#calculate lower and upper bounds of the confidence interval
ahigh <- ase + mean1
ahigh
[1] 18.50866
alow <- (ase - mean1) * -1
alow
[1] 17.49134
[1] 0.7084886
#calculate lower and upper bounds of the confidence interval
bhigh <- bse + mean2
blow <- (bse - mean2) * -1
bhigh
[1] 19.70849
blow
[1] 18.29151
Since we can’t use ± and days can’t be negative in r functions the closest we can get is just doing separate addition and subtraction and multiply the subtraction by -1.
Here we can see for that for a 90% confidence interval for the population mean using a z-score:
angiography is between 17.49 and 18.51 days, 18 ± 0.509 (margin of error)
bypass is between 18.29 and 19.71 days, 19 ± 0.708 (margin of error)
Therefore the confidence interval is narrower for the angiography.
A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success. Find the point estimate, p, of the proportion of all adult Americans who believe that a college education is essential for success. Construct and interpret a 95% confidence interval for p.
In order to find the confidence interval for p we first must look at what we know. Because we are looking for proportions the prop.test().
Since we are assuming this is a representative sample we are focusing on a normal distribution, with a mean of 0 and a standard deviation of 1. We are also using a confidence level of 95%. wThis means alpha .05. (.05)/2 which equals .025. To calculate the z score we use 1-.025 which equals .975.
To find the sample’s point estimate, you divide the number of people who believe college is essential by the total sample size. This value is .550.
We then find the margin of error by multiplying the z score by the square root of the sample proportion times 1 minus the sample proportion divided by the total sample size (margin <- qnorm(0.975)sqrt(p(1-p)/n)). This allows us to then calculate the upper and lower bounds of the confidence interval.
#find total sample size
n <- 1031
#number of agreements in college education
k <- 567
#find sample proportion
p <- k/n
p
[1] 0.5499515
[1] 0.03036761
#calculate lower and upper bounds of confidence interval
low <- p - margin
low
[1] 0.5195839
high <- p + margin
high
[1] 0.5803191
Here we can see for that our sample proportion, our point of estimate, is .550 and the 95% confidence interval that the population mean is between .520 and .580. When a series of representative samples are created 95% of the time the true mean should be between .520 and .580.
We could also run a prop test using the same numbers that would tell us automatically what the point of estimate and confidence are. Here we would use .95 for the confidence level as prop.test() recognizes it differently than if we were to manually do it. X would be the number of successes or in this case “education is needed” and n is the number of people taking the survey. There is a slight difference in the margin of error (less than .001), I’m unsure why but it may be because of how the standard deviation is calculated by the prop.test().
prop.test(x = 567, n=1031, alternative = 'two.sided',
conf.level = 0.95)
1-sample proportions test with continuity correction
data: 567 out of 1031, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.5189682 0.5805580
sample estimates:
p
0.5499515
Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range. Assuming the significance level to be 5%, what should be the size of the sample?
Our margin of error is $5 because the confidence interval level length is $10. Our significance level is .05, so we will use qnorm(.975). We also need to find the standard deviation which we will find by dividing the range by four. This gives us 42.5.
In order to find the sample size we will use the formula qnorm()^2 ∗ sigma^2/ E^2. Qnorm() calculates the z-score, sigma is the standard deviation, and E is the margin of error.
#standard deviation
s <- (200-30)/4
s
[1] 42.5
sample <- qnorm(.975)^2 * s^2/ 5^2
sample
[1] 277.5454
This gives us a sample size of about 278 students.
(Exercise 6.7, Chapter 6 of SMSS, Agresti 2018) According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.
Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.
Report the P-value for Ha : μ < 500. Interpret.
Report and interpret the P-value for Ha: μ > 500. (Hint: The P-values for the two possible one-sided tests must sum to 1.)
Our current null hypothesis is that the mean income for all senior level works is $500 per week. This means our alternative hypothesis is that the mean income does not equal $500.
We also know that the estimated mean income of a random sample of nine female employees are $410 and the standard deviation is 90.
First we need to find the t-score. To do that we use the formula t= (sample mean - hypothesized mean)/ sample standard deviation / (squareroot of n).
t.score <- (410-500)/(90/sqrt(9))
t.score
[1] -3
p.value <- pt(t.score, 8) * 2
p.value
[1] 0.01707168
This gives us a t score of -3. Then we need to do is find the p value using pt(). This uses the formula pt(p, df, lower.tail=FALSE). Where p is the t-score, df is the degrees of freedom, and lower.tail=FALSE is returns the probability of the right tail. The right and left tail must equal 1 so I added right + left together to check. I also used pt(t.score, 8) * 2 to find the overall p value.
p.value <- pt(t.score, 8) * 2
p.value
[1] 0.01707168
right <- pt(t.score, 8, lower.tail=FALSE)
right
[1] 0.9914642
left <- pt((t.score), 8, lower.tail=TRUE)
left
[1] 0.008535841
right + left
[1] 1
To answers the questions asked:
If we use the standard alpha level of .05 (standard 95% confidence interval) we can reject the null hypothesis that mean income for all senior level workers is $500 based on the alternate hypothesis that mean income is not $500 because the p value is less than .05. I used the two tailed test.
For the alternative hypothesis that the mean income is less than $500, we reject the null hypothesis that the mean is $500 because the p value is .01 which is less than .05. We accept the alternative hypothesis that the mean income is equal to $500. I used the left tailed test.
For the alternative hypothesis that the mean income is greater than $500 we do not reject the null hypothesis because the p value is .99 which is much larger than .05. We fail to reject the null hypothesis that the mean income is equal to $500. I used the right tail test.
(Exercise 6.23, Chapter 6 of SMSS, Agresti 2018). Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000. Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519. 7, with se = 10.0.
Show that t = 1.95 and P-value = 0.051 for Jones. Show that t = 1.97 and P-value = 0.049 for Smith.
Using α = 0.05, for each study indicate whether the result is “statistically significant.”
Using this example, explain the misleading aspects of reporting the result of a test as “P ≤ 0.05” versus “P > 0.05,” or as “reject H0” versus “Do not reject H0 ,” without reporting the actual P-value
Here we can run a test for Jones and a seperate test for Smith for the t scores against the null hypothesis which is that the mean 500. To find the t value we will use the equation t= (sample mean - hypothesized mean)/( sample standard error). The standard error is the standard deviation multiplied by the squareroot of the sample size.
We also need to find the p-value using the pt() function multiplied by two to get both ends of the tail because this is a two-tailed test. We already know the t values and p values for Jones (t = 1.95 and P-value = 0.051) and Smith (t = 1.97 and P-value = 0.049). We will check if we recieve the same results.
#check t value
smith <- (519.5-500)/10
smith
[1] 1.95
jones <- (519.7-500)/10
jones
[1] 1.97
#check p-value
smithp <- pt(smith, 999, lower.tail=FALSE) * 2
smithp
[1] 0.05145555
jonesp <- pt(jones, 999, lower.tail=FALSE) * 2
jonesp
[1] 0.04911426
Here we can see that the numbers match the correct t and p values.
We know that we are using a significance level of .05 which means that Jones with a (.051) value does not reject the null hypothesis, but Smith with a (.049) does due to not being at the .05 level.
The issue is that both of these studies are so similar that there is realistically little difference between their results. However because we are using a p value one of them is allowed to reject the null hypothesis while the other is not. Each study should be able to show there’s some evidence against the null hypothesis since they are both .01 off the significance level of .05.
Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.
We first needed to input the gas_taxes into our r code. We also need to calculate for a 95% confidence interval. We can also find out the standard deviation and mean from our numbers. For standard deviation we can use sd() because it is almost unbiased in estimating standard deviation. We use the formula pt(p, df, lower.tail=TRUE). Where p is the t-score, df is the degrees of freedom, and lower.tail=TRUE is returns the probability of the left tail.
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02,
48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
sd <- sd(gas_taxes)
sd
[1] 9.308317
mean <- mean(gas_taxes)
mean
[1] 40.86278
# critical t values
tgas <- qt(p=.05, df=17, lower.tail=TRUE)
tgas
[1] -1.739607
# margin of error is t * standard deviation/sqrt(sample size)
# calculate the lower and upper bounds of the confidence interval
CIa <- c(mean - tgas* sd / sqrt(18),
mean + tgas * sd / sqrt(18))
CIa
[1] 44.67946 37.04610
Here we can see for that for a 95% confidence interval that the gas tax mean is 40.86 and our sample size is 18. The sample standard deviation is 9.308. The critical t value is 2.11.
The 95% confidence interval for the average tax per gallon of gas in the US is between 37.05 cents and 44.68 cents.
If the null hypothesis is the average tax per gallon of gas is the US in 2005 is 45 cents, we reject the null hypothesis because the upper bound of the 95% confidence interval is 44.68 cents. If the alternate hypothesis is that the average tax per gallon of gas in the US in 2005 was less than 45 cents than we accept the alternate hypothesis.