Homework 1 DACSS 603

Descriptive Statistics, Probability, Statistical Inference, and Comparing Two Means

Kristina Becvar
2022-02-23

Question 1

Surgical Procedure - Representative Sample

Surgical Procedure Sample Size Mean Wait Time Standard Deviation
Bypass 539 19 10
Angiography 847 18 9

Answer 1

I calculated the answer by first calculating the standard error for each procedure given the mean, standard deviation, and sample size for each. I do so using 0.95 for the qnorm function so that I can determine the 5% confidence level for both the right and left side of the normal distribution, since the sample is larger than n=30. By calculating the 5% margin for each side of the distribution, this gives me the effective 90% confidence interval overall.

#Calculate the actual mean wait time for the bypass:

xbar1 <- 19 #sample mean
sd1a <- 10 #sample standard deviation
n1a <- 539 #sample size

error1a <- qnorm(0.95)*sd1a/sqrt(n1a)
error1a
[1] 0.7084886
lower1a <- xbar1-error1a
upper1a <- xbar1+error1a

lower1a
[1] 18.29151
upper1a
[1] 19.70849
#Calculate the actual mean wait time for the angiography:

xbar1b <- 18 #sample mean
sd1b <- 9 #sample standard deviation
n1b <- 847 #sample size

error1b <- qnorm(0.95)*sd1b/sqrt(n1b)
error1b
[1] 0.5086606
lower1b <- xbar1b-error1b
upper1b <- xbar1b+error1b

lower1b
[1] 17.49134
upper1b
[1] 18.50866

Next, I created a data frame with the information.

Show code
#Create a data frame with the information:

df1 <- data.frame( c('Bypass', 'Angiography')
                   ,c(19, 18)
                   ,c(539, 847)
                   ,c(10, 9)
                   ,c(18.29151, 17.49134)
                   ,c(19.70849, 18.50866))
names(df1) <- c('Procedure', 'Mean', 'Sample', 'SD', 'Lower', 'Upper')

df1
    Procedure Mean Sample SD    Lower    Upper
1      Bypass   19    539 10 18.29151 19.70849
2 Angiography   18    847  9 17.49134 18.50866

Finally, I created a plot to visualize the results. The visualization communicates that for each procedure, there is a range of values where we can expect 90% of the estimates to include the population mean given the sample mean and standard deviation.

Show code
gg1 <- ggplot(data = df1)
gg1 <- gg1 + geom_point(aes(x = Procedure, y = Mean), size = 5, color = "blue")
gg1 <- gg1 + geom_errorbar(aes(x = Procedure, y = Mean, ymin=Lower, ymax=Upper), width=.1, color = "blue")
gg1 <- gg1 + geom_text(aes(x = Procedure, y = Lower, label = round(Lower, 2)), hjust = -1)
gg1 <- gg1 + geom_text(aes(x = Procedure, y = Upper, label = round(Upper, 2)), hjust = -1)
gg1 <- gg1 + geom_text(aes(x = Procedure, y = Mean, label = round(Mean, 2)), hjust = -0.5)
gg1 <- gg1 + labs(x = "Procedure", y = "Mean")
gg1 <- gg1 + labs(title = "Chart Showing Mean With Upper and Lower Confidence Intervals at 90%")
gg1 <- gg1 + theme_classic()

gg1

Question 2

A survey of 1031 adult Americans was carried out by the National Center for Public Policy. Assume that the sample is representative of adult Americans. Among those surveyed, 567 believed that college education is essential for success.

Answer 2

To construct the 95% confidence interval, I began by calculating the point sample estimate of the population proportion:

#Calculate the point sample estimate of the population proportion using (p = x/n)

x2 = 567 #affirmative response size
n2 = 1031 #sample size - survey participants

p2 <- x2/n2
p2
[1] 0.5499515

Since np >= 5 and n(1-p) >= 5, I know that I will calculate the confidence interval of that population proportion as follows: p +/- z * square root of (p) x (1-p)/n

#Calculate the confidence interval for p:

error2 <-qnorm(0.975)*sqrt(p2*(1-p2)/n2)
error2
[1] 0.03036761
lower2 <- p2-error2
upper2 <- p2+error2

lower2
[1] 0.5195839
upper2
[1] 0.5803191

Alternatively, using the R prop.test function, I can compare the calculation to my manual calculation and find it is the same for finding the point, but slightly different (at 4 decimal points) on the calculation of the confidence interval.

prop.test(x2, n2)

    1-sample proportions test with continuity correction

data:  x2 out of n2, null probability 0.5
X-squared = 10.091, df = 1, p-value = 0.00149
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5189682 0.5805580
sample estimates:
        p 
0.5499515 

Question 3

Suppose that the financial aid office of UMass Amherst seeks to estimate the mean cost of textbooks per quarter for students. The estimate will be useful if it is within $5 of the true population mean (i.e. they want the confidence interval to have a length of $10 or less). The financial aid office is pretty sure that the amount spent on books varies widely, with most values between $30 and $200. They think that the population standard deviation is about a quarter of this range.

Answer 3

I will start by taking the information given and calculating the variables I need to know.

If the aid office believes the amount spent on books is between $30 and $200, I have a range of $170 to consider.

I want to be 95% confident that the interval estimate contains the population mean, so my z = 1.96.

I also know that the margin of error should be no more than +/- $5 on each end of the estimate, so my margin of error = 5

error3 <- (5)

#I need to calculate the standard deviation at the given estimate that it is a quarter of the range:

sd3 <- 170*0.25 #range * 25%
sd3 #standard deviation = 42.5
[1] 42.5
#Now I can calculate the sample size with the formula n=(zσ/M)2.

ss3 <- ((1.96*sd3)/error3)^2
ss3 #sample size
[1] 277.5556

Question 4

According to a union agreement, the mean income for all senior-level workers in a large service company equals $500 per week. A representative of a women’s group decides to analyze whether the mean income μ for female employees matches this norm. For a random sample of nine female employees, ȳ = $410 and s = 90.

(Hint: The P-values for the two possible one-sided tests must sum to 1.)

Answer 4

  1. H0: The mean weekly earnings for the population of women at the company is μ=$500#

  2. Ha: The mean weekly earnings for the population of women at the company is μ≠$500

I cannot assume that the population distribution is normal as I have a low sample size of 9 and it is not stated it is a random sample, but a representative sample.

My other assumptions are:

I also need to make a decision about using a significance level of 5%.

I will use the test statistic formula to find the t-value: [t = (x̄) - (μ) / (sd/sqrt(n)]

a4 <- 500
n4 <- 9
xbar4 <- 410
sd4 <- 90

#Test statistic:

t4 <-(xbar4-a4)/(sd4/sqrt(n4))

t4
[1] -3

Given that my test statistic = (-3), I can determine the p-value is .00135*2 (to get the sum of both tail probabilities) or 0.0027.

Since my confidence level is 0.05 and my p-value of 0.0027 < 0.05, I can reject the null hypothesis that the mean weekly earnings for the population of women at the company is μ=$500.

#Then I take the t-statistic result (-3) and the degrees of freedom by taking "sample size - 1" or ("9" - 1) = 8.

pval4a <- pt(-3, 8)

pval4a
[1] 0.008535841
pval4b <- pt(-3, 8, lower.tail=FALSE)

pval4b
[1] 0.9914642

Per the hint, I can confirm that these are logical answers by adding the two probabilities together and confirming they equal 1:

pval4a+pval4b
[1] 1

Question 5

Jones and Smith separately conduct studies to test H0: μ = 500 against Ha : μ ≠ 500, each with n = 1000.

Jones gets ȳ = 519.5, with se = 10.0. Smith gets ȳ = 519.7.

Answer 5

#Test statistic for Jones:

ybar5a <- 519.5
n5 <- 1000
a5 <- 500
se5 <- 10.0

t5a <-(ybar5a-a5)/(se5)

t5a
[1] 1.95
#Test statistic for Smith:

ybar5b <- 519.7
n5 <- 1000
a5 <- 500
se5 <- 10.0

t5b <-(ybar5b-a5)/(se5)

t5b
[1] 1.97
#For Jones' results:

pval5a <- pt(1.95, 999, lower.tail = FALSE) * 2

pval5a
[1] 0.05145555
#For Smith's results:

pval5b <- pt(1.97, 999, lower.tail = FALSE) * 2

pval5b
[1] 0.04911426

Hypothesis tests tell us that if the p-value < α, we reject H0 and if p-value ≥ α, we do not reject H0. Given this general statistical guidance, only Smith’s results would be considered “statistically significant”.

Question 6

Are the taxes on gasoline very high in the United States? According to the American Petroleum Institute, the per gallon federal tax that was levied on gasoline was 18.4 cents per gallon. However, state and local taxes vary over the same period. The sample data of gasoline taxes for 18 large cities is given below in the variable called gas_taxes.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

Answer

To answer whether there is enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents, we need to use a left-tailed t-test. We will use the sample data in the variable “gas_taxes” in the t.test function. We know that the t.test() function uses the 95% confidence interval as a default, but it also uses a two-sided test as the default, so we need to provide the alternative argument “less”, indicating a left-sided test.

gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)

t.test(gas_taxes, alternative = "less")

    One Sample t-test

data:  gas_taxes
t = 18.625, df = 17, p-value = 1
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
     -Inf 44.67946
sample estimates:
mean of x 
 40.86278 

This t-test gives us a confidence interval of [inf - 44.67946]. Since the confidence interval includes amounts that are all less than 45 cents, we have enough evidence to conclude that, at that confidence level, the average tax was less than 45 cents.