DACSS 603 Homework 1
library(distill)
library(dplyr)
library(tidyverse)
Problem1<- read.csv('homework1_prob1.csv',TRUE,',',na.strings = "N/A")
Problem1
ï..Surgical.Procedure Sample.Size Mean.Wait.Time Standard.Deviation
1 Bypass 539 19 10
2 Angiography 847 18 9
Construct the 90% confidence interval to estimate the actual mean wait time for each of the two procedures. Is the confidence interval narrower for angiography or bypass surgery?
#Our values are:
Bypass.mean<- 19
Bypass.sd<-10
Bypass.n<-539
Bypass.se <- Bypass.sd/sqrt(Bypass.n) #This is standard error of the mean
#Find the t.score for CI 90%
alpha = 0.1
degrees.freedom = Bypass.n - 1
Bypass.t.score = qt(p=alpha/2, df=degrees.freedom,lower.tail=F)
print(Bypass.t.score)
[1] 1.647691
#Calculate margin of error
Bypass.margin.error <- Bypass.t.score * Bypass.se
print(Bypass.margin.error)
[1] 0.7097107
#Calculate the 90% confidence interval for Bypass
lower.bound <- Bypass.mean - Bypass.margin.error
upper.bound <- Bypass.mean + Bypass.margin.error
print(c(lower.bound,upper.bound))
[1] 18.29029 19.70971
#Our values are:
Angio.mean<- 18
Angio.sd<-9
Angio.n<-847
Angio.se <- Angio.sd/sqrt(Angio.n) #This is standard error of the mean
#Find the t.score for CI 90%
alpha = 0.1
degrees.freedomA = Angio.n - 1
Angio.t.score = qt(p=alpha/2, df=degrees.freedomA,lower.tail=F)
print(Angio.t.score)
[1] 1.646657
#Calculate margin of error
Angio.margin.error <- Angio.t.score * Angio.se
print(Angio.margin.error)
[1] 0.5092182
#Calculate the 90% confidence interval for Angiography
lower.boundA <- Angio.mean - Angio.margin.error
upper.boundA <- Angio.mean + Angio.margin.error
print(c(lower.boundA,upper.boundA))
[1] 17.49078 18.50922
Angiography patients, at 90% confidence interval, had between (17.49078 and 18.50922) wait time in days which is NARROWER compared to the bypass patients wait time which is between (18.29029 and 19.70971)
#Point Estimate, p
n<-1031
k<-567
p<-k/n
p
[1] 0.5499515
The sample proportion of adult Americans who believed that college education is essential for success is 0.5499515 or 55%. This represents our point estimate for the population (adult Americans) proportion.
#Construct 95% confidence interval for p
S.margin <- qnorm(0.975)*sqrt(p*(1-p)/n) #calculate margin of error
S.lower.bound <- p-S.margin
S.upper.bound <- p+S.margin
print(c(S.lower.bound,S.upper.bound))
[1] 0.5195839 0.5803191
The 95% confidence interval for the population (adult Americans) proportion is [0.5195839 0.5803191]. This means between 51.9% to 58% of adult Americans believed that college education is essential for success.
n= square of ((Z 0.05/2 * sd of pop)/within $5 of true pop mean
sd_of_pop =(200-30)/4 #This is standard deviation of population
sd_of_pop
[1] 42.5
sample_size=((1.96*sd_of_pop)/5)**2 #Using Z score 1.96 for significance level 5%
sample_size
[1] 277.5556
Test whether the mean income of female employees differs from $500 per week. Include assumptions, hypotheses, test statistic, and P-value. Interpret the result.
assumption: seed set at 123, using rnorm
Ho mu=500
Ha mu≠500
Mean y_hat =410
sd = 90
n =9
[1] 359.5572 389.2840 550.2837 416.3458 421.6359 564.3558 451.4825
[8] 296.1445 348.1832
t.test(Mean_income, mu = 500) # Ho: mu=500
One Sample t-test
data: Mean_income
t = -2.6213, df = 8, p-value = 0.03059
alternative hypothesis: true mean is not equal to 500
95 percent confidence interval:
353.2313 490.6071
sample estimates:
mean of x
421.9192
At p-value 0.03, considered statistically significant, we reject the null hypothesis. We can conclude that mean income for female employees is not 500.
t.test(Mean_income, mu=500, alternative = 'less') # Ha: mu<500
One Sample t-test
data: Mean_income
t = -2.6213, df = 8, p-value = 0.01529
alternative hypothesis: true mean is less than 500
95 percent confidence interval:
-Inf 477.3087
sample estimates:
mean of x
421.9192
At p-value 0.01529, considered statistically significant, we reject the null hypothesis and accept the alternative hypothesis. We can conclude that mean income for female employees is less than 500.
t.test(Mean_income, mu=500, alternative = "greater") # Ha: mu>500
One Sample t-test
data: Mean_income
t = -2.6213, df = 8, p-value = 0.9847
alternative hypothesis: true mean is greater than 500
95 percent confidence interval:
366.5297 Inf
sample estimates:
mean of x
421.9192
At p-value 0.9847, considered statistically NOT significant, we fail to reject the null hypothesis. We can reject the alternative hypothesis that mean income for female employees is greater than 500.
(Hint: The P-values for the two possible one-sided tests must sum to 1.)
total_p_value=0.9847+0.01529
print(c("Total p-values for the two possible one-sided tests is",total_p_value))
[1] "Total p-values for the two possible one-sided tests is"
[2] "0.99999"
tab <- matrix(c(519.5, 519.7, 10, 10), ncol=2, byrow=TRUE)
colnames(tab) <- c("Jones","Smith")
rownames(tab) <- c("y_hat","se")
tab <- as.table(tab)
print(tab)
Jones Smith
y_hat 519.5 519.7
se 10.0 10.0
t.stat <- (y_hat - mu)/sample.se
p.value = pt(q=abs(t.stat), df=degrees.freedom,lower.tail=F) 2 *
degrees.freedom = 1000 - 1
p.value = pt(q=abs(t.stat), df=degrees.freedom,lower.tail=F) * 2
print(c("Two-sided p-value",p.value))
[1] "Two-sided p-value" "0.0514555476459477"
degrees.freedom = 1000 - 1
Smith.p.value = pt(q=abs(Smith.t.stat), df=degrees.freedom,lower.tail=F) * 2
print(c("Two-sided p-value",Smith.p.value))
[1] "Two-sided p-value" "0.0491142565416521"
α = 0.05
H0: μ = 500
Ha : μ ≠ 500
For Jones Data: Since p=0.051, we fail to reject the null hypothesis (H0: μ = 500) and reject the alternative hypothesis (Ha : μ ≠ 500). The p-value is NOT statistically significant.
For Smith Data: Since p=0.049, we reject the null hypothesis (H0: μ = 500) and accept the alternative hypothesis. We conclude that μ ≠ 500. P-value is statistically significant.
print(tab)
Jones Smith
y_hat 519.5 519.7
se 10.0 10.0
#α = 0.05
#H0: μ = 500
#Ha : μ ≠ 500
#Zc= 1.96
Jones.z <- (519.5-500)/10
Jones.z
[1] 1.95
Smith.z <- (519.7-500)/10
Smith.z
[1] 1.97
Reporting the result of a test using p-values could be misleading. We can avoid this by using z-score to report the results, using z-score =1.96 as the same 95% confidence level. Jones z-score of 1.95 < 1.96 means we fail to reject the null hypothesis (H0: μ = 500) and reject the alternative hypothesis (Ha : μ ≠ 500). Smith z-score of 1.97 > 1.96 we reject the null hypothesis (H0: μ = 500) and accept the alternative hypothesis(Ha : μ ≠ 500).
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
Is there enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents? Explain.
#H0: μ = 45
#Ha : μ ≠ 45, specifically μ < 45 assuming one-sided using argument alternative= "lesser"
gas_taxes <- c(51.27, 47.43, 38.89, 41.95, 28.61, 41.29, 52.19, 49.48, 35.02, 48.13, 39.28, 54.41, 41.66, 30.28, 18.49, 38.72, 33.41, 45.02)
t.test(gas_taxes,mu=45, alternative = 'less')
One Sample t-test
data: gas_taxes
t = -1.8857, df = 17, p-value = 0.03827
alternative hypothesis: true mean is less than 45
95 percent confidence interval:
-Inf 44.67946
sample estimates:
mean of x
40.86278
Yes, there is enough evidence to conclude at a 95% confidence level that the average tax per gallon of gas in the US in 2005 was less than 45 cents. With the p-value of 0.03827 < α = 0.05, we reject the null hypothesis that μ = 45 and accept the alternative hypothesis that μ < 45 cents.