The average time to graduate with a college degree in the U.S. is 6 years. Suppose you collect data from a normally-distributed sample of 40 NYU students who average a degree completion time of 5.4 years with standard deviation of 1.5 years. Based on this sample, test the hypothesis that the average time to earn a college degree from NYU is different from the U.S. average time to degree completion.
Since the average time is the U.S. , we will create the hypothesis based on that value
H0 is equal to 6 years H1 is not equal to 6 years
We are not given the variance, but the problem states that the data is normally distributed. Therefore a one sample t-test and Two-sided ( because most people use this to test both tails)
The numbers would be plugged as follow:
NYU avg = 5.4 NYU sd = 1.5 Country avg = 6 degrees of freedom = 40
#sqrt(40) is 6.32455
t-stat = 5.4-6/(1.5/6.32455) t-stat = -0.6/0.2371 t-stat = -2.5306
# t-stat is -2.53
pt(-2.53,40-1)*2
## [1] 0.01555661
#p-value is 0.0156 , which is under 0.05
We would reject the null hypothesis.
first <- qt(0.975,40-1)
first
## [1] 2.022691
# or
alternative<- qt(0.025,40-1)
alternative
## [1] -2.022691
To calculate the confidence interval we would calculate the following:
5.5 + 2.02(0.2371) 5.5+0.4789 5.97
5.5 - 2.02(0.2371) 5.5-0.4789 5.0211
Therefore, the Confidence intervals oscillates between 5.0211 and 5.97, it does not contain 6. This supports the rejection of the null hypothesis.
By replacing the degrees of freedo and the sqrt(40) would transition to sqrt (80) this would increase the denominator and therefore would the decrease the confidence interval.
Would decrease because you wuold cover less area.
It would increase , it would be the opposite effect of the df, sicne the standard deviation is in the numerator.
load('hanes_subset.RData')
Write code to create subsets of the data by the following groups:
GENDER (1 vs 2)
Gender1 <- subset(hanes,GENDER == 1)
Gender2 <-subset(hanes,GENDER == 2)
Smoker1 <- subset(hanes,SMOKER ==1)
Smoker2 <- subset(hanes,SMOKER ==2)
Spage_l35 <-subset(hanes,SPAGE <35)
Spage_g35 <- subset(hanes,SPAGE >=35)
inc_l37 <- subset(hanes,INCOME <37)
inc_g37 <- subset(hanes,INCOME >=37)
hist(Gender1$CHOLESTEROLTOTAL)
hist(Gender2$CHOLESTEROLTOTAL)
hist(Smoker1$CHOLESTEROLTOTAL)
hist(Smoker2$CHOLESTEROLTOTAL)
hist(Spage_l35$CHOLESTEROLTOTAL)
hist(Spage_g35$CHOLESTEROLTOTAL)
hist(inc_l37$CHOLESTEROLTOTAL)
hist(inc_g37$CHOLESTEROLTOTAL)
II.Write a null and alternative hypothesis for each of the four sub-group analyses.
H0: mean gender1 = mean gender2 H1: mean gender1 not equal to gender 2
H0: mean smoker1 = mean smoker2 H1: mean smoker1 not equal to smoker 2
H0: mean mean spage_under_35 = mean spage_over_or_equal_35 H1: mean spage_under_35 not equal to spage_over_or_equal_35
H0: mean income_under_37 = mean income_over_or_equal_37 H1: mean income_under_37 not equal to income_under_37
Parametric, because by analyzing the histograms most of the data looks that follow a normal distribution.
The groups meet normality by the analyses of the histograms and the groups are not correlated. Therefore we would use the independent sample t-test.
Because we are testing multiple hypothesis, we need the Bonferroni correction. ( we could be leniant in real life utilizing other tests)
var.test(Gender1$CHOLESTEROLTOTAL,Gender2$CHOLESTEROLTOTAL)
##
## F test to compare two variances
##
## data: Gender1$CHOLESTEROLTOTAL and Gender2$CHOLESTEROLTOTAL
## F = 1.0701, num df = 248, denom df = 330, p-value = 0.5648
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8489603 1.3540366
## sample estimates:
## ratio of variances
## 1.070061
t.test(Gender1$CHOLESTEROLTOTAL,Gender2$CHOLESTEROLTOTAL,var.equal = T)
##
## Two Sample t-test
##
## data: Gender1$CHOLESTEROLTOTAL and Gender2$CHOLESTEROLTOTAL
## t = -0.031087, df = 578, p-value = 0.9752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.253865 6.058983
## sample estimates:
## mean of x mean of y
## 183.2530 183.3505
# The p-value is 0.9752 this is too high and we fail to reject the null hypothesis.
var.test(Smoker1$CHOLESTEROLTOTAL,Smoker2$CHOLESTEROLTOTAL)
##
## F test to compare two variances
##
## data: Smoker1$CHOLESTEROLTOTAL and Smoker2$CHOLESTEROLTOTAL
## F = 1.1121, num df = 86, denom df = 492, p-value = 0.4909
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.8182402 1.5695098
## sample estimates:
## ratio of variances
## 1.112128
t.test(Smoker1$CHOLESTEROLTOTAL,Smoker2$CHOLESTEROLTOTAL,var.equal = T)
##
## Two Sample t-test
##
## data: Smoker1$CHOLESTEROLTOTAL and Smoker2$CHOLESTEROLTOTAL
## t = 4.7791, df = 578, p-value = 2.236e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 11.99678 28.73750
## sample estimates:
## mean of x mean of y
## 200.6207 180.2535
# The p-value is 2.236e-06 , very low, therefore, we reject the null hypothesis.
var.test(Spage_l35$CHOLESTEROLTOTAL,Spage_g35$CHOLESTEROLTOTAL)
##
## F test to compare two variances
##
## data: Spage_l35$CHOLESTEROLTOTAL and Spage_g35$CHOLESTEROLTOTAL
## F = 0.78588, num df = 283, denom df = 295, p-value = 0.04126
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6239042 0.9904665
## sample estimates:
## ratio of variances
## 0.7858812
t.test(Spage_l35$CHOLESTEROLTOTAL,Spage_g35$CHOLESTEROLTOTAL,var.equal = T)
##
## Two Sample t-test
##
## data: Spage_l35$CHOLESTEROLTOTAL and Spage_g35$CHOLESTEROLTOTAL
## t = -6.928, df = 578, p-value = 1.142e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -26.51913 -14.80405
## sample estimates:
## mean of x mean of y
## 172.7641 193.4257
# The p-value is 1.142e-11, very low compared to 0.05 , therefore we successfully reject the null hypothesis.
var.test(inc_l37$CHOLESTEROLTOTAL,inc_g37$CHOLESTEROLTOTAL)
##
## F test to compare two variances
##
## data: inc_l37$CHOLESTEROLTOTAL and inc_g37$CHOLESTEROLTOTAL
## F = 0.97737, num df = 272, denom df = 306, p-value = 0.8479
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7760489 1.2328718
## sample estimates:
## ratio of variances
## 0.9773652
t.test(inc_l37$CHOLESTEROLTOTAL,inc_g37$CHOLESTEROLTOTAL,var.equal = T)
##
## Two Sample t-test
##
## data: inc_l37$CHOLESTEROLTOTAL and inc_g37$CHOLESTEROLTOTAL
## t = -0.91403, df = 578, p-value = 0.3611
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.939772 3.261592
## sample estimates:
## mean of x mean of y
## 181.8059 184.6450
# The income p-value is 0.3611 , it is over 0.05 , therefore, we fail to reject the null hypothesis.
WILCOXON TEST should be sued because of the small sample size and two correlated samples.
cake <- c(7,6,10,4,6,9,6,8)
chips <- c(8,10,7,6,7,8,7,9)
wilcox.test(cake,chips,paired=TRUE)
## Warning in wilcox.test.default(cake, chips, paired = TRUE): cannot compute exact
## p-value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: cake and chips
## V = 10, p-value = 0.2815
## alternative hypothesis: true location shift is not equal to 0
# The p-value is 0.2815
We fail to reject the null hypothesis due to the p-value which is higher than 0.05.