The average time to graduate with a college degree in the U.S. is 6 years. Suppose you collect data from a normally-distributed sample of 40 NYU students who average a degree completion time of 5.4 years with standard deviation of 1.5 years. Based on this sample, test the hypothesis that the average time to earn a college degree from NYU is different from the U.S. average time to degree completion.

  1. Write a null and alternative hypothesis.

Since the average time is the U.S. , we will create the hypothesis based on that value

H0 is equal to 6 years H1 is not equal to 6 years

  1. What statistical test would you us e and why?

We are not given the variance, but the problem states that the data is normally distributed. Therefore a one sample t-test and Two-sided ( because most people use this to test both tails)

  1. Calculate the test statistic and corresponding p-value.

The numbers would be plugged as follow:

NYU avg = 5.4 NYU sd = 1.5 Country avg = 6 degrees of freedom = 40

#sqrt(40) is 6.32455

t-stat = 5.4-6/(1.5/6.32455) t-stat = -0.6/0.2371 t-stat = -2.5306

# t-stat is -2.53

pt(-2.53,40-1)*2
## [1] 0.01555661
#p-value is 0.0156 , which is under 0.05 
  1. Assuming a Type I error rate of 0.05, would you reject or fail to reject the null hypothesis?

We would reject the null hypothesis.

  1. Assuming a Type I error rate of 0.05, calculate a confidence interval about the sample mean. Does the interval include the population mean?
first <- qt(0.975,40-1)
first
## [1] 2.022691
# or
alternative<- qt(0.025,40-1)
alternative
## [1] -2.022691

To calculate the confidence interval we would calculate the following:

5.5 + 2.02(0.2371) 5.5+0.4789 5.97

5.5 - 2.02(0.2371) 5.5-0.4789 5.0211

Therefore, the Confidence intervals oscillates between 5.0211 and 5.97, it does not contain 6. This supports the rejection of the null hypothesis.

  1. The sample size increases from 40 to 80 NYU students.

By replacing the degrees of freedo and the sqrt(40) would transition to sqrt (80) this would increase the denominator and therefore would the decrease the confidence interval.

  1. The Type I error rate increases from 0.05 to 0.1.

Would decrease because you wuold cover less area.

  1. The standard deviation increases from 1.5 years to 2 years.

It would increase , it would be the opposite effect of the df, sicne the standard deviation is in the numerator.


  1. In the following questions, conduct exploratory data analysis using hypothesis testing:
  1. Load into R the file “hanes_subset.Rdata”.
load('hanes_subset.RData')
  1. Write code to create subsets of the data by the following groups:

  2. GENDER (1 vs 2)

  1. SMOKER (1 vs 2)
  2. SPAGE (median cut-off of less than 35 vs. greater than or equal to 35)
  3. INCOME (median cut-off of less than 37 vs. greater than or equal to 37)
Gender1 <- subset(hanes,GENDER == 1)
Gender2 <-subset(hanes,GENDER == 2)
Smoker1 <- subset(hanes,SMOKER ==1)
Smoker2 <- subset(hanes,SMOKER ==2)
Spage_l35 <-subset(hanes,SPAGE <35)
Spage_g35 <- subset(hanes,SPAGE >=35)
inc_l37 <-  subset(hanes,INCOME <37)
inc_g37 <-  subset(hanes,INCOME >=37)
  1. Generate 8 histograms of CHOLESTEROLTOTAL by each subgroup. Which subgroups follow a normal distribution?
hist(Gender1$CHOLESTEROLTOTAL)

hist(Gender2$CHOLESTEROLTOTAL)

hist(Smoker1$CHOLESTEROLTOTAL)

hist(Smoker2$CHOLESTEROLTOTAL)

hist(Spage_l35$CHOLESTEROLTOTAL)

hist(Spage_g35$CHOLESTEROLTOTAL)

hist(inc_l37$CHOLESTEROLTOTAL)

hist(inc_g37$CHOLESTEROLTOTAL)

II.Write a null and alternative hypothesis for each of the four sub-group analyses.

H0: mean gender1 = mean gender2 H1: mean gender1 not equal to gender 2

H0: mean smoker1 = mean smoker2 H1: mean smoker1 not equal to smoker 2

H0: mean mean spage_under_35 = mean spage_over_or_equal_35 H1: mean spage_under_35 not equal to spage_over_or_equal_35

H0: mean income_under_37 = mean income_over_or_equal_37 H1: mean income_under_37 not equal to income_under_37

  1. Should a parametric or non-parametric test be used in each of the four subgroup analyses? Justify your answer.

Parametric, because by analyzing the histograms most of the data looks that follow a normal distribution.

  1. Which statistical test(s) should be used in each of the four subgroup analyses? Justify your answers.

The groups meet normality by the analyses of the histograms and the groups are not correlated. Therefore we would use the independent sample t-test.

  1. What Type I error rate should be used?

Because we are testing multiple hypothesis, we need the Bonferroni correction. ( we could be leniant in real life utilizing other tests)

  1. Perform each of the four statistical tests you selected above using R. Report a p-value. Would you reject or fail to reject each null hypothesis?
var.test(Gender1$CHOLESTEROLTOTAL,Gender2$CHOLESTEROLTOTAL) 
## 
##  F test to compare two variances
## 
## data:  Gender1$CHOLESTEROLTOTAL and Gender2$CHOLESTEROLTOTAL
## F = 1.0701, num df = 248, denom df = 330, p-value = 0.5648
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8489603 1.3540366
## sample estimates:
## ratio of variances 
##           1.070061
t.test(Gender1$CHOLESTEROLTOTAL,Gender2$CHOLESTEROLTOTAL,var.equal = T)
## 
##  Two Sample t-test
## 
## data:  Gender1$CHOLESTEROLTOTAL and Gender2$CHOLESTEROLTOTAL
## t = -0.031087, df = 578, p-value = 0.9752
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.253865  6.058983
## sample estimates:
## mean of x mean of y 
##  183.2530  183.3505
# The p-value is 0.9752 this is too high and we fail to reject the null hypothesis.


var.test(Smoker1$CHOLESTEROLTOTAL,Smoker2$CHOLESTEROLTOTAL) 
## 
##  F test to compare two variances
## 
## data:  Smoker1$CHOLESTEROLTOTAL and Smoker2$CHOLESTEROLTOTAL
## F = 1.1121, num df = 86, denom df = 492, p-value = 0.4909
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8182402 1.5695098
## sample estimates:
## ratio of variances 
##           1.112128
t.test(Smoker1$CHOLESTEROLTOTAL,Smoker2$CHOLESTEROLTOTAL,var.equal = T)
## 
##  Two Sample t-test
## 
## data:  Smoker1$CHOLESTEROLTOTAL and Smoker2$CHOLESTEROLTOTAL
## t = 4.7791, df = 578, p-value = 2.236e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  11.99678 28.73750
## sample estimates:
## mean of x mean of y 
##  200.6207  180.2535
# The p-value is 2.236e-06 , very low, therefore, we reject the null hypothesis.

var.test(Spage_l35$CHOLESTEROLTOTAL,Spage_g35$CHOLESTEROLTOTAL) 
## 
##  F test to compare two variances
## 
## data:  Spage_l35$CHOLESTEROLTOTAL and Spage_g35$CHOLESTEROLTOTAL
## F = 0.78588, num df = 283, denom df = 295, p-value = 0.04126
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6239042 0.9904665
## sample estimates:
## ratio of variances 
##          0.7858812
t.test(Spage_l35$CHOLESTEROLTOTAL,Spage_g35$CHOLESTEROLTOTAL,var.equal = T) 
## 
##  Two Sample t-test
## 
## data:  Spage_l35$CHOLESTEROLTOTAL and Spage_g35$CHOLESTEROLTOTAL
## t = -6.928, df = 578, p-value = 1.142e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.51913 -14.80405
## sample estimates:
## mean of x mean of y 
##  172.7641  193.4257
# The p-value is 1.142e-11, very low compared to 0.05 , therefore we successfully reject the null hypothesis.


var.test(inc_l37$CHOLESTEROLTOTAL,inc_g37$CHOLESTEROLTOTAL) 
## 
##  F test to compare two variances
## 
## data:  inc_l37$CHOLESTEROLTOTAL and inc_g37$CHOLESTEROLTOTAL
## F = 0.97737, num df = 272, denom df = 306, p-value = 0.8479
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7760489 1.2328718
## sample estimates:
## ratio of variances 
##          0.9773652
t.test(inc_l37$CHOLESTEROLTOTAL,inc_g37$CHOLESTEROLTOTAL,var.equal = T) 
## 
##  Two Sample t-test
## 
## data:  inc_l37$CHOLESTEROLTOTAL and inc_g37$CHOLESTEROLTOTAL
## t = -0.91403, df = 578, p-value = 0.3611
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.939772  3.261592
## sample estimates:
## mean of x mean of y 
##  181.8059  184.6450
# The income p-value is 0.3611 , it is over 0.05 , therefore, we fail to reject the null hypothesis.
  1. You are trying to determine whether a sample of study participants prefer sweet or savory snacks. You give 8 participants cake and ask them to rate their satisfaction with the snack on a 1-10 scale (where 1 is the most unsatisfied and 10 is the most satisfied). You then provide these same participants with potato chips and ask them to also rate their satisfaction with the snack on a 1 to 10 scale. You collect the following data:
  1. Which hypothesis test should be used and why?

WILCOXON TEST should be sued because of the small sample size and two correlated samples.

  1. Use R to calculate a p-value.
cake <- c(7,6,10,4,6,9,6,8)
chips <- c(8,10,7,6,7,8,7,9)


wilcox.test(cake,chips,paired=TRUE)
## Warning in wilcox.test.default(cake, chips, paired = TRUE): cannot compute exact
## p-value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  cake and chips
## V = 10, p-value = 0.2815
## alternative hypothesis: true location shift is not equal to 0
# The p-value is 0.2815
  1. Should you reject or fail to reject the null hypothesis?

We fail to reject the null hypothesis due to the p-value which is higher than 0.05.