Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.


Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

Load Openintro Library

library("openintro")
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Store ncbirths in environment

ncbirths<-ncbirths
# View the structure of ncbirths
str(ncbirths)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

There are 1000 observations over 13 different variables in NC Births.

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Find and store the mean of weight gained
mean(ncbirths$gained, na=TRUE)
## [1] 30.3258
wtgmean<-mean(ncbirths$gained, na=TRUE)

# Find and store the standard deviation of weight gained
sd(ncbirths$gained, na=TRUE)
## [1] 14.2413
wtgsd<-sd(ncbirths$gained, na=TRUE)

# Find and store the sample size of weight gained
table(is.na(ncbirths$gained))
## 
## FALSE  TRUE 
##   973    27
wtgsample<-973
# Calculate t-critical value for 90% confidence
abs(qt(p = 0.05, df = 972))
## [1] 1.646423
t90 <- abs(qt(p = 0.05, df = 972))
# Calculate and store margin of error
wtgsd/sqrt(wtgsample)
## [1] 0.456555
SE <- wtgsd/sqrt(wtgsample)

The mean of weight gained is 30.33, with a standard deviation of 14.24 and a sample size of 973.

Bounds of Confidence Interval \(\bar{x}\pm Z \frac{\sigma}{\sqrt(n)}\)

# Calculate the lower bound of a 90% C.I.
wtgmean - t90*SE
## [1] 29.57411
# Calculate the upper bound of a 90% C.I.
wtgmean + t90*SE
## [1] 31.07748

The lower bound of a 90% C.I. is 29.57 and the upper bound is 31.08

Question 2 - Single Sample t-confidence interval

  1. Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.
# Calculate and store t-critical value for 95% confidence
abs(qt(p = 0.025, df = 972))
## [1] 1.962408
t95<-abs(qt(p = 0.025, df = 972))

# Calculate the lower bound of a 95% C.I.
wtgmean - t95*SE
## [1] 29.42985
# Calculate the upper bound of a 95% C.I.
wtgmean + t95*SE
## [1] 31.22174
  1. How does that confidence interval compare to the one in Question #1?

While the 95% confidence interval is a wider range than the 90% confidence interval, the difference between the two is small.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

  1. Write hypotheses

\(H_o: \mu ncbw = 7.7\)

\(H_A: \mu ncbw \neq 7.7\)

  1. Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)

# Find and store mean of birth weights
mean(ncbirths$weight, na = TRUE)
## [1] 7.101
birthwtmean<-mean(ncbirths$weight, na = TRUE)

# Find and store standard deviation of birth weights
sd(ncbirths$weight, na = TRUE)
## [1] 1.50886
birthwtsd<-sd(ncbirths$weight, na = TRUE)

# Find and store sample size of birth weights
table(is.na(ncbirths$weight))
## 
## FALSE 
##  1000
birthwtsample<-1000

The mean of the NC babies’ birth weights is 7.10, with a standard deviation of 1.51, and a sample size of 1000.

# Test statistic
(birthwtmean-7.7)/(birthwtsd/sqrt(birthwtsample))
## [1] -12.55388

The test statistic for the NC babies’ birthweights is -12.55.

# Probability of test statistic by chance
pt(q = -12.55, df = 999)*2
## [1] 1.184446e-33

The probability of obtaining this test statistic by chance is significantly less than the p-value of 0.05.

  1. Conclusion

There is sufficient evidence to suggest that the NC babies’ birth weights is not equal to those of the European babies. We reject the null hypothesis in favor of the alternate hypothesis.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

  1. Write hypotheses

\(H_o : \mu fage - \mu mage= 0\)

\(H_A : \mu fage - \mu mage \neq 0\)

  1. Test by confidence interval or p-value and decision
# Create a new column of the difference between mothers and fathers age
ncbirths$diff <- (ncbirths$fage) - (ncbirths$mage)

# Calculate and store mean of age difference
mean(ncbirths$diff, na = TRUE)
## [1] 2.652593
agemean<-mean(ncbirths$diff, na = TRUE)

# Calculate and store standard deviation of age difference
sd(ncbirths$diff, na = TRUE)
## [1] 4.321604
agesd<-sd(ncbirths$diff, na = TRUE)

# Find the sample size of age difference
table(is.na(ncbirths$diff))
## 
## FALSE  TRUE 
##   829   171

The mean of the age difference is 2.65, with a standard deviation of 4.32, and a sample size of 829.

# Calculate and store t-critical value for 95% confidence 
abs(qt(p = 0.025, df = 828))
## [1] 1.962833
tcv<-abs(qt(p = 0.025, df = 828))

# Calculate and store the margin of error
agesd/sqrt(829)
## [1] 0.1500955
ME<-agesd/sqrt(829)

# Calculate the lower bound of a 95% C.I.
agemean - tcv * ME
## [1] 2.357981
# Calculate the upper bound of a 95% C.I.
agemean + tcv * ME
## [1] 2.947206

The t-critical value (tcv) is 1.96 for a C.I. of 95%. The margin of error (ME) is 0.15. The lower bound is 2.36 and the upper bound is 2.95.

  1. Conclusion

There is not sufficient evidence to show a significant age difference between the fathers and mothers of NC babies. We reject the null hypothesis because 0 does not lie within the confidence interval: \(2.36<\mu<2.95\)

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

  1. Write hypotheses

\(H_o: \mu smokers - \mu nonsmokers = 0\)

\(H_A: \mu smokers - \mu nonsmokers \neq 0\)

  1. Test by confidence interval or p-value and decision
# Create a subset of Smokers and Nonsmokers
Smokers<- subset(ncbirths, ncbirths$habit == "smoker")
Nonsmokers<-subset(ncbirths, ncbirths$habit == "nonsmoker")

# Find and store the mean difference in length of pregnancy
mean(Smokers$weeks, na.rm = TRUE) - mean(Nonsmokers$weeks, na.rm = TRUE)
## [1] 0.1256371
Meandiff<-mean(Smokers$weeks, na.rm = TRUE) - mean(Nonsmokers$weeks, na.rm = TRUE) 

# Find sample size of Smokers
table(is.na(Smokers$weeks))
## 
## FALSE 
##   126
# Fins sample size of Nonsmokers
table(is.na(Nonsmokers$weeks))
## 
## FALSE  TRUE 
##   872     1
# Find and store standard error of weeks
sqrt((sd(Smokers$weeks, na.rm = TRUE)^2/126)+(sd(Nonsmokers$weeks, na.rm = TRUE)^2/872))
## [1] 0.2420771
SEdiff<-sqrt((sd(Smokers$weeks, na.rm = TRUE)^2/126)+(sd(Nonsmokers$weeks, na.rm = TRUE)^2/872))

# Test Statistic
(Meandiff-0)/SEdiff
## [1] 0.5189962
# Find the p-value
pt(-0.5189962, df = 125)*2
## [1] 0.604681

The mean difference of Smokers and Nonsmokers is 0.13 with a SE difference of 0.24. The test statistic is 0.52 and the p-value is 0.60. There are 126 Smokers and 872 Nonsmokers.

  1. Conclusion

There is not sufficient evidence to show that there is a significant difference in pregnancy lengths based upon whether or not the mother is a smoker. The p-value is greater than 0.05, so we fail to reject the null hypothesis.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

  1. Write hypotheses

\(H_o: \mu younger = \mu mature\)

\(H_A: \mu younger > \mu mature\)

  1. Test by confidence interval or p-value and decision
# Create a subset of Younger and Mature moms
Mature<-subset(ncbirths, ncbirths$mature == "mature mom")
Younger<-subset(ncbirths, ncbirths$mature == "younger mom")

# Find and store the mean difference of weights
mean(Younger$weight) - mean(Mature$weight)
## [1] -0.02833208
MeandiffW<-mean(Younger$weight) - mean(Mature$weight)

# Find the sample size of Younger moms' weights
table(is.na(Younger$weight))
## 
## FALSE 
##   867
# Find the sample size of Mature moms' weights
table(is.na(Mature$weight))
## 
## FALSE 
##   133
# Find and store the standard error for weight
sqrt((sd(Mature$weight)^2/133)+(sd(Younger$weight)^2/867))
## [1] 0.1524501
SEW<-sqrt((sd(Mature$weight)^2/133)+(sd(Younger$weight)^2/867))

# Test Statistic
(MeandiffW - 0)/SEW
## [1] -0.1858449
# Find the p-value
pt(-0.1858449, df=132)*2
## [1] 0.8528517

The mean difference in weights of Younger and Mature moms is -0.03 with a SE difference of 0.15. The test statistic is -0.19 and the p-value is 0.85. There are 867 Younger moms and 133 Mature moms.

  1. Conclusion

There is not sufficient evidence to show that there is a significant difference in pregnancy weight based upon whether or not the mother is younger or more mature. The p-value is greater than 0.05, so we fail to reject the null hypothesis.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# Create a subset of baby weights
Low<-subset(ncbirths, ncbirths$lowbirthweight == "low")
Notlow<-subset(ncbirths, ncbirths$lowbirthweight == "not low")

# Find summary for Low weight
summary(Low$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500
# Find summary for Notlow weight
summary(Notlow$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

The cutoff weight for babies to be considered “low” is 5.50 lbs and under. The cutoff weight for babies to be considered “not low” is 5.56 lbs and above. The summary() function shows the minimum and maximum values for each category; therefore, it is reliable when determining the specific weight cutoffs for each group.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

  1. Question

Is there a difference in the number of weeks white moms and not white moms carry a pregnancy?

  1. Write hypotheses

\(H_o : \mu white - \mu notwhite = 0\)

\(H_A ; \mu white - \mu notwhite \neq 0\)

  1. Test by confidence interval or p-value and decision
# Create a subset for white moms and not white moms
White<-subset(ncbirths, ncbirths$whitemom == "white")
Notwhite<-subset(ncbirths, ncbirths$whitemom == "not white")

# Find and store the mean difference of weeks carried
mean(White$weeks, na.rm = TRUE) - mean(Notwhite$weeks)
## [1] 0.6358799
MD<-mean(White$weeks, na.rm = TRUE) - mean(Notwhite$weeks)

# Find the sample size for white moms and not white moms
table(is.na(White$weeks))
## 
## FALSE  TRUE 
##   712     2
table(is.na(Notwhite$weeks))
## 
## FALSE 
##   284
# Find the test statistic for 95% C.I.
abs(qt(p = 0.025, df = 711))
## [1] 1.963306
# Calculate the margin of error
sqrt((sd(White$weeks, na.rm = TRUE)^2)/712 + (sd(Notwhite$weeks)^2/284))
## [1] 0.2380229
ER<-sqrt((sd(White$weeks, na.rm = TRUE)^2)/712 + (sd(Notwhite$weeks)^2/284))

# Calculate the lower bound of a 95% C.I.
(MD - 1.96)*ER
## [1] -0.3151709
# Calculate the upper bound of a 95% C.I.
(MD + 1.96)*ER
## [1] 0.6178788

The mean difference in pregnancy weeks of white moms and not white moms is 0.64, with a standard error of 0.24 and a test statistic of 1.96. The sample size for white moms is 712 and 284 for not white moms.

  1. Conclusion

There is not sufficient evidence to show a significant difference between the number of weeks a white mom carries and the number of weeks a not white mom carries a pregnancy. We fail to reject the null hypothesis because 0 lies within the 95% confidence interval: \(-0.32<\mu<0.62\)