In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
library("openintro")
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
##
## cars, trees
ncbirths<-ncbirths
# View the structure of ncbirths
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
There are 1000 observations over 13 different variables in NC Births.
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Find and store the mean of weight gained
mean(ncbirths$gained, na=TRUE)
## [1] 30.3258
wtgmean<-mean(ncbirths$gained, na=TRUE)
# Find and store the standard deviation of weight gained
sd(ncbirths$gained, na=TRUE)
## [1] 14.2413
wtgsd<-sd(ncbirths$gained, na=TRUE)
# Find and store the sample size of weight gained
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
wtgsample<-973
# Calculate t-critical value for 90% confidence
abs(qt(p = 0.05, df = 972))
## [1] 1.646423
t90 <- abs(qt(p = 0.05, df = 972))
# Calculate and store margin of error
wtgsd/sqrt(wtgsample)
## [1] 0.456555
SE <- wtgsd/sqrt(wtgsample)
The mean of weight gained is 30.33, with a standard deviation of 14.24 and a sample size of 973.
Bounds of Confidence Interval \(\bar{x}\pm Z \frac{\sigma}{\sqrt(n)}\)
# Calculate the lower bound of a 90% C.I.
wtgmean - t90*SE
## [1] 29.57411
# Calculate the upper bound of a 90% C.I.
wtgmean + t90*SE
## [1] 31.07748
The lower bound of a 90% C.I. is 29.57 and the upper bound is 31.08
# Calculate and store t-critical value for 95% confidence
abs(qt(p = 0.025, df = 972))
## [1] 1.962408
t95<-abs(qt(p = 0.025, df = 972))
# Calculate the lower bound of a 95% C.I.
wtgmean - t95*SE
## [1] 29.42985
# Calculate the upper bound of a 95% C.I.
wtgmean + t95*SE
## [1] 31.22174
While the 95% confidence interval is a wider range than the 90% confidence interval, the difference between the two is small.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
\(H_o: \mu ncbw = 7.7\)
\(H_A: \mu ncbw \neq 7.7\)
# Sample statistics (sample mean, standard deviation, and size)
# Find and store mean of birth weights
mean(ncbirths$weight, na = TRUE)
## [1] 7.101
birthwtmean<-mean(ncbirths$weight, na = TRUE)
# Find and store standard deviation of birth weights
sd(ncbirths$weight, na = TRUE)
## [1] 1.50886
birthwtsd<-sd(ncbirths$weight, na = TRUE)
# Find and store sample size of birth weights
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
birthwtsample<-1000
The mean of the NC babies’ birth weights is 7.10, with a standard deviation of 1.51, and a sample size of 1000.
# Test statistic
(birthwtmean-7.7)/(birthwtsd/sqrt(birthwtsample))
## [1] -12.55388
The test statistic for the NC babies’ birthweights is -12.55.
# Probability of test statistic by chance
pt(q = -12.55, df = 999)*2
## [1] 1.184446e-33
The probability of obtaining this test statistic by chance is significantly less than the p-value of 0.05.
There is sufficient evidence to suggest that the NC babies’ birth weights is not equal to those of the European babies. We reject the null hypothesis in favor of the alternate hypothesis.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
\(H_o : \mu fage - \mu mage= 0\)
\(H_A : \mu fage - \mu mage \neq 0\)
# Create a new column of the difference between mothers and fathers age
ncbirths$diff <- (ncbirths$fage) - (ncbirths$mage)
# Calculate and store mean of age difference
mean(ncbirths$diff, na = TRUE)
## [1] 2.652593
agemean<-mean(ncbirths$diff, na = TRUE)
# Calculate and store standard deviation of age difference
sd(ncbirths$diff, na = TRUE)
## [1] 4.321604
agesd<-sd(ncbirths$diff, na = TRUE)
# Find the sample size of age difference
table(is.na(ncbirths$diff))
##
## FALSE TRUE
## 829 171
The mean of the age difference is 2.65, with a standard deviation of 4.32, and a sample size of 829.
# Calculate and store t-critical value for 95% confidence
abs(qt(p = 0.025, df = 828))
## [1] 1.962833
tcv<-abs(qt(p = 0.025, df = 828))
# Calculate and store the margin of error
agesd/sqrt(829)
## [1] 0.1500955
ME<-agesd/sqrt(829)
# Calculate the lower bound of a 95% C.I.
agemean - tcv * ME
## [1] 2.357981
# Calculate the upper bound of a 95% C.I.
agemean + tcv * ME
## [1] 2.947206
The t-critical value (tcv) is 1.96 for a C.I. of 95%. The margin of error (ME) is 0.15. The lower bound is 2.36 and the upper bound is 2.95.
There is not sufficient evidence to show a significant age difference between the fathers and mothers of NC babies. We reject the null hypothesis because 0 does not lie within the confidence interval: \(2.36<\mu<2.95\)
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_o: \mu smokers - \mu nonsmokers = 0\)
\(H_A: \mu smokers - \mu nonsmokers \neq 0\)
# Create a subset of Smokers and Nonsmokers
Smokers<- subset(ncbirths, ncbirths$habit == "smoker")
Nonsmokers<-subset(ncbirths, ncbirths$habit == "nonsmoker")
# Find and store the mean difference in length of pregnancy
mean(Smokers$weeks, na.rm = TRUE) - mean(Nonsmokers$weeks, na.rm = TRUE)
## [1] 0.1256371
Meandiff<-mean(Smokers$weeks, na.rm = TRUE) - mean(Nonsmokers$weeks, na.rm = TRUE)
# Find sample size of Smokers
table(is.na(Smokers$weeks))
##
## FALSE
## 126
# Fins sample size of Nonsmokers
table(is.na(Nonsmokers$weeks))
##
## FALSE TRUE
## 872 1
# Find and store standard error of weeks
sqrt((sd(Smokers$weeks, na.rm = TRUE)^2/126)+(sd(Nonsmokers$weeks, na.rm = TRUE)^2/872))
## [1] 0.2420771
SEdiff<-sqrt((sd(Smokers$weeks, na.rm = TRUE)^2/126)+(sd(Nonsmokers$weeks, na.rm = TRUE)^2/872))
# Test Statistic
(Meandiff-0)/SEdiff
## [1] 0.5189962
# Find the p-value
pt(-0.5189962, df = 125)*2
## [1] 0.604681
The mean difference of Smokers and Nonsmokers is 0.13 with a SE difference of 0.24. The test statistic is 0.52 and the p-value is 0.60. There are 126 Smokers and 872 Nonsmokers.
There is not sufficient evidence to show that there is a significant difference in pregnancy lengths based upon whether or not the mother is a smoker. The p-value is greater than 0.05, so we fail to reject the null hypothesis.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_o: \mu younger = \mu mature\)
\(H_A: \mu younger > \mu mature\)
# Create a subset of Younger and Mature moms
Mature<-subset(ncbirths, ncbirths$mature == "mature mom")
Younger<-subset(ncbirths, ncbirths$mature == "younger mom")
# Find and store the mean difference of weights
mean(Younger$weight) - mean(Mature$weight)
## [1] -0.02833208
MeandiffW<-mean(Younger$weight) - mean(Mature$weight)
# Find the sample size of Younger moms' weights
table(is.na(Younger$weight))
##
## FALSE
## 867
# Find the sample size of Mature moms' weights
table(is.na(Mature$weight))
##
## FALSE
## 133
# Find and store the standard error for weight
sqrt((sd(Mature$weight)^2/133)+(sd(Younger$weight)^2/867))
## [1] 0.1524501
SEW<-sqrt((sd(Mature$weight)^2/133)+(sd(Younger$weight)^2/867))
# Test Statistic
(MeandiffW - 0)/SEW
## [1] -0.1858449
# Find the p-value
pt(-0.1858449, df=132)*2
## [1] 0.8528517
The mean difference in weights of Younger and Mature moms is -0.03 with a SE difference of 0.15. The test statistic is -0.19 and the p-value is 0.85. There are 867 Younger moms and 133 Mature moms.
There is not sufficient evidence to show that there is a significant difference in pregnancy weight based upon whether or not the mother is younger or more mature. The p-value is greater than 0.05, so we fail to reject the null hypothesis.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# Create a subset of baby weights
Low<-subset(ncbirths, ncbirths$lowbirthweight == "low")
Notlow<-subset(ncbirths, ncbirths$lowbirthweight == "not low")
# Find summary for Low weight
summary(Low$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
# Find summary for Notlow weight
summary(Notlow$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
The cutoff weight for babies to be considered “low” is 5.50 lbs and under. The cutoff weight for babies to be considered “not low” is 5.56 lbs and above. The summary() function shows the minimum and maximum values for each category; therefore, it is reliable when determining the specific weight cutoffs for each group.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Is there a difference in the number of weeks white moms and not white moms carry a pregnancy?
\(H_o : \mu white - \mu notwhite = 0\)
\(H_A ; \mu white - \mu notwhite \neq 0\)
# Create a subset for white moms and not white moms
White<-subset(ncbirths, ncbirths$whitemom == "white")
Notwhite<-subset(ncbirths, ncbirths$whitemom == "not white")
# Find and store the mean difference of weeks carried
mean(White$weeks, na.rm = TRUE) - mean(Notwhite$weeks)
## [1] 0.6358799
MD<-mean(White$weeks, na.rm = TRUE) - mean(Notwhite$weeks)
# Find the sample size for white moms and not white moms
table(is.na(White$weeks))
##
## FALSE TRUE
## 712 2
table(is.na(Notwhite$weeks))
##
## FALSE
## 284
# Find the test statistic for 95% C.I.
abs(qt(p = 0.025, df = 711))
## [1] 1.963306
# Calculate the margin of error
sqrt((sd(White$weeks, na.rm = TRUE)^2)/712 + (sd(Notwhite$weeks)^2/284))
## [1] 0.2380229
ER<-sqrt((sd(White$weeks, na.rm = TRUE)^2)/712 + (sd(Notwhite$weeks)^2/284))
# Calculate the lower bound of a 95% C.I.
(MD - 1.96)*ER
## [1] -0.3151709
# Calculate the upper bound of a 95% C.I.
(MD + 1.96)*ER
## [1] 0.6178788
The mean difference in pregnancy weeks of white moms and not white moms is 0.64, with a standard error of 0.24 and a test statistic of 1.96. The sample size for white moms is 712 and 284 for not white moms.
There is not sufficient evidence to show a significant difference between the number of weeks a white mom carries and the number of weeks a not white mom carries a pregnancy. We fail to reject the null hypothesis because 0 lies within the 95% confidence interval: \(-0.32<\mu<0.62\)