In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean_gained <- mean(ncbirths$gained, na.rm=TRUE)
sd_gained <- sd(ncbirths$gained, na.rm=TRUE)
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
#Storing n number of observations and degrees of freedom
n_gained <- 973
df_gained <- n_gained-1
# Calculate t-critical value for 90% confidence
t_gained <- qt(0.05, df_gained,lower.tail=FALSE)
t_gained
## [1] 1.646423
Our t-critical value is 1.646423.
# Calculate margin of error
me_gained <- (t_gained*sd_gained) / (sqrt(n_gained))
me_gained
## [1] 0.7516826
The margin of error is 0.7516826.
# Boundaries of confidence interval
mean_gained - me_gained
## [1] 29.57411
mean_gained + me_gained
## [1] 31.07748
We can be 90% confident that the mean amount of weight gained by all expectant mothers is between 29.57 and 31.08 pounds.
#Finding confidence interval using the t.test function
t.test(ncbirths$gained, mu=0, alternative="two.sided")$conf.int
## [1] 29.42985 31.22174
## attr(,"conf.level")
## [1] 0.95
We can be 95% confident that the mean amount of weight gained by all expectant mothers is between 29.43 and 31.22 pounds.
The 95% confidence interval is wider than the 90% interval. This difference is logical because higher confidence levels have wider intervals.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
\(H_0: \mu_{weight} = 7.7\)
\(H_A: \mu_{weight} \neq 7.7\)
# Sample statistics (sample mean, standard deviation, and size)
mean_weight <- mean(ncbirths$weight)
sd_weight <- sd(ncbirths$weight)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
#Storing n number of observations and degrees of freedom
n_weight <- 1000
df_weight <- n_gained-1
# Test statistic
t_weight <-(mean_weight-7.7) / (sd_weight/sqrt(n_weight))
t_weight
## [1] -12.55388
Our test statistic is -12.55388.
# Probability of test statistic by chance
pt(t_weight, df_weight) *2
## [1] 1.310452e-33
The p-value is much smaller than 0.05; we can therefore reject the null hypothesis in favor of the alternative hypothesis.
Because our p-value is significantly less than 0.05, we can conclude that there is evidence that the mean birth weight of all North Carolina babies is different than the mean birth weight of European babies.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
\(H_0: \mu_{mothers} - \mu_{fathers} = O\)
\(H_A: \mu_{mothers} - \mu_{fathers} \neq 0\)
#Calculating p-value using the t.test function
t.test(ncbirths$mage,ncbirths$fage, mu=0, alternative="two.sided", paired=TRUE)$p.value
## [1] 1.504608e-59
The p-value is 1.504608e-59, which is significantly less than 0.05. Therefore, we reject the null hypothesis in favor of the alternative hypothesis.
The data show that there is a significant difference in age between all North Carolina mothers and fathers.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0: \mu_{smokers} - \mu_{nonsmokers} = 0\)
\(H_A: \mu_{smokers} - \mu_{nonsmokers} \neq 0\)
#creating subsets for each category
nc_smokers <- subset(ncbirths, ncbirths$habit=="smoker")
nc_nonsmokers <- subset(ncbirths, ncbirths$habit=="nonsmoker")
#Finding p-value using t.test function, calling on the variable 'weeks' for pregnancy length
t.test(nc_smokers$weeks, nc_nonsmokers$weeks, mu=0, alternative="two.sided", paired=FALSE)$p.value
## [1] 0.6043917
Our p-value is 0.6044, which is greater than the significance level of 0.05. We therefore fail to reject the null hypothesis.
There is not sufficient evidence that the average pregnancy length for North Carolina smokers is significantly different than the average pregnancy length of North Carolina non-smokers.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_0: \mu_{younger} = \mu_{mature}\)
\(H_A: \mu_{younger} > \mu_{mature}\)
#creating subsets for each category
nc_younger <- subset(ncbirths, ncbirths$mature=="younger mom")
nc_mature <- subset(ncbirths, ncbirths$mature=="mature mom")
#Finding p-value using t.test function
t.test(nc_younger$gained, nc_mature$gained, mu=0, alternative="greater", paired=FALSE)$p.value
## [1] 0.0852137
Our p-value is 0.0852. Because this is greater than the significance level of 0.05, we fail to reject the null hypothesis.
There is not sufficient evidence to prove that younger North Carolina mothers gain significantly more weight than older North Carolina mothers.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# Creating subsets for each category
nc_low <- subset(ncbirths, ncbirths$lowbirthweight=="low")
nc_notlow <- subset(ncbirths, ncbirths$lowbirthweight=="not low")
# Finding the maximum and minimum values of each subset using the five number summary
summary(nc_low$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(nc_notlow$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
The maximum weight in the “low birthweight” subset is 5.50 pounds. The minimum weight in the “not low birthweight” subset is 5.56 pounds. This means that the cutoff between “low” and “not low” is likely 5.50 pounds–anything below 5.50 is “low” and anything above is “not low.”
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Is there a significant difference in the number of hospital visits during pregnancy between mothers of premature babies and mothers of full-term babies?
\(H_0: \mu_{premature} - \mu_{fullterm} = 0\)
\(H_A: \mu_{premature} - \mu_{fullterm} \neq 0\)
# Creating subsets for each category
nc_premie <- subset(ncbirths, ncbirths$premie=="premie")
nc_full <- subset(ncbirths, ncbirths$premie=="full term")
#Calculating confidence interval using the t.test function
t.test(nc_premie$visits, nc_full$visits, mu=0, alternative="two.sided", paired=FALSE)$p.value
## [1] 0.000108382
Our p-value for this data is 0.0001084, which is significantly less than 0.05. We can therefore reject the null hypothesis in favor of the alternative hypothesis.
There is significant evidence that North Carolina mothers of premature babies visit the hospital a different number of times than North Carolina mothers of full-term babies.