In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
(mean_NC_weight_gained <- mean(ncbirths$gained, na.rm = TRUE))
## [1] 30.3258
(sd_NC_weight_gained <- sd(ncbirths$gained, na.rm = TRUE))
## [1] 14.2413
(size_weight_gained <- table(is.na(ncbirths$gained))[[1]])
## [1] 973
The mean of the variable “gained”" is 30.33, the standard deviation is 14.24, and the sample size is 973.
# Calculate t-critical value for 90% confidence
t90 <- abs(qt(p = 0.05, df = 972))
The two-tailed t-critical value for 90% confidence interval is 1.65.
# Calculate margin of error
SE <- sd_NC_weight_gained/sqrt(size_weight_gained)
The Standard Error of the “weight” variable is 0.46.
# Boundaries of confidence interval
mean_NC_weight_gained - t90*SE
## [1] 29.57411
mean_NC_weight_gained + t90*SE
## [1] 31.07748
We can be 90% confident that the population mean of the variable “gained” is between 29.57 and 31.08, which means that we’re 90% sure that on average mothers in North Carolina gained between 29.57 and 31.08lbs.
#Calculate the two-tailed t-critical value for the 95% confidence interval for the variable "gained".
t95 <- abs(qt(p = 0.025, df = 972))
The two-tailed t-critical value or the confidence interval of 95% is 1.96.
#Find the the boudaries of the 95% confidence interval
mean_NC_weight_gained - t95*SE
## [1] 29.42985
mean_NC_weight_gained + t95*SE
## [1] 31.22174
We can be 95% confident that the population mean of the variable “gained” is between 29.43 and 31.22, which means that we’re 95% sure that on average mothers in North Carolina gained between 29.43 and 31.22lbs.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
\(H_0 : \mu_{NCbabies} = 7.7\)
\(H_A : \mu_{NCbabies} \neq 0\)
Our Null hypothesis is that the mean of the NC babies’ weight is the same as the mean of the European babies’ weight, which equals to 7.7lbs. The alternative hypothesis is that the average weight of the NC babies is different than the average weight of the European babies.
# Sample statistics (sample mean, standard deviation, and size)
mean(ncbirths$weight)
## [1] 7.101
sd(ncbirths$weight)
## [1] 1.50886
table(!is.na(ncbirths$weight))
##
## TRUE
## 1000
The mean of the variable “weight” in the dataset ncbirths is 7.101, the standard deviation is 1.51, and the size is 1000.
# Test statistic
(7.101 - 7.7)/(1.51/sqrt(1000))
## [1] -12.5444
Test statistic is -12.54.
# Probability of test statistic by chance
pt(q = -12.54, df = 999)*2
## [1] 1.320923e-33
Since the p-value is smaller than the significance level of 0.05, we reject the Null hypothesis.
The evidence shows that the average weight of the NC babies is different than the average weight of the European babies.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
\(H_0 : \mu_{fathers} = \mu_{mothers}\)
\(H_A : \mu_{fathers} \neq \mu_{mothers}\)
The Null hypothesis is that the average age of fathers is the same as the average age of mothers. The alternative hypothesis is that the difference between mothers’ and fathers’ ages is significant.
#First we need to find the sample sizes for "mage" and "fage" variables.
table(!is.na(ncbirths$mage))
##
## TRUE
## 1000
table(!is.na(ncbirths$fage))
##
## FALSE TRUE
## 171 829
The sample size of the “mage”" variable is 1000; the sample size of the “fage”" variable is 829; our sample size in this case will be 829, because in 171 cases we cannot determine the difference in the ages because of some missing values.
#Create the "age_diff" column in the dataset.
ncbirths$age_diff <- ncbirths$fage - ncbirths$mage
#Find the test statistic.
(mean(ncbirths$age_diff, na.rm = TRUE) - 0)/(sd(ncbirths$age_diff, na.rm= TRUE)/sqrt(829))
## [1] 17.6727
#Find the probability of getting that test statistic.
pt(17.6727, df = 828, lower.tail = FALSE)*2
## [1] 1.504649e-59
Because the p-value is smaller than the significance level of 0.05, we reject the Null hypothesis.
The evidence shows that the average age of fathers is significantly different than the average age of mothers in NC.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0 : \mu_{smokers} = \mu_{nonsmokers}\)
\(H_A : \mu_{smokers} \neq \mu_{nonsmokers}\)
The Null hypothesis is that there is no difference between the average lengths of pregnancies between smoking and nonsmoking mothers. The alternative hypothesis is that there is significant difference between the average length of pregnancies between smoking and nonsmoking mothers.
#First we need to create subsets for smokers and non smokers, and then we need to find the mean difference between the average lengths of both subsets.
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
mean(smokers$weeks) - mean(nonsmokers$weeks, na.rm = TRUE)
## [1] 0.1256371
The mean difference is 0.1256.
#Now we have to find sample sizes
table(!is.na(smokers$weeks))
##
## TRUE
## 126
table(!is.na(nonsmokers$weeks))
##
## FALSE TRUE
## 1 872
The sample sizes are 126 and 872 for smokers and nonsmokers, respectively.
#Find test statistic
0.1256371/sqrt((sd(smokers$weeks)^2)/126 + (sd(nonsmokers$weeks, na.rm = TRUE)^2)/872)
## [1] 0.5189961
The test statistic is 0.519.
#Find the p-value.
(pt(q = 0.519, df = 125, lower.tail = FALSE))*2
## [1] 0.6046784
The p-value is greater than the significance level of 0.05, therefore, we fail to reject the Null Hypothesis.
There is not enough evidence that shows that the average length of pregnancies of mothers who smoke is significantly different than the average length of pregnancies of mothers who don’t smoke.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_0 : \mu_{younger} = \mu_{mature}\)
\(H_A : \mu_{younger} > \mu_{mature}\)
The Null hypothesis states that the average weight that the younger mothers gained is the same the average weight that the older mothers gained. The alternative hypothesis states that the average weight that the younger mothers gained is greater than the average weight that the mature mothers gained.
#First we need to create subsets for youger mothers and mature mothers, and then we need to find the mean difference between the average weight gains between younger mothers and mature mothers.
younger_mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
mature_mothers <- subset(ncbirths, ncbirths$mature == "mature mom")
mean(younger_mothers$gained, na.rm = TRUE) - mean(mature_mothers$gained, na.rm = TRUE)
## [1] 1.769729
The mean difference between the average weight gained between younger moms and mature moms is 1.7697.
#Now we have to find sample sizes
table(!is.na(younger_mothers$gained))
##
## FALSE TRUE
## 23 844
table(!is.na(mature_mothers$gained))
##
## FALSE TRUE
## 4 129
The sample sizes for younger moms is 844, the sample size for mature moms is 129.
#Find the SE
sqrt((sd(younger_mothers$gained, na.rm = TRUE)^2)/844 +(sd(mature_mothers$gained, na.rm = TRUE)^2/129))
## [1] 1.285689
The standard error is 1.2857.
#Find the test statistic for 95% confidence interval
qt(p = 0.05, df = 128, lower.tail = FALSE)
## [1] 1.656845
The t statistic is 1.6568.
#Find the boundaries for the 95% confidence interval.
1.7697 - 1.6568*1.2857
## [1] -0.3604478
1.7697 + 1.6568*1.2857
## [1] 3.899848
We can be 95% sure that the mean difference of the weight gained between youger moms and mature moms is between -0.36 and 3.90lbs. Because the interval contains 0, we fail to reject the Null hypothesis.
There is not enough evidence that shows that younger moms gain on average more weight during pregnancies than the mature moms.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#First we need to create subsets for babies with low weight and babies with the weight that is not considered low.
babies_low_weight <- subset(ncbirths, ncbirths$lowbirthweight == "low")
babies_not_low_weight <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
#Now we have to find the weight cutoffs for both groups using the summary() function.
summary(babies_low_weight$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(babies_not_low_weight$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
First we had to split “lowbirthweight” variable into two subsets: “low”" and “not low”. Using the summary() function, we were able to determine that the minimum weight for the babies who are considered low weight in the dataset ncbirths is 1lb; the maximum weight is 5.5lbs. Using the same function, we were able to find that the minimum weight for the babies to be considered low weight in the dataset ncbirths is 5.56, and the maximum weight is 11.75.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Are the boys born in North Carolina on average heavier than the girls?
Write hypotheses
\(H_0 : \mu_{girls' weight} = \mu_{boys' weight}\)
\(H_A : \mu_{girls' weight} < \mu_{boys' weight}\)
Our Null hypothesis is that the average weight of girls born in North Carolina is the same as the average weight of the boys born in north Carolina. The alternative hypothesis is that the boys born in North Carolina are on average heavier than the girls.
#Split gender variable into two separate subsets and find the mean difference between their average weight.
girls <- subset(ncbirths, ncbirths$gender == "female")
boys <- subset(ncbirths, ncbirths$gender == "male")
mean(boys$weight) - mean(girls$weight)
## [1] 0.3986264
The difference between the mean weight of the boys and the mean weight of the girls is 0.3986.
#Find the test statistic for the 95% confidence interval.
abs(qt(p = 0.05, df = 496))
## [1] 1.647932
The test statistic is 1.65.
#Find the SE.
sqrt((sd(boys$weight)^2)/497 + (sd(girls$weight)^2/503))
## [1] 0.0946563
The standard error is 0.0947.
#Calculate the 95% confidence interval.
0.3986 - 1.65*0.0947
## [1] 0.242345
0.3986 + 1.65*0.0947
## [1] 0.554855
Because the 95% confidence interval doesn’t contain 0, we can reject the Null hypothesis. Because the numbers in the confidence interval are positive, we can tell that boys have on average heavier weight than the girls do.
We have enough evidence to conclude that we can be 95% sure that on average boys in North Carolina weigh more at birth than the girls do.