In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean_avgwghtgained <- mean(ncbirths$gained, na.rm = TRUE)
sd_avgwghtgained <- sd(ncbirths$gained, na.rm = TRUE)
sample_size <- table(!is.na(ncbirths$gained))[[2]]
# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))
## [1] 1.646423
# Calculate margin of error
1.646423 * sd_avgwghtgained/sqrt(sample_size)
## [1] 0.7516827
# Boundaries of confidence interval
mean_avgwghtgained - 0.7516827
## [1] 29.57411
mean_avgwghtgained + 0.7516827
## [1] 31.07748
We are 90% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.57 and 31.07 pounds.
# Calculate t-critical value for 95% confidence
abs(qt(.025, df=972))
## [1] 1.962408
# Calculate margin of error
1.962408*sd_avgwghtgained/sqrt(sample_size)
## [1] 0.8959472
# Boundaries of confidence interval
mean_avgwghtgained - 0.8959472
## [1] 29.42985
mean_avgwghtgained + 0.8959472
## [1] 31.22174
We are 95% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.43 pounds and 31.22 pounds.
# 90% confidence interval difference
31.07-29.57
## [1] 1.5
# 95% confidence interval difference
31.22-29.43
## [1] 1.79
We are 95% confident that the average weight gained for all North Carolina mothers is between 29.43 and 31.22.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\(H_0: \mu = 7.7\) \(H_A: \mu \neq 7.7\)
Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
mean_avgwght <- mean(ncbirths$weight, na.rm = TRUE)
sd_avgwght <- sd(ncbirths$weight, na.rm = TRUE)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
The sample has a mean of 7.101, a standard ddeviation of approximately 1.509, and a sample size of 1000.
# Test statistic
(mean_avgwght - 7.7)/sd_avgwght/sqrt(1000)
## [1] -0.01255388
The t-score is -12.55388.
# Probability of test statistic by chance
pt(-12.55388, df = 999)*2
## [1] 1.135354e-33
Because the p-value (1.135354e-33) is less than aplha(0.05), we reject the null hypothesis in favor of the alternate hypothesis.
C.Conclusion:
The data suggest that there is indeed a difference between the average birthweight of European babies and the average birthweight of babies in North Carolina
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\(H_0: \mu = 0\) \(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision
# Calculate Age Difference and Store mean, standard deviation, and sample size of parents
ncbirths$diff <- ncbirths$fage - ncbirths$mage
# summary(ncbirths$habit)
table(is.na(ncbirths$diff))
##
## FALSE TRUE
## 829 171
In the “diff” column of the ncbirths dataset, a postiitve value indicates an instance where the mother is younger than the father, while a negative value indicates a case where the mother is older than the father.
mean_diff <- mean(ncbirths$diff, na.rm = TRUE)
sd_diff <- sd(ncbirths$diff, na.rm = TRUE)
length_diff <- 829
df_diff <- 829 - 1
# Test stat
stat_diff <- (mean_diff-0) / (sd_diff/sqrt(length_diff))
# Probability Test
pt(stat_diff, df_diff, lower.tail = FALSE)*2
## [1] 1.504608e-59
Because the p-value (1.504649e-59) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.
We found that the p-value is smaller than alpha, so we must reject the null hypothesis in favor of the alternative hypothesis. The data suggests that there is a significant difference between the mean age of mothers and fathers in the ncbirths data set.
The data suggests that there is indeed a significant difference in the average ages of mothers and fathers from the dataset.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
Write hypotheses
\(H_O: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neg 0\)
Test by confidence interval or p-value and decision
#Creating subsets for both smokers and nonsmokers
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Storing the mean lengths of pregnancies and standard deviations of pregnancy lengths for smokers and nonsmokers
mn_smoker <- mean(smokers$weeks, na.rm = TRUE)
mn_nonsmoker <- mean(nonsmokers$weeks, na.rm=TRUE)
sd_smokers <- sd(smokers$weeks, na.rm=TRUE)
sd_nonsmokers <- sd(nonsmokers$weeks, na.rm=TRUE)
#Finding the sample size of each group
summary(ncbirths$habit)
## nonsmoker smoker NA's
## 873 126 1
#Storing the standard error
se <- sqrt((sd_smokers^2/126)+(sd_nonsmokers^2/873))
#Finding the test-statistic
(((mn_smoker)-(mn_nonsmoker))-0)/se
## [1] 0.5190483
# Calculate the p-value
pt(0.5190483, df=125, lower.tail=FALSE)*2
## [1] 0.6046448
The p-value (.605) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.
C. Conclusion
There is not sufficient evidence to say that there is a significant difference between the number of weeks the pregnancies of smoking and nonsmoking mothers lasted.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses \(H_O: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neg 0\)
Test by confidence interval or p-value and decision
#Creating subsets for the weight gained by younger mothers and mature mothers
younger_mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
mature_mothers <- subset(ncbirths, ncbirths$mature == "mature mom")
#Storing the mean, standard deviation, and mean difference for the younger mothers and mature mothers subsets
mn_younger_mothers <- mean(younger_mothers$gained, na.rm=TRUE)
mn_mature_mothers <- mean(mature_mothers$gained, na.rm=TRUE)
sd_younger_mothers <- sd(younger_mothers$gained, na.rm=TRUE)
sd_mature_mothers <- sd(mature_mothers$gained, na.rm=TRUE)
mn_diff_gained <- (mn_younger_mothers - mn_mature_mothers)
#Finding the sample sizes of the younger mothers and mature mothers
table(is.na(younger_mothers$gained))
##
## FALSE TRUE
## 844 23
table(is.na(mature_mothers$gained))
##
## FALSE TRUE
## 129 4
# Storing the standard error
se_gained <- sqrt((sd_younger_mothers^2/844)+(sd_mature_mothers^2/129))
# Calculate the test statistic
(mn_diff_gained - 0)/se_gained
## [1] 1.376483
# Calculate the p-value
pt(1.376483, df=128, lower.tail=FALSE)
## [1] 0.08553763
The p-value (0.086) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.
There is no significant evidence that the average weight gained by younger mothers is greater than the average weight gained by mature mothers.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#Creating subsets of low and not low birth weights
low_birthwght <- subset(ncbirths, ncbirths$lowbirthweight == "low" )
notlow_birthwght <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
#Using the summary function to observe values within the classifications of "low" and "not low" weights
summary(low_birthwght$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(notlow_birthwght$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
Using the summary function, I can say on average the birthweight for the low pregnancies is less than the ones that aren’t low.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Is there a significant difference between the average weights of male and female babies?
Write hypotheses
\(H_0: \mu_d = 0\) \(H_A: \mu_d \neg 0\)
Test by confidence interval or p-value and decision
#Creating a subset of male and female babies
male_babies <- subset(ncbirths, ncbirths$gender == "male")
female_babies <- subset(ncbirths, ncbirths$gender == "female")
#Storing the mean weight and standard deviation of male and female babies
mn_male_wt <- mean(male_babies$weight, na.rm = TRUE)
mn_female_wt <- mean (female_babies$weight, na.rm = TRUE)
sd_male_wt <- sd(male_babies$weight, na.rm = TRUE)
sd_female_wt <- sd(female_babies$weight, na.rm = TRUE)
#Storing the mean difference of male and female babies
mn_diff_babywt <- (mn_male_wt - mn_female_wt)
#Calculating the sample size of male and female babies
table(is.na(male_babies$weight))
##
## FALSE
## 497
table(is.na(female_babies$weight))
##
## FALSE
## 503
#Storing the standard error
SE_b_wt <- sqrt((sd_male_wt^2/497)+(sd_female_wt^2/503))
#Finding the test statistic
(mn_diff_babywt - 0)/SE_b_wt
## [1] 4.211303
#Finding the p-value
pt(4.211303, df=496, lower.tail = FALSE)*2
## [1] 3.015765e-05
The p-value (3.015765e-05) is less than alpha (0.05), and we therefore reject the null hypothesis in favor of the alternate hypothesis.
The data suggests that there is indeed a difference between the average weight of male and females babies.