In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths
dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
#Analyzing the ncbirths dataset
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
#data.frame(ncbirths)
#In this project, the data.frame function above has been commented to keep the knit document more neat.
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mn_gained <- mean(ncbirths$gained, na.rm = TRUE)
sd_gained <- sd(ncbirths$gained, na.rm = TRUE)
#Testing to see how many N.A. values are recorded for the weight variable to find the accurate sample size
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
ss_gained <- 973
# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))
## [1] 1.646423
# Calculate margin of error
1.646423 * sd_gained/sqrt(ss_gained)
## [1] 0.7516827
# Boundaries of confidence interval
mn_gained - 0.7516827
## [1] 29.57411
mn_gained + 0.7516827
## [1] 31.07748
We are 90% confident that the average weight gained for all North Carolina mothers is between 29.57 and 31.08.
#Calculating the t-critical value for the 95% confidence interval
abs(qt(.025, df=ss_gained-1))
## [1] 1.962408
#Calculating the upper and lower bounds of the confidence interval
mn_gained - 1.962341*sd_gained/sqrt(ss_gained)
## [1] 29.42988
mn_gained + 1.962341*sd_gained/sqrt(ss_gained)
## [1] 31.22171
We are 95% confident that the average weight gained for all North Carolina mothers is between 29.43 and 31.22.
The 95% confidence interval is slightly larger than the 90% confidence interval, encompassing more values.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
\(H_0\): \(\mu\) = 7.7
\(H_A\): \(\mu\) \(\neq\) 7.7
# Sample statistics (sample mean, standard deviation, and size)
mn_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- sd(ncbirths$weight, na.rm=TRUE)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
The sample has a mean of 7.101, a standard deviation of approximately 1.509, and a sample size of 1000.
# Test statistic
(mn_weight - 7.7)/(sd_weight/sqrt(1000))
## [1] -12.55388
The t-score is -12.55388.
# Probability of test statistic occuring by chance
pt(-12.55388, df=999)*2
## [1] 1.135354e-33
Because the p-value (1.136346e-33) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.
The data suggest that there is indeed a difference between the average birthweight of European babies and the average birthweight of babies in North Carolina.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
\(H_0 : \mu_d = 0\)
\(H_A : \mu_d \neq 0\)
#Creating a column of differences in age between mothers and fathers
ncbirths$dif <- ncbirths$fage - ncbirths$mage
In the “dif” column of the ncbirths dataset, a postiitve value indicates an instance where the mother is younger than the father, while a negative value indicates a case where the mother is older than the father.
#Finding how many N.A. values exist in the "dif" column of the ncbirths dataset
table(is.na(ncbirths$dif))
##
## FALSE TRUE
## 829 171
#The sample size is 829.
#Calculating the test statistic
(mean(ncbirths$dif,na.rm=TRUE)-0)/(sd(ncbirths$dif,na.rm=TRUE)/sqrt(829))
## [1] 17.6727
#The t-score is 17.6727
#Finding the probability of getting the test statistic by chance
pt(17.6727, df=828, lower.tail=FALSE)*2
## [1] 1.504649e-59
Because the p-value (1.504649e-59) is less than alpha (0.05), we reject the null hypothesis in favor of the alternate hypothesis.
The data suggests that there is indeed a significant difference in the average ages of mothers and fathers from the dataset.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0 : \mu_1 - \mu_2 = 0\)
\(H_A : \mu_1 - \mu_2 \neq 0\)
#Creating subsets for both smokers and nonsmokers
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Storing the mean lengths of pregnancies and standard deviations of pregnancy lengths for smokers and nonsmokers
mn_smoker <- mean(smokers$weeks, na.rm = TRUE)
mn_nonsmoker <- mean(nonsmokers$weeks, na.rm=TRUE)
sd_smokers <- sd(smokers$weeks, na.rm=TRUE)
sd_nonsmokers <- sd(nonsmokers$weeks, na.rm=TRUE)
#Finding the sample size of each group
summary(ncbirths$habit)
## nonsmoker smoker NA's
## 873 126 1
#Storing the standard error
SE <- sqrt((sd_smokers^2/126)+(sd_nonsmokers^2/873))
#Finding the test-statistic
(((mn_smoker)-(mn_nonsmoker))-0)/SE
## [1] 0.5190483
#Finding the p-value
pt(0.5190483, df=125, lower.tail=FALSE)*2
## [1] 0.6046448
The p-value (.605) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.
There is not sufficient evidence to say that there is a significant difference between the number of weeks the pregnancies of smoking and nonsmoking mothers lasted.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_0 : \mu_y \leq \mu_m\)
\(H_A : \mu_y > \mu_m\)
#Creating subsets for the weight gained by younger mothers and mature mothers
younger_mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
mature_mothers <- subset(ncbirths, ncbirths$mature == "mature mom")
#Storing the mean, standard deviation, and mean difference for the younger mothers and mature mothers subsets
mn_younger_mothers <- mean(younger_mothers$gained, na.rm=TRUE)
mn_mature_mothers <- mean(mature_mothers$gained, na.rm=TRUE)
sd_younger_mothers <- sd(younger_mothers$gained, na.rm=TRUE)
sd_mature_mothers <- sd(mature_mothers$gained, na.rm=TRUE)
mn_diff_gained <- (mn_younger_mothers - mn_mature_mothers)
#Finding the sample sizes of the younger mothers and mature mothers
table(is.na(younger_mothers$gained))
##
## FALSE TRUE
## 844 23
table(is.na(mature_mothers$gained))
##
## FALSE TRUE
## 129 4
#Storing the standard error
SE_gained <- sqrt((sd_younger_mothers^2/844)+(sd_mature_mothers^2/129))
#Finding the test statistic
(mn_diff_gained - 0)/SE_gained
## [1] 1.376483
#Finding the p-value
pt(1.376483, df=128, lower.tail=FALSE)
## [1] 0.08553763
The p-value (0.086) is greater than alpha (0.05), and we therefore fail to reject the null hypothesis.
There is no significant evidence that the average weight gained by younger mothers is greater than the average weight gained by mature mothers.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#Creating subsets of low and not low birth weights
low_wt <- subset(ncbirths, ncbirths$lowbirthweight == "low" )
notlow_wt <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
#Using the summary function to observe values within the classifications of "low" and "not low" weights
summary(low_wt$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(notlow_wt$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
#Creating a dataframe to show the frequency of the birthweights in each classification.
data.frame(table(low_wt$weight))
## Var1 Freq
## 1 1 2
## 2 1.19 1
## 3 1.31 1
## 4 1.38 3
## 5 1.44 2
## 6 1.5 2
## 7 1.56 1
## 8 1.63 1
## 9 1.69 2
## 10 1.88 1
## 11 2.19 1
## 12 2.25 2
## 13 2.5 1
## 14 2.63 1
## 15 2.69 2
## 16 2.88 3
## 17 2.94 1
## 18 3 1
## 19 3.19 1
## 20 3.25 1
## 21 3.31 1
## 22 3.44 1
## 23 3.56 1
## 24 3.63 2
## 25 3.75 2
## 26 3.81 1
## 27 3.94 2
## 28 4 2
## 29 4.06 2
## 30 4.13 2
## 31 4.19 1
## 32 4.25 1
## 33 4.31 1
## 34 4.44 3
## 35 4.5 3
## 36 4.56 4
## 37 4.63 2
## 38 4.69 4
## 39 4.75 6
## 40 4.88 1
## 41 4.94 2
## 42 5 4
## 43 5.06 3
## 44 5.13 2
## 45 5.19 2
## 46 5.25 4
## 47 5.38 8
## 48 5.44 7
## 49 5.5 7
data.frame(table(notlow_wt$weight))
## Var1 Freq
## 1 5.56 5
## 2 5.63 8
## 3 5.69 4
## 4 5.75 4
## 5 5.81 9
## 6 5.88 12
## 7 5.94 15
## 8 6 15
## 9 6.06 10
## 10 6.13 6
## 11 6.19 11
## 12 6.25 16
## 13 6.31 15
## 14 6.38 16
## 15 6.44 8
## 16 6.5 15
## 17 6.56 13
## 18 6.63 9
## 19 6.69 13
## 20 6.75 23
## 21 6.81 12
## 22 6.88 25
## 23 6.94 14
## 24 7 22
## 25 7.06 20
## 26 7.13 24
## 27 7.19 21
## 28 7.25 22
## 29 7.31 24
## 30 7.38 21
## 31 7.44 30
## 32 7.5 24
## 33 7.56 19
## 34 7.63 18
## 35 7.69 17
## 36 7.75 17
## 37 7.81 20
## 38 7.88 25
## 39 7.94 16
## 40 8 19
## 41 8.06 13
## 42 8.13 17
## 43 8.19 17
## 44 8.25 16
## 45 8.31 12
## 46 8.38 20
## 47 8.44 14
## 48 8.5 15
## 49 8.56 11
## 50 8.63 5
## 51 8.69 4
## 52 8.75 14
## 53 8.81 12
## 54 8.88 9
## 55 8.94 4
## 56 9 8
## 57 9.06 6
## 58 9.13 7
## 59 9.19 7
## 60 9.25 6
## 61 9.31 5
## 62 9.38 2
## 63 9.5 4
## 64 9.56 3
## 65 9.63 3
## 66 9.69 1
## 67 9.75 2
## 68 9.81 1
## 69 9.88 4
## 70 9.94 1
## 71 10.06 2
## 72 10.13 2
## 73 10.19 1
## 74 10.25 1
## 75 10.38 1
## 76 11.63 1
## 77 11.75 1
By observing the summaries of the weights of babies classified as having “low” and “not low” birthweights, it can be seen that the maximum value of “low” birthweight babies is 5.500 pounds and the minimum weight for “not low” babies is 5.560. This implies that the cutoff of what classifies a baby’s weight as “low” or “not low” is somewhere at or between these two values.
I believe that 5.500 pounds is the cutoff value, such that babies at or below this weight are classified as having a “low” birthweight and any babies above this weight are classified as “not low”. I believe this answer is correct because as seen in the dataframe the seven babies of weight 5.500 pounds were all grouped as having “low” birthweights, while the five with a weight of 5.560 pounds, only .06 pounds more, were all grouped as “not low”. This is a difference of less than one tenth of a pound, showing the two values are very close yet exist on different sides of the cutoff. A cutoff value of 5.500 is exactly half way betwen 5 and 6 pounds, making it more straightforward than other possible decimal values to mark as a cutoff point in determining this variable.
Pick a pair of numerical and categorical variables from the ncbirths
dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Is there a significant difference between the average weights of male and female babies?
\(H_0 : \mu_d = 0\)
\(H_A : \mu_d \neq 0\)
#Creating a subset of male and female babies
male_babies <- subset(ncbirths, ncbirths$gender == "male")
female_babies <- subset(ncbirths, ncbirths$gender == "female")
#Storing the mean weight and standard deviation of male and female babies
mn_male_wt <- mean(male_babies$weight, na.rm = TRUE)
mn_female_wt <- mean (female_babies$weight, na.rm = TRUE)
sd_male_wt <- sd(male_babies$weight, na.rm = TRUE)
sd_female_wt <- sd(female_babies$weight, na.rm = TRUE)
#Storing the mean difference of male and female babies
mn_diff_babywt <- (mn_male_wt - mn_female_wt)
#Calculating the sample size of male and female babies
table(is.na(male_babies$weight))
##
## FALSE
## 497
table(is.na(female_babies$weight))
##
## FALSE
## 503
#Storing the standard error
SE_b_wt <- sqrt((sd_male_wt^2/497)+(sd_female_wt^2/503))
#Finding the test statistic
(mn_diff_babywt - 0)/SE_b_wt
## [1] 4.211303
#Finding the p-value
pt(4.211303, df=496, lower.tail = FALSE)*2
## [1] 3.015765e-05
The p-value (3.015765e-05) is less than alpha (0.05), and we therefore reject the null hypothesis in favor of the alternate hypothesis.
The data suggests that there is indeed a difference between the average weight of male and females babies.